Descriptor Generation
We provide the interface DescriptorGenerator to define the high-level
behavior for transforming input blob data, in the form of a
smqtk_dataprovider.DataElement 1, into a descriptor (feature vector).
This interface also descends from the
smqtk_dataprovider.ContentTypeValidator 2 interface to allow
implementations the ability to declare what input data content types it can
accept for processing.
Thus, input DataElement instances must be of a content type that the
DescriptorGenerator supports, otherwise an exception is raised when
the offending data element is reached.
Descriptors may be generated most simply as numpy.ndarray arrays via
the DescriptorGenerator.generate_arrays().
An additional layer of wrapping into a DescriptorElement may be
invoked via DescriptorGenerator.generate_elements().
- 1
TODO: fill in appropriate link to DataElement interface under https://smqtk-dataprovider.readthedocs.io/
- 2
TODO: fill in appropriate link to ContentTypeValidator interface under https://smqtk-dataprovider.readthedocs.io/
Bundled Implementation Model Details
The DescriptorGenerator interface does not define a model building
method, but some implementations require internal models.
Below are explanations on how to build or get modes for
DescriptorGenerator implementations that require a model.
Caffe 1.0 Default Image Net
The CaffeDescriptorGenerator
implementation does not come with a method of training its own models, but requires model files provided by Caffe:
the network model file and the image mean binary protobuf file.
The Caffe source tree provides two scripts to download the specific files (relative to the caffe source tree):
# Downloads the network model file
scripts/download_model_binary.py models/bvlc_reference_caffenet
# Downloads the ImageNet mean image binary protobuf file
data/ilsvrc12/get_ilsvrc_aux.sh
These script effectively just download files from a specific source.
If the Caffe source tree is not available, the model files can be downloaded from the following URLs:
Reference
- class smqtk_descriptors.interfaces.descriptor_generator.DescriptorGenerator(*args: Any, **kwargs: Any)[source]
Base abstract Feature Descriptor interface.
- generate_arrays(data_iter: Iterable[smqtk_dataprovider.interfaces.data_element.DataElement]) Iterable[numpy.ndarray][source]
Generate descriptor vector elements for all input data elements.
Descriptor arrays yielded out will be parallel in association with the data elements input.
Selective Iteration For situations when it is desired to access specific generator returns, like when only one data element is provided in order to get a single array out, it is strongly recommended to expand the returned generator into a sequence type first. For example, expanding out the generator’s returns into a list (
list(g.generate_arrays([e]))[0]) is recommended over just getting the “next” element of the returned generator (next(g.generate_arrays([e]))). Expansion into a sequence allows the generator to fully execute, which includes any functionality after the finalyieldstatement in any of the underlying iterators.- Parameters
data_iter – Iterable of DataElement instances to be described.
- Raises
RuntimeError – Descriptor extraction failure of some kind.
ValueError – Given data element content was not of a valid type with respect to this descriptor generator implementation.
- Returns
Iterator of result numpy.ndarray instances.
- generate_elements(data_iter: Iterable[smqtk_dataprovider.interfaces.data_element.DataElement], descr_factory: smqtk_descriptors.descriptor_element_factory.DescriptorElementFactory = <smqtk_descriptors.descriptor_element_factory.DescriptorElementFactory object>, overwrite: bool = False) Generator[smqtk_descriptors.interfaces.descriptor_element.DescriptorElement, None, None][source]
Generate DescriptorElement instances for the input data elements, generating new descriptors for those elements that need them, or optionally all input data elements.
Descriptor elements yielded out will be parallel in association with the data elements input. Descriptor element UUIDs are inherited from the data element it was generated from.
Selective Iteration For situations when it is desired to access specific generator returns, like when only one data element is provided in order to get a single element out, it is strongly recommended to expand the returned generator into a sequence type first. For example, expanding out the generator’s returns into a list (
list(g.generate_elements([e]))[0]) is recommended over just getting the “next” element of the returned generator (next(g.generate_elements([e]))). Expansion into a sequence allows the generator to fully execute, which includes any functionality after the finalyieldstatement in any of the underlying iterators that may perform required clean-up.Non-redundant Processing Certain descriptor element implementations, as dictated by the input factory, may be connected to persistent storage in the background. Because of this, some descriptor elements may already “have” a vector on construction. This method, by default, only computes new descriptor vectors for data elements whose associated descriptor element does not report as already containing a vector. If the
overwriteflag is True then descriptors are computed for all input data elements and are set to their respective descriptor elements regardless of existing vector storage.- Parameters
data_iter – Iterable of DataElement instances to be described.
descr_factory – DescriptorElementFactory instance to drive the generation of element instances with some parametrization.
overwrite – By default, if a factory-produced DescriptorElement reports as containing a vector, we do not compute a descriptor again for the associated data element. If this is
True, however, we will generate descriptors for all input data elements, overwriting the vectors previously stored in the factory-produces descriptor elements.
- Raises
RuntimeError – Descriptor extraction failure of some kind.
ValueError – Given data element content was not of a valid type with respect to this descriptor generator implementation.
IndexError – Underlying vector-producing generator either under or over produced vectors.
- Returns
Iterator of result DescriptorElement instances. UUIDs of generated DescriptorElement instances will reflect the UUID of the DataElement it was generated from.
- generate_one_array(data_elem: smqtk_dataprovider.interfaces.data_element.DataElement) numpy.ndarray[source]
Convenience wrapper around
generate_arraysfor the single-input case.See the documentation for the
DescriptorGenerator.generate_arrays()method for more information.- Parameters
data_elem (smqtk.representation.DataElement) – DataElement instance to be described.
- Raises
RuntimeError – Descriptor extraction failure of some kind.
ValueError – Given data element content was not of a valid type with respect to this descriptor generator implementation.
- Returns
Descriptor vector the given data as a
numpy.ndarrayinstance.- Return type
numpy.ndarray
- generate_one_element(data_elem: smqtk_dataprovider.interfaces.data_element.DataElement, descr_factory: smqtk_descriptors.descriptor_element_factory.DescriptorElementFactory = <smqtk_descriptors.descriptor_element_factory.DescriptorElementFactory object>, overwrite: bool = False) smqtk_descriptors.interfaces.descriptor_element.DescriptorElement[source]
Convenience wrapper around
generate_elementsfor the single-input case.See documentation for the
DescriptorGenerator.generate_elements()method for more information- Parameters
data_elem – DataElement instance to be described.
descr_factory – DescriptorElementFactory instance to drive the generation of element instances with some parametrization.
overwrite – By default, if a factory-produced DescriptorElement reports as containing a vector, we do not compute a descriptor again for the associated data element. If this is
True, however, we will generate descriptors for all input data elements, overwriting the vectors previously stored in the factory-produces descriptor elements.
- Raises
IndexError – Underlying vector-producing generator either under or over produced vectors.
RuntimeError – Descriptor extraction failure of some kind.
ValueError – Given data element content was not of a valid type with respect to this descriptor generator implementation.
- Returns
Result DescriptorElement instance. UUID of the generated DescriptorElement instance will reflect the UUID of the DataElement it was generated from.
- class smqtk_descriptors.impls.descriptor_generator.caffe1.CaffeDescriptorGenerator(*args: Any, **kwargs: Any)[source]
Compute images against a Caffe model, extracting a layer as the content descriptor.
- Parameters
network_prototxt – Data element containing the text file defining the network layout.
network_model – Data element containing the trained
.caffemodelfile to use.image_mean – Optional data element containing the image mean
.binaryprotoor.npyfile.return_layer – The label of the layer we take data from to compose output descriptor vector.
batch_size – The maximum number of images to process in one feed forward of the network. This is especially important for GPUs since they can only process a batch that will fit in the GPU memory space.
use_gpu – If Caffe should try to use the GPU
gpu_device_id – Integer ID of the GPU device to use. Only used if
use_gpuis True.network_is_bgr – If the network is expecting BGR format pixels. For example, the BVLC default caffenet does (thus the default is True).
data_layer – String label of the network’s data layer. We assume its ‘data’ by default.
load_truncated_images – If we should be lenient and force loading of truncated image bytes. This is False by default.
pixel_rescale – Re-scale image pixel values before being transformed by caffe (before mean subtraction, etc) into the given tuple
(min, max)range. By default, images are loaded in the[0, 255]range. Refer to the image mean being used for desired input pixel scale.input_scale – Optional floating-point scalar value to scale values of caffe network input data AFTER mean subtraction. This value is directly multiplied against the pixel values.
threads – Optional specific number of threads to use for data loading and pre-processing. If this is None or 0, we introspect the current system thread capacity and use that.
- Raises
AssertionError – Optionally provided image mean protobuf consisted of more than one image, or its shape was neither 1 nor 3 channels.
- get_config() Dict[str, Any][source]
Return a JSON-compliant dictionary that could be passed to this class’s
from_configmethod to produce an instance with identical configuration.In the common case, this involves naming the keys of the dictionary based on the initialization argument names as if it were to be passed to the constructor via dictionary expansion.
- Returns
JSON type compliant configuration dictionary.