src.model.deeplearn.sequencer.dl_offline_sequencer

Classes

DLOfflineSequencer(X, y, batch_size, **kwargs)

class src.model.deeplearn.sequencer.dl_offline_sequencer.DLOfflineSequencer(X, y, batch_size, **kwargs)

Author:: Alberto M. Esmoris Pena

A deep learning sequencer that decorates a backbone sequencer and serializes the data so it can be read and fed directly into a neural network. This sequencer allows using training datasets that cannot be hold entirely in memory.

The internal structure of an offline storage file (ofs) is:

– ofs["data"]

Group containing the chunks of training data.

—ofs["data"]["chunk_idx"]

Group containing the training data of the chunk with given index. Note that "chunk_idx" ranges from 0 to number of chunks, i.e., \(0, 1, 2, \ldots, \text{number of chunks}\).

—ofs["data"]["chunk_idx"]["elem_idx"]

All the training data elements in the chunk. Note that "elem_idx" identifies one of the input elements. For example, assume a neural network that receives two inputs, e.g., the structure space and the feature space. Then, the tensor representing the structure space will have "elem_idx" = "X0" and the tensor representing the feature space "elem_idx" = "X1". For convenience the elements are named as \(X_0, \ldots, X_{\text{number of data elements}-1}\).

—ofs["references"]

Group containing the chunks of training references.

—ofs["references"]["chunk_idx"]

Group containing the training references of the chunk with given index. Note that "chunk_idx" ranges from 0 to number of chunks, i.e., \(0, 1, 2, \ldots, \text{number of chunks}\).

—ofs["references"]["chunk_idx"]["elem_idx"]

The training reference elements in the chunk. For standard point-wise classification tasks there is only one single element named "Y0". In other cases, the elements will be indexed as \(Y_0, \ldots, Y_{\text{number of reference elements}-1}\).

—ofs["ordered_keys"]

The keys for the data elements given in the same order as they must yielded by the sequencer so they can be straightforward feed into the neural network.

Variables:

offline_storage (str) – Path to the online storage file.
chunk_size (int) – How many batches per chunk. Note that each chunk must be small enough to be stored in the available memory, i.e., each chunk will be held in memory during training.
chunk_randomization (bool) – When enabled (True), the offline training sequencer will iterate sequentially over the chunks in the first pass but with a different random order in passes after the first one. If disabled (False, default), the offline sequencer will iterate always in a sequential way (i.e., for all passes).
batch_randomization (bool) – When enabled (True), the offline training sequencer will iterate sequentially over the elements of a batch in the first pass buth with a different random order in passes after the first one. If disabled (False, default), the offline sequencer will iterate always in a sequential way (i.e., for all passes). Note that the passes refer to calling __getitem__ for all the dataset once, i.e., chunk-wise passes not batch-wise passes.
offline_pcloud (None or list of str) – Paths to the many point clouds that must be pre-processed and included into the offline storage file. Note that only the neural network pre-processor will be applied, i.e., previous components of the pipeline (wrt the deep learning model) are applied to the original input point cloud but not to the extra point clouds specified through this argument.
disable_offline_storage_writing (bool) – Whether to allow writing operations in the offline storage (typically inserting new data, False by default) or not (True).
mh – The model handler handling the deep learning model for which the offline sequencer (optional in general but necessary for most real training pipelines) must yield training data and references.
instantiated (bool) – Flag tracking whether the offline sequencer has been instantiated (True, after calling __init__) or not (False, at the beginning of __init__).

__init__(X, y, batch_size, **kwargs)

Initialize the member attributes of DLOfflineSequencer.

Parameters:

X – The input data.
y – The input reference values.
batch_size (int) – The number of elements per batch.
kwargs – The key-word specification to parametrize the sequencer.

set_input_data(X, y)

Delegate the logic to the backbone.

See DLAbstractSequencer.set_input_data().

get_input_data()

Delegate the logic to the backbone.

See DLAbstractSequencer.get_input_data().

getitem_training(idx): See DLAbstractSequencer.getitem_training().

on_epoch_end_training(): See DLAbstractSequencer.on_epoch_end_training().

getitem_predict(idx)

Delegate the logic to the backbone.

See DLAbstractSequencer.getitem_predict().

on_epoch_end_predict()

Delegate the logic to the backbone.

See DLAbstractSequencer.getitem_predict().

is_offline_storage_open()

Check whether the offline storage is currently open.

Returns:: True if offline storage is currently open, false otherwise.
Return type:: bool

open_offline_storage()

Open the offline storage.

Returns:: Nothing, but the offline storage is assigned to the member cache variable self.osf (offline storage file).

are_pclouds_loaded_in_offline_storage()

Check whether offline point clouds have been loaded into the offline storage (True) or not (False).

Returns:: True if offline point clouds are loaded in the offline storage, False otherwise.
Return type:: bool

load_pclouds_in_offline_storage()

Load the point clouds (the one already loaded in the backbone and those whose paths are given in the offline_pcloud list). Note that if disable_offline_storage_writing is set to True, no point cloud will be loaded into the offline storage at all.

Returns:: Nothing, but the offline storage file will contain training data derived from the point clouds in the offline sequencer.

load_backbone_pcloud_in_offline_storage()

Load the training data from the point cloud currently handled by the backbone sequencer into the offline storage file.

Returns:: Nothing, but the point cloud in the backbone will be loaded into the offline storage file (osf).

insert_chunk_into_offline_storage_file(chunk_idx, chunk, chunk_ref, chunk_sw=None)

Insert the given chunk into the offline storage file.

Parameters:

chunk_idx (int) – Index of the chunk to be inserted.
chunk (List of list) – The data of the chunk to be inserted. Note that chunk[k][j] reads k-th element of the j-th batch.
chunk_ref (list) – The references of the chunk to be inserted. Note that chunk_ref[j] reads j-th batch.
chunk_sw (list or None) – Optional per-batch sample weights, one entry per emitted batch (or empty when the backbone did not supply sample weights). Persisted as the W0 dataset inside the reference group so the masking contract survives a round-trip through offline storage.

Returns:

The index for the next chunk.

Return type:

int

load_pcloud_in_backbone(pcloud_path)

Load the point cloud in the given path in the backbone, after pre-processing it with the pre-processor in the model’s architecture.

Parameters:: pcloud_path – The path of the point cloud to be loaded (after pre-processing) in the backbone.
Returns:: Nothing, but the backbone is updated with the point cloud in the given path.

is_offline_storage_file_initialized()

Check whether the offline storage file (osf) has been initialized (i.e., it has the expected HDFS groups associated to the corresponding keys: references, data, ordered_keys).

Returns:: True if the offline storage file has been initialized, False otherwise.
Return type:: bool

initialize_offline_storage_file()

Initialize the offline storage file registering groups "data" and "references" and the dataset "ordered_keys".

Returns:: Nothing at all, but the offline storage file is modified.

getitem_training_from_offline_storage(idx)

Get the next training data item from the offline storage to feed it into the neural network.

Parameters:: idx – The index of the batch that must be obtained. Note that for the offline sequencer this index is ignored and, instead, it is tracked internally considering chunk and batch indices. Consequently, calling offline_sequencer[i] will yield the same result as calling offline_sequencer[j] even if \(i \neq j\).
Returns:: The next batch as a tuple (X, y), where X is the training data itself and y are the references.
Return type:: tuple

load_current_chunk_in_cache()

Load the current chunk in the offline sequencer cache.

Returns:: Nothing, but the member attributes self.current_chunk and self.current_ref_chunk are updated internally.

init_random_indices(): See DLAbstractSequencer.init_random_indices().

apply_random_indices(): See DLAbstractSequencer.apply_random_indices().

post_process_output(z_rf)

Some backbones might need to post-process the output after computing the predictions. This method delegates the post-processing logic to the backbone.

See DLAbstractSequencer.post_process_output().