Machine learning

Machine learning was defined by Tom Mitchell in 1997 as computer programs that learn from experience \(E\) with respect to some task \(T\) and some performance measure \(P\). Here we will explain how machine learning can be used to solve tasks related to point clouds.

Note that applying machine learning models to point clouds requires point-wise features because machine learning models will classify the points from the features. The features can typically be specified through the fnames in the JSON as a list of string where each string is the name of an attribute in the input LAS/LAZ point cloud. Alternatively, features can also be mined using the data mining methods in the VL3D framework. See data mining documentation.

Models

Random forest classifier

The RandomForestClassificationModel can be used to solve point-wise classification tasks. The random forest is an ensemble of decision trees such that each individual decision tree is trained on a different random subset of the training dataset. The final prediction of a random forest model for classification is computed as the most frequent prediction considering all the trees in the ensemble. In the VL3D framework, scikit learn (sklearn) is used as the backend for the random forest implementation. A RandomForestClassificationModel can be defined as shown in the JSON below:

{
    "train": "RandomForestClassifier",
    "fnames": ["AUTO"],
    "training_type": "base",
    "random_seed": null,
    "shuffle_points": false,
    "model_args": {
        "n_estimators": 4,
        "criterion": "entropy",
        "max_depth": 20,
        "min_samples_split": 5,
        "min_samples_leaf": 1,
        "min_weight_fraction_leaf": 0.0,
        "max_features": "sqrt",
        "max_leaf_nodes": null,
        "min_impurity_decrease": 0.0,
        "bootstrap": true,
        "oob_score": false,
        "n_jobs": 4,
        "warm_start": false,
        "class_weight": null,
        "ccp_alpha": 0.0,
        "max_samples": 0.8
    },
    "hyperparameter_tuning": null,
    "importance_report_path": "RF_importance.log",
    "importance_report_permutation": true,
    "decision_plot_path": "RF_decision.svg",
    "decision_plot_trees": 3,
    "decision_plot_max_depth": 5
}

The JSON above defines a RandomForestClassificationModel made of four decision trees. It is trained in a straightforward way with no hyperparameter tuning. Once the model is trained. the feature importance for each feature is written to a text file called RF_importance.log, while three decision trees in the ensemble are graphically represented up to depth five. The graphical representation of the decision trees is exported to the RF_decision.svg file.

Arguments

– fnames: The names of the features to be considered to train the model. If "AUTO", the features considered by the last component that operated over the features will be selected.
– training_type: Either "base", "autoval", or "stratified_kfold". For further details, read the training strategies section.
– random_seed: Can be used to specify an integer like seed for any randomness-based computation. Mostly to be used for reproducibility purposes.
– shuffle_points: Whether to randomly shuffle the points (True) or not (False). It only has an effect when using the "autoval" or "stratified_kfold" training strategies.
– class_names: Optional argument that can be used to name the classes. When given, it must be a list with as many string as classes such that the class \(c\) will be named by the \(c\)-th string the list.
– model_args: The arguments governing the Random Forest model. See the sklearn documentation on Random Forest for further details.
– hyperparameter_tuning: This argument can be used to specify an hyperparameter tuning strategy using automated machine learning (AutoML) methods. For further details, read the hyperparameter tuning section.
– importance_report_path: Path to a file where the computed feature importances will be exported.
– importance_report_permutation: True to enable permutation invariance, false to ignore it. Permutation invariance importance is more robust than the straightforward importance. However, it increases the computational cost since it computes feature-wise permutations many times. See the sklearn documentation on permutation importance for further details.
– decision_plot_path: Path to a file where the requested plots representing the decision trees in the ensemble will be written. The path refers to the base file. Many files will be named the same but appending a "_n" at the end, where n is the number of the tree.
– decision_plot_trees: How many decision trees must be plotted. Using -1 implies plotting all the decision trees in the ensemble.
– decision_plot_max_depth: The max depth to consider for the graphical representation of the trees.

Output

The table below is an example of the reported feature importances on a dataset where the geometric features have been transformed through PCA. The “PERM. IMP.” columns refer to the permutation invariance feature importance (mean and standard deviation, respectively).

FEATURE	FEATURE IMPORTANCE	PERM. IMP. MEAN	PERM. IMP. STDEV
PCA_9	0.022607	0.014063	0.000040
PCA_6	0.025638	0.018803	0.000036
PCA_8	0.028034	0.020091	0.000058
PCA_7	0.032169	0.023295	0.000054
PCA_11	0.036930	0.021372	0.000036
PCA_3	0.047025	0.028464	0.000019
PCA_10	0.052353	0.016763	0.000031
PCA_5	0.062458	0.039308	0.000041
PCA_12	0.062910	0.019000	0.000044
PCA_4	0.068408	0.042190	0.000069
PCA_2	0.078614	0.053191	0.000059
PCA_1	0.482856	0.322271	0.000300

Training strategies

The VL3D framework supports three different strategies when training machine learning models. They are base, auto-validation, and stratified K-folding. The base strategy is a straightforward training of the model. The auto-validation strategy extracts a subset of the training dataset for later evaluation. The stratified K-folding combines stratification and K-folding to have an initial quantification of the model’s variance.

Base training

Base training is simple. The model is trained considering the input training dataset in a straightforward way. It is enough with setting "training_type": "base" in the JSON file, nothing else needs to be done.

Auto-validation

In auto-validation training, a subset of the training dataset is explicitly avoided during training so it represents previously unseen data. As this subset is not used during training, it can be used to compute a reasonable initial estimation of the generalization capabilities of the model. While this validation is not enough because, typically, the validation data comes from a very similar distribution to the training data, it yields valuable information directly inside the training pipeline. After all, if a model does not present a good generalization on a similar data distribution, it is not likely to generalize well to datasets with different characteristics.

To use an auto-validation training strategy one must set "training_type: "autoval" in the model training component. On top of that, the arguments detailed below can be used to configure the auto-validation.

{
    "training_type": "autoval",
    "autoval_metrics": ["OA", "MCC"],
    "autoval_size": 0.2,
    "shuffle_points": true
}

– autoval_metrics

A list of strings representing the metrics to compute for the auto-validation. The following can be used:

"OA" Overall accuracy.
"P" Precision.
"R" Recall.
"F1" F1 score (harmonic mean of precision and recall).
"IoU" Intersection over union (also known as Jaccard index).
"wP" Weighted precision (weights by the number of true instances for each class).
"wR" Weighted recall (weights by the number of true instances for each class).
"wF1" Weighted F1 score (weights by the number of true instances for each class).
"wIoU" Weighted intersection over union (weights by the number of true instances for each class).
"MCC" Matthews correlation coefficient.
"Kappa" Cohen’s kappa score.

– autoval_size

How many training data consider for the auto validation subset. It must be given as a number inside \((0, 1]\) when given as a float (ratio) or inside \((0, m]\) when given as an int (number of points).

– shuffle_points

Whether to randomly shuffle the points (True) or not (False).

Stratified K-folding

Understanding the stratified K-folding strategy for training requires understanding stratification and K-folding.

Stratification consists of dividing the data into subsets (stratum). Then, a subset of test points is built by selecting points from each stratum. The idea is that they follow a class distribution approximately proportional to the distribution in the original dataset. Therefore, the test points are expected to offer a reliable representation of the input dataset.

K-folding consists of dividing the data into \(K\) different subsets called folds. Then, \(K\) iterations are computed such that each one considers a different fold as the test set and the other \(K-1\) folds as the training dataset.

Stratified K-folding is K-folding with stratified folds, i.e., each fold is also considered a stratum, and therefore, it does better preserve the proportionality of the original class distribution.

To use a stratified K-folding strategy one must set "training_type": "stratified_kfold" in the model training component. On top of that, the arguments detailed below can be used to configure the stratified K-folding.

{
    "training_type": "stratified_kfold",
    "autoval_metrics": ["OA", "MCC"],
    "num_folds": 5,
    "shuffle_points": true,
    "stratkfold_report_path": "stratkfold_report.log",
    "stratkfold_plot_path": "startkfold_plot.svg"
}

– autoval_metrics

A list of strings representing the metrics to compute on the test subsets during the stratified K-folding. The following can be used:

"OA" Overall accuracy.
"P" Precision.
"R" Recall.
"F1" F1 score (harmonic mean of precision and recall).
"IoU" Intersection over union (also known as Jaccard index).
"wP" Weighted precision (weights by the number of true instances for each class).
"wR" Weighted recall (weights by the number of true instances for each class).
"wF1" Weighted F1 score (weights by the number of true instances for each class).
"wIoU" Weighted intersection over union (weights by the number of true instances for each class).
"MCC" Matthews correlation coefficient.
"Kappa" Cohen’s kappa score.

– num_folds

How many folds. Note that stratified K-folding only makes sense for two or more folds.

– shuffle_points

When true, the points will be randomly sampled before computing the subsets. However, points in the same split are not shuffled.

– stratkfold_report_path

The path where a text report summarizing the stratified K-folding will be exported.

– stratkfold_plot_path

The path where a plot representing the results of the stratified K-folding will be written.

Training data pipelines

The VL3D framework supports the definition of simple sequential pipelines for training data. These pipelines will NOT be applied when predicting. They can be used to transform the input data \(X\) and the reference values \(y\). Each consecutive component in the pipeline will be executed on the transformed training data as returned by the previous component. Finally, the model will be trained on the training data as it is after calling the last component (see TrainingDataComponent).

Training data pipelines can be defined inside a component with a "train" key (see Random forest classifier). To define a training data pipeline it is necessary to add an entry "training_data_pipeline" : [...], where the list contains dictionaries sequentially specifying the components in the pipeline. Each dictionary must contain a "component" key whose value is a string with the name of the component and a "component_args" that will typically be a dictionary with the parameters governing the component.

Class-wise sampler

The class-wise sampler (ClasswiseSampler) can be used to sample points from the input training dataset such that a target number of points per class is selected. The class-wise sampler can work with or without replacement. In the first case, repeated points might be considered. The JSON below shows an example of how to define a training data pipeline with a ClasswiseSampler component.

"training_data_pipeline": [
    {
        "component": "ClasswiseSampler",
        "component_args": {
            "target_class_distribution": [2000000, 2000000, 2000000, 2000000],
            "replace": false
        }
    }
]

Arguments

– target_class_distribution: Number of points for each class (in a classification task) or each continuous variable (in a regression task).
– replace: Boolean flag governing whether replacement is enabled (true) or not (false).

Synthetic minority oversampling technique (SMOTE)

The synthetic minority oversampling technique (SMOTE) can be used to synthetically generate points for underrepresented classes. SMOTE works by considering nearest neighbors and generating points between the point and each of its nearest neighbors. In doing so, underrepresented classes can be prioritized over overrepresented classes to address class imbalance. See the Imbalanced learn documentation on SMOTE for further details. The JSON below shows an example of how to define a training data pipeline with a SMOTE component.

"training_data_pipeline": [
    {
        "component": "SMOTE",
        "component_args": {
            "sampling_strategy": "auto",
            "random_state": null,
            "k_neighbors": 5,
            "n_jobs": 16

        }
    }
]

Arguments

See the Imbalanced learn documentation on SMOTE for the arguments.

Hyperparameter tuning

The model’s parameters are automatically derived from the data when training the model. However, the model’s hyperparameters are not handled by the training algorithm. Instead, they must be handled by the data scientist. However, it is possible to use AutoML methods to ease the process of hyperparameter optimization. These methods enable the data scientist to define automatic search procedures to automatically find an adequate set of hyperparameters.

The hyperparameter strategies can be included into any training component based on machine learning. To achieve this, simply give the specification of the hyperparameter strategy associated to the "hyperparameter_tuning" key.

Grid search

Grid search can be used to automatically find the best combination for a set of hyperparameters. More concretely, a set family must be given such that its elements are sets, each defining the values for a given hyperparamter. The grid search algorithm will explore all the potential combinations derived from the Cartesian product between these sets. For example, a grid of two hyperparameters can be defined in JSON with "criterion": ["gini", "entropy"] and "max_depth": [5, 10, 15]. The HyperGridSearch component will explore the combinations given by the Cartesian product, i.e., [("gini", 5), ("gini", 10), ("gini", 15), ("entropy", 5), ("entropy", 10), ("entropy", 15)].

The arguments detailed below can be used to configure an arbitrary grid search on the hyperparameters of the machine learning model.

"hyperparameter_tuning": {
    "tuner": "GridSearch",
    "hyperparameters": ["n_estimators", "max_depth", "max_samples"],
    "scores": "f1_macro",
    "num_folds": 5,
    "grid": {
        "n_estimators": [2, 4, 8, 16],
        "max_depth": [15, 20, 27],
        "max_samples": [0.6, 0.8, 0.9]
    },
    "nthreads": -1,
    "pre_dispatch": 8,
    "report_path": "hyper_grid_search.log"
}

– hyperparameters: A list with the names of the hyperparameters to be considered.
– scores: It can be null (or not given), a string, a list, or a dictionary where the keys are the desired names for the scores and the values are the internal names. The values of the dictionary, those of the list, and the single string must match the convention of scoring parameters of scikit learn. Note that when many scores are given, only the first one will be considered to automatically determine the best model arguments.
– num_folds: How many folds consider to validate the model following a K-folding strategy for each node explored during the grid search.
– grid: A key-value specification of the search space. Each key must be the name of a feature and each value must be a list of the values to be explored during the grid search.
– nthreads: How many threads use when computing the grid search. Note that the model might be run in parallel too. In that case, it is important to consider the sum between the threads used by the model and the threads used by the grid search.
– pre_dispatch: How many jobs will be dispatched during the parallel execution. It can be used to prevent dispatching more jobs than desired, e.g., to avoid resource exhaustion.
– report_path: When given, a text report about the grid search will be exported to the file pointed by the path.

Random search

Random search can be used to automatically find the best combination for some given hyperparameters. A HyperRandomSearch will run many iterations and at each one it will compute a random value for each hyperparameter. The arguments detailed below can be used to configure an arbitrary random search on the hyperparameters of the machine learning model.

"hyperparameter_tuning": {
    "tuner": "RandomSearch",
    "hyperparameters": ["n_estimators", "max_depth", "ccp_alpha", "min_impurity_decrease", "criterion"],
    "scores": {
        "F1": "f1_macro",
        "wF1": "f1_weighted"
    },
    "iterations": 32,
    "num_folds": 5,
    "distributions": {
        "n_estimators": {
            "distribution": "randint",
            "start": 2,
            "end": 17
        },
        "max_depth": {
            "distribution": "randint",
            "start": 10,
            "end": 31
        },
        "ccp_alpha": {
            "distribution": "uniform",
            "start": 0.0,
            "offset": 0.05
        },
        "min_impurity_decrease": {
            "distribution": "normal",
            "mean": 0.01,
            "stdev": 0.001
        },
        "criterion": ["gini", "entropy", "log_loss"]
    },
    "report_path": "random_search.log",
    "nthreads": -1,
    "pre_dispatch": 8
}

– hyperparameters

A list with the names of the hyperparameters to be considered.

– scores

It can be null (or not given), a string, a list, or a dictionary where the keys are the desired names for the scores and the values are the internal names. The values of the dictionary, those of the list, and the single string must match the convention of scoring parameters of scikit learn. Note that when many scores are given, only the first one will be considered to automatically determine the best model arguments.

– iterations

How many iterations of random search must be computed. At each iteration a random value is taken for each tuned hyperparameter.

– num_folds

How many folds consider to validate the model following a K-folding strategy for each node explored during the random search.

– distributions

The specification of the random distributions to take the values at each random search iteration. Distributions can be specified in four different ways:

List
In this case, the elements of the list will be uniformly sampled.
Uniform discrete random variable
In this case, the values will be taken from a uniform discrete random distribution. The start value is included, the end value is excluded.
Uniform continuous random variable
In this case, the values will be taken from a uniform continuous random distribution. The included values range from start to start + offset (inclusive).
Normal continuous random variable
In this case, the values will be taken from a normal continuous random distribution with given mean and standard deviation.

– nthreads

How many threads use when computing the random search. Note that the model might be run in parallel too. In that case, it is important to consider the sum between the threads used by the model and the threads used by the grid search.

– pre_dispatch

How many jobs will be dispatched during the parallel execution. It can be used to prevent dispatching more jobs than desired, e.g., to avoid resource exhaustion.

– report_path

When given, a text report about the random search will be exported to the file pointed by the path.

Decorators

Furthest point sampling decorator

The FPSDecoratorTransformer can be used to decorate a model such that the computations can take place in a transformed space of reduced dimensionality. Typically, the domain of a model is the entire point cloud, let us say \(m\) points. When using a FPSDecoratedModel this domain will be transformed to a subset of the original point cloud with \(R\) points, such that \(m \geq R\). Decorating a model with this decorator can be useful to reduce the execution time of model training, or to speedup the computation of an hyperparameter tuning procedure. A decorated model can use the alternative point cloud of \(R\) points for training but not for predicting, or it can work on the alternative representation for both operations.

{
    "train": "FPSDecorated",
    "fps_decorator": {
        "num_points": "m/11",
        "fast": true,
        "num_encoding_neighbors": 1,
        "num_decoding_neighbors": 1,
        "release_encoding_neighborhoods": false,
        "threads": 16,
        "representation_report_path": "*/fps_repr/model_representation_points.las"
    },
    "undecorated_predictions": true,
    "decorated_model": {
        "train": "RandomForestClassifier",
        "fnames": ["AUTO"],
        "training_type": "stratified_kfold",
        "autoval_metrics": ["OA", "P", "R", "F1", "IoU", "wP", "wR", "wF1", "wIoU", "MCC"],
        "num_folds": 3,
        "random_seed": null,
        "shuffle_points": true,
        "stratkfold_report_path": "*/stratkfold_report.log",
        "stratkfold_plot_path": "*/stratkfold_plot.svg",
        "model_args": {
            "n_estimators": 64,
            "criterion": "entropy",
            "max_depth": 20,
            "min_samples_split": 10,
            "min_samples_leaf": 1,
            "min_weight_fraction_leaf": 0.0,
            "max_features": null,
            "max_leaf_nodes": null,
            "min_impurity_decrease": 0.0,
            "bootstrap": true,
            "oob_score": false,
            "n_jobs": 12,
            "warm_start": false,
            "class_weight": null,
            "ccp_alpha": 0.0,
            "max_samples": 0.8
        },
        "importance_report_path": "*/RF_importance.log",
        "importance_report_permutation": false,
        "decision_plot_path": "*/RF_decision.svg",
        "decision_plot_trees": 5,
        "decision_plot_max_depth": 7,
        "hyperparameter_tuning": {
            "tuner": "RandomSearch",
            "hyperparameters": ["n_estimators", "max_depth", "min_samples_split", "min_samples_leaf", "class_weight", "max_samples"],
            "scores": {
                "F1": "f1_macro",
                "OA": "accuracy"
            },
            "iterations": 24,
            "num_folds": 3,
            "distributions": {
                "n_estimators": {
                    "distribution": "randint",
                    "start": 12,
                    "end": 96
                },
                "max_depth": {
                    "distribution": "randint",
                    "start": 5,
                    "end": 25
                },
                "min_samples_split": {
                    "distribution": "randint",
                    "start": 4,
                    "end": 16
                },
                "min_samples_leaf": {
                    "distribution": "randint",
                    "start": 1,
                    "end": 8
                },
                "class_weight": ["balanced", "balanced_subsample", null],
                "max_samples": {
                    "distribution": "uniform",
                    "start": 0.4,
                    "offset": 0.5
                }
            },
            "report_path": "*/random_search.log",
            "nthreads": -1,
            "pre_dispatch": 2
        }
    }
}

Arguments

– fps_decorator

The specification of the furthest point sampling (FPS) decoration carried out through the FPSDecoratorTransformer.

– num_points: The target number of points \(R\) for the transformed point cloud. It can be an integer or an expression that will be evaluated with \(m\) representing the number of points of the original point cloud, e.g., "m/2" will downscale the point cloud to half the number of points.
– fast: Whether to use exact furthest point sampling (false) or a faster stochastic approximation (true).
– num_encoding_neighbors: How many closest neighbors in the original point cloud are considered for each point in the transformed point cloud to reduce from the original space to the transformed one.
– num_decoding_neighbors: How many closest neighbors in the transformed point cloud are considered for each point in the original point cloud to propagate back from the transformed space to the original one.
– release_encoding_neighborhoods: Whether the encoding neighborhoods can be released after computing the transformation (true) or not (false). Releasing these neighborhoods means the FPSDecoratorTransformer.reduce() method must not be called, otherwise errors will arise. Setting this flag to true can help saving memory when needed.
– threads: The number of parallel threads to consider for the parallel computations. Note that -1 means using as many threads as available cores.
– representation_report_path: Where to export the transformed point cloud. In general, it should be null to prevent unnecessary operations. However, it can be enabled (by given any valid path to write a point cloud file) to visualize the points that are seen by the model.

– undecorated_predictions

Whether to apply the FPS decorator for predictions (true) or only for training (false).

– decorated_model

A typical machine learning model specification. See the Random forest classifier for an example.

Minimum distance decimator decorator

The MinDistDecimatorDecorator can be used to decorate a model such that the computations can take place in a transformed space of reduced dimensionality. Typically, the domain of a model is the entire point cloud, let us say \(m\) points. When using a MinDistDecoratedModel this domain will be transformed to a subset of the original point cloud with \(R \leq m\) points. Decorating a model with this decorator can be useful to reduce the execution tiem of model training, or to speedup the computation of an hyperparameter tuning procedure. A decorated model can use the alternative point cloud of \(R\) points for training but not for predicting, or it can work on the alternative representation for both operations.

{
    "train": "MinDistDecorated",
    "mindist_decorator": {
        "min_distance": 0.03
        "num_encoding_neighbors": 1,
        "num_decoding_neighbors": 1,
        "release_encoding_neighborhoods": false,
        "threads": -1,
        "representation_report_path": "*/mindist_repr/model_representation_points.las"
    },
    "undecorated_predictions": true,
    "decorated_model": {
        "train": "RandomForestClassifier",
        "fnames": ["AUTO"],
        "training_type": "stratified_kfold",
        "autoval_metrics": ["OA", "P", "R", "F1", "IoU", "wP", "wR", "wF1", "wIoU", "MCC"],
        "num_folds": 3,
        "random_seed": null,
        "shuffle_points": true,
        "stratkfold_report_path": "*/stratkfold_report.log",
        "stratkfold_plot_path": "*/stratkfold_plot.svg",
        "model_args": {
            "n_estimators": 64,
            "criterion": "entropy",
            "max_depth": 20,
            "min_samples_split": 10,
            "min_samples_leaf": 1,
            "min_weight_fraction_leaf": 0.0,
            "max_features": null,
            "max_leaf_nodes": null,
            "min_impurity_decrease": 0.0,
            "bootstrap": true,
            "oob_score": false,
            "n_jobs": 12,
            "warm_start": false,
            "class_weight": null,
            "ccp_alpha": 0.0,
            "max_samples": 0.8
        },
        "importance_report_path": "*/RF_importance.log",
        "importance_report_permutation": false,
        "decision_plot_path": "*/RF_decision.svg",
        "decision_plot_trees": 5,
        "decision_plot_max_depth": 7,
        "hyperparameter_tuning": {
            "tuner": "RandomSearch",
            "hyperparameters": ["n_estimators", "max_depth", "min_samples_split", "min_samples_leaf", "class_weight", "max_samples"],
            "scores": {
                "F1": "f1_macro",
                "OA": "accuracy"
            },
            "iterations": 24,
            "num_folds": 3,
            "distributions": {
                "n_estimators": {
                    "distribution": "randint",
                    "start": 12,
                    "end": 96
                },
                "max_depth": {
                    "distribution": "randint",
                    "start": 5,
                    "end": 25
                },
                "min_samples_split": {
                    "distribution": "randint",
                    "start": 4,
                    "end": 16
                },
                "min_samples_leaf": {
                    "distribution": "randint",
                    "start": 1,
                    "end": 8
                },
                "class_weight": ["balanced", "balanced_subsample", null],
                "max_samples": {
                    "distribution": "uniform",
                    "start": 0.4,
                    "offset": 0.5
                }
            },
            "report_path": "*/random_search.log",
            "nthreads": -1,
            "pre_dispatch": 2
        }
    }
}

Arguments

– mindist_decorator: See the minimum distance decorated miner documentation about arguments .
– undecorated_predictions: Whether to apply the FPS decorator for predictions (true) or only for training (false).
– decorated_model: A typical machine learning model specification. See the Random forest classifier for an example.

Working example

This example shows how to define two different pipelines, one to train a model and export it as a PredictivePipeline, the other to use the predictive pipeline to compute a leaf-wood segmentation on another point cloud. Readers are referred to the pipelines documentation to read more about how pipelines work and to see more examples.

Training pipeline

The training pipeline will train two models, each on a different input point cloud. In this case, the input point clouds are specified using a URL format, i.e., the framework will automatically download them, providing the given links point to a valid and accessible LAS/LAZ file. The output for each model will be written to a different directory.

The pipeline starts computing the geometric features on many different radii. Then, the geometric features are exported to a file inside the pcloud folder in the corresponding output directory (see the sequential pipeline documentation to understand how the * works when specifying output paths). Then, an univariate imputation is applied to NaN values, followed by a standardization. Afterward, a PCA transformer takes as many principal components as necessary to project the features on an orthogonal basis that explains at least \(99\%\) of the variance. These transformed features are exported to the geomfeats_transf.las file before training a random forest classifier on them. The random forest model is trained using stratified K-folding to assess its variance and potential generalization. On top of that, grid search is used as a hyperparameter tuning strategy to automatically find a good combination of max tree depth, max number of samples per tree, and the number of decision trees in the ensemble. Finally, the model is exported to a predictive pipeline that can later be used to compute predictions on previously unseen models.

{
    "in_pcloud": [
        "https://3dweb.geog.uni-heidelberg.de/trees_leafwood/PinSyl_KA09_T048_2019-08-20_q1_TLS-on_c_t.laz",
        "https://3dweb.geog.uni-heidelberg.de/trees_leafwood/PinSyl_KA10_03_2019-07-30_q2_TLS-on_c_t.laz"
    ],
    "out_pcloud": [
        "out/training/PinSyl_KA09_T048_pca_RF/*",
        "out/training/PinSyl_KA10_03_pca_RF/*"
    ],
    "sequential_pipeline": [
        {
            "miner": "GeometricFeatures",
            "radius": 0.05,
            "fnames": ["linearity", "planarity", "surface_variation", "eigenentropy", "omnivariance", "verticality", "anisotropy"]
        },
        {
            "miner": "GeometricFeatures",
            "radius": 0.1,
            "fnames": ["linearity", "planarity", "surface_variation", "eigenentropy", "omnivariance", "verticality", "anisotropy"]
        },
        {
            "miner": "GeometricFeatures",
            "radius": 0.2,
            "fnames": ["linearity", "planarity", "surface_variation", "eigenentropy", "omnivariance", "verticality", "anisotropy"]
        },
        {
            "writer": "Writer",
            "out_pcloud": "*pcloud/geomfeats.las"
        },
        {
            "imputer": "UnivariateImputer",
            "fnames": ["AUTO"],
            "target_val": "NaN",
            "strategy": "mean",
            "constant_val": 0
        },
        {
            "feature_transformer": "Standardizer",
            "fnames": ["AUTO"],
            "center": true,
            "scale": true
        },
        {
            "feature_transformer": "PCATransformer",
            "out_dim": 0.99,
            "whiten": false,
            "random_seed": null,
            "fnames": ["AUTO"],
            "report_path": "*report/pca_projection.log",
            "plot_path": "*plot/pca_projection.svg"
        },
        {
            "writer": "Writer",
            "out_pcloud": "*pcloud/geomfeats_transf.las"
        },
        {
            "train": "RandomForestClassifier",
            "fnames": ["AUTO"],
            "training_type": "stratified_kfold",
            "random_seed": null,
            "shuffle_points": true,
            "num_folds": 5,
            "model_args": {
                "n_estimators": 4,
                "criterion": "entropy",
                "max_depth": 20,
                "min_samples_split": 5,
                "min_samples_leaf": 1,
                "min_weight_fraction_leaf": 0.0,
                "max_features": "sqrt",
                "max_leaf_nodes": null,
                "min_impurity_decrease": 0.0,
                "bootstrap": true,
                "oob_score": false,
                "n_jobs": 4,
                "warm_start": false,
                "class_weight": null,
                "ccp_alpha": 0.0,
                "max_samples": 0.8
            },
            "autoval_metrics": ["OA", "P", "R", "F1", "IoU", "wP", "wR", "wF1", "wIoU", "MCC", "Kappa"],
            "stratkfold_report_path": "*report/RF_stratkfold_report.log",
            "stratkfold_plot_path": "*plot/RF_stratkfold_plot.svg",
            "hyperparameter_tuning": {
                "tuner": "GridSearch",
                "hyperparameters": ["n_estimators", "max_depth", "max_samples"],
                "nthreads": -1,
                "num_folds": 5,
                "pre_dispatch": 8,
                "grid": {
                    "n_estimators": [2, 4, 8, 16],
                    "max_depth": [15, 20, 27],
                    "max_samples": [0.6, 0.8, 0.9]
                },
                "report_path": "*report/RF_hyper_grid_search.log"
            },
            "importance_report_path": "*report/LeafWood_Training_RF_importance.log",
            "importance_report_permutation": true,
            "decision_plot_path": "*plot/LeafWood_Training_RF_decision.svg",
            "decision_plot_trees": 3,
            "decision_plot_max_depth": 5
        },
        {
            "writer": "PredictivePipelineWriter",
            "out_pipeline": "*pipe/LeafWood_Training_RF.pipe",
            "include_writer": false,
            "include_imputer": true,
            "include_feature_transformer": true,
            "include_miner": true
        }
    ]
}

The table below is the report describing the findings of the grid search strategy. On the left side of the table, the columns correspond to the optimized hyperparameters. On the right side, the mean and standard deviation of the accuracy (percentage), and the mean and standard deviation of the training time (seconds).

max_depth	max_samples	n_estimators	mean_test	std_test	mean_time	std_time
27	0.6	2	92.643	0.107	21.757	1.034
27	0.8	2	92.825	0.024	27.677	1.595
27	0.9	2	92.918	0.098	29.347	2.284
20	0.6	2	93.758	0.076	21.022	0.919
20	0.8	2	93.976	0.095	25.462	0.869
20	0.9	2	94.015	0.050	28.862	2.233
15	0.9	2	94.067	0.135	27.748	0.815
15	0.6	2	94.068	0.036	17.009	0.179
15	0.8	2	94.142	0.040	26.068	0.904
15	0.8	4	94.521	0.025	29.552	1.008
15	0.6	4	94.535	0.023	21.353	3.019
15	0.9	4	94.555	0.030	30.338	2.784
15	0.6	8	94.676	0.027	46.888	1.147
15	0.8	8	94.683	0.046	57.878	1.811
15	0.9	8	94.717	0.046	62.278	2.398
15	0.9	16	94.762	0.022	115.058	3.096
15	0.6	16	94.776	0.033	84.547	2.639
15	0.8	16	94.786	0.022	105.056	2.721
20	0.6	4	95.063	0.013	24.861	0.706
20	0.8	4	95.162	0.020	29.255	1.180
27	0.6	4	95.164	0.022	24.244	0.791
20	0.9	4	95.177	0.018	33.292	1.051
27	0.8	4	95.293	0.018	32.017	0.842
27	0.9	4	95.364	0.038	33.952	1.576
20	0.6	8	95.402	0.028	47.750	2.282
20	0.8	8	95.453	0.022	63.293	1.517
20	0.9	8	95.475	0.022	66.069	0.669
20	0.6	16	95.524	0.023	88.514	1.857
20	0.8	16	95.572	0.027	108.852	3.814
20	0.9	16	95.604	0.024	126.188	4.900
27	0.6	8	95.767	0.016	50.634	1.783
27	0.8	8	95.892	0.031	63.077	2.125
27	0.9	8	95.944	0.019	68.445	1.141
27	0.6	16	95.978	0.014	89.364	1.832
27	0.8	16	96.100	0.021	113.879	2.932
27	0.9	16	96.144	0.017	109.126	10.376

Predictive pipeline

The predictive pipeline will use the model trained on the first point cloud to compute leaf-wood segmentation on the second point cloud. The input point cloud will be downloaded from the given URL. The classified point cloud will be written including some extra information like the fail/success point-wise mask (because ClassifiedPcloudWriter is used instead of Writer) to the predicted.las file. The predicted labels will also be exported to the single-column file predictions.lbl. Finally, the ClassificationEvaluator component is used to analyze the predictions with respect to the expected values. In doing so, many reports and plots are generated including confusion matrices, and the requested evaluation metrics.

{
    "in_pcloud": [
        "https://3dweb.geog.uni-heidelberg.de/trees_leafwood/PinSyl_KA10_03_2019-07-30_q2_TLS-on_c_t.laz"
    ],
    "out_pcloud": [
        "out/prediction/PinSyl_KA09_T048_pca_RF/PinSyl_KA10_03/*"
    ],
    "sequential_pipeline": [
        {
            "predict": "PredictivePipeline",
            "model_path": "out/training/PinSyl_KA09_T048_pca_RF/pipe/LeafWood_Training_RF.pipe"
        },
        {
            "writer": "ClassifiedPcloudWriter",
            "out_pcloud": "*predicted.las"
        },
        {
            "writer": "PredictionsWriter",
            "out_preds": "*predictions.lbl"
        },
        {
            "eval": "ClassificationEvaluator",
            "class_names": ["wood", "leaf"],
            "metrics": ["OA", "P", "R", "F1", "IoU", "wP", "wR", "wF1", "wIoU", "MCC", "Kappa"],
            "class_metrics": ["P", "R", "F1", "IoU"],
            "report_path": "*report/global_eval.log",
            "class_report_path": "*report/class_eval.log",
            "confusion_matrix_report_path" : "*report/confusion_matrix.log",
            "confusion_matrix_plot_path" : "*plot/confusion_matrix.svg",
            "class_distribution_report_path": "*report/class_distribution.log",
            "class_distribution_plot_path": "*plot/class_distribution.svg"
        }
    ]
}

The table below exemplifies the evaluation metrics describing how good the predictions are with respect to the expected values.

OA	P	R	F1	IoU	wP	wR	wF1	wIoU	MCC	Kappa
90.724	91.997	87.590	89.245	80.813	91.099	90.724	90.469	82.811	79.466	78.592

The figure below represents the computed leaf-wood segmentation directly visualized in the point cloud.

Figure representing the leaf-wood segmentation computed on a point cloud with previously unseen data. — Visualization of the output obtained after computing a leaf-wood segmentation on a dataset containing a previously unseen tree. On the left, the red points represent misclassified points while the gray points represent successfully classified points. In the middle, the point-wise predicted labels (green for leaf, brown for wood). In the right, the point-wise reference labels.