.. _Machine learning page: Machine learning ****************** Machine learning was defined by Tom Mitchell in 1997 as computer programs that learn from experience :math:`E` with respect to some task :math:`T` and some performance measure :math:`P`. Here we will explain how machine learning can be used to solve tasks related to point clouds. Note that applying machine learning models to point clouds requires point-wise features because machine learning models will classify the points from the features. The features can typically be specified through the ``fnames`` in the JSON as a list of string where each string is the name of an attribute in the input LAS/LAZ point cloud. Alternatively, features can also be mined using the data mining methods in the VL3D framework. See :ref:`data mining documentation `. Models =========== .. _Random forest classifier: Random forest classifier --------------------------- The :class:`.RandomForestClassificationModel` can be used to solve point-wise classification tasks. The random forest is an ensemble of decision trees such that each individual decision tree is trained on a different random subset of the training dataset. The final prediction of a random forest model for classification is computed as the most frequent prediction considering all the trees in the ensemble. In the VL3D framework, `scikit learn (sklearn) `_ is used as the backend for the random forest implementation. A :class:`.RandomForestClassificationModel` can be defined as shown in the JSON below: .. code-block:: json { "train": "RandomForestClassifier", "fnames": ["AUTO"], "training_type": "base", "random_seed": null, "shuffle_points": false, "model_args": { "n_estimators": 4, "criterion": "entropy", "max_depth": 20, "min_samples_split": 5, "min_samples_leaf": 1, "min_weight_fraction_leaf": 0.0, "max_features": "sqrt", "max_leaf_nodes": null, "min_impurity_decrease": 0.0, "bootstrap": true, "oob_score": false, "n_jobs": 4, "warm_start": false, "class_weight": null, "ccp_alpha": 0.0, "max_samples": 0.8 }, "hyperparameter_tuning": null, "importance_report_path": "RF_importance.log", "importance_report_permutation": true, "decision_plot_path": "RF_decision.svg", "decision_plot_trees": 3, "decision_plot_max_depth": 5 } The JSON above defines a :class:`.RandomForestClassificationModel` made of four decision trees. It is trained in a straightforward way with no hyperparameter tuning. Once the model is trained. the feature importance for each feature is written to a text file called `RF_importance.log`, while three decision trees in the ensemble are graphically represented up to depth five. The graphical representation of the decision trees is exported to the `RF_decision.svg` file. **Arguments** -- ``fnames`` The names of the features to be considered to train the model. If ``"AUTO"``, the features considered by the last component that operated over the features will be selected. -- ``training_type`` Either ``"base"``, ``"autoval"``, or ``"stratified_kfold"``. For further details, read the :ref:`training strategies section `. -- ``random_seed`` Can be used to specify an integer like seed for any randomness-based computation. Mostly to be used for reproducibility purposes. -- ``shuffle_points`` Whether to randomly shuffle the points (True) or not (False). It only has an effect when using the ``"autoval"`` or ``"stratified_kfold"`` training strategies. -- ``class_names`` Optional argument that can be used to name the classes. When given, it must be a list with as many string as classes such that the class :math:`c` will be named by the :math:`c`-th string the list. -- ``model_args`` The arguments governing the Random Forest model. See the `sklearn documentation on Random Forest `_ for further details. -- ``hyperparameter_tuning`` This argument can be used to specify an hyperparameter tuning strategy using automated machine learning (AutoML) methods. For further details, read the :ref:`hyperparameter tuning section `. -- ``importance_report_path`` Path to a file where the computed feature importances will be exported. -- ``importance_report_permutation`` True to enable permutation invariance, false to ignore it. Permutation invariance importance is more robust than the straightforward importance. However, it increases the computational cost since it computes feature-wise permutations many times. See the `sklearn documentation on permutation importance `_ for further details. -- ``decision_plot_path`` Path to a file where the requested plots representing the decision trees in the ensemble will be written. The path refers to the base file. Many files will be named the same but appending a ``"_n"`` at the end, where n is the number of the tree. -- ``decision_plot_trees`` How many decision trees must be plotted. Using ``-1`` implies plotting all the decision trees in the ensemble. -- ``decision_plot_max_depth`` The max depth to consider for the graphical representation of the trees. **Output** The table below is an example of the reported feature importances on a dataset where the geometric features have been transformed through PCA. The "PERM. IMP." columns refer to the permutation invariance feature importance (mean and standard deviation, respectively). .. csv-table:: :file: ../csv/ml_rfclassif_importances.csv :widths: 15 20 20 20 :header-rows: 1 .. _Training strategies: Training strategies ===================== The VL3D framework supports three different strategies when training machine learning models. They are base, auto-validation, and stratified K-folding. The **base** strategy is a straightforward training of the model. The **auto-validation** strategy extracts a subset of the training dataset for later evaluation. The **stratified K-folding** combines stratification and K-folding to have an initial quantification of the model's variance. Base training --------------- Base training is simple. The model is trained considering the input training dataset in a straightforward way. It is enough with setting :code:`"training_type": "base"` in the JSON file, nothing else needs to be done. Auto-validation ----------------- In auto-validation training, a subset of the training dataset is explicitly avoided during training so it represents previously unseen data. As this subset is not used during training, it can be used to compute a reasonable initial estimation of the generalization capabilities of the model. While this validation is not enough because, typically, the validation data comes from a very similar distribution to the training data, it yields valuable information directly inside the training pipeline. After all, if a model does not present a good generalization on a similar data distribution, it is not likely to generalize well to datasets with different characteristics. To use an auto-validation training strategy one must set :code:`"training_type: "autoval"` in the model training component. On top of that, the arguments detailed below can be used to configure the auto-validation. .. code-block:: json { "training_type": "autoval", "autoval_metrics": ["OA", "MCC"], "autoval_size": 0.2, "shuffle_points": true } -- ``autoval_metrics`` A list of strings representing the metrics to compute for the auto-validation. The following can be used: * ``"OA"`` Overall accuracy. * ``"P"`` Precision. * ``"R"`` Recall. * ``"F1"`` F1 score (harmonic mean of precision and recall). * ``"IoU"`` Intersection over union (also known as Jaccard index). * ``"wP"`` Weighted precision (weights by the number of true instances for each class). * ``"wR"`` Weighted recall (weights by the number of true instances for each class). * ``"wF1"`` Weighted F1 score (weights by the number of true instances for each class). * ``"wIoU"`` Weighted intersection over union (weights by the number of true instances for each class). * ``"MCC"`` Matthews correlation coefficient. * ``"Kappa"`` Cohen's kappa score. -- ``autoval_size`` How many training data consider for the auto validation subset. It must be given as a number inside :math:`(0, 1]` when given as a float (ratio) or inside :math:`(0, m]` when given as an int (number of points). -- ``shuffle_points`` Whether to randomly shuffle the points (True) or not (False). .. _Stratified K-folding: Stratified K-folding ---------------------- Understanding the stratified K-folding strategy for training requires understanding stratification and K-folding. Stratification consists of dividing the data into subsets (stratum). Then, a subset of test points is built by selecting points from each stratum. The idea is that they follow a class distribution approximately proportional to the distribution in the original dataset. Therefore, the test points are expected to offer a reliable representation of the input dataset. K-folding consists of dividing the data into :math:`K` different subsets called folds. Then, :math:`K` iterations are computed such that each one considers a different fold as the test set and the other :math:`K-1` folds as the training dataset. Stratified K-folding is K-folding with stratified folds, i.e., each fold is also considered a stratum, and therefore, it does better preserve the proportionality of the original class distribution. To use a stratified K-folding strategy one must set ``"training_type": "stratified_kfold"`` in the model training component. On top of that, the arguments detailed below can be used to configure the stratified K-folding. .. code-block:: json { "training_type": "stratified_kfold", "autoval_metrics": ["OA", "MCC"], "num_folds": 5, "shuffle_points": true, "stratkfold_report_path": "stratkfold_report.log", "stratkfold_plot_path": "startkfold_plot.svg" } -- ``autoval_metrics`` A list of strings representing the metrics to compute on the test subsets during the stratified K-folding. The following can be used: * ``"OA"`` Overall accuracy. * ``"P"`` Precision. * ``"R"`` Recall. * ``"F1"`` F1 score (harmonic mean of precision and recall). * ``"IoU"`` Intersection over union (also known as Jaccard index). * ``"wP"`` Weighted precision (weights by the number of true instances for each class). * ``"wR"`` Weighted recall (weights by the number of true instances for each class). * ``"wF1"`` Weighted F1 score (weights by the number of true instances for each class). * ``"wIoU"`` Weighted intersection over union (weights by the number of true instances for each class). * ``"MCC"`` Matthews correlation coefficient. * ``"Kappa"`` Cohen's kappa score. -- ``num_folds`` How many folds. Note that stratified K-folding only makes sense for two or more folds. -- ``shuffle_points`` When true, the points will be randomly sampled before computing the subsets. However, points in the same split are not shuffled. -- ``stratkfold_report_path`` The path where a text report summarizing the stratified K-folding will be exported. -- ``stratkfold_plot_path`` The path where a plot representing the results of the stratified K-folding will be written. .. _Training data pipelines: Training data pipelines ========================== The VL3D framework supports the definition of simple sequential pipelines for training data. These pipelines will NOT be applied when predicting. They can be used to transform the input data :math:`X` and the reference values :math:`y`. Each consecutive component in the pipeline will be executed on the transformed training data as returned by the previous component. Finally, the model will be trained on the training data as it is after calling the last component (see :class:`.TrainingDataComponent`). Training data pipelines can be defined inside a component with a :code:`"train"` key (see :ref:`Random forest classifier `). To define a training data pipeline it is necessary to add an entry ``"training_data_pipeline" : [...]``, where the list contains dictionaries sequentially specifying the components in the pipeline. Each dictionary must contain a ``"component"`` key whose value is a string with the name of the component and a ``"component_args"`` that will typically be a dictionary with the parameters governing the component. Class-wise sampler -------------------- The class-wise sampler (:class:`.ClasswiseSampler`) can be used to sample points from the input training dataset such that a target number of points per class is selected. The class-wise sampler can work with or without replacement. In the first case, repeated points might be considered. The JSON below shows an example of how to define a training data pipeline with a :class:`.ClasswiseSampler` component. .. code-block:: json "training_data_pipeline": [ { "component": "ClasswiseSampler", "component_args": { "target_class_distribution": [2000000, 2000000, 2000000, 2000000], "replace": false } } ] **Arguments** -- ``target_class_distribution`` Number of points for each class (in a classification task) or each continuous variable (in a regression task). -- ``replace`` Boolean flag governing whether replacement is enabled (``true``) or not (``false``). Synthetic minority oversampling technique (SMOTE) ----------------------------------------------------- The synthetic minority oversampling technique (:class:`.SMOTE`) can be used to synthetically generate points for underrepresented classes. SMOTE works by considering nearest neighbors and generating points between the point and each of its nearest neighbors. In doing so, underrepresented classes can be prioritized over overrepresented classes to address class imbalance. See the `Imbalanced learn documentation on SMOTE `_ for further details. The JSON below shows an example of how to define a training data pipeline with a :class:`.SMOTE` component. .. code-block:: json "training_data_pipeline": [ { "component": "SMOTE", "component_args": { "sampling_strategy": "auto", "random_state": null, "k_neighbors": 5, "n_jobs": 16 } } ] **Arguments** See the `Imbalanced learn documentation on SMOTE `_ for the arguments. .. _Hyperparameter tuning: Hyperparameter tuning ======================== The model's parameters are automatically derived from the data when training the model. However, the model's hyperparameters are not handled by the training algorithm. Instead, they must be handled by the data scientist. However, it is possible to use AutoML methods to ease the process of hyperparameter optimization. These methods enable the data scientist to define automatic search procedures to automatically find an adequate set of hyperparameters. The hyperparameter strategies can be included into any training component based on machine learning. To achieve this, simply give the specification of the hyperparameter strategy associated to the ``"hyperparameter_tuning"`` key. .. _Grid search: Grid search -------------- Grid search can be used to automatically find the best combination for a set of hyperparameters. More concretely, a set family must be given such that its elements are sets, each defining the values for a given hyperparamter. The grid search algorithm will explore all the potential combinations derived from the Cartesian product between these sets. For example, a grid of two hyperparameters can be defined in JSON with ``"criterion": ["gini", "entropy"]`` and ``"max_depth": [5, 10, 15]``. The :class:`.HyperGridSearch` component will explore the combinations given by the Cartesian product, i.e., ``[("gini", 5), ("gini", 10), ("gini", 15), ("entropy", 5), ("entropy", 10), ("entropy", 15)]``. The arguments detailed below can be used to configure an arbitrary grid search on the hyperparameters of the machine learning model. .. code-block:: json "hyperparameter_tuning": { "tuner": "GridSearch", "hyperparameters": ["n_estimators", "max_depth", "max_samples"], "scores": "f1_macro", "num_folds": 5, "grid": { "n_estimators": [2, 4, 8, 16], "max_depth": [15, 20, 27], "max_samples": [0.6, 0.8, 0.9] }, "nthreads": -1, "pre_dispatch": 8, "report_path": "hyper_grid_search.log" } -- ``hyperparameters`` A list with the names of the hyperparameters to be considered. -- ``scores`` It can be null (or not given), a string, a list, or a dictionary where the keys are the desired names for the scores and the values are the internal names. The values of the dictionary, those of the list, and the single string must match the `convention of scoring parameters of scikit learn `_. Note that when many scores are given, only the first one will be considered to automatically determine the best model arguments. -- ``num_folds`` How many folds consider to validate the model following a K-folding strategy for each node explored during the grid search. -- ``grid`` A key-value specification of the search space. Each key must be the name of a feature and each value must be a list of the values to be explored during the grid search. -- ``nthreads`` How many threads use when computing the grid search. Note that the model might be run in parallel too. In that case, it is important to consider the sum between the threads used by the model and the threads used by the grid search. -- ``pre_dispatch`` How many jobs will be dispatched during the parallel execution. It can be used to prevent dispatching more jobs than desired, e.g., to avoid resource exhaustion. -- ``report_path`` When given, a text report about the grid search will be exported to the file pointed by the path. Random search --------------- Random search can be used to automatically find the best combination for some given hyperparameters. A :class:`.HyperRandomSearch` will run many iterations and at each one it will compute a random value for each hyperparameter. The arguments detailed below can be used to configure an arbitrary random search on the hyperparameters of the machine learning model. .. code-block:: json "hyperparameter_tuning": { "tuner": "RandomSearch", "hyperparameters": ["n_estimators", "max_depth", "ccp_alpha", "min_impurity_decrease", "criterion"], "scores": { "F1": "f1_macro", "wF1": "f1_weighted" }, "iterations": 32, "num_folds": 5, "distributions": { "n_estimators": { "distribution": "randint", "start": 2, "end": 17 }, "max_depth": { "distribution": "randint", "start": 10, "end": 31 }, "ccp_alpha": { "distribution": "uniform", "start": 0.0, "offset": 0.05 }, "min_impurity_decrease": { "distribution": "normal", "mean": 0.01, "stdev": 0.001 }, "criterion": ["gini", "entropy", "log_loss"] }, "report_path": "random_search.log", "nthreads": -1, "pre_dispatch": 8 } -- ``hyperparameters`` A list with the names of the hyperparameters to be considered. -- ``scores`` It can be null (or not given), a string, a list, or a dictionary where the keys are the desired names for the scores and the values are the internal names. The values of the dictionary, those of the list, and the single string must match the `convention of scoring parameters of scikit learn `_. Note that when many scores are given, only the first one will be considered to automatically determine the best model arguments. -- ``iterations`` How many iterations of random search must be computed. At each iteration a random value is taken for each tuned hyperparameter. -- ``num_folds`` How many folds consider to validate the model following a K-folding strategy for each node explored during the random search. -- ``distributions`` The specification of the random distributions to take the values at each random search iteration. Distributions can be specified in four different ways: #. List In this case, the elements of the list will be uniformly sampled. #. Uniform discrete random variable In this case, the values will be taken from a uniform discrete random distribution. The ``start`` value is included, the ``end`` value is excluded. #. Uniform continuous random variable In this case, the values will be taken from a uniform continuous random distribution. The included values range from ``start`` to ``start + offset`` (inclusive). #. Normal continuous random variable In this case, the values will be taken from a normal continuous random distribution with given mean and standard deviation. -- ``nthreads`` How many threads use when computing the random search. Note that the model might be run in parallel too. In that case, it is important to consider the sum between the threads used by the model and the threads used by the grid search. -- ``pre_dispatch`` How many jobs will be dispatched during the parallel execution. It can be used to prevent dispatching more jobs than desired, e.g., to avoid resource exhaustion. -- ``report_path`` When given, a text report about the random search will be exported to the file pointed by the path. Decorators ================ .. _FPS decorated model: Furthest point sampling decorator -------------------------------------------- The :class:`.FPSDecoratorTransformer` can be used to decorate a model such that the computations can take place in a transformed space of reduced dimensionality. Typically, the domain of a model is the entire point cloud, let us say :math:`m` points. When using a :class:`.FPSDecoratedModel` this domain will be transformed to a subset of the original point cloud with :math:`R` points, such that :math:`m \geq R`. Decorating a model with this decorator can be useful to reduce the execution time of model training, or to speedup the computation of an hyperparameter tuning procedure. A decorated model can use the alternative point cloud of :math:`R` points for training but not for predicting, or it can work on the alternative representation for both operations. .. code-block:: json { "train": "FPSDecorated", "fps_decorator": { "num_points": "m/11", "fast": true, "num_encoding_neighbors": 1, "num_decoding_neighbors": 1, "release_encoding_neighborhoods": false, "threads": 16, "representation_report_path": "*/fps_repr/model_representation_points.las" }, "undecorated_predictions": true, "decorated_model": { "train": "RandomForestClassifier", "fnames": ["AUTO"], "training_type": "stratified_kfold", "autoval_metrics": ["OA", "P", "R", "F1", "IoU", "wP", "wR", "wF1", "wIoU", "MCC"], "num_folds": 3, "random_seed": null, "shuffle_points": true, "stratkfold_report_path": "*/stratkfold_report.log", "stratkfold_plot_path": "*/stratkfold_plot.svg", "model_args": { "n_estimators": 64, "criterion": "entropy", "max_depth": 20, "min_samples_split": 10, "min_samples_leaf": 1, "min_weight_fraction_leaf": 0.0, "max_features": null, "max_leaf_nodes": null, "min_impurity_decrease": 0.0, "bootstrap": true, "oob_score": false, "n_jobs": 12, "warm_start": false, "class_weight": null, "ccp_alpha": 0.0, "max_samples": 0.8 }, "importance_report_path": "*/RF_importance.log", "importance_report_permutation": false, "decision_plot_path": "*/RF_decision.svg", "decision_plot_trees": 5, "decision_plot_max_depth": 7, "hyperparameter_tuning": { "tuner": "RandomSearch", "hyperparameters": ["n_estimators", "max_depth", "min_samples_split", "min_samples_leaf", "class_weight", "max_samples"], "scores": { "F1": "f1_macro", "OA": "accuracy" }, "iterations": 24, "num_folds": 3, "distributions": { "n_estimators": { "distribution": "randint", "start": 12, "end": 96 }, "max_depth": { "distribution": "randint", "start": 5, "end": 25 }, "min_samples_split": { "distribution": "randint", "start": 4, "end": 16 }, "min_samples_leaf": { "distribution": "randint", "start": 1, "end": 8 }, "class_weight": ["balanced", "balanced_subsample", null], "max_samples": { "distribution": "uniform", "start": 0.4, "offset": 0.5 } }, "report_path": "*/random_search.log", "nthreads": -1, "pre_dispatch": 2 } } } **Arguments** -- ``fps_decorator`` The specification of the furthest point sampling (FPS) decoration carried out through the :class:`.FPSDecoratorTransformer`. -- ``num_points`` The target number of points :math:`R` for the transformed point cloud. It can be an integer or an expression that will be evaluated with :math:`m` representing the number of points of the original point cloud, e.g., ``"m/2"`` will downscale the point cloud to half the number of points. -- ``fast`` Whether to use exact furthest point sampling (``false``) or a faster stochastic approximation (``true``). -- ``num_encoding_neighbors`` How many closest neighbors in the original point cloud are considered for each point in the transformed point cloud to reduce from the original space to the transformed one. -- ``num_decoding_neighbors`` How many closest neighbors in the transformed point cloud are considered for each point in the original point cloud to propagate back from the transformed space to the original one. -- ``release_encoding_neighborhoods`` Whether the encoding neighborhoods can be released after computing the transformation (``true``) or not (``false``). Releasing these neighborhoods means the :meth:`.FPSDecoratorTransformer.reduce` method must not be called, otherwise errors will arise. Setting this flag to true can help saving memory when needed. -- ``threads`` The number of parallel threads to consider for the parallel computations. Note that ``-1`` means using as many threads as available cores. -- ``representation_report_path`` Where to export the transformed point cloud. In general, it should be ``null`` to prevent unnecessary operations. However, it can be enabled (by given any valid path to write a point cloud file) to visualize the points that are seen by the model. -- ``undecorated_predictions`` Whether to apply the FPS decorator for predictions (``true``) or only for training (``false``). -- ``decorated_model`` A typical machine learning model specification. See :ref:`the Random forest classifier ` for an example. .. _Mindist decimator decorated model: Minimum distance decimator decorator ---------------------------------------- The :class:`.MinDistDecimatorDecorator` can be used to decorate a model such that the computations can take place in a transformed space of reduced dimensionality. Typically, the domain of a model is the entire point cloud, let us say :math:`m` points. When using a :class:`.MinDistDecoratedModel` this domain will be transformed to a subset of the original point cloud with :math:`R \leq m` points. Decorating a model with this decorator can be useful to reduce the execution tiem of model training, or to speedup the computation of an hyperparameter tuning procedure. A decorated model can use the alternative point cloud of :math:`R` points for training but not for predicting, or it can work on the alternative representation for both operations. .. code-block:: json { "train": "MinDistDecorated", "mindist_decorator": { "min_distance": 0.03 "num_encoding_neighbors": 1, "num_decoding_neighbors": 1, "release_encoding_neighborhoods": false, "threads": -1, "representation_report_path": "*/mindist_repr/model_representation_points.las" }, "undecorated_predictions": true, "decorated_model": { "train": "RandomForestClassifier", "fnames": ["AUTO"], "training_type": "stratified_kfold", "autoval_metrics": ["OA", "P", "R", "F1", "IoU", "wP", "wR", "wF1", "wIoU", "MCC"], "num_folds": 3, "random_seed": null, "shuffle_points": true, "stratkfold_report_path": "*/stratkfold_report.log", "stratkfold_plot_path": "*/stratkfold_plot.svg", "model_args": { "n_estimators": 64, "criterion": "entropy", "max_depth": 20, "min_samples_split": 10, "min_samples_leaf": 1, "min_weight_fraction_leaf": 0.0, "max_features": null, "max_leaf_nodes": null, "min_impurity_decrease": 0.0, "bootstrap": true, "oob_score": false, "n_jobs": 12, "warm_start": false, "class_weight": null, "ccp_alpha": 0.0, "max_samples": 0.8 }, "importance_report_path": "*/RF_importance.log", "importance_report_permutation": false, "decision_plot_path": "*/RF_decision.svg", "decision_plot_trees": 5, "decision_plot_max_depth": 7, "hyperparameter_tuning": { "tuner": "RandomSearch", "hyperparameters": ["n_estimators", "max_depth", "min_samples_split", "min_samples_leaf", "class_weight", "max_samples"], "scores": { "F1": "f1_macro", "OA": "accuracy" }, "iterations": 24, "num_folds": 3, "distributions": { "n_estimators": { "distribution": "randint", "start": 12, "end": 96 }, "max_depth": { "distribution": "randint", "start": 5, "end": 25 }, "min_samples_split": { "distribution": "randint", "start": 4, "end": 16 }, "min_samples_leaf": { "distribution": "randint", "start": 1, "end": 8 }, "class_weight": ["balanced", "balanced_subsample", null], "max_samples": { "distribution": "uniform", "start": 0.4, "offset": 0.5 } }, "report_path": "*/random_search.log", "nthreads": -1, "pre_dispatch": 2 } } } **Arguments** -- ``mindist_decorator`` See the :ref:`minimum distance decorated miner documentation about arguments ` . -- ``undecorated_predictions`` Whether to apply the FPS decorator for predictions (``true``) or only for training (``false``). -- ``decorated_model`` A typical machine learning model specification. See :ref:`the Random forest classifier ` for an example. Working example ================= This example shows how to define two different pipelines, one to train a model and export it as a :class:`.PredictivePipeline`, the other to use the predictive pipeline to compute a leaf-wood segmentation on another point cloud. Readers are referred to the :ref:`pipelines documentation ` to read more about how pipelines work and to see more examples. Training pipeline -------------------- The training pipeline will train two models, each on a different input point cloud. In this case, the input point clouds are specified using a URL format, i.e., the framework will automatically download them, providing the given links point to a valid and accessible LAS/LAZ file. The output for each model will be written to a different directory. The pipeline starts computing the geometric features on many different radii. Then, the geometric features are exported to a file inside the `pcloud` folder in the corresponding output directory (see the :ref:`sequential pipeline documentation ` to understand how the ``*`` works when specifying output paths). Then, an univariate imputation is applied to NaN values, followed by a standardization. Afterward, a PCA transformer takes as many principal components as necessary to project the features on an orthogonal basis that explains at least :math:`99\%` of the variance. These transformed features are exported to the `geomfeats_transf.las` file before training a random forest classifier on them. The random forest model is trained using stratified K-folding to assess its variance and potential generalization. On top of that, grid search is used as a hyperparameter tuning strategy to automatically find a good combination of max tree depth, max number of samples per tree, and the number of decision trees in the ensemble. Finally, the model is exported to a predictive pipeline that can later be used to compute predictions on previously unseen models. .. code-block:: json { "in_pcloud": [ "https://3dweb.geog.uni-heidelberg.de/trees_leafwood/PinSyl_KA09_T048_2019-08-20_q1_TLS-on_c_t.laz", "https://3dweb.geog.uni-heidelberg.de/trees_leafwood/PinSyl_KA10_03_2019-07-30_q2_TLS-on_c_t.laz" ], "out_pcloud": [ "out/training/PinSyl_KA09_T048_pca_RF/*", "out/training/PinSyl_KA10_03_pca_RF/*" ], "sequential_pipeline": [ { "miner": "GeometricFeatures", "radius": 0.05, "fnames": ["linearity", "planarity", "surface_variation", "eigenentropy", "omnivariance", "verticality", "anisotropy"] }, { "miner": "GeometricFeatures", "radius": 0.1, "fnames": ["linearity", "planarity", "surface_variation", "eigenentropy", "omnivariance", "verticality", "anisotropy"] }, { "miner": "GeometricFeatures", "radius": 0.2, "fnames": ["linearity", "planarity", "surface_variation", "eigenentropy", "omnivariance", "verticality", "anisotropy"] }, { "writer": "Writer", "out_pcloud": "*pcloud/geomfeats.las" }, { "imputer": "UnivariateImputer", "fnames": ["AUTO"], "target_val": "NaN", "strategy": "mean", "constant_val": 0 }, { "feature_transformer": "Standardizer", "fnames": ["AUTO"], "center": true, "scale": true }, { "feature_transformer": "PCATransformer", "out_dim": 0.99, "whiten": false, "random_seed": null, "fnames": ["AUTO"], "report_path": "*report/pca_projection.log", "plot_path": "*plot/pca_projection.svg" }, { "writer": "Writer", "out_pcloud": "*pcloud/geomfeats_transf.las" }, { "train": "RandomForestClassifier", "fnames": ["AUTO"], "training_type": "stratified_kfold", "random_seed": null, "shuffle_points": true, "num_folds": 5, "model_args": { "n_estimators": 4, "criterion": "entropy", "max_depth": 20, "min_samples_split": 5, "min_samples_leaf": 1, "min_weight_fraction_leaf": 0.0, "max_features": "sqrt", "max_leaf_nodes": null, "min_impurity_decrease": 0.0, "bootstrap": true, "oob_score": false, "n_jobs": 4, "warm_start": false, "class_weight": null, "ccp_alpha": 0.0, "max_samples": 0.8 }, "autoval_metrics": ["OA", "P", "R", "F1", "IoU", "wP", "wR", "wF1", "wIoU", "MCC", "Kappa"], "stratkfold_report_path": "*report/RF_stratkfold_report.log", "stratkfold_plot_path": "*plot/RF_stratkfold_plot.svg", "hyperparameter_tuning": { "tuner": "GridSearch", "hyperparameters": ["n_estimators", "max_depth", "max_samples"], "nthreads": -1, "num_folds": 5, "pre_dispatch": 8, "grid": { "n_estimators": [2, 4, 8, 16], "max_depth": [15, 20, 27], "max_samples": [0.6, 0.8, 0.9] }, "report_path": "*report/RF_hyper_grid_search.log" }, "importance_report_path": "*report/LeafWood_Training_RF_importance.log", "importance_report_permutation": true, "decision_plot_path": "*plot/LeafWood_Training_RF_decision.svg", "decision_plot_trees": 3, "decision_plot_max_depth": 5 }, { "writer": "PredictivePipelineWriter", "out_pipeline": "*pipe/LeafWood_Training_RF.pipe", "include_writer": false, "include_imputer": true, "include_feature_transformer": true, "include_miner": true } ] } The table below is the report describing the findings of the grid search strategy. On the left side of the table, the columns correspond to the optimized hyperparameters. On the right side, the mean and standard deviation of the accuracy (percentage), and the mean and standard deviation of the training time (seconds). .. csv-table:: :file: ../csv/ml_rfclassif_hypergridsearch.csv :widths: 10 10 10 10 10 10 10 :header-rows: 1 Predictive pipeline ---------------------- The predictive pipeline will use the model trained on the first point cloud to compute leaf-wood segmentation on the second point cloud. The input point cloud will be downloaded from the given URL. The classified point cloud will be written including some extra information like the fail/success point-wise mask (because :class:`.ClassifiedPcloudWriter` is used instead of :class:`.Writer`) to the `predicted.las` file. The predicted labels will also be exported to the single-column file `predictions.lbl`. Finally, the :class:`.ClassificationEvaluator` component is used to analyze the predictions with respect to the expected values. In doing so, many reports and plots are generated including confusion matrices, and the requested evaluation metrics. .. code-block:: json { "in_pcloud": [ "https://3dweb.geog.uni-heidelberg.de/trees_leafwood/PinSyl_KA10_03_2019-07-30_q2_TLS-on_c_t.laz" ], "out_pcloud": [ "out/prediction/PinSyl_KA09_T048_pca_RF/PinSyl_KA10_03/*" ], "sequential_pipeline": [ { "predict": "PredictivePipeline", "model_path": "out/training/PinSyl_KA09_T048_pca_RF/pipe/LeafWood_Training_RF.pipe" }, { "writer": "ClassifiedPcloudWriter", "out_pcloud": "*predicted.las" }, { "writer": "PredictionsWriter", "out_preds": "*predictions.lbl" }, { "eval": "ClassificationEvaluator", "class_names": ["wood", "leaf"], "metrics": ["OA", "P", "R", "F1", "IoU", "wP", "wR", "wF1", "wIoU", "MCC", "Kappa"], "class_metrics": ["P", "R", "F1", "IoU"], "report_path": "*report/global_eval.log", "class_report_path": "*report/class_eval.log", "confusion_matrix_report_path" : "*report/confusion_matrix.log", "confusion_matrix_plot_path" : "*plot/confusion_matrix.svg", "class_distribution_report_path": "*report/class_distribution.log", "class_distribution_plot_path": "*plot/class_distribution.svg" } ] } The table below exemplifies the evaluation metrics describing how good the predictions are with respect to the expected values. .. csv-table:: :file: ../csv/ml_rfclassif_predict_global_eval.csv :widths: 9 9 9 9 9 9 9 9 9 9 9 :header-rows: 1 The figure below represents the computed leaf-wood segmentation directly visualized in the point cloud. .. figure:: ../img/rfclassif_unseen.png :scale: 25 :alt: Figure representing the leaf-wood segmentation computed on a point cloud with previously unseen data. Visualization of the output obtained after computing a leaf-wood segmentation on a dataset containing a previously unseen tree. On the left, the red points represent misclassified points while the gray points represent successfully classified points. In the middle, the point-wise predicted labels (green for leaf, brown for wood). In the right, the point-wise reference labels.