.. _Clustering page: Clustering *************** Clustering consists of receiving an input point cloud and finding points that can be grouped together into clusters. The clusters are typically represented with integer values :math:`c \in \mathbb{Z}_{\geq -1}`, where the value :math:`-1` is used to represent no cluster or noise cluster. Clustering computations are carried out through :class:`.Clusterer` components. They can be included in pipelines. Besides, it is possible to define a :ref:`post-processing pipeline ` that will be computed immediately after the clustering. Clusterers ============= .. _DBScan clusterer: DBScan clusterer ------------------- The :class:`.DBScanClusterer` can be used to compute Density-Based Spatial Clustering of Applications with Noise (DBSCAN) in the point cloud. The DBSCAN algorithm considers a minimum number of points and a radius to define the neighborhood for spatial queries. It finds kernel points, those in whose neighborhood there are at least the minimum number of points. Then, it considers any accessible point to be in the cluster. The accessible points that are not kernel points are included in the cluster because they are close enough to the kernel points. However, these points are not considered to include new points as only kernel points can be used to determine whether a point is accessible or not. The JSON below shows how to define a DBScan clustering component: .. code-block:: json { "clustering": "fpsdecorated", "fps_decorator": { "num_points": "m/10", "fast": true, "num_encoding_neighbors": 1, "num_decoding_neighbors": 1, "release_encoding_neighborhoods": false, "threads": 16, "representation_report_path": null }, "decorated_clusterer": { "clustering": "dbscan", "cluster_name": "cluster_wood", "precluster_name": "prediction", "precluster_domain": [1], "min_points": 200, "radius": 3.0, "post_clustering": [ { "post-processor": "ClusterEnveloper", "envelopes": [ { "type": "AlignedBoundingBox", "compute_volume": true, "output_path": "*/aabbs.csv", "separator": "," }, { "type": "BoundingBox", "compute_volume": true, "output_path": "*/bboxes.csv", "separator": "," } ] } ] } } The JSON above defines a :class:`.DBScanClusterer` that is decorated to work on a FPS representation of the input point cloud (see :ref:`the FPS decorated clusterer documentation` for further details). It will consider all the points predicted as wood to compute clusters of wood. The minimum number of points in the neighborhoods of kernel points is set to 200 points, and the radius of the neighborhood to 3 meters. After computing the clustering, a post-processing pipeline will be run to find the axis-aligned and oriented bounding boxes, respectively. For each bounding box the vertices, the cluster label, and the enclosed volume will be exported to a CSV file using ``,`` as separator. **Arguments** -- ``cluster_name`` The name of the clusters. For instance, when finding clusters of buildings it is a good idea to use ``"cluster_building"`` as the cluster name. -- ``precluster_name`` In case the point cloud is already clustered, the precluster name specifies the point-wise attribute to consider to filter the points. Note that both reference and predicted classes can be seen as clusters. -- ``precluster_domain`` If the preclustering filter is applied, then the precluster domain list can be used to specify the values of interest (any point that is not in the domain will be ignored). If not given and preclustering filter is requested, then the precluster domain will be automatically inferred as the set of unique values for the attribute. -- ``min_points`` The minimum number of points that must belong to a neighborhood to pass the kernel point check. -- ``radius`` The radius of the neighborhood for kernel point checks. -- ``post_clustering`` The post-clustering pipeline. It can be ``null`` or a list of :ref:`post-clustering components `. .. _Bivariate critical clusterer: Bivariate critical clusterer ------------------------------- The :class:`.BivariateCriticalClusterer` can be used to compute one cluster per critical point in the given point cloud. The bivariate critical clustering algorithm considers the requested type of critical point (e.g., min or max) and then does a distance-based clustering. It is called bivariate because it works on three variables :math:`(x, y, z)` (they can be the coordinates or any other variables, including those from the feature space). Assuming Monge form :math:`z = \hat{z}(x, y)`, the critical points will correspond to the minima or maxima of the third variable :math:`z` on the surface defined by the other two :math:`(x, y)`. The clusters for each critical point can be computed through the nearest neighbor method or using a distance-based region growing algorithm. The JSON below shows how to define a bivariate critical clustering component: .. code-block:: json { "clustering": "BivariateCritical", "cluster_name": "avocado", "precluster_name": "prediction", "precluster_domain": [1], "critical_type": "max", "filter_criticals": false, "radius": 2.0, "x": "x", "y": "y", "z": "z", "strategy": { "type": "RecursiveRegionGrowing", "initial_radius": 0.25, "growing_radius": 0.25, "max_iters": 20000, "first_stage": false, "first_stage_correction": false, "second_stage": true, "nn_correction": false }, "label_criticals": false, "critical_label_name": "max", "chunk_size": 50000, "nthreads": 8, "kdt_nthreads": 2 } The JSON above defines a :class:`.BivariateCriticalClusterer` that will consider the point-wise coordinates (i.e., the structure space of the point cloud) as the variables for the clustering. The critical points will be searched on a radius of :math:`2` meters and the clustering strategy will be the recursive region growing algorithm. Furthermore, the clustering will be computed in parallel dividing the workload in chunks of :math:`50,000` points using :math:`8` parallel jobs, each with :math:`2` threads for each KDTree-based spatial query. **Arguments** -- ``cluster_name`` The name of the clusters. For instance, when finding clusters of avocados it is a good idea to use ``"avocado"`` as the cluster name. -- ``precluster_name`` In case the point cloud is already clustered, the precluster name specifies the point-wise attribute to consider to filter the points. Note that both reference and predicted classes can be seen as clusters. -- ``precluster_domain`` If the preclustering filter is applied, then the precluster domain list can be used to specify the values of interest (any point that is not in the domain will be ignored). -- ``crtical_type`` The type of critical point to be considered, either ``"min"`` or ``"max"``. -- ``filter_criticals`` Boolean mask governing whether to filter critical points in case there is more than one in a neighborhood of the given radius. When ``true`` all the crtitical points in the same neighborhood but one will be discarded (this happens when both are maxima or minima, e.g., they have the same :math:`z` coordinate). When ``false`` all the critical points will be kept, even if this implies more than one per neighborhood. -- ``radius`` The radius for the neighborhood analysis used to find the critical points. Note that it is a 2D radius because only the first and second variables (i.e., :math:`(x, y)`) are considered for the spatial queries. -- ``x`` The first variable, e.g., ``"x"``. -- ``y`` The second variable, e.g., ``"y"``. -- ``z`` The third variable, e.g., ``"z"``. Note that critical points will be minima or maxima with respect to this variable. -- ``strategy`` The strategy specification for the clustering. It can be either the nearest neighbor method or the recursive region growing algorithm. -- **Nearest neighbor method** -- ``type`` It must be ``"NearestNeighbor"``. -- **Recursive region growing algorithm** -- ``type`` It must be ``"RecursiveRegionGrowing"``. -- ``initial_radius`` The radius for the initial clusters. The distance is computed considering the :math:`(x, y)` variables, i.e., in 2D. All the points that are inside this radius with respect to a critical point will belong to its cluster. -- ``growing_radius`` The radius for each region growing step. Again, distances will be computed in the plane defined by the first two variables :math:`(x, y)`. -- ``max_iters`` The maximum number of iterations for the clustering. If zero, then the clustering will run until full convergence is achieved. -- ``first_stage`` Whether to compute the first stage of the region growing strategy (``true``) or not (``false``). This stage considers any point in the cluster to analyze its neighborhood. Any point in the neighborhood that is at the same height or lower (when max critical points are considered) or at the same height or above (when min critical points are considered) will be included in the cluster. The aforementioned process will be computed recursively until not more points are added to a cluster or the max number of iterations is reached. -- ``first_stage_correction`` Whether to apply a concave hull-based correction to the first stage (``true``) or not (``false``). The correction consists of computing the 2D concave hull for each cluster (on the :math:`x` and :math:`y` coordinates) and then include in the cluster any point that is inside the concave hull. -- ``second_stage`` Whether to compute the second stage of the region growing strategy. In this stage, the clusters grow considering the neighborhood for each point in the cluster, and doing so recursively to account for newly added points during the iterations of the second stage too. The second stage stops when not even a single cluster grew or when the maximum number of iterations is reached. -- ``nn_correction`` Whether to apply the nearest neighbor method after finishing the clustering (``true``) or not (``false``). If applied, it will consider all the points that have not been clustered by the region growing algorithm and assign them to the cluster of its closest clustered neighbor. -- ``label_criticals`` Whether to label the critical points (``true``) or not (``false``). Note that the point cloud will be updated with an extra feature that will be a label specifying whether it is a critical point or not. -- ``critical_label_name`` The name for the feature that labels critical points. -- ``chunk_size`` How many points per chunk for the parallel exeuction. If zero, then all the points are consireded in a single chunk. -- ``nthreads`` How many chunks will be computed in parallel at the same time. -- ``kdt_nthreads`` How many threads will be used to speedup the computation of the spatial queries through the KDTree. Note that the final number of parallel jobs will be :math:`\text{nthreads} \times \text{kdt_nthreads}`. **Output** The figure below represents the bivariate critical clusters computed on the maxima of a previously classified avocado plantation. There are approximately as many clusters as avocados, with each cluster generally corresponding to a different avocado. .. figure:: ../img/bivarcrit_avocados.png :scale: 50 :alt: Figure representing avocado cluster. Visualization of the avocado clusters computed through the bivariate critical clustering algorithm with the region growing strategy. .. _Simple feature clusterer: Simple features clusterer ----------------------------- The :class:`.SimpleFeaturesClusterer` can be used to compute clusters by filtering points considering relational conditions. These relationals can be specified similar to how it is done for :ref:`the cluster selector post-processor `. The JSON below shows how to define a simple features clusterer: .. code-block:: json { "clustering": "SimpleFeatures", "cluster_name": "change_cluster", "cluster_value": 1, "filters": [ { "attribute": "chg_euclid", "relational": "greater_than_or_equal_to", "target": 0.33, "action": "preserve" } ] } The JSON above defines a :class:`.SimpleFeaturesClusterer` that makes a cluster with all the points that have Euclidean distance-based point-wise change quantification greater than :math:`0.33` meters. The points that do not belong to the cluster will have a value of zero, those that belong to the cluster will have a value of one. The name of the feature representing the cluster is ``"change_cluster"``. **Arguments** -- ``cluster_name`` The name of the feature representing the cluster. -- ``cluster_value`` The value for the points that belong to the cluster. -- ``filters`` See :ref:`cluster selector filters documentation `. .. _Simple curve extractor: Simple curve extractor ------------------------- The :class:`.SimpleCurveExtractor` extracts clean, parameterized 3D curves from previously classified or clusterized point clouds. It implements a 6-stage pipeline: adaptive stabilization, soft 2.5D skeleton extraction, monotonic topological pruning, resilient geometry refinement, multi-format output, and optional reverse label propagation. The JSON below shows how to define a simple curve extractor: .. code-block:: json { "clustering": "SimpleCurveExtractor", "cluster_name": "CURVE", "curve_source": "classification", "curve_classes": [4], "min_voxel_size": 0.1, "merge_radius": 15.0, "z_tolerance": 2.0, "min_segment_length": 3.0, "spatial_scale": 1.5, "min_curve_length": null, "min_twig_length": null, "rdp_tolerance": null, "curve_pcloud_step": null, "snap_densify_step": null, "snap_max_shift": null, "snap_target_radius": null, "snap_pull_radius": null, "hallucination_drop_densify_step": null, "hallucination_drop_radius": null, "hallucination_min_supported_length": null, "endpoint_extension_step": null, "endpoint_extension_radius": null, "endpoint_extension_max_length": null, "cid_relabel_z_disjoint_threshold": null, "cid_gap_split_dz_threshold": null, "max_merge_iterations": 3, "tangent_smoothing_iterations": 3, "occupancy_threshold": 0.1, "chain_radius_factor": 3.0, "angular_threshold_k": 1.5, "betweenness_percentile": 0.05, "ridge_fraction": 0.5, "output_point_spacing": 0.0, "z_consistency_sigma": 0.0, "snap_enable": true, "snap_min_neighbors": 4, "snap_smoothing_iterations": 2, "snap_passes": 30, "snap_post_passes": 20, "hallucination_drop_enable": true, "hallucination_drop_threshold": 0.025, "snap_truncate_iterations": 15, "endpoint_extension_enable": true, "cid_relabel_z_disjoint_enable": true, "cid_gap_split_enable": true, "cid_gap_split_planar_radius": 15.0, "merge_orphans_enable": true, "final_cleanup": false, "slope_output_form": "angle_deg", "propagate_labels": true, "curve_pcloud": "*/curve_pcloud.las", "output_gpkg": "*/break_curves.gpkg", "output_shp": "*/break_curves.shp", "output_csv": "*/break_curves.csv", "epsg": 25829, "nthreads": -1, "post_clustering": null } The JSON above defines a :class:`.SimpleCurveExtractor` that will extract 3D curves from points classified as class :math:`3` (e.g., roads). The adaptive stabilization will run up to :math:`3` merge iterations with :math:`3` tangent smoothing iterations. The skeleton will be extracted using a ``z_tolerance`` of :math:`2.0` (soft-Z sigma) and an occupancy threshold of :math:`0.1`. Topological pruning removes twigs shorter than :math:`50 \times \text{min_voxel_size}` (i.e. :math:`5.0` m at the default voxel size) and prunes low-centrality edges below the :math:`5`-th percentile. The final geometry is simplified with an RDP tolerance derived from ``min_voxel_size`` (:math:`0.1` m by default). Extracted curves will be exported to GeoPackage, Shapefile, and CSV formats in the EPSG:25830 coordinate reference system. Computations will use all available cores. Example configuration notes: - All hidden/advanced kwargs are listed at their default values so the JSON above can be used as a "copy this, customise the line you care about" template. - ``rdp_tolerance`` and ``min_twig_length`` use the sentinel ``null``. When omitted (or ``null``) they are derived from ``min_voxel_size`` so they track the input coordinate scale (see their individual argument entries for the derivation). - ``snap_densify_step``, ``hallucination_drop_densify_step``, and ``endpoint_extension_step`` likewise use the sentinel ``null``. When unset they each resolve to :math:`5 \times \text{min_voxel_size}` (i.e. :math:`0.5` m at the default voxel size), matching the densification step on the production fixture. - ``cid_gap_split_planar_radius`` also defaults to ``null`` and resolves to ``merge_radius`` when the latter is set. If ``merge_radius`` is itself ``null`` it falls back to the default :math:`15.0` m so the gap-split pass keeps its metric-aligned planar radius. .. note:: Some kwargs use the sentinel ``null`` and resolve at instance construction to a fixed coefficient times ``min_voxel_size`` so the algorithm scales with the input voxel resolution. * ``min_twig_length`` -- :math:`50 \times \text{min_voxel_size}` (= :math:`5.0` m at default). * ``rdp_tolerance`` -- :math:`\text{min_voxel_size}` (= :math:`0.1` m at default). * ``min_curve_length`` -- :math:`500 \times \text{min_voxel_size}` (= :math:`50.0` m at default). * ``curve_pcloud_step`` -- :math:`1 \times \text{min_voxel_size}` (= :math:`0.1` at default). * ``snap_densify_step`` -- :math:`5 \times \text{min_voxel_size}` (= :math:`0.5` m at default). * ``snap_max_shift`` -- :math:`35 \times \text{min_voxel_size}` (= :math:`3.5` m at default). * ``snap_target_radius`` -- :math:`30 \times \text{min_voxel_size}` (= :math:`3.0` m at default). * ``snap_pull_radius`` -- :math:`400 \times \text{min_voxel_size}` (= :math:`40.0` m at default). * ``hallucination_drop_densify_step`` -- :math:`5 \times \text{min_voxel_size}` (= :math:`0.5` m at default). * ``hallucination_drop_radius`` -- :math:`25 \times \text{min_voxel_size}` (= :math:`2.5` m at default). * ``hallucination_min_supported_length`` -- :math:`1 \times \text{min_voxel_size}` (= :math:`0.1` m at default). * ``endpoint_extension_step`` -- :math:`5 \times \text{min_voxel_size}` (= :math:`0.5` m at default). * ``endpoint_extension_radius`` -- :math:`25 \times \text{min_voxel_size}` (= :math:`2.5` m at default). * ``endpoint_extension_max_length`` -- :math:`150 \times \text{min_voxel_size}` (= :math:`15.0` m at default). * ``cid_relabel_z_disjoint_threshold`` -- :math:`50 \times \text{min_voxel_size}` (= :math:`5.0` m at default). * ``cid_gap_split_dz_threshold`` -- :math:`30 \times\text{min_voxel_size}` (= :math:`3.0` m at default). **Arguments** *Input configuration* -- ``curve_source`` The source attribute from which curve points are selected. It can be ``"classification"`` to use the point cloud's classification attribute, ``"predictions"`` (or equivalently ``"prediction"``) to use predicted labels, or any custom attribute name to access an extra dimension by name from the LAS structure. The three built-in keywords (``"classification"``, ``"prediction"``, ``"predictions"``) are matched case-insensitively; a custom attribute name is passed through verbatim to the LAS-dimension lookup and is therefore case-sensitive. When the predictions array is two-dimensional and of a floating-point dtype (i.e. per-class probabilities), hard labels are derived via :math:`\operatorname{argmax}` along the class axis. -- ``curve_classes`` A list of integer class or cluster IDs that identify curve points. All points whose ``curve_source`` attribute matches one of these values will be used as input for the curve extraction pipeline. *Stage 1: Adaptive stabilization* -- ``max_merge_iterations`` The maximum number of convergence iterations for the adaptive stabilization stage. During each iteration, disconnected endpoint pairs are scored and merged when their combined distance, tangent alignment, and component size score exceeds the acceptance threshold. Default is :math:`3`. -- ``tangent_smoothing_iterations`` The number of iterations for tangent field smoothing within the stabilization stage. Higher values produce smoother tangent estimates at the cost of reduced locality. Default is :math:`3`. -- ``min_voxel_size`` The minimum allowed voxel size for adaptive voxelization. Prevents pathological cases where duplicate points or noise produce an extremely small median nearest-neighbor distance. The actual voxel size is computed as :math:`V = \max(1.5 \times \text{median_nn}, \text{min_voxel_size})`. ``min_voxel_size`` also acts as the reference scale for several scale-coupled downstream thresholds (``min_twig_length``, ``rdp_tolerance``, the :math:`\Delta z_{\max}` floor, and the Z-jump bridge floor), so **pick a value consistent with the input coordinate scale**. Default is :math:`0.1`. *Stage 2: Soft 2.5D skeleton extraction* -- ``z_tolerance`` Soft-Z sigma :math:`\sigma_z` used by the occupancy weighting. It controls the Gaussian weighting of point heights when building the 2.5D occupancy grid: :math:`\text{occupancy}(\text{cell}) = \sum_i \exp\left(-\frac{(Z_i - Z_{\text{ref}})^2}{2\,\sigma_z^2}\right)`. This parameter also governs the maximum Z-difference threshold :math:`\Delta z_{\max}` used by the cylindrical merge criterion and the three Z-artifact detectors (bridge Z-jump split, vertical column split, Z-reversal split; see ``merge_radius``). Default is :math:`2.0`. -- ``occupancy_threshold`` The binary threshold applied to the soft-Z occupancy values. Cells with occupancy below this threshold are considered unoccupied. Default is :math:`0.1`. -- ``spatial_scale`` The skeleton-extraction voxel size, independent of the stabilization voxel size. When set to :math:`0.0` (default), the stabilization voxel size is used. For wide curves (e.g., roads), setting this to a value larger than the default (e.g., :math:`1.0`--:math:`2.0` meters) produces cleaner centerlines by reducing the skeleton grid resolution. -- ``merge_radius`` The endpoint merge radius. When set (not ``null``), endpoint pairs are tested against a **cylindrical** proximity criterion: the 2D (XY-plane) distance must be within the merge radius :math:`r_m`, and the absolute Z-difference must not exceed a maximum threshold :math:`\Delta z_{\max}`: .. math:: \| \mathbf{p}_a^{xy} - \mathbf{p}_b^{xy} \|_2 \;\le\; r_m \qquad\text{and}\qquad | z_a - z_b | \;\le\; \Delta z_{\max} where the Z-difference threshold is derived from ``z_tolerance``: .. math:: \Delta z_{\max} = \max\!\bigl(\min(3\,\sigma_z,\; r_m),\; 10 \times \text{min_voxel_size}\bigr) The :math:`\max(\cdot,\, 10\,V_{\min})` floor prevents degenerate behaviour when :math:`\sigma_z = 0` and scales with ``min_voxel_size`` so the threshold is portable across coordinate scales (metre, sub-metre, or very-large clouds). At the default ``min_voxel_size = 0.1`` the floor is :math:`1\,\text{m}`, matching the legacy metre-scale behaviour. This cylindrical test is used in three stages: endpoint merging (connecting different curve components), post-chain merging (reconnecting same-curve segments), and post-resolve merging (reconnecting endpoints created by cross-feature truncation). Three post-processing artifact detectors are derived from :math:`\Delta z_{\max}` to remove elevation artifacts produced by multi-terrace skeleton connections: 1. **Bridge Z-jump split.** Operates on each polyline using the consecutive XY spacings :math:`\Delta xy_k` and their median :math:`\widetilde{\Delta xy}`. A segment :math:`k` is flagged as a Z-jump bridge when either of two criteria holds. The **moderate-bridge** criterion flags short gaps with a large vertical jump: .. math:: \Delta xy_k > \max\!\bigl(3\,\tilde{\Delta xy},\; 10 \times \text{min_voxel_size}\bigr) \quad\text{and}\quad |\Delta z_k| > \max\!\left( \frac{\Delta z_{\max}}{2},\; 1\right) The :math:`\max(\cdot, 10\,V_{\min})` XY floor and the :math:`\max(\cdot, 1)` Z floor prevent degenerate thresholds when the polyline is very tight or :math:`\Delta z_{\max}` is small. The Z floor of :math:`1` is an absolute constant from the code and does not scale with ``min_voxel_size``. The **long-bridge** criterion flags long gaps with a moderate vertical jump: .. math:: \Delta xy_k > \ell_{xy} \quad\text{and}\quad |\Delta z_k| > \ell_z where the thresholds track the pipeline's spatial and Z-noise scales: .. math:: \ell_{xy} = \begin{cases} \tfrac{2}{3}\,r_m & \text{if } r_m > 0 \\ 10 & \text{otherwise} \end{cases} \qquad \ell_z = \begin{cases} \sigma_z & \text{if } \sigma_z > 0 \\ 2 & \text{otherwise} \end{cases} At the legacy test defaults (:math:`r_m = 15`, :math:`\sigma_z = 2`) these recover :math:`\ell_{xy} = 10\,\text{m}` and :math:`\ell_z = 2\,\text{m}`. Flagged indices split the polyline into independent sub-features; the short arc immediately surrounding the jump is then removed by the ``min_curve_length`` / short-curve-artifact filter downstream. 2. **Vertical column split.** A windowed grade analysis detects contiguous runs where Z changes rapidly relative to 2D movement. Let :math:`N` be the number of polyline vertices and :math:`w = \min(100,\, \lfloor N/4 \rfloor)`; the detector is only engaged when :math:`w \ge 10`. For each window start :math:`i` the grade is .. math:: G_w(i) = \frac{|z_{i+w} - z_i|} {\max\!\bigl(\sum_{k=i}^{i+w-1} \Delta xy_k,\; 0.1\bigr)} Let :math:`I = \{i : G_w(i) > \tan(\theta)\}` be the set of vertical-grade hits. Consecutive hits are grouped into runs: two successive indices :math:`i_j,\, i_{j+1} \in I` belong to the same run when :math:`i_{j+1} - i_j \le w`; a gap :math:`> w` starts a new run. For each run the polyline is split both at the run-start index :math:`i_{\text{start}}` and at :math:`i_{\text{end}} + w` (clamped to the polyline's interior), so the vertical column is isolated as a short sub-feature the downstream ``min_curve_length`` / short-curve-artifact filter then drops. Here :math:`\theta` is fixed at :math:`\pi/4` (giving :math:`\tan(\theta) = 1` — steeper than :math:`45^{\circ}`). 3. **Z-reversal split.** A drawdown analysis detects features whose elevation profile climbs to an interior peak and then drops by more than :math:`\Delta z_{\max}`: .. math:: D(i) = \max_{j \le i} z_j \;-\; z_i \qquad\Longrightarrow\qquad \text{split where } D(i) > \Delta z_{\max} The feature is split at the peak preceding the excessive drawdown. Peaks at the very start or end of the feature (monotonic descents) are not considered reversals. When ``null`` (default) or set to a non-positive value, no merging or artifact splitting is performed — the internal enable flag is ``merge_radius is not None and merge_radius > 0``. *Stage 3: Monotonic topological pruning* Stage 3 operates on the **skeleton graph** :math:`G = (V, E)` produced by Stage 2 — an undirected radius graph whose nodes are the extracted skeleton points and whose edges connect skeleton points that were judged topologically adjacent. Given a node :math:`i \in V`, we write :math:`\deg(i)` for its degree (the number of edges incident to :math:`i` in :math:`E`) and :math:`\mathcal{N}(i)` for its neighbor set; these are the only graph primitives the rest of the stage relies on. Stage 3 runs four sequential operations — twig pruning, junction detection, betweenness-based chain pruning, and cycle validation — each of which consumes the output of the previous step. -- ``min_twig_length`` Minimum length (in point-cloud units) of a **twig** — a chain of degree-2 nodes emanating from a degree-1 endpoint — below which the twig is removed. Twig tracing, implemented in ``TopologicalPruner::pruneTwigs``, starts from every node :math:`e` with :math:`\deg(e) = 1` and walks the unique chain :math:`e = v_0 \to v_1 \to \dots \to v_m` through degree-2 nodes, stopping at the first stop node — either a junction (flagged by ``angular_threshold_k`` below), a graph boundary, or another degree :math:`\ne 2` node. The twig length is its cumulative Euclidean arc length .. math:: L(v_0,\dots,v_m) = \sum_{k=0}^{m-1} \bigl\lVert \mathbf{x}_{v_{k+1}} - \mathbf{x}_{v_k}\bigr\rVert_2. If :math:`L < \text{min_twig_length}`, every edge along the chain is removed from :math:`G`. Chains whose both endpoints are degree-1 are deduplicated (traced once) so they are not removed twice. When set to ``null`` (default) the value is derived as :math:`50 \times \text{min_voxel_size}` (i.e. :math:`5.0` at the default :math:`\text{min_voxel_size} = 0.1`). The sentinel is resolved in :meth:`__init__` and the derived value is what ``pruneTwigs`` actually receives. -- ``angular_threshold_k`` Robust multiplier :math:`k` that selects which high-degree nodes are classified as **junctions**. The detector, in ``TopologicalPruner::detectJunctions``, proceeds as follows. 1. **Candidate set.** Let :math:`C = \{ i \in V : \deg(i) \ge 3 \}` be the set of candidate nodes. A degree-1 or degree-2 node cannot express branching, so it is never a junction candidate. 2. **Per-node angular spread.** For each candidate :math:`i \in C`, compute the unit directions from :math:`i` to each of its neighbors, .. math:: \mathbf{d}_{i \to j} = \frac{\mathbf{x}_j - \mathbf{x}_i} {\lVert \mathbf{x}_j - \mathbf{x}_i\rVert_2}, \qquad j \in \mathcal{N}(i), (neighbors at exactly the same position as :math:`i` are discarded via a :math:`10^{-12}` length check). The **angular spread** :math:`\alpha(i)` is the *maximum* pairwise angle among those unit directions: .. math:: \alpha(i) = \max_{\substack{ j_1, j_2 \in \mathcal{N}(i) \\ j_1 < j_2}} \arccos\!\bigl( \mathbf{d}_{i \to j_1} \cdot \mathbf{d}_{i \to j_2} \bigr). The dot product is clamped to :math:`[-1, 1]` before the :math:`\arccos` to absorb floating-point drift. Candidates with fewer than two finite unit directions receive :math:`\alpha(i) = 0`. 3. **Robust global threshold.** Let :math:`m = \operatorname{median}_{i \in C}\alpha(i)` and .. math:: s = \operatorname{median}_{i \in C} \bigl|\alpha(i) - m\bigr| be the median and median absolute deviation (MAD) of the candidate spreads. The junction threshold is .. math:: \tau_{\alpha} = m + k \cdot s. Using the MAD rather than the standard deviation makes :math:`\tau_{\alpha}` robust to a handful of extreme Y-shaped splits that would otherwise inflate a mean/stddev estimate. 4. **Flagging.** A candidate :math:`i \in C` is marked as a junction iff :math:`\alpha(i) > \tau_{\alpha}`. The resulting junction mask is consumed by ``pruneTwigs`` (stop criterion) and by the segment/curve-ID assignment downstream. Raising :math:`k` makes the threshold stricter and produces fewer junctions — more branches are then treated as plain degree-2 chain vertices. Default is :math:`1.5`. -- ``betweenness_percentile`` Percentile cutoff :math:`p \in [0, 1]` on **edge betweenness centrality**. For an undirected graph, `Brandes' (2001) `_ unweighted edge betweenness is .. math:: c_B(e) = \sum_{\substack{s, t \in V \\ s \ne t}} \frac{\sigma_{st}(e)}{\sigma_{st}}, where :math:`\sigma_{st}` is the number of shortest paths from :math:`s` to :math:`t` and :math:`\sigma_{st}(e)` is the number of those paths that traverse edge :math:`e`. *Shortest* is measured by **hop count**: every edge has unit length. Intuitively, :math:`c_B(e)` counts how often :math:`e` appears on a shortest path between arbitrary pairs of nodes — bridge-like edges on the backbone of :math:`G` score high, peripheral side-roads score low. Brandes' algorithm computes all :math:`c_B(e)` in :math:`O(|V|\,|E|)` total. For each chosen source :math:`s \in V`, a single **breadth-first search (BFS)** from :math:`s` populates four arrays over nodes :math:`v \in V`: - :math:`d_s(v)` — the number of edges on a shortest :math:`s \to v` path; - :math:`\sigma_s(v)` — the count of shortest :math:`s \to v` paths, accumulated during the BFS layer sweep via :math:`\sigma_s(w) \mathrel{+}= \sigma_s(v)` whenever :math:`d_s(w) = d_s(v) + 1` and :math:`w \in \mathcal{N}(v)`; - :math:`P_s(v)` — the list of predecessors of :math:`v` on its shortest paths from :math:`s`; - :math:`\delta_s(v)` — the source-dependent *dependency* of :math:`v`, computed bottom-up after the BFS in decreasing-:math:`d_s` order by .. math:: \delta_s(v) = \sum_{w : v \in P_s(w)} \frac{\sigma_s(v)}{\sigma_s(w)} \bigl(1 + \delta_s(w)\bigr). The edge dependency contributed by source :math:`s` to edge :math:`(v, w)` with :math:`v \in P_s(w)` is :math:`(\sigma_s(v)/\sigma_s(w))\,(1 + \delta_s(w))`, and the final edge centrality is the sum over all chosen sources :math:`s`. *(Breadth-first search*, BFS, is the canonical graph traversal that visits all nodes at hop distance 1 from the source first, then all at hop distance 2, and so on; it yields shortest-path distances in unweighted graphs in :math:`O(|V| + |E|)` per source.)* **Where the centrality is computed.** Running Brandes' on the raw skeleton graph is wasteful because every degree-2 chain adds redundant edges that all carry the same score. Stage 3 therefore first **contracts** the graph: each maximal degree-2 chain between two non-degree-2 endpoints :math:`a, b` collapses to a single edge :math:`(a, b)` in a contracted graph :math:`\tilde{G}`, with that edge's weight irrelevant for the BFS (edges are still unit-length during the shortest-path computation). Centrality is then computed on :math:`\tilde{G}`, the score of each contracted edge is *mapped back* to all the original edges along the corresponding chain, and the pruning threshold is applied in the original graph. **Pruning.** Let :math:`c_{(1)} \le c_{(2)} \le \dots \le c_{(M)}` be the contracted-edge scores sorted ascending (one score per contracted edge, :math:`M` = ``contractedChainCount``). The code selects the threshold index .. math:: \operatorname{pIdx} = \lfloor p \cdot (M-1) \rfloor \in \{0, 1, \dots, M-1\}, (0-based, clamped to :math:`M-1`) and sets :math:`\tau_B = c_{(\operatorname{pIdx} + 1)}` (1-based order statistic), which is the standard *left-tail* percentile estimate for sample size :math:`M`. Every original edge whose corresponding contracted edge has score :math:`c \le \tau_B` is removed from :math:`G`. The :math:`\le` comparison has two notable edge cases: - :math:`p = 0` does **not** disable the prune. With :math:`\operatorname{pIdx} = 0`, :math:`\tau_B = c_{(1)} = \min_i c_i`, so the minimum-score edge is removed and *all* other edges tied with that minimum are removed as well. In a graph with no ties at the minimum, exactly one contracted edge (and all original edges it represents) is pruned per pass. - :math:`p = 1` gives :math:`\tau_B = c_{(M)} = \max_i c_i`, so every contracted edge satisfies :math:`c \le \tau_B` and the entire backbone is stripped. This is degenerate and is not a supported operating point. With :math:`p = 0.05` (default), roughly the bottom 5 % of backbone edges are pruned each pass. Default :math:`p = 0.05`. -- ``ridge_fraction`` Relative **distance-transform (DT)** threshold used to validate (or break) cycles in the pruned graph. Each skeleton node :math:`v` inherits from Stage 2 a scalar DT value :math:`dt(v) \ge 0` produced by the exact Felzenszwalb-Huttenlocher 2D Euclidean Distance Trasnsform (EDT) on the soft-Z binary occupancy grid. The EDT is seeded on **background** (unoccupied) cells (the standard medial-axis convention), so if :math:`\mathrm{cell}(v)` is the 2D voxel containing :math:`v` and :math:`B` is the set of background cells of the pre-thinning occupancy grid, each cell :math:`q` is assigned .. math:: E^2(q) = \min_{p \in B}(q - p)^2, and the per-node DT is .. math:: dt(v) = \sqrt{E^2\!\bigl(\mathrm{cell}(v)\bigr)} \cdot V, where :math:`V` is the voxel size. The ``validateCycles`` helper consumes these values verbatim. Because skeleton nodes are, by construction, interior points of the occupied region surviving Zhang-Suen thinning, they sit at positive :math:`dt`: thick medial ridges carry large values, boundary-adjacent nodes carry small ones. A genuine ridge-like curve therefore has uniformly high :math:`dt` along its length; spurious loops closed across narrow chokepoints contain at least one node whose :math:`dt` is comparatively small. Let :math:`m_{dt} = \operatorname{median}_{v \in V} dt(v)` and define the global DT threshold .. math:: \tau_{dt} = \text{ridge_fraction} \cdot m_{dt}. Cycles are discovered by an iterative (stack-based) depth-first traversal of the currently active subgraph (:math:`\deg(v) > 0` nodes). The implementation does not maintain the classical *discovery/finish-time* coloring, so "back edge" here is a **DFS-tree heuristic**: an edge :math:`(u, v)` examined during traversal is flagged when .. math:: v \in \operatorname{visited} \quad\text{and}\quad v \neq \operatorname{parent}(u) \quad\text{and}\quad \operatorname{depth}(v) < \operatorname{depth}(u), i.e. it re-enters an already-visited node that is not the parent in the DFS tree and that lies at a strictly smaller DFS depth. On a plain undirected graph traversed with an explicit stack this captures the typical cycle-closing edges without the overhead of the two-color scheme. Let :math:`C` be the cycle this back edge closes, reconstructed by walking the DFS parent pointers from :math:`u` back to :math:`v`. The cycle is characterised by its minimum edge DT .. math:: \operatorname{minDT}(C) = \min_{(a, b) \in C} \min\!\bigl(dt(a),\, dt(b)\bigr), i.e. the DT of the weakest edge along :math:`C` (edge DT is defined as the minimum of its two endpoint DTs). If :math:`\operatorname{minDT}(C) < \tau_{dt}` the cycle is flagged as *spurious* — at least one of its edges is too close to the boundary to represent a real loop — and the **weakest edge** of :math:`C` (the :math:`(a, b) \in C` achieving the :math:`\operatorname{minDT}`) is removed, breaking the cycle. If :math:`\operatorname{minDT}(C) \ge \tau_{dt}`, the cycle is preserved. Raising ``ridge_fraction`` makes :math:`\tau_{dt}` larger relative to the median, so more cycles fail the check and are broken; lowering it preserves more cycles. With the default :math:`0.5`, a cycle must have every edge at DT no smaller than half the median nodal DT to survive. **Operating regime.** Zhang-Suen thinning preserves only cells that retain at least one background neighbour, so every surviving skeleton cell is at squared cell-distance :math:`\ge 1` from the nearest background cell. Scaled by the voxel size this gives a hard lower bound .. math:: \min_{v} dt(v) \;\ge\; V, i.e. the smallest per-node DT is at least the voxel size :math:`V` (``spatial_scale`` when set, otherwise the Stage-1 adaptive voxel size). Because :math:`\tau_{dt} = \text{ridge\_fraction} \cdot m_{dt}` must exceed :math:`\min_v dt(v) \ge V` for any cycle edge to qualify for removal, the effective no-fire threshold is .. math:: \text{ridge\_fraction} \;\lesssim\; \frac{V}{m_{dt}}. On typical breakline inputs :math:`m_{dt}` sits roughly at :math:`1.4\,V` (so :math:`V / m_{dt} \approx 0.7`), which means the documented default :math:`0.5` is effectively a **no-op** for cycle validation on those inputs — the step becomes active only once :math:`\text{ridge\_fraction} \gtrsim 0.7`. The practical useful band is narrow and data-dependent (typically :math:`[0.8,\, 1.0]`); values above :math:`\approx 1.0` tend to remove cycle edges that are load-bearing in the final polyline chaining and introduce gaps and near self-intersections downstream. Sweep a handful of values for a new input if you need cycle validation to fire; otherwise leave at the default and rely on the twig-pruning and betweenness-pruning stages to handle spurious topology. *Stage 4: Resilient geometry refinement* -- ``rdp_tolerance`` The tolerance for Ramer-Douglas-Peucker (RDP) simplification, in point cloud units. Larger values produce fewer output points. When set to ``null`` (default) the value is derived as :math:`\text{rdp_tolerance} = 1 \times \text{min_voxel_size}` (i.e. :math:`0.1` at the default voxel size). It is kept as a separate attribute so it can be tuned independently from ``min_voxel_size`` if the application requires a different simplification scale. -- ``output_point_spacing`` The target spacing for PCHIP resampling of the final curves. Each segment will be resampled to :math:`\max(3, \lceil L / \text{spacing} \rceil)` points, where :math:`L` is the arc length. When set to :math:`0.0`, the spacing is automatically derived from the voxel size computed during stabilization. Default is :math:`0.0`. -- ``z_consistency_sigma`` The width of the 1D Gaussian filter applied along the arc-length of each segment to enforce Z consistency. Prevents grade drift in sparse regions. When set to :math:`0.0`, the width is automatically computed as :math:`5 \times \text{voxel_size}`. Default is :math:`0.0`. *Post-filtering* -- ``min_segment_length`` The minimum 3D arc-length for a curve segment to be retained in the output. Segments shorter than this value are discarded after geometry refinement. Useful for removing small fragments from noisy skeletons. When set to :math:`0.0` (default), no filtering is applied. -- ``chain_radius_factor`` Multiplier applied to the skeleton voxel size to compute the chain radius for connecting adjacent segments into continuous polylines. The chain radius is :math:`r_c = \text{chain_radius_factor} \times V_s` where :math:`V_s` is the skeleton voxel size (``spatial_scale``). Segment endpoints are tested using the same cylindrical proximity criterion as ``merge_radius``, with the chain radius :math:`r_c` as the 2D threshold and :math:`\Delta z_{\max} = \max\!\bigl(\min(3\,\sigma_z,\; 2\,r_c),\; 10 \times \text{min_voxel_size}\bigr)` as the Z-difference bound. Default is :math:`3.0`. -- ``min_curve_length`` The minimum arc-length for a curve to be retained in the final output. Curves whose arc-length is below this value are discarded after all chaining, merging, and geometric cleanup. Arc-length is measured as the 2D (XY-plane footprint) length :math:`L_{\text{2D}}`. For a polyline with :math:`N` vertices :math:`\mathbf{p}_1, \ldots, \mathbf{p}_N`, the 2D arc-length (XY-plane footprint) is .. math:: L_{\text{2D}} = \sum_{i=1}^{N-1} \sqrt{(x_{i+1}-x_i)^2 + (y_{i+1}-y_i)^2} and the 3D arc-length (full Euclidean) is .. math:: L_{\text{3D}} = \sum_{i=1}^{N-1} \sqrt{(x_{i+1}-x_i)^2 + (y_{i+1}-y_i)^2 + (z_{i+1}-z_i)^2} In regions where the Z coordinate oscillates (e.g., due to noisy PCHIP interpolation in steep terrain), :math:`L_{\text{3D}}` can be significantly larger than :math:`L_{\text{2D}}`, causing visually short curves to survive the filter. For this reason the default metric is :math:`L_{\text{2D}}`. When set to :math:`0.0` no curve-level filtering is applied (and the ``adopt-short-CIDs`` pass is also disabled). Default ``null`` (sentinel) -- derived as :math:`500 \times \text{min_voxel_size}` at instance construction (= :math:`50.0` m at the production ``min_voxel_size = 0.1`` fixture, matching the inicorta-tuned value). **Note:** ``min_curve_length`` is scene-coupled (depends on the characteristic length of the curves being extracted), not just scale-coupled; override explicitly when the scene's curve-length distribution differs from inicorta's. The kwarg is also load-bearing for the ``adopt-short-CIDs`` pass (gated by ``min_curve_length > 0``), so changing it can alter intermediate CID groupings, not just the final filter. *Snap-to-input pass* After the chain-merge cascade and resilient geometry refinement, the snap-to-input pass shifts interior polyline vertices toward the input curve-class points the extraction missed. Endpoints are deliberately never moved. The pass runs in two stages: a pre-sanitise pulse of ``snap_passes`` and a post-sanitise pulse of ``snap_post_passes``. The sanitiser between them clips loop-prone polyline material and the post-pulse re-fits the clipped polylines toward newly-missed points. -- ``snap_enable`` When ``true`` (default) the snap-to-input pass is active. -- ``snap_max_shift`` Maximum 2D distance (in metres) a single interior vertex can be shifted in one snap pass. Smaller caps reduce the number of near-self-crossings the sanitiser later clips, which in turn preserves more coverage. Default ``null`` (sentinel) -- derived as :math:`35 \times \text{min_voxel_size}` at instance construction. -- ``snap_min_neighbors`` Minimum number of missed input points that must pull a given interior vertex before that vertex is shifted. Acts as a sparsity filter that suppresses noisy single-point pulls. Default :math:`4`. -- ``snap_smoothing_iterations`` Number of 3-point moving-average passes applied to the per-vertex delta sequence before the shifts are committed. Prevents adjacent vertices from shifting in opposing directions. Default :math:`2`. -- ``snap_target_radius`` Coverage radius (m) used to classify input points as MISSED (eligible to pull a vertex). A curve-class input point is MISSED iff its distance to every densified polyline vertex exceeds this radius. Default ``null`` (sentinel) -- derived as :math:`30 \times \text{min_voxel_size}` at instance construction. -- ``snap_pull_radius`` Maximum distance (m) from a missed point to its nearest interior polyline vertex; missed points farther than this are ignored. Larger values pull more distant misses toward their nearest polyline; gain saturates once this exceeds typical inter-cluster spacing. Default ``null`` (sentinel) -- derived as :math:`400 \times \text{min_voxel_size}` at instance construction. -- ``snap_densify_step`` Spacing (m) of the densified polyline used to classify input points as covered or missed. When set to ``null`` (default) the value is derived as :math:`5 \times \text{min_voxel_size}` (i.e. :math:`0.5` m at the default voxel size). -- ``snap_passes`` Number of times the snap pass is applied BEFORE the H25 sanitiser. Each pass re-evaluates which input points are still MISSED, so successive passes progressively close the remaining coverage gap. Default :math:`30`. -- ``snap_post_passes`` Number of additional snap passes applied AFTER the sanitiser. The sanitiser clips loop-prone polyline material the pre-sanitise snap may have created; the post-pass re-fits the clipped polylines toward newly-missed points. The per-polyline SI guard prevents any post-pass shift from introducing new self-crossings. Setting to :math:`0` reproduces the snap behaviour. Default :math:`20`. *Hallucination-drop and snap/truncate iteration* After all snap passes, a hallucination-drop pass identifies and truncates polyline material with no input curve-class support. The snap/truncate iteration loop then alternates one snap pass and one drop pass for several cycles, so coverage gains from the snap compound while the hallucination invariant is preserved (no hallucinated material survives any cycle). -- ``hallucination_drop_enable`` When ``true`` (default), every feature whose hallucinated polyline length exceeds ``hallucination_drop_threshold`` of its total length is dropped at the end of the global-optimisation step. Hallucinated material is polyline length without any input curve-class support within ``hallucination_drop_radius``. -- ``hallucination_drop_radius`` Neighbourhood radius (m) used to classify a densified polyline vertex as supported. Default ``null`` (sentinel) -- derived as :math:`25 \times \text{min_voxel_size}` at instance construction. -- ``hallucination_drop_threshold`` Per-feature hallucination ratio above which the feature is truncated at its hallucinated runs (or dropped if no supported run survives). Default :math:`0.025` -- identical to the validation metric's ``--hallucination-feature-threshold``. -- ``hallucination_drop_densify_step`` Polyline densification step (m) used to compute the per-feature hallucination ratio. When set to ``null`` (default) the value is derived as :math:`5 \times \text{min_voxel_size}` (i.e. :math:`0.5` m at the default voxel size), matching the validation metric's densification step. -- ``hallucination_min_supported_length`` Minimum length (m) of a supported sub-polyline kept when a hallucinated feature is split. Hallucinated features above the per-feature threshold are truncated at their hallucinated runs; each contiguous supported run becomes a new feature, but only if its 2D length is at least this many metres. Shorter supported runs are discarded. Default ``null`` (sentinel) -- derived as :math:`1 \times \text{min_voxel_size}` at instance construction. -- ``snap_truncate_iterations`` Number of snap/truncate cycles run after the initial truncation pass. Each cycle performs one :meth:`_snap_polylines_to_input` followed by one :meth:`_drop_hallucinated_features` so coverage gains from snap shifts compound without ever leaving hallucinated material in the output. Default :math:`15` -- empirically saturates the residual coverage budget at modest runtime cost. *Endpoint-extension pass* After the snap-truncate loop saturates, a few input points remain MISSED just past the polyline endpoints because the snap pass deliberately never moves endpoints. The endpoint- extension pass walks each endpoint outward along its tangent in steps of ``endpoint_extension_step`` while the next step still has at least one input curve-class point within ``endpoint_extension_radius``. The walk stops at the first unsupported step or after ``endpoint_extension_max_length`` metres, so by construction the extension never introduces hallucinated material. -- ``endpoint_extension_enable`` When ``true`` (default) the endpoint-extension pass is active. The pass cannot introduce hallucinated material because it stops at the first unsupported step. Default ``true``. -- ``endpoint_extension_radius`` Neighbourhood radius (m) used to test whether a candidate extension point is supported by the input cloud. Default ``null`` (sentinel) -- derived as :math:`25 \times \text{min_voxel_size}` at instance construction. -- ``endpoint_extension_step`` Distance (m) between successive extension steps. Smaller values track curved input clusters more precisely; larger values are faster. When set to ``null`` (default) the value is derived as :math:`5 \times \text{min_voxel_size}` (i.e. :math:`0.5` m at the default voxel size). -- ``endpoint_extension_max_length`` Maximum extension length (m) per endpoint. Caps the walk so a particularly long stretch of supported material does not run away. Default ``null`` (sentinel) -- derived as :math:`150 \times \text{min_voxel_size}` at instance It should be long enough to add meaningful coverage past saturated endpoints. The per-step crossing guard (with in-walk segment registration) plus the post-extension truncation pass keep self-intersections at zero, gaps at zero, and hallucinated features at zero even at this length. *z-disjoint CID relabel* After hallucination-drop, any ``CURVE_ID`` that still groups two or more features whose z-extents are separated by more than ``cid_relabel_z_disjoint_threshold`` metres is broken into multiple ``CURVE_IDs`` (one per connected z-component). This eliminates phantom gap pairs that arise when split-z-jump or split-z-reversal cuts a polyline into vertically disjoint halves but leaves both halves with the original ``CURVE_ID``. -- ``cid_relabel_z_disjoint_enable`` When ``true`` (default) the z-disjoint CID relabel pass is active. Eliminates phantom same-CID gap pairs left behind by Z-jump and Z-reversal splits. Default ``true``. -- ``cid_relabel_z_disjoint_threshold`` Z separation (m) above which two features sharing a ``CURVE_ID`` are considered disjoint and split apart into two ``CURVE_IDs``. Default ``null`` (sentinel) -- derived as :math:`50 \times \text{min_voxel_size}` at instance construction . Well above ``z_tolerance`` (so legitimate Z-noise stays grouped) and well below the observed bad-pair gap on the production fixture. *Endpoint-pair gap splitter* After all geometry passes, this final pass splits any same-CID feature pair whose closest endpoints are within ``cid_gap_split_planar_radius`` in 2D AND more than ``cid_gap_split_dz_threshold`` apart in elevation. The metric counts such pairs as gaps, but the existing z-disjoint relabel only fires when the features' z-extents are fully disjoint; the endpoint-pair splitter closes the loophole where a polyline dips below another within the cluster but emerges with z-disjoint endpoints. -- ``cid_gap_split_enable`` When ``true`` (default) the endpoint-pair gap splitter is active. Default ``true``. -- ``cid_gap_split_planar_radius`` Maximum 2D distance (m) between the closest endpoints of two same-CID features for the pair to be considered a candidate gap. When set to ``null`` (default) the value is derived from ``merge_radius`` (:math:`R_{\text{merge}}` when set, else :math:`15.0` m fallback) so the planar threshold tracks the metric's gap radius. -- ``cid_gap_split_dz_threshold`` Minimum :math:`|\Delta z|` (m) between the closest endpoints of two same-CID features for the pair to be flagged as a vertical gap and split apart. Default ``null`` (sentinel) -- derived as :math:`30 \times \text{min_voxel_size}` at instance construction. *Orphan-fragment merge* Final cleanup pass that reassigns short whole-CID fragments to a neighbouring ``CURVE_ID`` when they sit planar-close AND z-overlapping with that neighbour. Geometry is never modified; only the CID label of the orphan changes. Two safety nets prevent undoing prior splits: the geometric guard (planar-close AND z-overlap) is the strict negation of ``cid_relabel_z_disjoint_*`` (which fires only on z-extent separation); and the length cap prevents undoing the endpoint-pair splitter, which operates on full-length features. -- ``merge_orphans_enable`` When ``true`` (default) the orphan-fragment merge pass is active. Default ``true``. *Final cleanup* Optional pass that runs at the tail of the global-optimization phase (after the post-optimization snap/sanitize/snap chain) and re-applies three cleanups whose first invocation in the pipeline ran *before* the merge and snap cascades and whose invariants the cascades can violate. The three operations are: 1. ``split_z_reversals`` at the standard :math:`\Delta z_{\text{full}} = \mathtt{\_max\_dz}( \text{merge\_radius})` threshold. The post-optimization ``post_chain_merge_until_stable`` driver bridges same-CID features across different elevations and can introduce drawdowns far above the pre-merge split's threshold. 2. ``drop_cov_refine_by_predicate`` with the cross-CID coverage-refine predicate. Snap shifts can drift coverage_refine helpers across an other-CID feature after the pre-snap drop already ran. 3. ``_relabel_disconnected_curves``. Adoption can place a feature into a ``CURVE_ID`` whose other members are within ``merge_radius`` at the moment of adoption; subsequent endpoint extension and snap can then drift the components apart. Between operations 1 and 2 the driver also drops post-split sub-features whose 2D length falls below ``min_segment_length`` (post-split tails shorter than the SCE min-segment threshold). The pass is **off by default** so production pipelines, the :class:`.SimpleCurveEvaluator` workflow, and the bench probes preserve their established behaviour exactly. The ``SimpleCurveExtractionTest`` opts in (``"final_cleanup": true``) because its strict geometric checks (cross-CID, disconnected components, z-reversals) are not part of the production KPI set and thus tolerate the small coverage / feature-count perturbations the cleanup can introduce. -- ``final_cleanup`` When ``true``, run the optional final-cleanup cascade described above at the tail of the global-optimization phase. When ``false`` (default), the cascade is skipped and the global-optimization output is forwarded unchanged. Default ``false``. *Output configuration* -- ``output_gpkg`` The output path for the GeoPackage file. If ``null`` or not specified, no GeoPackage is written. The GeoPackage contains LineStringZ geometries with curve and segment metadata. -- ``output_shp`` The output path for the Shapefile. If ``null`` or not specified, no Shapefile is written. The Shapefile contains PolyLineZ geometries (Shape Type 13) and produces ``.shp``, ``.shx``, ``.dbf``, and ``.prj`` files. -- ``output_csv`` The output path for the CSV file. If ``null`` or not specified, no CSV is written. The CSV contains per-point records with columns ``CurveID``, ``SegmentID``, ``SequenceID``, ``X``, ``Y``, ``Z``, and ``Slope``. -- ``curve_pcloud`` The output path for the curve point cloud in LAS format. If ``null`` or not specified, no point cloud is written. When set, each curve is resampled along its arc-length at ``curve_pcloud_step`` spacing to produce :math:`N_i = \max\!\bigl(2,\, \lceil L_i / \Delta s \rceil + 1\bigr)` points per curve, where :math:`L_i` is the 3D arc-length and :math:`\Delta s` is ``curve_pcloud_step``. Point coordinates are linearly interpolated along the arc parameterization :math:`\bigl(X(s),\, Y(s),\, Z(s)\bigr)`. The output LAS inherits the CRS from the input point cloud (via GeoTIFF VLRs) and always includes ``CURVE_ID`` (int32), ``SEG_ID`` (int32), and ``SLOPE`` (float64) as required extra dimensions, plus four length columns ``CURVE_LENGTH_3D``, ``CURVE_LENGTH_2D``, ``SEG_LENGTH_3D``, ``SEG_LENGTH_2D`` (all ``float32``). ``CURVE_LENGTH_*`` carries the per-``CURVE_ID`` total length broadcast to every point of every feature in that curve; ``SEG_LENGTH_*`` carries the per-feature length broadcast to every point of that feature. ``CURVE_ID`` and ``SEG_ID`` in the LAS are renumbered to contiguous, gap-free 0-based ranges immediately before writing. -- ``curve_pcloud_step`` The distance between consecutive points in the curve point cloud, in the same spatial units as the input point cloud. Smaller values produce denser point clouds with finer curve representation. Default ``null`` (sentinel) -- derived as :math:`1 \times \text{min_voxel_size}` at instance construction. -- ``slope_output_form`` Output convention for the ``SLOPE`` column written by the curve point cloud. Internal computations always use the project- canonical ``slope = dZ/ds_3D = sin(theta)`` (matches ``computeGrade`` in ``cpp/src/alg/CurveSimplifier.tpp``; bounded in :math:`[-1, 1]`; well-defined on near-vertical segments). The writer can emit one of three conventions: * ``"sin"`` (default) -- emit the cached :math:`\sin(\theta) = dZ/ds_{\text{3D}}` value. Bounded in :math:`[-1, 1]`; uniform color ramp; the most stable form. * ``"tan"`` -- recompute :math:`\tan(\theta) = dZ/dxy` directly from the resampled output coordinates with a planar guard :math:`dxy \geq 10^{-9}` and a magnitude clip at :math:`\pm 10^{6}`. Matches civil-engineering "grade"; can blow up on near-vertical sub-polylines (a :math:`89.99^\circ` segment reads as :math:`\approx 5730`, crushing any color ramp). * ``"angle_deg"`` -- recompute the inclination angle in degrees as :math:`\arctan_2(dZ, dxy) \cdot 180 / \pi`. Unconditionally bounded in :math:`[-90^\circ, +90^\circ]`; the most interpretable form for visualization in QGIS, CloudCompare, or any tool that maps a numeric column to a color ramp. The ``"tan"`` and ``"angle_deg"`` conventions recompute the slope directly from the output coordinates (rather than from the cached :math:`\sin` via the analytical conversion :math:`\tan = \sin / \sqrt{1 - \sin^{2}}`) to avoid catastrophic cancellation near vertical. Internal slope computations are not affected by this setting. -- ``epsg`` The EPSG code for the coordinate reference system of the output files. If not specified, the CRS is inherited from the input LAS file. -- ``propagate_labels`` Whether to propagate curve and segment IDs back to the original point cloud (``true``) or not (``false``). When enabled, each point in the original cloud receives ``curve_id`` and ``segment_id`` attributes via nearest-neighbor assignment from the extracted curves. Points farther than :math:`2 \times \text{voxel_size}` remain unassigned. Disabled by default to save memory when only file outputs are needed. *General* -- ``cluster_name`` The name for the cluster attribute in the output point cloud. Default is ``"CLUSTER"``. -- ``nthreads`` The number of threads for parallel computations. The value ``-1`` means using as many threads as available cores. -- ``post_clustering`` The post-clustering pipeline. It can be ``null`` or a list of :ref:`post-clustering components `. **Output** The :class:`.SimpleCurveExtractor` can produce up to four output file formats: ``CURVE_ID`` and ``SEG_ID`` in every output are renumbered into contiguous, gap-free 0-based ranges immediately before writing. After the renumbering pass the surviving curves carry IDs ``0..N-1`` (where ``N`` is the number of distinct curves), and within each curve the segments carry IDs ``0..N_i-1`` (where ``N_i`` is the number of features of curve ``i``). The four length attributes ``CURVE_LENGTH_3D``, ``CURVE_LENGTH_2D``, ``SEG_LENGTH_3D``, and ``SEG_LENGTH_2D`` are always written into every output that supports per-feature attributes. - **GeoPackage** (``.gpkg``): Contains a ``road_edges`` layer with LineStringZ geometries. Each feature includes metadata fields such as ``CURVE_ID``, ``SEG_ID``, ``CONF_GEOM`` (geometry confidence), ``CONF_TOP`` (topology confidence), ``STAB_TOP`` (topology stability), ``SRC_PTS`` (source point count), and (when enabled) ``CURVE_LENGTH_3D`` / ``CURVE_LENGTH_2D`` (per-``CURVE_ID`` total lengths broadcast to every feature of the curve), ``SEG_LENGTH_3D`` / ``SEG_LENGTH_2D`` (per-feature 3D / planar polyline lengths), ``AVG_GRADE`` (mean absolute slope), and ``MAX_GRADE`` (maximum absolute slope). - **Shapefile** (``.shp``): Contains PolyLineZ geometries (Shape Type 13) with the same attribute fields as the GeoPackage output. Field names are abbreviated to fit the 10-character dbase limit: ``CRV_LEN3D`` / ``CRV_LEN2D`` / ``SEG_LEN3D`` / ``SEG_LEN2D`` (1:1 to the GeoPackage / LAS names ``CURVE_LENGTH_3D`` etc.). Produces the standard ``.shp``, ``.shx``, ``.dbf``, and ``.prj`` companion files. - **CSV** (``.csv``): A per-point tabular format with columns ``CurveID``, ``SegmentID``, ``SequenceID``, ``X``, ``Y``, ``Z``, and ``Slope``. Coordinates are written with 6 decimal places and slope values with 8 decimal places. A header row is included. The CSV does not carry the length columns. - **LAS point cloud** (``.las``): A georeferenced 3D point cloud where each curve is resampled at ``curve_pcloud_step`` spacing. Each point carries ``CURVE_ID`` (int32), ``SEG_ID`` (int32), and ``SLOPE`` (float64) as required extra byte dimensions, plus four length columns: ``CURVE_LENGTH_3D`` / ``CURVE_LENGTH_2D`` (float32, per-``CURVE_ID`` totals broadcast to every point of the curve) and ``SEG_LENGTH_3D`` / ``SEG_LENGTH_2D`` (float32, per-feature totals broadcast to every point of the feature). ``SLOPE`` follows the convention selected by ``slope_output_form``: the project-canonical :math:`dZ/ds_{\text{3D}} = \sin(\theta)` by default (``"sin"``), the civil-engineering :math:`\tan(\theta) = dZ/dxy` (``"tan"``), or the inclination angle in degrees :math:`\arctan_2(dZ, dxy) \cdot 180 / \pi` (``"angle_deg"``). See the ``slope_output_form`` argument for details. The CRS is inherited from the input LAS file via its GeoTIFF VLRs. This format enables direct 3D overlay with the source point cloud in tools like CloudCompare or QGIS for visual quality assessment. **Output** The figure below represents the curves extracted from the breaklines of a 3D point cloud. The visualization was computed in QGIS. .. figure:: ../img/extracted_curve.jpg :scale: 50 :alt: Figure representing the extracted curves. Visualization of the extracted cruves representing breaklines. .. _Clustering post-processing: Post-processing ================== Once clusters are available, it might be useful to apply some post-processing to derive useful information from them. That is what post-processing components aim to do. They can be defined in the ``"post_clustering"`` list. The execution will start by the first component in the list and will end at the last one, i.e., it is sequential with respect to the order in the list. ClusterEnveloper -------------------- The :class:`.ClusterEnveloper` can be used to find the geometry enclosing the points. Envelopes often support the estimation of the enclosed volume. They can be exported to output files. The JSON below shows how to define a cluster enveloper post-processor inside the post-processing pipeline of a clusterer. .. code-block:: json "post_clustering": [ { "post-processor": "ClusterEnveloper", "envelopes": [ { "type": "AlignedBoundingBox", "compute_volume": true, "output_path": "*/aabbs.csv", "separator": "," }, { "type": "BoundingBox", "compute_volume": true, "output_path": "*/bboxes.csv", "separator": "," } ] } ] The JSON above defines a :class:`.ClusterEnveloper` that will compute the axis-aligned bounding box and also the oriented bounding box. The volume will be computed for each envelope and they will be exported to a CSV file. More concretely, each row in the CSV will contain three coordinates representing a vertex, an integer representing the cluster label, and the estimated volume. All the values will be separated by ``,``. **Arguments** -- ``envelopes`` List of envelopes that must be computed by the :class:`.ClusterEnveloper` -- ``type`` The type of the envelope, it can be ``"AxisAlignedBoundingBox"`` or ``"BoundingBox"`` (the edges of the last one are not necessarily parallel to the vectors of the canonical basis). -- ``compute_volume`` Flag to specify whether to compute the volume enclosed by the envelope (``true``) or not (``false``). -- ``output_path`` The path where the envelope's data will be exported (typically in CSV format). -- ``separator`` The separator to be used in the output CSV file. ClusterMarker ------------------ The :class:`.ClusterMarker` can be used to compute a point for each cluster to represent it. These points can be exported as CSV files or shapefiles. The JSON below shows how to define a cluster marker post-processor inside the post-processing pipeline of a clusterer. .. code-block:: json "post_clustering": [ { "post-processor": "ClusterMarker", "strategy": "centroid", "proj_str": null, "epsg": 32630, "output_path": "*/marks.shp", "nthreads": -1 } ] The JSON above defines a :class:`.ClusterMarker` that will compute the centroid for each cluster as a marker. The output will be exported in the EPSG:32630 CRS to a shapefile called *marks.shp*. The computations will be carried out using all the cores available in the system. **Arguments** -- ``strategy`` The strategy that must be used to compute the markers. Supported strategies are ``"centroid"``, (see :meth:`.ClusterMarker.compute_cluster_centroid`), ``"midrange"`` (see :meth:`.ClusterMarker.compute_cluster_midrange`), ``"medianoid"`` (see :meth:`.ClusterMarker.compute_cluster_medianoid`), ``"medoid"`` (see :meth:`.ClusterMarker.compute_cluster_medoid`), and ``"geometric_median"`` (see :meth:`.ClusterMarker.compute_cluster_geometric_median`). -- ``proj_str`` The coordinate reference system specified as a projection string (proj). -- ``epsg`` Can be optionally used to specified the CRS following the European Petroleum Survey Groups (EPSG) standard format. -- ``nthreads`` The number of threads that must be used for the parallel computations. Note that -1 means using as many threads as available cores. -- ``output_path`` The output path where the markers must be written. Note that if the extension is ``.shp``, then the output will be written as a shapefile. Otherwise, it will be written as an ASCII CSV file. ClusterSelector ------------------ The :class:`.ClusterSelector` can be used to discard those clusters that do not satisfy satisfy some given criteria. It consists of a list of filters that check some given relationals and either preserve or discard those clusters that satisfy them. The JSON below shows how to define a cluster enveloper post-processor inside the post-processing pipeline of a clusterer. .. code-block:: json "post_clustering": [ { "post-processor": "ClusterSelector", "filters": [ { "attribute": "number_of_points", "relational": "greater_than_or_equal_to", "target": 256, "action": "preserve" }, { "attribute": "surface_area", "relational": "less_than", "target": 1, "action": "discard" }, { "attribute": "z_length", "relational": "less_than", "target": 3, "action": "discard" } ] } ] The JSON above defines a :class:`.ClusterSelector` that will preserve only the clusters with at least :math:`256` points and discard those whose convex hull has a surface area less than :math:`1` meter or whose length along the :math:`z`-axis is below the :math:`3` meters. **Arguments** .. _Cluster selector filters: -- ``filters`` The list of filters based on relational conditions to decide whether to preserve or discard those clusters who satisfy them. -- ``attribute`` The attribute involved in the relational condition as the left-hand side. Supported attributes are ``"number_of_points"``, ``"surface_area"``, ``"volume"``, ``"surface_density"``, ``"volume_density"``, ``"x_length"``, ``"y_length"`` and ``"z_length"``. Their description can be read in the documentation of :class:`.ClusterSelector`. -- ``relational`` The relational governing the condition. Supported relationals are ``"not_equals"`` :math:`x \neq y`, ``"equals"`` :math:`x = y`, ``"less_than"`` :math:`x < y`, ``"less_than_or_equal_to"`` :math:`x \leq y`, ``"greater_than"`` :math:`x > y`, ``"greater_than_or_equal_to"`` :math:`x \geq y`, ``"in"`` :math:`x \in S`, ``"not_in"`` :math:`x \notin S`, and ``"inside"`` :math:`x \in [a, b] \subset \mathbb{R}`. -- ``target`` The right-hand side of the relational condition. -- ``action`` Whether to ``"preserve"`` or ``"discard"`` the clusters that satisfy the relational condition. .. _FPS decorated clusterer: Decorators =============== Furthest point sampling decorator -------------------------------------------- The :class:`.FPSDecoratorTransformer` can be used to decorate a data clusterer such that the computations can take place in a transformed space of reduced dimensionality. Typically, the domain of a clustering is the entire point cloud, let us say :math:`m` points. When using a :class:`.FPSDecoratedClusterer` this domain will be transformed to a subset of the original point cloud with :math:`R` points, such that :math:`m \geq R`. Decorating a clusterer with this decorator can be useful to reduce its execution time. .. code-block:: json { "clustering": "fpsdecorated", "fps_decorator": { "num_points": "m/10", "fast": true, "num_encoding_neighbors": 1, "num_decoding_neighbors": 1, "release_encoding_neighborhoods": false, "threads": 16, "representation_report_path": null }, "decorated_clusterer": { "clustering": "dbscan", "cluster_name": "cluster_wood", "precluster_name": "prediction", "precluster_domain": [1], "min_points": 200, "radius": 3.0, "post_clustering": [ { "post-processor": "ClusterEnveloper", "envelopes": [ { "type": "AlignedBoundingBox", "compute_volume": true, "output_path": "*/aabbs.csv", "separator": "," }, { "type": "BoundingBox", "compute_volume": true, "output_path": "*/bboxes.csv", "separator": "," } ] } ] } } **Arguments** -- ``fps_decorator`` The specification of the furthest point sampling (FPS) decoration carried out through the :class:`.FPSDecoratorTransformer`. -- ``num_points`` The target number of points :math:`R` for the transformed point cloud. It can be an integer or an expression that will be evaluated with :math:`m` representing the number of points of the original point cloud, e.g., ``"m/2"`` will downscale the point cloud to half the number of points. -- ``fast`` Whether to use exact furthest point sampling (``false``) or a faster stochastic approximation (``true``). -- ``num_encoding_neighbors`` How many closest neighbors in the original point cloud are considered for each point in the transformed point cloud to reduce from the original space to the transformed one. -- ``num_decoding_neighbors`` How many closest neighbors in the transformed point cloud are considered for each point in the original point cloud to propagate back from the transformed space to the original one. -- ``release_encoding_neighborhoods`` Whether the encoding neighborhoods can be released after computing the transformation (``true``) or not (``false``). Releasing these neighborhoods means the :meth:`.FPSDecoratorTransformer.reduce` method must not be called, otherwise errors will arise. Setting this flag to true can help saving memory when needed. -- ``threads`` The number of parallel threads to consider for the parallel computations. Note that ``-1`` means using as many threads as available cores. -- ``representation_report_path`` Where to export the transformed point cloud. In general, it should be ``null`` to prevent unnecessary operations. However, it can be enabled (by given any valid path to write a point cloud file) to visualize the points that are seen by the data miner. -- ``decorated_clusterer`` A typical clustering specification. See :ref:`the DBScan clusterer ` for an example. .. _Mindist decimator decorated clusterer: Minimum distance decimator decorator ----------------------------------------- The :class:`.MinDistDecimatorDecorator` can be used to decorate a clusterer such that the computations can take place in a transformed space of reduced dimensionality. Typically, the domain of a clusterer is the entire point cloud, let us say :math:`m` points. When using a :class:`.MinDistDecoratedClusterer` this domain will be transformed to a subset of the original point cloud with :math:`R \leq m` points. This representation is achieved by assuring that no point is closer to its closest neighbor than a given minimum distance threshold. Decorating a clusterer with this decorator can be useful to reduce its execution time. .. code-block:: json { "clustering": "MinDistDecorated", "mindist_decorator": { "min_distance": 0.3, "num_encoding_neighbors": 1, "num_decoding_neighbors": 1, "release_encoding_neighborhoods": false, "threads": -1, "representation_report_path": null }, "decorated_clusterer": { "clustering": "dbscan", "cluster_name": "bunny", "precluster_name": "classification", "min_points": 3, "radius": 3.0 } } **Arguments** -- ``mindist_decorator`` See the :ref:`minimum distance decorated miner documentation about arguments ` . -- ``decorated_clusterer`` A typical clustering specification. See :ref:`the DBScan clusterer ` for an example.