src.model.deeplearn.arch.transf_octorf_pwise_classif

Classes

GVABlock(arch, depth[, input_dim, output_dim])

SharedMLPBlock(arch, depth[, input_dim, ...])

TransfOctoRFPwiseClassif(**kwargs)

TransformerBlock(arch, depth[, input_dim, ...])

class src.model.deeplearn.arch.transf_octorf_pwise_classif.TransfOctoRFPwiseClassif(**kwargs)
Author:

Alberto M. Esmoris Pena

Transformer (or SharedMLP) + classification head for the TransfOctoRF pipeline. Consumes K-nearest-neighbor centroids with their features, RF outputs, uncertainties, and inter-point coordinates.

Inputs

Let \(B\) be the batch size, \(K\) the number of neighbors (including the center point itself), \(n_f\) the number of input features, and \(n_c\) the number of classes. The architecture receives three inputs:

  • \(\mathbf{F} \in \mathbb{R}^{B \times K \times n_f}\) — per-neighbor feature vectors (RF outputs, mined features, etc.).

  • \(\mathbf{X} \in \mathbb{R}^{B \times K \times 3}\) — per-neighbor 3D coordinates.

  • \(\mathbf{m} \in \{0, 1\}^{B \times K}\) — padding mask where \(m_{bk} = 1\) if neighbor \(k\) is valid for sample \(b\), and \(m_{bk} = 0\) otherwise.

Transformer Path

When operator="transformer", the hidden representation is computed as follows. Let \(n_h\) be the hidden dimensionality and \(H\) the number of attention heads.

1. Feature projection. A linear layer projects the input features to the hidden dimension. When \(n_f = n_h\), the projection is the identity: \(\mathbf{H}^{(0)} = \mathbf{F}\). Otherwise:

\[\mathbf{H}^{(0)} = \mathbf{F} \, \mathbf{W}_{\text{proj}} + \mathbf{b}_{\text{proj}}, \quad \mathbf{W}_{\text{proj}} \in \mathbb{R}^{n_f \times n_h}\]

2. Positional encoding. Relative coordinates with respect to the center point \(\mathbf{x}_0\) are computed and passed through a two-layer MLP with ReLU activations:

\[\boldsymbol{\delta}_k = \mathbf{x}_k - \mathbf{x}_0, \quad k = 0, \ldots, K-1\]
\[\mathbf{PE} = \text{ReLU}\!\bigl( \text{ReLU}(\boldsymbol{\delta} \, \mathbf{W}_1 + \mathbf{b}_1) \, \mathbf{W}_2 + \mathbf{b}_2 \bigr), \quad \mathbf{W}_1 \in \mathbb{R}^{3 \times d_{\text{pe}}},\; \mathbf{W}_2 \in \mathbb{R}^{d_{\text{pe}} \times n_h}\]

where \(d_{\text{pe}}\) is the positional encoding intermediate dimensionality (pos_enc_units). The PE MLP weights are shared across all depths. The positional encoding is added to the projected features:

\[\mathbf{H}^{(1)} = \mathbf{H}^{(0)} + \mathbf{PE}\]

3. Multi-head self-attention. The attention is computed with \(H\) heads, each with key dimensionality \(d_k = n_h / H\). For each head \(h\):

\[\mathbf{Q}_h = \mathbf{H}^{(1)} \mathbf{W}_h^Q, \quad \mathbf{K}_h = \mathbf{H}^{(1)} \mathbf{W}_h^K, \quad \mathbf{V}_h = \mathbf{H}^{(1)} \mathbf{W}_h^V\]
\[\text{Attn}_h = \text{softmax}\!\left( \frac{\mathbf{Q}_h \mathbf{K}_h^\top}{\sqrt{d_k}} + \mathbf{M} \right) \mathbf{V}_h\]

where \(\mathbf{M}\) is the attention mask derived from \(\mathbf{m}\): positions where \(m_{bk} = 0\) receive \(-\infty\) to zero out their attention weight after softmax. The heads are concatenated and linearly projected:

\[\text{MHA}(\mathbf{H}^{(1)}) = \text{Concat}(\text{Attn}_1, \ldots, \text{Attn}_H) \, \mathbf{W}^O, \quad \mathbf{W}^O \in \mathbb{R}^{n_h \times n_h}\]

4. Residual connections and layer normalization.

\[\mathbf{H}^{(2)} = \text{LayerNorm}\!\bigl( \mathbf{H}^{(1)} + \text{MHA}(\mathbf{H}^{(1)}) \bigr)\]

5. Feed-forward network (FFN).

\[\text{FFN}(\mathbf{z}) = \text{ReLU}(\mathbf{z} \, \mathbf{W}_3 + \mathbf{b}_3) \, \mathbf{W}_4 + \mathbf{b}_4, \quad \mathbf{W}_3 \in \mathbb{R}^{n_h \times 2n_h},\; \mathbf{W}_4 \in \mathbb{R}^{2n_h \times n_h}\]

with dropout applied after the FFN and a second residual + LayerNorm:

\[\mathbf{H}^{(3)} = \text{LayerNorm}\!\bigl( \mathbf{H}^{(2)} + \text{Dropout}(\text{FFN}(\mathbf{H}^{(2)})) \bigr)\]

The block output is \(\mathbf{Z} = \mathbf{H}^{(3)} \in \mathbb{R}^{B \times K \times n_h}\). The notation \(\mathbf{H}^{(i)}\) denotes intra-block intermediates; \(\mathbf{Z}^{(d)}\) in the multi-depth section denotes the output of block \(d\).

SharedMLP Path

When operator="sharedmlp", the hidden representation is computed by a stack of \(L\) shared fully-connected layers applied independently to each of the \(K\) neighbors (weight-shared across the neighbor dimension). Let \(d_{\text{in}},\; d_1, \ldots, d_L\) be the layer widths (shared_mlp_layers), where \(d_{\text{in}} = n_f\) for block \(d=0\) and \(d_{\text{in}} = n_h\) for blocks \(d > 0\).

For each layer \(\ell = 1, \ldots, L\):

\[\mathbf{h}_k^{(\ell)} = \text{Dropout}\!\bigl( \text{ReLU}\!\bigl( \text{BN}\!\bigl( \mathbf{h}_k^{(\ell-1)} \, \mathbf{W}^{(\ell)} + \mathbf{b}^{(\ell)} \bigr) \bigr) \bigr), \quad \mathbf{W}^{(\ell)} \in \mathbb{R}^{d_{\ell-1} \times d_\ell}\]

where \(\mathbf{h}_k^{(0)} = \mathbf{f}_k\) (the input feature vector for neighbor \(k\), with \(d_0 = d_{\text{in}}\)), and BN denotes batch normalization (enabled by default for SharedMLP, disabled for Transformer). A final linear projection maps to the hidden dimension:

\[\mathbf{p}_k = \mathbf{h}_k^{(L)} \, \mathbf{W}_{\text{proj}} + \mathbf{b}_{\text{proj}}, \quad \mathbf{W}_{\text{proj}} \in \mathbb{R}^{d_L \times n_h}\]

Plain SharedMLP (shared_mlp_residual=False). The output is the projection directly:

\[\mathbf{z}_k = \mathbf{p}_k\]

Residual SharedMLP (shared_mlp_residual=True, default). A parallel linear projection maps the block’s input features to the hidden dimension, ensuring that bounded signals such as RF pseudoprobabilities in \([0, 1]\) are directly accessible to the output head without passing through the nonlinear ReLU layers:

\[\mathbf{s}_k = \mathbf{f}_k \, \mathbf{W}_{\text{skip}} + \mathbf{b}_{\text{skip}}, \quad \mathbf{W}_{\text{skip}} \in \mathbb{R}^{d_{\text{in}} \times n_h}\]

The residual is added to the SharedMLP output:

\[\mathbf{z}_k = \mathbf{p}_k + \mathbf{s}_k\]

The block output is \(\mathbf{Z} \in \mathbb{R}^{B \times K \times n_h}\).

Grouped Vector Attention Path

When operator="gva" (or "grouped_vector_attention"), the hidden representation is computed using the Grouped Vector Attention mechanism from Point Transformer v2. Let \(G\) be the number of groups (mapped from num_heads) and \(d_g = n_h / G\) the per-group dimensionality.

1. Feature and coordinate projections. Three linear layers project the input features to query, key, and value representations, each of shape \((B, K, G, d_g)\):

\[\mathbf{Q} = \text{reshape}(\mathbf{F} \, \mathbf{W}^Q, \; (B, K, G, d_g))\]
\[\mathbf{K} = \text{reshape}(\mathbf{F} \, \mathbf{W}^K, \; (B, K, G, d_g))\]
\[\mathbf{V} = \text{reshape}(\mathbf{F} \, \mathbf{W}^V, \; (B, K, G, d_g))\]

2. Positional encoding. The shared PE is reshaped to \((B, K, G, d_g)\) for element-wise operations.

3. Relation-based weight vectors. For each neighbor \(k\), the attention weight vector is computed from the relative difference between the query (broadcast from the center point \(k=0\)) and each key, modulated by the positional encoding:

\[\boldsymbol{\delta}_k = \mathbf{Q}_{b,0} - \mathbf{K}_{b,k} + \mathbf{PE}_{b,k}\]
\[\mathbf{w}_{b,k} = \text{MLP}(\boldsymbol{\delta}_k), \quad \mathbf{w}_{b,k} \in \mathbb{R}^{G \times d_g}\]

The weight MLP is a configurable multi-layer perceptron with BN + ReLU activations (gva_weight_mlp_units). Unlike standard dot-product attention where each head produces a scalar weight, grouped vector attention produces a vector of weights per group, making it mathematically stronger.

4. Scalar attention from weight vectors. The per-group weight vectors are reduced to scalar attention scores via channel summation, then normalized with softmax:

\[\alpha_{b,k,g} = \text{softmax}_k\!\left( \sum_{j=1}^{d_g} w_{b,k,g,j} \right)\]

Invalid neighbors (where \(m_{bk} = 0\)) receive \(-\infty\) before softmax to zero out their contribution.

5. Value modulation and aggregation. The value vectors are modulated by the positional encoding and weighted by the element-wise weight vectors:

\[\mathbf{o}_{b,g} = \sum_{k} \alpha_{b,k,g} \cdot \bigl( \mathbf{w}_{b,k,g} \odot (\mathbf{V}_{b,k,g} + \mathbf{PE}_{b,k,g}) \bigr)\]

The groups are concatenated and linearly projected:

\[\text{GVA}(\mathbf{F}) = \text{reshape}(\mathbf{o}, \; (B, n_h)) \, \mathbf{W}^O, \quad \mathbf{W}^O \in \mathbb{R}^{n_h \times n_h}\]

6. Broadcast and residual. The aggregated output \(\mathbf{o} \in \mathbb{R}^{B \times n_h}\) is broadcast (repeated) to \((B, K, n_h)\) so that every neighbor position shares the same attended representation. A residual connection from the block input and layer normalization are applied:

\[\mathbf{Z} = \text{LayerNorm}\!\bigl( \mathbf{H}^{(\text{in})} + \text{broadcast}_K\!\bigl(\text{GVA}(\mathbf{F})\bigr) \bigr)\]

The broadcast ensures compatibility with multi-depth stacking (which expects \((B, K, n_h)\) for masked mean pooling and context injection) and the output head (which applies masked max pooling over K).

Note that unlike the Transformer path, GVA has no separate FFN block — the weight MLP serves that role.

The block output is \(\mathbf{Z} \in \mathbb{R}^{B \times K \times n_h}\).

Multi-Depth Stacking (hidden_depth > 1)

When hidden_depth \(= D > 1\), the architecture stacks \(D\) blocks of the selected operator. Only block \(d=0\) maps from \(n_f\) to \(n_h\); blocks \(d > 0\) operate in \(n_h\) space.

Between consecutive blocks, a global context vector is computed by masked mean pooling over the \(K\) neighbors and broadcast-summed back, so that each neighbor’s representation is enriched with global neighborhood information:

\[\mathbf{g}^{(d)} = \frac{1}{\epsilon + \sum_k m_{bk}} \sum_{k: m_{bk}=1} \mathbf{z}_{bk}^{(d)}, \quad \mathbf{g}^{(d)} \in \mathbb{R}^{B \times 1 \times n_h}\]

where \(\epsilon\) is a small constant (\(10^{-8}\)) for numerical stability.

\[\mathbf{I}^{(d+1)} = \mathbf{Z}^{(d)} + \mathbf{1}_K \, \mathbf{g}^{(d)}\]

where \(\mathbf{1}_K \in \mathbb{R}^{K \times 1}\) is a column vector of ones that broadcasts \(\mathbf{g}^{(d)}\) from \((B, 1, n_h)\) to \((B, K, n_h)\).

For the Transformer and GVA paths, positional encoding is re-injected at each depth (shared PE MLP weights):

\[\mathbf{I}^{(d+1)} \mathrel{+}= \mathbf{PE}\]

Block \(d+1\) then processes \(\mathbf{I}^{(d+1)}\) to produce \(\mathbf{Z}^{(d+1)}\).

Skip links (hidden_skip_links=True): all block outputs are summed and normalized before the output head:

\[\mathbf{Z}_{\text{final}} = \text{LayerNorm}\!\Bigl( \sum_{d=0}^{D-1} \mathbf{Z}^{(d)} \Bigr)\]

When hidden_skip_links=False, only the last block’s output is used: \(\mathbf{Z}_{\text{final}} = \mathbf{Z}^{(D-1)}\).

Shared weights (hidden_shared_weights=True): blocks \(d > 0\) share the same parameters, reducing the model size. Default is False (independent weights per depth).

Gradient Analysis for Skip Links

Without skip links. The gradient from the loss \(\mathcal{L}\) to block \(d\) must traverse all intermediate blocks:

\[\frac{\partial \mathcal{L}}{\partial \mathbf{Z}^{(d)}} = \frac{\partial \mathcal{L}} {\partial \mathbf{Z}^{(D-1)}} \prod_{j=d}^{D-2} \frac{\partial \mathbf{Z}^{(j+1)}} {\partial \mathbf{Z}^{(j)}}\]

The product of Jacobians can cause gradient vanishing or explosion as \(D\) grows, since each factor depends on the nonlinear transformations inside block \(j+1\) and on the broadcast-sum context injection.

With skip links. The additive aggregation \(\mathbf{S} = \sum_{d=0}^{D-1} \mathbf{Z}^{(d)}\) gives every block a direct, unit-Jacobian path to the output:

\[\frac{\partial \mathbf{S}}{\partial \mathbf{Z}^{(d)}} = \mathbf{I} \quad \Longrightarrow \quad \frac{\partial \mathcal{L}}{\partial \mathbf{Z}^{(d)}} = \frac{\partial \mathcal{L}}{\partial \mathbf{S}} + \text{(indirect terms)}\]

The direct term \(\partial \mathcal{L} / \partial \mathbf{S}\) does not depend on \(D\) and cannot vanish due to depth. This is the same mechanism that makes ResNet training stable: every block receives a gradient signal of constant magnitude regardless of its position in the stack.

The indirect terms arise because \(\mathbf{Z}^{(d)}\) also influences subsequent blocks through the global context injection: \(\mathbf{Z}^{(d)} \to \mathbf{g}^{(d)}\) (mean pool) \(\to \mathbf{I}^{(d+1)}\) (broadcast-sum) \(\to \mathbf{Z}^{(d+1)}\), and recursively through all deeper blocks. These indirect paths compound through the chain of Jacobians \(\prod_{j} \partial \mathbf{Z}^{(j+1)} / \partial \mathbf{Z}^{(j)}\) but their contribution is typically small because the mean-pool bottleneck (\(K \to 1\)) attenuates the per-neighbor gradient signal by a factor of \(1/K\).

Impact of shared vs. independent weights. With independent weights, each block \(\text{Block}_d\) has its own parameters \(\boldsymbol{\theta}_d\). The gradients are:

\[\frac{\partial \mathcal{L}} {\partial \boldsymbol{\theta}_d} = \frac{\partial \mathcal{L}} {\partial \mathbf{Z}^{(d)}} \cdot \frac{\partial \mathbf{Z}^{(d)}} {\partial \boldsymbol{\theta}_d}\]

Each parameter set is updated based on the gradient from a single block output, allowing each depth to specialize.

With shared weights, blocks \(d = 1, \ldots, D-1\) all use the same parameters \(\boldsymbol{\theta}_{\text{shared}}\). The gradient accumulates contributions from every depth:

\[\frac{\partial \mathcal{L}} {\partial \boldsymbol{\theta}_{\text{shared}}} = \sum_{d=1}^{D-1} \frac{\partial \mathcal{L}} {\partial \mathbf{Z}^{(d)}} \cdot \frac{\partial \mathbf{Z}^{(d)}} {\partial \boldsymbol{\theta}_{\text{shared}}}\]

This sum acts as implicit gradient averaging: the shared parameters receive a richer but potentially conflicting signal, since different depths process inputs at different levels of global-context enrichment. The trade-off is reduced representational capacity (all depths perform the same learned function) in exchange for fewer parameters and stronger regularization.

Role of LayerNorm on Skip-Link Sum

When skip links are enabled, the sum \(\mathbf{S} = \sum_d \mathbf{Z}^{(d)}\) has magnitude that grows linearly with \(D\). Without normalization, this would push the masked max-pool inputs to larger values, potentially saturating the softmax in the classification head and producing overconfident (near-one-hot) predictions even for uncertain samples.

The LayerNorm applied to \(\mathbf{S}\) normalizes each position’s feature vector across the \(n_h\) channels:

\[\hat{s}_{j} = \frac{s_{j} - \mu}{\sqrt{\sigma^2 + \epsilon}}, \quad \mu = \frac{1}{n_h} \sum_{j=1}^{n_h} s_j, \quad \sigma^2 = \frac{1}{n_h} \sum_{j=1}^{n_h} (s_j - \mu)^2\]
\[\text{LayerNorm}(\mathbf{s}) = \gamma \odot \hat{\mathbf{s}} + \beta\]

This ensures that the distribution of the aggregated representation is independent of \(D\), so the output head can use the same learned weights regardless of the hidden depth. Without this normalization, changing \(D\) would require retraining the classification head.

Batch Normalization

When enabled (default for SharedMLP), batch normalization is applied before the activation function at each layer. For a mini-batch of activations \(\{z_i\}_{i=1}^{B'}\) along the batch dimension:

\[\hat{z}_i = \frac{z_i - \mu_{\mathcal{B}}} {\sqrt{\sigma_{\mathcal{B}}^2 + \epsilon}}, \quad \mu_{\mathcal{B}} = \frac{1}{B'} \sum_{i=1}^{B'} z_i, \quad \sigma_{\mathcal{B}}^2 = \frac{1}{B'} \sum_{i=1}^{B'} (z_i - \mu_{\mathcal{B}})^2\]
\[\text{BN}(z_i) = \gamma \, \hat{z}_i + \beta\]

where \(\gamma, \beta\) are learnable scale and shift parameters, and \(\epsilon\) is a small constant for numerical stability. During inference, running estimates of \(\mu\) and \(\sigma^2\) (computed via exponential moving average during training) are used instead of per-batch statistics.

Output Head

The output head is shared by both paths. Let \(\mathbf{Z} \in \mathbb{R}^{B \times K \times n_h}\) be the hidden representation from the selected operator.

1. Masked max pooling. Aggregate over the \(K\) neighbors, ignoring padded positions:

\[\mathbf{g}_b = \max_{k \,\mid\, m_{bk} = 1} \mathbf{z}_{bk}, \quad \mathbf{g} \in \mathbb{R}^{B \times n_h}\]

2. Output dropout (optional). When output_dropout > 0, dropout is applied to the pooled representation before the classification layer.

3. Classification. A dense layer maps to the output space:

  • Multiclass (\(n_c > 2\)):

    \[\hat{\mathbf{y}}_b = \text{softmax}\!\bigl( \mathbf{g}_b \, \mathbf{W}_{\text{out}} + \mathbf{b}_{\text{out}} \bigr), \quad \mathbf{W}_{\text{out}} \in \mathbb{R}^{n_h \times n_c}\]
  • Binary (\(n_c = 2\)):

    \[\hat{p}_b = \sigma\!\bigl( \mathbf{g}_b \, \mathbf{w}_{\text{out}} + b_{\text{out}} \bigr), \quad \mathbf{w}_{\text{out}} \in \mathbb{R}^{n_h \times 1}\]

    where \(\sigma\) is the sigmoid function. The full probability vector \([1 - \hat{p}_b,\; \hat{p}_b]\) is reconstructed for export.

See Architecture.

Variables:
  • num_classes (int) – Number of output classes.

  • n_f (int) – Number of input features per neighbor.

  • K (int) – Number of neighbors (including the point itself).

  • n_h (int) – Hidden dimensionality.

  • operator (str) – The operator type: "transformer", "sharedmlp", "gva" (grouped vector attention).

  • num_heads (int) – Number of attention heads/groups.

  • pos_enc_units (int) – PE intermediate dimensionality.

  • transformer_dropout (float) – Dropout rate for Transformer.

  • gva_dropout (float) – Dropout rate for GVA output projection.

  • gva_weight_mlp_units (list[int] or None) – Layer widths for GVA weight MLP.

  • shared_mlp_layers (list[int]) – Layer widths for SharedMLP stack.

  • shared_mlp_dropout (float) – Dropout rate for SharedMLP.

  • shared_mlp_residual (bool) – Enable residual skip in SharedMLP.

  • batch_normalization (bool) – Enable BN (auto-selected if None).

  • hidden_depth (int) – Number of stacked blocks (>= 1).

  • hidden_skip_links (bool) – Additive skip links across depths.

  • hidden_shared_weights (bool) – Share weights for blocks d > 0.

  • output_dropout (float) – Dropout rate before classification head.

__init__(**kwargs)

Initialize a TransfOctoRFPwiseClassif.

Parameters:

kwargs – Architecture configuration.

build_input(**kwargs)

Build the three input layers.

Returns:

List of [features, coordinates, mask].

build_hidden(inputs, **kwargs)

Build the hidden layers with multi-depth stacking.

Each depth d produces a tensor of shape (B, K, n_h_list[d]). When dimensions vary across depths and skip links are enabled, each block output is projected to n_h_list[-1] before summation.

Parameters:

inputs – [features, coordinates, mask].

Returns:

Hidden representation (B, K, n_h).

build_pe_dim(coords, depth, dim)

Build positional encoding with an explicit output dimension.

Parameters:
  • coords – Coordinate tensor (B, K, 3).

  • depth – Depth index for layer naming.

  • dim – Output dimensionality.

Returns:

PE tensor (B, K, dim).

build_pe(coords, depth=0)

Build positional encoding for a specific depth.

Parameters:
  • coords – Coordinate tensor (B, K, 3).

  • depth – Depth index (determines output dim and layer naming).

Returns:

PE tensor (B, K, n_h_list[depth]).

build_block_d0(features, coords, mask, pe)

Build block d=0 (n_f -> n_h_list[0]).

Returns:

Z_0 tensor (B, K, n_h_list[0]).

build_deep_blocks()

Build block(s) for depths d > 0.

Each block maps n_h_list[d-1] -> n_h_list[d]. Returns a list of callables. If shared weights, the list has length 1. Otherwise length D-1.

masked_mean_pool(z, mask_float, depth)

Masked mean pool: (B, K, n_h) -> (B, 1, n_h).

build_transformer_block(x, mask, pe, depth, input_dim, output_dim=None)

Build one Transformer block.

Parameters:
  • x – Input tensor (B, K, input_dim).

  • mask – Boolean mask (B, K).

  • pe – Positional encoding (B, K, output_dim).

  • depth – Depth index for layer naming.

  • input_dim – Input feature dimension.

  • output_dim – Output dimension (default: n_h).

Returns:

Output tensor (B, K, output_dim).

build_attention_mask(mask, depth)

Build boolean attention mask: (B, K) -> (B, 1, K).

Keras 3 MultiHeadAttention expects attention_mask of shape (B, T, S). For self-attention with a per-key mask broadcast across all queries, (B, 1, K) is the canonical shape. Keras auto-expands the heads dimension internally.

A boolean mask applies where(mask, logits, -inf), which correctly suppresses padded positions. A float mask would be added to the logits instead, which does not suppress invalid positions.

build_gva_block(features, mask, pe, depth, input_dim, output_dim=None)

Build one Grouped Vector Attention block (Point Transformer v2).

Parameters:
  • features – Input tensor (B, K, input_dim).

  • mask – Boolean mask (B, K).

  • pe – Positional encoding (B, K, output_dim).

  • depth – Depth index for layer naming.

  • input_dim – Input feature dimension.

  • output_dim – Output dimension (default: n_h).

Returns:

Output tensor (B, K, output_dim).

build_gva_mix(x_proj, agg_repeated, depth, n_h_d)

Build the in-block per-point mixer.

Replaces the trivial RepeatVector+Add path with a per-point feed-forward operator over the concatenation of the projected per-point features and the broadcast global GVA descriptor. The output is residual-added to x_proj outside this helper and then LayerNorm-stabilized.

Parameters:
  • x_proj – Projected per-point features (B, K, n_h_d).

  • agg_repeated – Broadcast global descriptor (B, K, n_h_d).

  • depth – Block depth index for layer naming.

  • n_h_d – Block hidden dimensionality.

Returns:

Mixed per-point tensor (B, K, n_h_d).

Return type:

keras tensor

build_shared_mlp_block(features, depth, input_dim, output_dim=None)

Build one SharedMLP block with optional residual.

Parameters:
  • features – Input tensor (B, K, input_dim).

  • depth – Depth index for layer naming.

  • input_dim – Input feature dimension.

  • output_dim – Output dimension (default: n_h).

Returns:

Output tensor (B, K, output_dim).

build_output(hidden, **kwargs)

Build the output head.

When point_wise_labels is False (default): masked max pool → Dense → (B, n_c).

When point_wise_labels is True: per-point Dense → masked zero-fill → (B, K, n_c). No pooling is applied; each point in the receptive field gets its own classification output.

Parameters:

hidden – Hidden representation (B, K, n_h).

Returns:

Output tensor (B, n_c) or (B, K, n_c).

build_output_mix(hidden, mask_float)

Build the pre-pool per-point mixer in the output head.

Uses the fused TORFOutputMixLayer which owns its Dense sub-layers as attributes. The fused layout replaced an earlier 14-Keras-layer unfused chain that triggered a glibc heap-corruption SIGABRT at process exit when instantiated at v2 scale (DALES, leaf_voxel_length=1.0). Behavior is identical to the unfused version.

Parameters:
  • hidden – Final hidden representation (B, K, n_h).

  • mask_float – Float mask (B, K).

Returns:

Mixed per-point tensor (B, K, n_h).

Return type:

keras tensor

class src.model.deeplearn.arch.transf_octorf_pwise_classif.TransformerBlock(arch, depth, input_dim=None, output_dim=None)
Author:

Alberto M. Esmoris Pena

Callable wrapper for a Transformer block at depth d > 0. Builds Keras layers once at construction and reuses them on every call, enabling true weight sharing when the same instance is called for multiple depths.

__init__(arch, depth, input_dim=None, output_dim=None)

Pre-build all Keras layers for this block.

__call__(x, mask, pe=None)

Apply the pre-built Transformer layers.

class src.model.deeplearn.arch.transf_octorf_pwise_classif.SharedMLPBlock(arch, depth, input_dim=None, output_dim=None)
Author:

Alberto M. Esmoris Pena

Callable wrapper for a SharedMLP block at depth d > 0. Builds Keras layers once at construction and reuses them on every call, enabling true weight sharing.

__init__(arch, depth, input_dim=None, output_dim=None)

Pre-build all Keras layers for this block.

__call__(x, mask, pe=None)

Apply the pre-built SharedMLP layers.

Parameters:
  • x – Input (B, K, n_h).

  • mask – Boolean mask (B, K).

  • pe – Positional encoding (ignored by SharedMLP).

Returns:

Output (B, K, n_h).

class src.model.deeplearn.arch.transf_octorf_pwise_classif.GVABlock(arch, depth, input_dim=None, output_dim=None)
Author:

Alberto M. Esmoris Pena

Callable wrapper for a Grouped Vector Attention block at depth d > 0. Builds Keras layers once at construction and reuses them on every call, enabling true weight sharing when the same instance is called for multiple depths.

__init__(arch, depth, input_dim=None, output_dim=None)

Pre-build all Keras layers for this block.

__call__(x, mask, pe=None)

Apply the pre-built GVA layers.

Parameters:
  • x – Input (B, K, n_h).

  • mask – Boolean mask (B, K).

  • pe – Positional encoding (B, K, n_h).

Returns:

Output (B, K, n_h).