Skip to content

Spatio Temporal Model Architecture

Model Architecture

The model takes daily SST (or similar) data in video format: x ∈ ℝ^{B × 1 × T × H × W} and a daily_mask indicating missing pixels. It also takes land_mask_patch indicating land regions in the output. The model does of the following tasks:

  • Combines video encoder, temporal attention, spatial transformer, and decoder
  • Encodes 3D data (space, time) into spatio-temporal patches
  • Aggregates temporal information per spatial patch
  • Mixes spatial features across patches
  • Decodes back to original spatial resolution

The architecture consists of the following steps:

# 1. Patch embedding:
X (VideoEncoder)---------> X_patch

# 2. Add temporal encoding +
# 3. Temporal aggregation:
X_patch + PE (TemporalAttentionAggregator)---------> X_temp_agg

# 4. Add spatial encoding +
# 5. Spatial transformer:
X_temp_agg + PE (SpatialTransformer) ---------> X_mixed

# 6. Decode to original resolution:
X_mixed (MonthlyConvDecoder)---------> Output

Model architecture description

We explain the model architecture in more detail in the code and math description document.

References