Spatio Temporal Model Architecture

The model takes daily SST (or similar) data in video format: x ∈ ℝ^{B × 1 × T ×
H × W} and a daily_mask indicating missing pixels. It also takes
land_mask_patch indicating land regions in the output. The model does of the
following tasks:
- Combines video encoder, temporal attention, spatial transformer, and decoder
- Encodes 3D data (space, time) into spatio-temporal patches
- Aggregates temporal information per spatial patch
- Mixes spatial features across patches
- Decodes back to original spatial resolution
The architecture consists of the following steps:
# 1. Patch embedding:
X (VideoEncoder)---------> X_patch
# 2. Add temporal encoding +
# 3. Temporal aggregation:
X_patch + PE (TemporalAttentionAggregator)---------> X_temp_agg
# 4. Add spatial encoding +
# 5. Spatial transformer:
X_temp_agg + PE (SpatialTransformer) ---------> X_mixed
# 6. Decode to original resolution:
X_mixed (MonthlyConvDecoder)---------> Output
Model architecture description
We explain the model architecture in more detail in the code and math description document.