ClimaNet

Skip to content

ClimaNet

Introduction

Spatio Temporal Model Architecture

Model Architecture

The model takes daily SST (or similar) data in video format: x ∈ ℝ^{B × 1 × T × H × W} and a daily_mask indicating missing pixels. It also takes land_mask_patch indicating land regions in the output. The model does of the following tasks:

Combines video encoder, temporal attention, spatial transformer, and decoder
Encodes 3D data (space, time) into spatio-temporal patches
Aggregates temporal information per spatial patch
Mixes spatial features across patches
Decodes back to original spatial resolution

The architecture consists of the following steps:

# 1. Patch embedding:
X (VideoEncoder)---------> X_patch

# 2. Add temporal encoding +
# 3. Temporal aggregation:
X_patch + PE (TemporalAttentionAggregator)---------> X_temp_agg

# 4. Add spatial encoding +
# 5. Spatial transformer:
X_temp_agg + PE (SpatialTransformer) ---------> X_mixed

# 6. Decode to original resolution:
X_mixed (MonthlyConvDecoder)---------> Output

Model architecture description

We explain the model architecture in more detail in the code and math description document.

References