Post

Can a machine learning model predict whether an extreme rainfall event will occur within the next hour?

Can a machine learning model predict whether an extreme rainfall event will occur within the next hour?

Predicting Extreme Rainfall Events Using Logistic Regression

A Single-Station Machine Learning Approach for Flood Early Warning

Research Question: Given only the meteorological conditions observable right now at a single ground station, can a machine learning model predict whether an extreme rainfall event will occur within the next hour?

1. Dataset Description

1.1 Data Source

Meteorological observations from Station 178, recorded at 10-minute intervals over a 20-year period (September 2005 – August 2025). The raw dataset contains 1,047,486 observations across 7 meteorological variables.

1.2 Measured Variables

VariableDescriptionUnitRange
RHRelative Humidity%5 – 100
TDDew Point Temperature°C-3 – 48
TDmaxDaily Maximum Temperature°C-3 – 52
TDminDaily Minimum Temperature°C-5 – 48
WSWind Speedm/s0 – 40
WDWind Directiondegrees0 – 360
RainRainfall Intensitymm/10min0 – 100

1.3 Target Variable Definition

The binary target variable was defined using a forward-looking window approach:

\[\text{target}_t = \begin{cases} 1 & \text{if } \max(\text{Rain}_{t+1}, \text{Rain}_{t+2}, \ldots, \text{Rain}_{t+6}) \geq 2.3 \text{ mm} \\ 0 & \text{otherwise} \end{cases}\]

Where:

  • Threshold: 2.3 mm/10min (95th percentile of rainy observations)

  • Forecast horizon: 6 timesteps = 60 minutes (1 hour ahead)

  • Interpretation: target = 1 means “at least one extreme rainfall measurement will occur within the next hour”

    1.4 Class Distribution

ClassCountPercentage
Normal (target = 0)1,042,94899.67%
Extreme (target = 1)3,4540.33%
Imbalance ratio1 : 302 

2. Data Preprocessing and Feature Engineering Pipeline

2.1 Pipeline Overview

flowchart TD

%% =======================
%% NODES (UNCHANGED CONTENT)
%% =======================

A["RAW DATA<br/>(1,047,486 rows × 7 columns)"]

B["[1] DATA QUALITY CONTROL<br/>
• Out-of-bounds detection (27,126 flagged)<br/>
• Cross-validation rules (251 violations)<br/>
• Total NaNs introduced: 27,377"]

C["[2] MISSING DATA IMPUTATION<br/>
• Time-bounded ffill/bfill (≤ 1 hour gaps)<br/>
• Climatological median fill (rainy rows)<br/>
• Drop remaining dry NaN rows (920 rows)"]

D["[3] SMALL GAP FILLING<br/>
• Linear interpolation (≤ 2 hours)<br/>
• Large gaps (> 2 hours) left intact"]

E["[4] WIND DIRECTION DECOMPOSITION<br/>
• WD → WD_sin = sin(WD radians)<br/>
• WD → WD_cos = cos(WD radians)<br/>
• Raw WD column dropped"]

F["[5] FEATURE ENGINEERING<br/>
• Rain lag features (6)<br/>
• Meteorological lag features (30)<br/>
• Rain rolling stats (12)<br/>
• Meteorological rolling stats (36)"]

G["[6] LEAKAGE PREVENTION<br/>
• Raw Rain column dropped<br/>
• Rain features use shift(1): never include t=0<br/>
• Gap contamination zones removed (6 rows before, 36 after)"]

H["[7] VALIDATION<br/>
• Verify 0 NaNs remaining<br/>
• Verify 91 features + 1 target<br/>
• Verify no raw Rain in features"]

I["FINAL DATASET<br/>(1,046,402 rows × 92 columns)"]

%% =======================
%% FLOW (THICKER ARROWS)
%% =======================

A ==> B ==> C ==> D ==> E ==> F ==> G ==> H ==> I

%% =======================
%% STYLING
%% =======================

classDef raw fill:#E3F2FD,stroke:#1E88E5,stroke-width:2px,color:#000;
classDef cleaning fill:#FFF3E0,stroke:#FB8C00,stroke-width:2px,color:#000;
classDef engineering fill:#E8F5E9,stroke:#43A047,stroke-width:2px,color:#000;
classDef protection fill:#FCE4EC,stroke:#D81B60,stroke-width:2px,color:#000;
classDef validation fill:#F3E5F5,stroke:#8E24AA,stroke-width:2px,color:#000;
classDef final fill:#E0F2F1,stroke:#00897B,stroke-width:2px,color:#000;

class A raw;
class B,C,D cleaning;
class E,F engineering;
class G protection;
class H validation;
class I final;

linkStyle default stroke-width:3px

2.2 Data Quality Control

Phase 1 – Out-of-Bounds Detection: Domain-specific physical bounds were applied to each variable to flag impossible sensor readings as NaN.

VariableValid RangeReadings Flagged
Rain0 – 100 mm0
TD-3 – 48 °C6,875
TDmax-3 – 52 °C6,779
TDmin-5 – 48 °C6,917
RH5 – 100 %6,517
WS0 – 40 m/s19
WD0 – 360°19
Total 27,126

Phase 2 – Cross-Validation Rules: Six physically-impossible combination rules were applied (e.g., TD > TDmax, TDmin > TDmax, nighttime TD > 40°C). An additional 251 rows were flagged.

"figure number 1"

A three-step strategy was designed to preserve all rainy observations (critical for rare-event modeling):

  1. Time-bounded fill (gap limit = 6 consecutive NaNs = 1 hour): Forward-fill then backward-fill within this limit. Recovered 31 of 307 rainy rows with NaN.

  2. Drop remaining dry NaN rows (920 rows): Only non-rainy rows were removed.

  3. Climatological median fill: For remaining rainy rows with NaN (276 rows), values were imputed using per-month climatological medians computed from clean data.

2.4 Feature Engineering

Starting from 7 base variables (with WD decomposed into WD_sin and WD_cos), 91 features were constructed:

CategoryCountDescriptionExample
Current readings5Values at time tRH, TD, WS, WD_sin, WD_cos
Daily context2Daily aggregatesTDmax, TDmin
Rain lags6Rain at t-k (k = 1,3,6,12,18,36 steps)Rain_lag_1 (10 min ago)
Meteorological lags305 variables x 6 lag stepsRH_lag_6 (1 hour ago)
Rain rolling stats123 windows x 4 aggregationsRain_roll_max_6 (max rain in last hour)
Meteorological rolling stats363 variables x 3 windows x 4 aggregationsRH_roll_mean_18 (avg humidity in last 3h)
Total91  
Lag steps: 1, 3, 6, 12, 18, 36 timesteps (10 min, 30 min, 1h, 2h, 3h, 6h)   

Rolling windows: 6, 18, 36 timesteps (1h, 3h, 6h)

Rolling aggregations: mean, std, max, min

Leakage prevention: All rain-derived features are computed on Rain.shift(1) to ensure the current timestep’s rainfall (which contributes to the target) is never included as a feature.

"figure number 2"

"figure number 3"

3. Temporal Train / Validation / Test Split

3.1 Split Design and statistics

A strictly chronological split with purge gaps was applied to prevent data leakage:

flowchart LR

A["Train<br/>2005-09-01 → 2020-12-31<br/>76.7%"]

G1["6h purge"]

B["Validation<br/>2021-01-01 → 2023-06-30<br/>12.5%"]

G2["6h purge"]

C["Test<br/>2023-07-01 → 2025-08-30<br/>10.9%"]

A --> G1 --> B --> G2 --> C

classDef train fill:#E3F2FD,stroke:#1E88E5,stroke-width:2px,color:#000000,font-weight:bold;
classDef val fill:#FFF8E1,stroke:#F9A825,stroke-width:2px,color:#000000,font-weight:bold;
classDef test fill:#E8F5E9,stroke:#43A047,stroke-width:2px,color:#000000,font-weight:bold;
classDef gap fill:#FAFAFA,stroke:#9E9E9E,stroke-dasharray:5 5,color:#000000,font-weight:bold;

class A train;
class B val;
class C test;
class G1,G2 gap;

linkStyle default stroke-width:2px

The 6-hour purge gap matches the maximum lag of 36 steps (6 hours), ensuring no test row’s lag features can reach into training data.

SplitPeriodRowsPositivesPositive RateNeg:Pos
Train2005-09 to 2020-12802,3552,4410.304%1:328
Validation2021-01 to 2023-06130,3355360.411%1:243
Test2023-07 to 2025-08113,6404770.420%1:238

3.2 Feature Scaling

StandardScaler was applied to normalize features to zero mean and unit variance. The scaler was fitted on the training set only to prevent information leakage. Validation and test sets were transformed using the same scaler parameters.


4. Logistic Regression Model

Logistic Regression is a linear classification algorithm that models the probability of a binary outcome as a function of input features. Rather than predicting the class directly, it estimates the probability that a given observation belongs to the positive class (extreme rainfall) by passing a linear combination of features through the sigmoid function. The model learns a weight (coefficient) for each feature, quantifying its contribution to the prediction. During training, these weights are optimized by minimizing a regularized cross-entropy loss function, which penalizes both incorrect predictions and overly large coefficients to prevent overfitting.

Model Formula:

\[P(y = 1 \mid \mathbf{x}) = \sigma(z) = \frac{1}{1 + e^{-z}}, \quad \text{where} \quad z = \mathbf{w}^\top \mathbf{x} + b = \sum_{j=1}^{p} w_j x_j + b\]
SymbolDefinition
$P(y = 1 \mid \mathbf{x})$Predicted probability that the observation $\mathbf{x}$ belongs to the positive class (extreme rainfall within the next hour)
$\sigma(z)$The sigmoid (logistic) function, which maps any real-valued $z$ to the range $(0, 1)$
$z$The logit – the linear combination of all input features before the sigmoid transformation
$\mathbf{x} = [x_1, x_2, \ldots, x_p]$The feature vector of a single observation, containing all $p = 91$ engineered features
$\mathbf{w} = [w_1, w_2, \ldots, w_p]$The weight (coefficient) vector – one learned weight per feature, representing its contribution to the prediction
$b$The bias (intercept) term, representing the model’s baseline log-odds when all features are zero (learned value: $b = -2.0022$)
$p$The number of input features ($p = 91$)

Training Objective (L2-Regularized Cross-Entropy Loss):

\[\mathcal{L}(\mathbf{w}, b) = -\sum_{i=1}^{N} \left[ c_{y_i} \cdot y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right] + \frac{1}{2C} \|\mathbf{w}\|_2^2\]
SymbolDefinition
$N$Total number of training samples
$y_i$True label of sample $i$ (0 = normal, 1 = extreme)
$\hat{y}_i$Predicted probability $P(y_i = 1 \mid \mathbf{x}_i)$
$c_{y_i}$Class weight for sample $i$’s true class (1.0 for normal, ~164.3 for extreme), compensating for class imbalance
$C$Inverse regularization strength – smaller $C$ applies stronger L2 penalty on the weights (selected value: $C = 0.001$)
$|\mathbf{w}|_2^2$L2 norm of the weight vector, penalizing large coefficients to reduce overfitting

Classification Rule: An observation is classified as extreme if $P(y = 1 \mid \mathbf{x}) \geq \tau$, where $\tau = 0.5$ is the default decision threshold.

4.1 Model Configuration

ParameterValueRationale
C (regularization)0.001Optimal value from regularization curve analysis (Cell 23b); outperformed default C=1.0 on both val_F1 (+6.4%) and val_PR-AUC (+2.9%)
solverlbfgsSecond-order optimizer suited for medium-sized datasets
class_weightbalancedAutomatically upweights the minority class (~164x) to address class imbalance
max_iter2000Generous convergence limit; model converged in 92 iterations
random_state42Reproducibility

Class weighting formula:

\[w_{\text{class}} = \frac{n_{\text{samples}}}{n_{\text{classes}} \times n_{\text{samples in class}}}\]
  • Weight for class 0 (Normal): ~0.50

  • Weight for class 1 (Extreme): ~164.3

4.2 Regularization Parameter Selection

The regularization parameter C was selected by training models across C = {0.001, 0.01, 0.1, 0.5, 1.0, 5.0, 10.0, 100.0} and evaluating on the validation set.

CIterationsVal F1Val PR-AUC
0.001920.08480.1047
0.012220.08380.1040
0.15920.08260.1028
0.51,1220.08060.1021
1.01,4220.07970.1017
5.01,8740.07930.1009
10.02,000*0.07830.1007
100.02,000*0.07780.1004

Did not converge within max_iter=2000.

C = 0.001 was selected as it achieved the highest validation F1 and PR-AUC, while converging 15x faster than C = 1.0.

"figure number 4"

4.3 Validation Set Performance

MetricValue
Recall (Extreme class)0.8451
Precision (Extreme class)0.0447
F1 (Extreme class)0.0848
ROC-AUC0.9548
PR-AUC0.1047 (baseline = 0.0041)
Training time10.1 seconds
Confusion Matrix (Validation Set): 
 Predicted: NormalPredicted: Extreme
Actual: Normal120,108 (TN)9,691 (FP)
Actual: Extreme83 (FN)453 (TP)

Interpretation: The model correctly identified 453 out of 536 extreme events (84.5% recall), at the cost of 9,691 false alarms. The PR-AUC of 0.1047 represents a 25.5x improvement over a random baseline (0.0041).

"figure number 5"

4.4 Test Set Performance

MetricValue
Recall (Extreme class)0.7358
Precision (Extreme class)0.0504
F1 (Extreme class)0.0943
ROC-AUC0.9216
PR-AUC0.0949 (baseline = 0.0042)

"figure number 6"

Confusion Matrix (Test Set):

 Predicted: NormalPredicted: Extreme
Actual: Normal106,545 (TN)6,618 (FP)
Actual: Extreme126 (FN)351 (TP)

Interpretation: On completely unseen data (2023–2025), the model correctly identified 351 out of 477 extreme events (73.6% recall). The PR-AUC of 0.0949 represents a 22.6x improvement over a random baseline.

4.5 Generalization: Validation vs. Test Comparison

MetricValidationTestDifference
F10.08480.0943+0.0094
Precision0.04470.0504+0.0057
Recall0.84510.7358-0.1093
ROC-AUC0.95480.9216-0.0332
PR-AUC0.10470.0949-0.0098

The model shows a 10.9% drop in recall on the test set compared to validation. ROC-AUC and PR-AUC remain strong, indicating good overall separation ability. The recall drop suggests that the 2023–2025 test period may contain extreme events with slightly different atmospheric signatures than the training period (2005–2020).

4.6 Threshold Sensitivity Analysis

The default classification threshold of 0.5 was analyzed against alternative thresholds to understand the precision-recall tradeoff:

Validation Set:

ThresholdRecallPrecisionF1TPFPFNMissed%
0.051.0000.00620.012353685,87200.0%
0.100.9980.00820.016353564,47710.2%
0.150.9930.01010.020053252,11940.7%
0.200.9890.01230.024253042,72761.1%
0.300.9510.01860.036551026,863264.9%
0.400.8920.03110.060147814,8875810.8%
0.500.8450.04470.08484539,6918315.5%

Test Set:

ThresholdRecallPrecisionF1TPFPFNMissed%
0.050.9790.00650.012946771,619102.1%
0.100.9560.00850.016845653,440214.4%
0.150.9350.01040.020644642,383316.5%
0.200.9200.01280.025243933,885388.0%
0.300.8780.02080.040741919,7175812.2%
0.400.7970.03730.07123809,8189720.3%
0.500.7360.05040.09433516,61812626.4%

The analysis reveals that lowering the threshold substantially increases recall at the cost of more false alarms, while precision remains low at all thresholds. The default threshold of 0.5 was retained for this baseline model; optimal threshold selection is deferred to the final deployed model.

"figure number 7"

4.7 Feature Coefficients – What the Model Learned

The top 20 features by absolute coefficient magnitude reveal the atmospheric signals the model relies on:

RankFeatureCoefficientInterpretation
1RH_roll_max_6+0.628Peak humidity in last 1h: higher = more likely extreme
2WS_roll_max_36+0.619Peak wind speed in last 6h: higher = more likely extreme
3RH_roll_max_18+0.530Peak humidity in last 3h: higher = more likely extreme
4RH_roll_std_6+0.498Humidity fluctuation in last 1h: more volatile = more likely
5RH_lag_6+0.478Humidity 1 hour ago: higher = more likely extreme
6TD_roll_max_36-0.469Peak temperature in last 6h: higher = less likely (hot = dry)
7WS_roll_max_18+0.466Peak wind speed in last 3h: higher = more likely
8RH_roll_mean_6+0.446Average humidity in last 1h
9RH_roll_max_36+0.443Peak humidity in last 6h
10WD_sin_lag_1+0.437Wind direction component 10 min ago
11RH_roll_mean_18+0.436Average humidity in last 3h
12TD_roll_max_18-0.430Peak temperature in last 3h: higher = less likely
13RH+0.427Current humidity: higher = more likely
14RH_lag_12+0.405Humidity 2 hours ago
15WD_cos_lag_18-0.386Wind direction component 3 hours ago
16RH_roll_min_6+0.385Minimum humidity in last 1h
17TD_roll_max_6-0.374Peak temperature in last 1h: higher = less likely
18RH_lag_3+0.370Humidity 30 minutes ago
19TD_roll_std_36-0.347Temperature variability in last 6h
20TDmax-0.347Daily maximum temperature: higher = less likely

Model intercept (bias): -2.0022 (the model starts with a strong prior toward “no extreme event”, consistent with the 0.33% base rate).

"figure number 8"

Physical interpretation: The model has learned a coherent meteorological story:

  • Humidity dominance: 11 of the top 20 features are humidity-related. Rising, high, and volatile humidity signals moisture-laden air masses.

  • Wind signals: Strong winds in recent hours indicate incoming storm systems.

  • Temperature as negative signal: High temperatures push predictions away from extreme rain, consistent with hot, dry conditions being incompatible with convective precipitation.

  • Temporal pattern: Features spanning 10 minutes to 6 hours capture the evolution of pre-storm atmospheric conditions.


5. Discussion

5.1 Why We Observed This Performance on the Test Set

The Logistic Regression model achieved strong discriminative ability on the validation set (ROC-AUC = 0.9548, Recall = 84.5%) but exhibited a notable decline on the test set (ROC-AUC = 0.9216, Recall = 73.6%). Several factors explain this performance gap.

Temporal distribution shift. The training data spans 2005–2020 and the test set covers 2023–2025. Extreme rainfall events in the test period may exhibit different atmospheric signatures than those in the training period. Climate variability, shifts in storm patterns, or changes in the frequency of convective versus frontal precipitation systems could all contribute to this distributional mismatch. The validation period (2021–2023) sits temporally closer to the training data and likely shares more similar weather patterns, explaining its stronger performance.

Linearity constraint. Logistic Regression models a linear decision boundary in feature space. In practice, extreme rainfall is likely governed by non-linear interactions between atmospheric variables – for instance, the combination of high humidity AND strong winds AND dropping temperature may be far more predictive than any of these signals individually. A linear model cannot capture such multiplicative interactions; it can only learn additive contributions from each feature independently. This fundamental limitation places a ceiling on the achievable performance.

Extreme class imbalance. With only 0.33% of rows belonging to the positive class, the model faces an inherently difficult classification task. The precision remained below 5% at the default threshold across both validation and test sets, indicating that the model fires a large number of false alarms for every true detection. This is not unique to our model – it reflects the difficulty of distinguishing genuine pre-storm atmospheric signatures from conditions that appear similar but do not produce extreme rainfall. The threshold sensitivity analysis (Section 4.6) confirmed that precision remains low at all operating points, suggesting this is a structural challenge rather than a threshold selection issue.

What the model gets right. Despite these limitations, the PR-AUC of 0.0949 on the test set represents a 22.6x improvement over a random classifier (baseline = 0.0042). The ROC-AUC of 0.9216 confirms that the model achieves strong overall separation between the two classes. The model successfully learns that rising humidity, strong winds, and low temperatures precede extreme events – a physically coherent signal that generalizes, even if imperfectly, to unseen data.

5.2 What This Project Successfully Demonstrated

A single ground station is sufficient for meaningful prediction. Without radar, satellite imagery, or numerical weather model output, a simple linear model using only local meteorological observations achieved 73.6% recall on completely unseen data. This demonstrates that ground-level atmospheric conditions carry substantial predictive information about imminent extreme rainfall.

Feature engineering transforms raw observations into predictive signals. Starting from 7 raw meteorological variables, the lag and rolling-window feature engineering pipeline produced 91 features that capture the temporal evolution of atmospheric conditions over the preceding 6 hours. This transformation was critical – the model’s top features (RH_roll_max_6, WS_roll_max_36) are engineered features, not raw measurements.

A leakage-free temporal evaluation pipeline. The strict chronological split with 6-hour purge gaps, training-only scaler fitting, and removal of the raw Rain column at $t=0$ ensures that all reported metrics reflect genuine out-of-sample predictive skill. No future information leaks into the model during training or evaluation.

Model interpretability reveals physically meaningful patterns. The learned coefficients tell a coherent meteorological story: humidity saturation, strong winds, and low temperatures precede extreme rainfall, while high temperatures and stable conditions suppress it. This interpretability provides confidence that the model has learned genuine atmospheric physics rather than statistical artifacts.

A strong baseline for future work. The Logistic Regression model establishes a performance floor that more complex models (Random Forest, gradient-boosted trees, neural networks) must exceed to justify their added complexity. Any model that cannot outperform this linear baseline on the same temporal split does not capture meaningful non-linear structure in the data.


6. Conclusion

This study investigated whether a machine learning model can predict extreme rainfall events (>= 2.3 mm/10min) within the next hour using only ground-level meteorological observations from a single station. A Logistic Regression classifier was trained on 15 years of 10-minute resolution data (2005–2020) and evaluated on a held-out test period (2023–2025) with strict temporal separation.

The model achieved a test set recall of 73.6%, correctly identifying 351 out of 477 extreme events, with an ROC-AUC of 0.9216 and a PR-AUC of 0.0949 (22.6x above random baseline). Feature coefficient analysis revealed that the model relies primarily on humidity dynamics (11 of the top 20 features), wind intensity, and temperature as a negative indicator – signals consistent with established meteorological understanding of convective precipitation.

Limitations. The model’s precision remains low (5.0%), generating approximately 19 false alarms for every correctly detected event. This is a consequence of the extreme class imbalance (0.33% positive rate) combined with the linear model’s inability to capture non-linear feature interactions. Additionally, the 10.9% recall drop from validation to test suggests sensitivity to temporal distribution shift.

Future directions. Three avenues may improve upon this baseline: (1) non-linear models such as Random Forest or XGBoost, which can capture feature interactions (e.g., high humidity AND strong winds simultaneously); (2) threshold optimization tailored to the operational cost structure of false alarms versus missed events; and (3) incorporation of additional data sources such as radar reflectivity or neighboring station observations to provide spatial context that a single station cannot offer.

This post is licensed under CC BY 4.0 by the author.