Can a machine learning model predict whether an extreme rainfall event will occur within the next hour?

Posted Feb 25, 2026 Updated Feb 26, 2026

By Dor G

17 min read

Can a machine learning model predict whether an extreme rainfall event will occur within the next hour?

Predicting Extreme Rainfall Events Using Logistic Regression

A Single-Station Machine Learning Approach for Flood Early Warning

Research Question: Given only the meteorological conditions observable right now at a single ground station, can a machine learning model predict whether an extreme rainfall event will occur within the next hour?

1. Dataset Description

1.1 Data Source

Meteorological observations from Station 178, recorded at 10-minute intervals over a 20-year period (September 2005 – August 2025). The raw dataset contains 1,047,486 observations across 7 meteorological variables.

1.2 Measured Variables

Variable	Description	Unit	Range
RH	Relative Humidity	%	5 – 100
TD	Dew Point Temperature	°C	-3 – 48
TDmax	Daily Maximum Temperature	°C	-3 – 52
TDmin	Daily Minimum Temperature	°C	-5 – 48
WS	Wind Speed	m/s	0 – 40
WD	Wind Direction	degrees	0 – 360
Rain	Rainfall Intensity	mm/10min	0 – 100

1.3 Target Variable Definition

The binary target variable was defined using a forward-looking window approach:

\[\text{target}_t = \begin{cases} 1 & \text{if } \max(\text{Rain}_{t+1}, \text{Rain}_{t+2}, \ldots, \text{Rain}_{t+6}) \geq 2.3 \text{ mm} \\ 0 & \text{otherwise} \end{cases}\]

Where:

Threshold: 2.3 mm/10min (95th percentile of rainy observations)
Forecast horizon: 6 timesteps = 60 minutes (1 hour ahead)
Interpretation: target = 1 means “at least one extreme rainfall measurement will occur within the next hour”
1.4 Class Distribution

Class	Count	Percentage
Normal (target = 0)	1,042,948	99.67%
Extreme (target = 1)	3,454	0.33%
Imbalance ratio	1 : 302

2. Data Preprocessing and Feature Engineering Pipeline

2.1 Pipeline Overview

flowchart TD

%% =======================
%% NODES (UNCHANGED CONTENT)
%% =======================

A["RAW DATA<br/>(1,047,486 rows × 7 columns)"]

B["[1] DATA QUALITY CONTROL<br/>
• Out-of-bounds detection (27,126 flagged)<br/>
• Cross-validation rules (251 violations)<br/>
• Total NaNs introduced: 27,377"]

C["[2] MISSING DATA IMPUTATION<br/>
• Time-bounded ffill/bfill (≤ 1 hour gaps)<br/>
• Climatological median fill (rainy rows)<br/>
• Drop remaining dry NaN rows (920 rows)"]

D["[3] SMALL GAP FILLING<br/>
• Linear interpolation (≤ 2 hours)<br/>
• Large gaps (> 2 hours) left intact"]

E["[4] WIND DIRECTION DECOMPOSITION<br/>
• WD → WD_sin = sin(WD radians)<br/>
• WD → WD_cos = cos(WD radians)<br/>
• Raw WD column dropped"]

F["[5] FEATURE ENGINEERING<br/>
• Rain lag features (6)<br/>
• Meteorological lag features (30)<br/>
• Rain rolling stats (12)<br/>
• Meteorological rolling stats (36)"]

G["[6] LEAKAGE PREVENTION<br/>
• Raw Rain column dropped<br/>
• Rain features use shift(1): never include t=0<br/>
• Gap contamination zones removed (6 rows before, 36 after)"]

H["[7] VALIDATION<br/>
• Verify 0 NaNs remaining<br/>
• Verify 91 features + 1 target<br/>
• Verify no raw Rain in features"]

I["FINAL DATASET<br/>(1,046,402 rows × 92 columns)"]

%% =======================
%% FLOW (THICKER ARROWS)
%% =======================

A ==> B ==> C ==> D ==> E ==> F ==> G ==> H ==> I

%% =======================
%% STYLING
%% =======================

classDef raw fill:#E3F2FD,stroke:#1E88E5,stroke-width:2px,color:#000;
classDef cleaning fill:#FFF3E0,stroke:#FB8C00,stroke-width:2px,color:#000;
classDef engineering fill:#E8F5E9,stroke:#43A047,stroke-width:2px,color:#000;
classDef protection fill:#FCE4EC,stroke:#D81B60,stroke-width:2px,color:#000;
classDef validation fill:#F3E5F5,stroke:#8E24AA,stroke-width:2px,color:#000;
classDef final fill:#E0F2F1,stroke:#00897B,stroke-width:2px,color:#000;

class A raw;
class B,C,D cleaning;
class E,F engineering;
class G protection;
class H validation;
class I final;

linkStyle default stroke-width:3px

2.2 Data Quality Control

Phase 1 – Out-of-Bounds Detection: Domain-specific physical bounds were applied to each variable to flag impossible sensor readings as NaN.

Variable	Valid Range	Readings Flagged
Rain	0 – 100 mm	0
TD	-3 – 48 °C	6,875
TDmax	-3 – 52 °C	6,779
TDmin	-5 – 48 °C	6,917
RH	5 – 100 %	6,517
WS	0 – 40 m/s	19
WD	0 – 360°	19
Total		27,126

Phase 2 – Cross-Validation Rules: Six physically-impossible combination rules were applied (e.g., TD > TDmax, TDmin > TDmax, nighttime TD > 40°C). An additional 251 rows were flagged.

A three-step strategy was designed to preserve all rainy observations (critical for rare-event modeling):

Time-bounded fill (gap limit = 6 consecutive NaNs = 1 hour): Forward-fill then backward-fill within this limit. Recovered 31 of 307 rainy rows with NaN.
Drop remaining dry NaN rows (920 rows): Only non-rainy rows were removed.
Climatological median fill: For remaining rainy rows with NaN (276 rows), values were imputed using per-month climatological medians computed from clean data.

2.4 Feature Engineering

Starting from 7 base variables (with WD decomposed into WD_sin and WD_cos), 91 features were constructed:

Category	Count	Description	Example
Current readings	5	Values at time t	RH, TD, WS, WD_sin, WD_cos
Daily context	2	Daily aggregates	TDmax, TDmin
Rain lags	6	Rain at t-k (k = 1,3,6,12,18,36 steps)	Rain_lag_1 (10 min ago)
Meteorological lags	30	5 variables x 6 lag steps	RH_lag_6 (1 hour ago)
Rain rolling stats	12	3 windows x 4 aggregations	Rain_roll_max_6 (max rain in last hour)
Meteorological rolling stats	36	3 variables x 3 windows x 4 aggregations	RH_roll_mean_18 (avg humidity in last 3h)
Total	91
Lag steps: 1, 3, 6, 12, 18, 36 timesteps (10 min, 30 min, 1h, 2h, 3h, 6h)

Rolling windows: 6, 18, 36 timesteps (1h, 3h, 6h)

Rolling aggregations: mean, std, max, min

Leakage prevention: All rain-derived features are computed on Rain.shift(1) to ensure the current timestep’s rainfall (which contributes to the target) is never included as a feature.

3. Temporal Train / Validation / Test Split

3.1 Split Design and statistics

A strictly chronological split with purge gaps was applied to prevent data leakage:

flowchart LR

A["Train<br/>2005-09-01 → 2020-12-31<br/>76.7%"]

G1["6h purge"]

B["Validation<br/>2021-01-01 → 2023-06-30<br/>12.5%"]

G2["6h purge"]

C["Test<br/>2023-07-01 → 2025-08-30<br/>10.9%"]

A --> G1 --> B --> G2 --> C

classDef train fill:#E3F2FD,stroke:#1E88E5,stroke-width:2px,color:#000000,font-weight:bold;
classDef val fill:#FFF8E1,stroke:#F9A825,stroke-width:2px,color:#000000,font-weight:bold;
classDef test fill:#E8F5E9,stroke:#43A047,stroke-width:2px,color:#000000,font-weight:bold;
classDef gap fill:#FAFAFA,stroke:#9E9E9E,stroke-dasharray:5 5,color:#000000,font-weight:bold;

class A train;
class B val;
class C test;
class G1,G2 gap;

linkStyle default stroke-width:2px

The 6-hour purge gap matches the maximum lag of 36 steps (6 hours), ensuring no test row’s lag features can reach into training data.

Split	Period	Rows	Positives	Positive Rate	Neg:Pos
Train	2005-09 to 2020-12	802,355	2,441	0.304%	1:328
Validation	2021-01 to 2023-06	130,335	536	0.411%	1:243
Test	2023-07 to 2025-08	113,640	477	0.420%	1:238

3.2 Feature Scaling

StandardScaler was applied to normalize features to zero mean and unit variance. The scaler was fitted on the training set only to prevent information leakage. Validation and test sets were transformed using the same scaler parameters.

4. Logistic Regression Model

Logistic Regression is a linear classification algorithm that models the probability of a binary outcome as a function of input features. Rather than predicting the class directly, it estimates the probability that a given observation belongs to the positive class (extreme rainfall) by passing a linear combination of features through the sigmoid function. The model learns a weight (coefficient) for each feature, quantifying its contribution to the prediction. During training, these weights are optimized by minimizing a regularized cross-entropy loss function, which penalizes both incorrect predictions and overly large coefficients to prevent overfitting.

Model Formula:

\[P(y = 1 \mid \mathbf{x}) = \sigma(z) = \frac{1}{1 + e^{-z}}, \quad \text{where} \quad z = \mathbf{w}^\top \mathbf{x} + b = \sum_{j=1}^{p} w_j x_j + b\]

Symbol	Definition
$P(y = 1 \mid \mathbf{x})$	Predicted probability that the observation $\mathbf{x}$ belongs to the positive class (extreme rainfall within the next hour)
$\sigma(z)$	The sigmoid (logistic) function, which maps any real-valued $z$ to the range $(0, 1)$
$z$	The logit – the linear combination of all input features before the sigmoid transformation
$\mathbf{x} = [x_1, x_2, \ldots, x_p]$	The feature vector of a single observation, containing all $p = 91$ engineered features
$\mathbf{w} = [w_1, w_2, \ldots, w_p]$	The weight (coefficient) vector – one learned weight per feature, representing its contribution to the prediction
$b$	The bias (intercept) term, representing the model’s baseline log-odds when all features are zero (learned value: $b = -2.0022$)
$p$	The number of input features ($p = 91$)

Training Objective (L2-Regularized Cross-Entropy Loss):

\[\mathcal{L}(\mathbf{w}, b) = -\sum_{i=1}^{N} \left[ c_{y_i} \cdot y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right] + \frac{1}{2C} \|\mathbf{w}\|_2^2\]

Symbol	Definition
$N$	Total number of training samples
$y_i$	True label of sample $i$ (0 = normal, 1 = extreme)
$\hat{y}_i$	Predicted probability $P(y_i = 1 \mid \mathbf{x}_i)$
$c_{y_i}$	Class weight for sample $i$’s true class (1.0 for normal, ~164.3 for extreme), compensating for class imbalance
$C$	Inverse regularization strength – smaller $C$ applies stronger L2 penalty on the weights (selected value: $C = 0.001$)
$\|\mathbf{w}\|_2^2$	L2 norm of the weight vector, penalizing large coefficients to reduce overfitting

Classification Rule: An observation is classified as extreme if $P(y = 1 \mid \mathbf{x}) \geq \tau$, where $\tau = 0.5$ is the default decision threshold.

4.1 Model Configuration

Parameter	Value	Rationale
C (regularization)	0.001	Optimal value from regularization curve analysis (Cell 23b); outperformed default C=1.0 on both val_F1 (+6.4%) and val_PR-AUC (+2.9%)
solver	lbfgs	Second-order optimizer suited for medium-sized datasets
class_weight	balanced	Automatically upweights the minority class (~164x) to address class imbalance
max_iter	2000	Generous convergence limit; model converged in 92 iterations
random_state	42	Reproducibility

Class weighting formula:

\[w_{\text{class}} = \frac{n_{\text{samples}}}{n_{\text{classes}} \times n_{\text{samples in class}}}\]

Weight for class 0 (Normal): ~0.50
Weight for class 1 (Extreme): ~164.3

4.2 Regularization Parameter Selection

The regularization parameter C was selected by training models across C = {0.001, 0.01, 0.1, 0.5, 1.0, 5.0, 10.0, 100.0} and evaluating on the validation set.

C	Iterations	Val F1	Val PR-AUC
0.001	92	0.0848	0.1047
0.01	222	0.0838	0.1040
0.1	592	0.0826	0.1028
0.5	1,122	0.0806	0.1021
1.0	1,422	0.0797	0.1017
5.0	1,874	0.0793	0.1009
10.0	2,000*	0.0783	0.1007
100.0	2,000*	0.0778	0.1004

Did not converge within max_iter=2000.

C = 0.001 was selected as it achieved the highest validation F1 and PR-AUC, while converging 15x faster than C = 1.0.

4.3 Validation Set Performance

Metric	Value
Recall (Extreme class)	0.8451
Precision (Extreme class)	0.0447
F1 (Extreme class)	0.0848
ROC-AUC	0.9548
PR-AUC	0.1047 (baseline = 0.0041)
Training time	10.1 seconds
Confusion Matrix (Validation Set):

	Predicted: Normal	Predicted: Extreme
Actual: Normal	120,108 (TN)	9,691 (FP)
Actual: Extreme	83 (FN)	453 (TP)

Interpretation: The model correctly identified 453 out of 536 extreme events (84.5% recall), at the cost of 9,691 false alarms. The PR-AUC of 0.1047 represents a 25.5x improvement over a random baseline (0.0041).

4.4 Test Set Performance

Metric	Value
Recall (Extreme class)	0.7358
Precision (Extreme class)	0.0504
F1 (Extreme class)	0.0943
ROC-AUC	0.9216
PR-AUC	0.0949 (baseline = 0.0042)

Confusion Matrix (Test Set):

	Predicted: Normal	Predicted: Extreme
Actual: Normal	106,545 (TN)	6,618 (FP)
Actual: Extreme	126 (FN)	351 (TP)

Interpretation: On completely unseen data (2023–2025), the model correctly identified 351 out of 477 extreme events (73.6% recall). The PR-AUC of 0.0949 represents a 22.6x improvement over a random baseline.

4.5 Generalization: Validation vs. Test Comparison

Metric	Validation	Test	Difference
F1	0.0848	0.0943	+0.0094
Precision	0.0447	0.0504	+0.0057
Recall	0.8451	0.7358	-0.1093
ROC-AUC	0.9548	0.9216	-0.0332
PR-AUC	0.1047	0.0949	-0.0098

The model shows a 10.9% drop in recall on the test set compared to validation. ROC-AUC and PR-AUC remain strong, indicating good overall separation ability. The recall drop suggests that the 2023–2025 test period may contain extreme events with slightly different atmospheric signatures than the training period (2005–2020).

4.6 Threshold Sensitivity Analysis

The default classification threshold of 0.5 was analyzed against alternative thresholds to understand the precision-recall tradeoff:

Validation Set:

Threshold	Recall	Precision	F1	TP	FP	FN	Missed%
0.05	1.000	0.0062	0.0123	536	85,872	0	0.0%
0.10	0.998	0.0082	0.0163	535	64,477	1	0.2%
0.15	0.993	0.0101	0.0200	532	52,119	4	0.7%
0.20	0.989	0.0123	0.0242	530	42,727	6	1.1%
0.30	0.951	0.0186	0.0365	510	26,863	26	4.9%
0.40	0.892	0.0311	0.0601	478	14,887	58	10.8%
0.50	0.845	0.0447	0.0848	453	9,691	83	15.5%

Test Set:

Threshold	Recall	Precision	F1	TP	FP	FN	Missed%
0.05	0.979	0.0065	0.0129	467	71,619	10	2.1%
0.10	0.956	0.0085	0.0168	456	53,440	21	4.4%
0.15	0.935	0.0104	0.0206	446	42,383	31	6.5%
0.20	0.920	0.0128	0.0252	439	33,885	38	8.0%
0.30	0.878	0.0208	0.0407	419	19,717	58	12.2%
0.40	0.797	0.0373	0.0712	380	9,818	97	20.3%
0.50	0.736	0.0504	0.0943	351	6,618	126	26.4%

The analysis reveals that lowering the threshold substantially increases recall at the cost of more false alarms, while precision remains low at all thresholds. The default threshold of 0.5 was retained for this baseline model; optimal threshold selection is deferred to the final deployed model.

4.7 Feature Coefficients – What the Model Learned

The top 20 features by absolute coefficient magnitude reveal the atmospheric signals the model relies on:

Rank	Feature	Coefficient	Interpretation
1	RH_roll_max_6	+0.628	Peak humidity in last 1h: higher = more likely extreme
2	WS_roll_max_36	+0.619	Peak wind speed in last 6h: higher = more likely extreme
3	RH_roll_max_18	+0.530	Peak humidity in last 3h: higher = more likely extreme
4	RH_roll_std_6	+0.498	Humidity fluctuation in last 1h: more volatile = more likely
5	RH_lag_6	+0.478	Humidity 1 hour ago: higher = more likely extreme
6	TD_roll_max_36	-0.469	Peak temperature in last 6h: higher = less likely (hot = dry)
7	WS_roll_max_18	+0.466	Peak wind speed in last 3h: higher = more likely
8	RH_roll_mean_6	+0.446	Average humidity in last 1h
9	RH_roll_max_36	+0.443	Peak humidity in last 6h
10	WD_sin_lag_1	+0.437	Wind direction component 10 min ago
11	RH_roll_mean_18	+0.436	Average humidity in last 3h
12	TD_roll_max_18	-0.430	Peak temperature in last 3h: higher = less likely
13	RH	+0.427	Current humidity: higher = more likely
14	RH_lag_12	+0.405	Humidity 2 hours ago
15	WD_cos_lag_18	-0.386	Wind direction component 3 hours ago
16	RH_roll_min_6	+0.385	Minimum humidity in last 1h
17	TD_roll_max_6	-0.374	Peak temperature in last 1h: higher = less likely
18	RH_lag_3	+0.370	Humidity 30 minutes ago
19	TD_roll_std_36	-0.347	Temperature variability in last 6h
20	TDmax	-0.347	Daily maximum temperature: higher = less likely

Model intercept (bias): -2.0022 (the model starts with a strong prior toward “no extreme event”, consistent with the 0.33% base rate).

Physical interpretation: The model has learned a coherent meteorological story:

Humidity dominance: 11 of the top 20 features are humidity-related. Rising, high, and volatile humidity signals moisture-laden air masses.
Wind signals: Strong winds in recent hours indicate incoming storm systems.
Temperature as negative signal: High temperatures push predictions away from extreme rain, consistent with hot, dry conditions being incompatible with convective precipitation.
Temporal pattern: Features spanning 10 minutes to 6 hours capture the evolution of pre-storm atmospheric conditions.

5. Discussion

5.1 Why We Observed This Performance on the Test Set

The Logistic Regression model achieved strong discriminative ability on the validation set (ROC-AUC = 0.9548, Recall = 84.5%) but exhibited a notable decline on the test set (ROC-AUC = 0.9216, Recall = 73.6%). Several factors explain this performance gap.

Temporal distribution shift. The training data spans 2005–2020 and the test set covers 2023–2025. Extreme rainfall events in the test period may exhibit different atmospheric signatures than those in the training period. Climate variability, shifts in storm patterns, or changes in the frequency of convective versus frontal precipitation systems could all contribute to this distributional mismatch. The validation period (2021–2023) sits temporally closer to the training data and likely shares more similar weather patterns, explaining its stronger performance.

Linearity constraint. Logistic Regression models a linear decision boundary in feature space. In practice, extreme rainfall is likely governed by non-linear interactions between atmospheric variables – for instance, the combination of high humidity AND strong winds AND dropping temperature may be far more predictive than any of these signals individually. A linear model cannot capture such multiplicative interactions; it can only learn additive contributions from each feature independently. This fundamental limitation places a ceiling on the achievable performance.

Extreme class imbalance. With only 0.33% of rows belonging to the positive class, the model faces an inherently difficult classification task. The precision remained below 5% at the default threshold across both validation and test sets, indicating that the model fires a large number of false alarms for every true detection. This is not unique to our model – it reflects the difficulty of distinguishing genuine pre-storm atmospheric signatures from conditions that appear similar but do not produce extreme rainfall. The threshold sensitivity analysis (Section 4.6) confirmed that precision remains low at all operating points, suggesting this is a structural challenge rather than a threshold selection issue.

What the model gets right. Despite these limitations, the PR-AUC of 0.0949 on the test set represents a 22.6x improvement over a random classifier (baseline = 0.0042). The ROC-AUC of 0.9216 confirms that the model achieves strong overall separation between the two classes. The model successfully learns that rising humidity, strong winds, and low temperatures precede extreme events – a physically coherent signal that generalizes, even if imperfectly, to unseen data.

5.2 What This Project Successfully Demonstrated

A single ground station is sufficient for meaningful prediction. Without radar, satellite imagery, or numerical weather model output, a simple linear model using only local meteorological observations achieved 73.6% recall on completely unseen data. This demonstrates that ground-level atmospheric conditions carry substantial predictive information about imminent extreme rainfall.

Feature engineering transforms raw observations into predictive signals. Starting from 7 raw meteorological variables, the lag and rolling-window feature engineering pipeline produced 91 features that capture the temporal evolution of atmospheric conditions over the preceding 6 hours. This transformation was critical – the model’s top features (RH_roll_max_6, WS_roll_max_36) are engineered features, not raw measurements.

A leakage-free temporal evaluation pipeline. The strict chronological split with 6-hour purge gaps, training-only scaler fitting, and removal of the raw Rain column at $t=0$ ensures that all reported metrics reflect genuine out-of-sample predictive skill. No future information leaks into the model during training or evaluation.

Model interpretability reveals physically meaningful patterns. The learned coefficients tell a coherent meteorological story: humidity saturation, strong winds, and low temperatures precede extreme rainfall, while high temperatures and stable conditions suppress it. This interpretability provides confidence that the model has learned genuine atmospheric physics rather than statistical artifacts.

A strong baseline for future work. The Logistic Regression model establishes a performance floor that more complex models (Random Forest, gradient-boosted trees, neural networks) must exceed to justify their added complexity. Any model that cannot outperform this linear baseline on the same temporal split does not capture meaningful non-linear structure in the data.

6. Conclusion

This study investigated whether a machine learning model can predict extreme rainfall events (>= 2.3 mm/10min) within the next hour using only ground-level meteorological observations from a single station. A Logistic Regression classifier was trained on 15 years of 10-minute resolution data (2005–2020) and evaluated on a held-out test period (2023–2025) with strict temporal separation.

The model achieved a test set recall of 73.6%, correctly identifying 351 out of 477 extreme events, with an ROC-AUC of 0.9216 and a PR-AUC of 0.0949 (22.6x above random baseline). Feature coefficient analysis revealed that the model relies primarily on humidity dynamics (11 of the top 20 features), wind intensity, and temperature as a negative indicator – signals consistent with established meteorological understanding of convective precipitation.

Limitations. The model’s precision remains low (5.0%), generating approximately 19 false alarms for every correctly detected event. This is a consequence of the extreme class imbalance (0.33% positive rate) combined with the linear model’s inability to capture non-linear feature interactions. Additionally, the 10.9% recall drop from validation to test suggests sensitivity to temporal distribution shift.

Future directions. Three avenues may improve upon this baseline: (1) non-linear models such as Random Forest or XGBoost, which can capture feature interactions (e.g., high humidity AND strong winds simultaneously); (2) threshold optimization tailored to the operational cost structure of false alarms versus missed events; and (3) incorporation of additional data sources such as radar reflectivity or neighboring station observations to provide spatial context that a single station cannot offer.

Code, python, Hydrology

This post is licensed under CC BY 4.0 by the author.