Forecasting & Decisions

Status: 🔴 Planned (Phase 5–6)

Two separate problems

This chapter covers two distinct machine learning problems:

Problem	Type	Model	Input	Output
Should I water the plants?	Binary classification	Random Forest	Local sensor data (current + history)	Yes / No + confidence
Short-term weather forecast	Time-series regression	LSTM	Historical measurements + external data	T, H, P, rain probability at t+1h to t+6h

These are independent models, trained and deployed separately.

Important

Both models require data before they can be trained. Do not start ML work before you have at least 1–3 months of local sensor data (Phase 3 output). The forecasting model additionally benefits from external historical data collected from nearby weather stations.

Problem 1 — Plant watering decision

Why Random Forest?

Random Forest is the right tool here for several reasons:

Interpretable: feature importance tells you exactly which variables drive the decision — useful for debugging and learning
Works with small datasets: you don’t need thousands of samples; 200–500 labeled observations are enough to get started
No feature scaling required: handles temperature, rain, and percentage values naturally
Robust: handles missing values (with imputation) and doesn’t overfit easily
Fast inference: predictions take milliseconds on the Odroid C4

Deep learning would be overkill and harder to interpret for this binary decision.

Feature engineering

The watering model receives a feature vector computed from recent history, not raw instantaneous readings.

Feature	Computation	Rationale
`rain_24h`	Sum of rain over last 24h (mm)	Did it just rain?
`rain_48h`	Sum of rain over last 48h (mm)	Soil drainage lag
`rain_72h`	Sum of rain over last 72h (mm)	Deeper saturation
`soil_moisture_now`	Latest soil moisture reading (%)	Direct ground truth
`soil_moisture_24h_avg`	24h rolling average (%)	Trend vs instantaneous
`temp_max_24h`	Max temperature last 24h (°C)	Evapotranspiration proxy
`temp_avg_24h`	Mean temperature last 24h (°C)	Baseline heat load
`humidity_avg_24h`	Mean relative humidity (%)	Evaporation rate
`light_avg_24h`	Mean lux last 24h	Solar evaporation driver
`wind_avg_24h`	Mean wind speed m/s (Phase 3+)	Evapotranspiration factor
`day_of_year`	1–365	Seasonal baseline correction
`days_since_last_watering`	Counter	Avoids over/under-watering cycles

Labeling strategy

Phase 5a — manual labels: You water the plants and record it as a “should have watered” event. Over a few weeks, you build a labeled dataset.

Phase 5b — semi-automated: Once soil moisture baseline is established, use threshold rules to auto-generate labels: - soil_pct < 35% AND rain_48h < 5 mm → label as water - soil_pct > 60% OR rain_24h > 10 mm → label as don’t water - Otherwise: ambiguous, skip for training

This hybrid approach lets the Random Forest learn subtle seasonal and contextual patterns that the threshold rules miss.

Implementation

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import pandas as pd, joblib

# Load labeled dataset
df = pd.read_csv("labeled_watering.csv")
features = ["rain_24h", "rain_48h", "soil_moisture_now", "temp_max_24h",
            "humidity_avg_24h", "light_avg_24h", "day_of_year",
            "days_since_last_watering"]
X = df[features]
y = df["should_water"]  # 0 or 1

# Train
clf = RandomForestClassifier(n_estimators=100, random_state=42)
scores = cross_val_score(clf, X, y, cv=5, scoring="f1")
print(f"F1 cross-val: {scores.mean():.2f} ± {scores.std():.2f}")

clf.fit(X, y)
joblib.dump(clf, "watering_model.pkl")

# Feature importance (key for learning)
import matplotlib.pyplot as plt
pd.Series(clf.feature_importances_, index=features).sort_values().plot.barh()
plt.title("Watering model — feature importance")
plt.tight_layout()
plt.savefig("feature_importance.png")

The trained model is loaded by the FastAPI server and called on demand (GET /api/predict/watering).

Problem 2 — Short-term weather forecasting

Scope and horizon

Target: predict temperature, humidity, pressure, and rain probability at t+1h, t+3h, and t+6h using recent local measurements.

This is a realistic horizon for local sensor data. Beyond 6–12 hours, local data alone provides minimal advantage over public forecasts — use OpenMeteo or similar beyond that horizon.

Why not start from scratch?

Training a time-series forecasting model requires months of data with seasonal coverage. You don’t have that in Phase 1 or 2. The solution: use data from nearby public weather stations to train an initial model, then fine-tune it on your local data using transfer learning.

Phase A — external data collection

Sources for historical weather data near your location:

Source	Coverage	Format	Notes
OpenMeteo API	Global, free	JSON	No API key, hourly historical, excellent
Météo-France open data	France	CSV/NetCDF	High quality, synoptic network
Copernicus ERA5	Global, reanalysis	NetCDF	0.25° resolution, any date

Recommended starting point: OpenMeteo (free, no registration, easy Python client).

import openmeteo_requests
import pandas as pd

om = openmeteo_requests.Client()
params = {
    "latitude": 48.85,   # Your location
    "longitude": 2.35,
    "start_date": "2023-01-01",
    "end_date": "2025-12-31",
    "hourly": ["temperature_2m", "relative_humidity_2m",
               "pressure_msl", "precipitation",
               "wind_speed_10m", "shortwave_radiation"]
}
responses = om.weather_api("https://archive-api.open-meteo.com/v1/archive", params=params)
df = pd.DataFrame(responses[0].Hourly().Variables(0).ValuesAsNumpy(),
                  columns=["temp"])  # extend for all variables

Collect 2–3 years of hourly data. This gives ~17 000–26 000 samples — enough to train a solid LSTM baseline.

Phase B — model architecture (LSTM)

A Long Short-Term Memory (LSTM) network is well-suited for multivariate time-series forecasting with this data volume.

Input: sliding window of the last 24 hours of observations (24 timesteps × N features) Output: values at t+1h, t+3h, t+6h

flowchart LR
    A["24h history\n(24 × 6 features)"] --> B["LSTM\n2 layers, 64 units"]
    B --> C["Dense output\n(3 horizons × 4 targets)"]
    C --> D["T, H, P, Rain_prob\nat t+1h, t+3h, t+6h"]

Feature set for forecasting:

Feature	Horizon relevance
Temperature (°C)	Direct
Relative humidity (%)	Direct
Pressure (hPa)	Best single predictor of change
Pressure trend (Δ over 3h, 6h)	Derived — very informative
Precipitation (mm/h)	Direct
Wind speed (m/s)	Evaporation, storm approach
Solar radiation / lux	Diurnal cycle
Hour of day (sin/cos encoded)	Removes daily periodicity
Day of year (sin/cos encoded)	Seasonal baseline

Pressure trend (derivative) is the most informative single predictor for 1–6h weather changes. A falling pressure reliably precedes rain.

Phase C — transfer learning on local data

Once you have 3–6 months of local station data:

Take the pre-trained LSTM (weights from external data)
Freeze the first LSTM layer (learned general weather dynamics)
Fine-tune the second LSTM layer + output head on your local data
Evaluate local vs external predictions — quantify the improvement

This approach works because general atmospheric dynamics (pressure changes, temperature cycles) are the same everywhere. Only the local bias differs.

import torch
import torch.nn as nn

# Freeze first LSTM layer
for param in model.lstm1.parameters():
    param.requires_grad = False

# Fine-tune on local data (smaller learning rate)
optimizer = torch.optim.Adam(
    filter(lambda p: p.requires_grad, model.parameters()),
    lr=1e-4  # smaller LR for fine-tuning
)

Deployment on Odroid C4

Task	Feasibility
LSTM inference (small model <100k params)	✅ <100 ms per prediction
LSTM training from scratch	🟡 Hours — prefer laptop
Transfer learning fine-tuning	✅ Minutes on Odroid C4 with CPU
TFLite / ONNX export + inference	✅ Recommended for production

Workflow: train on laptop → export to ONNX or TFLite → deploy to Odroid C4 for serving.

Baseline comparison

Before trusting the LSTM, always compare it against simple baselines:

Baseline	Description
Persistence	“Tomorrow = today” — surprisingly hard to beat for T at t+1h
Pressure trend rule	P falling > 3 hPa/3h → rain likely (classical barometry)
Linear regression on Δ	Linear model on last 6h gradient
OpenMeteo public API	Best available for your location

The LSTM should outperform these baselines on 3h+ horizons to be worth deploying.

What NOT to do

Temptation	Why to avoid
Start ML before collecting data	No data = no model. Collect first.
Use deep learning for watering decision	Random Forest is better: interpretable, needs less data
Train the forecast model only on local data	Too little data early on. Use external data first.
Train on Odroid C4 from scratch	Slow. Train on laptop, deploy to Odroid.
Skip baselines	You won’t know if your model is actually useful.