Detection Methods

Temporal Detection Methods

Temporal methods analyze a single station's behavior over time to detect anomalies.

ARIMA (AutoRegressive Integrated Moving Average)

Best For: Complex trends, seasonal patterns, weather forecasting

How It Works:

ARIMA models the time series as a combination of:

AR (AutoRegressive): Current value depends on past values
I (Integrated): Differencing to make the series stationary
MA (Moving Average): Current value depends on past forecast errors

Parameters: (p, d, q) where:

p: Number of lag observations
d: Degree of differencing
q: Size of moving average window

Implementation:

from statsmodels.tsa.arima.model import ARIMA

# Fit model on historical data
model = ARIMA(historical_data, order=(1, 1, 1))
fitted = model.fit()

# Forecast next value
forecast = fitted.forecast(steps=1)
confidence_interval = fitted.get_forecast(steps=1).conf_int()

# Check if actual value is outside confidence interval
if actual < confidence_interval[0] or actual > confidence_interval[1]:
    return True  # Anomalous

Advantages:

Captures complex patterns
Provides confidence intervals
Adapts to trends

Disadvantages:

Computationally expensive
Requires sufficient historical data (> 30 points)
May fail on very noisy data

Recommended Settings:

Window size: 6 hours (36 data points at 10-min intervals)
Order: (1, 1, 1) for most weather variables
Confidence level: 95%

3-Sigma (Z-Score)

Best For: Quick outlier detection, normally distributed data

How It Works:

Calculates how many standard deviations the current value is from the mean:

[ Z = \frac{X - \mu}{\sigma} ]

Where:

(X) = Current value
(\mu) = Mean of historical data
(\sigma) = Standard deviation

Implementation:

import numpy as np

mean = np.mean(historical_data)
std = np.std(historical_data)

z_score = (actual_value - mean) / std

if abs(z_score) > 3:
    return True  # Anomalous

Advantages:

Extremely fast
Simple to understand
Works well for stable data

Disadvantages:

Assumes normal distribution
Sensitive to outliers in historical data
Doesn't capture trends

Recommended Settings:

Threshold: 3 (captures 99.7% of normal data)
Window size: 6 hours minimum

MAD (Median Absolute Deviation)

Best For: Robust detection, data with outliers, stable baselines

How It Works:

Uses median instead of mean for robustness:

[ \text{MAD} = \text{median}(|X_i - \text{median}(X)|) ]

[ \text{Score} = \frac{|X - \text{median}(X)|}{\text{MAD}} ]

Implementation:

import numpy as np

median = np.median(historical_data)
mad = np.median(np.abs(historical_data - median))

score = abs(actual_value - median) / mad

if score > 3.5:
    return True  # Anomalous

Advantages:

Robust to outliers
No assumption of normal distribution
Stable over time

Disadvantages:

Can be too sensitive to changes
Doesn't handle trends well
May flag all values if data is flat

Recommended Settings:

Threshold: 3.5
Use when baseline is stable (e.g., barometric pressure)

IQR (Interquartile Range)

Best For: Exploratory analysis, boxplot-style outlier detection

How It Works:

Uses quartiles to define normal range:

[ \text{IQR} = Q_3 - Q_1 ]

Outliers are values outside:

[ [Q_1 - 1.5 \times \text{IQR}, Q_3 + 1.5 \times \text{IQR}] ]

Implementation:

import numpy as np

q1 = np.percentile(historical_data, 25)
q3 = np.percentile(historical_data, 75)
iqr = q3 - q1

lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr

if actual_value < lower_bound or actual_value > upper_bound:
    return True  # Anomalous

Advantages:

Intuitive (boxplot logic)
Robust to outliers
No distribution assumptions

Disadvantages:

Static threshold
Doesn't capture trends
Less sensitive than other methods

Recommended Settings:

Multiplier: 1.5 (standard) or 3.0 (conservative)

Isolation Forest

Best For: Multidimensional patterns, complex anomalies

How It Works:

Machine learning algorithm that isolates anomalies by randomly partitioning data:

Anomalies are easier to isolate (fewer splits needed)
Normal points are harder to isolate (more splits needed)

Implementation:

from sklearn.ensemble import IsolationForest

# Train on historical data (all variables)
X = historical_data[['temp_out', 'out_hum', 'wind_speed', 'bar', 'rain']]
model = IsolationForest(contamination=0.1, random_state=42)
model.fit(X)

# Predict on current value
current = [[temp, hum, wind, bar, rain]]
prediction = model.predict(current)

if prediction[0] == -1:
    return True  # Anomalous

Advantages:

Finds subtle patterns
Handles multiple variables simultaneously
No assumption of distribution

Disadvantages:

Black box (hard to interpret)
Requires all variables present
Sensitive to contamination parameter

Recommended Settings:

Contamination: 0.1 (expect 10% anomalies)
Random state: 42 (for reproducibility)

STL (Seasonal-Trend Decomposition)

Best For: Data with strong seasonality (daily/weekly cycles)

How It Works:

Decomposes time series into:

Trend: Long-term progression
Seasonal: Repeating patterns
Residual: Remaining variation

Anomalies are detected in the residual component.

Implementation:

from statsmodels.tsa.seasonal import STL

# Decompose time series
stl = STL(historical_data, seasonal=13)  # 13 for 6.5-hour cycle
result = stl.fit()

# Get residuals
residuals = result.resid

# Detect outliers in residuals using 3-sigma
threshold = 3 * np.std(residuals)
if abs(residuals[-1]) > threshold:
    return True  # Anomalous

Advantages:

Handles seasonality well
Separates trend from noise
Interpretable components

Disadvantages:

Requires sufficient data (> 2 cycles)
Computationally expensive
Fixed seasonal period

Recommended Settings:

Seasonal period: 13 (for 2-hour cycles at 10-min intervals)
Use for temperature (daily cycle)

LOF (Local Outlier Factor)

Best For: Density-based outlier detection, clustered data

How It Works:

Measures local deviation of density compared to neighbors:

High LOF: Point is in a sparse region (outlier)
Low LOF: Point is in a dense region (normal)

Implementation:

from sklearn.neighbors import LocalOutlierFactor

# Fit on historical data
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.1)
lof.fit(historical_data.reshape(-1, 1))

# Predict on current value
prediction = lof.fit_predict(
    np.append(historical_data, actual_value).reshape(-1, 1)
)

if prediction[-1] == -1:
    return True  # Anomalous

Advantages:

Finds local outliers
Handles varying densities
No global threshold needed

Disadvantages:

Sensitive to k parameter
Computationally expensive
Requires retraining for each prediction

Recommended Settings:

n_neighbors: 20
contamination: 0.1

Spatial Verification Methods

Pearson Correlation (Default)

Purpose: Measure trend consistency between suspect and neighbors

Formula:

[ r = \frac{\sum (X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum (X_i - \bar{X})^2 \sum (Y_i - \bar{Y})^2}} ]

Where:

(X) = Suspect station time series
(Y) = Neighbor station time series

Implementation:

from scipy.stats import pearsonr

# Get time series for same window
suspect_series = get_series(suspect_station, window_hours=6)
neighbor_series = get_series(neighbor_station, window_hours=6)

# Compute correlation
correlation, p_value = pearsonr(suspect_series, neighbor_series)

# Classify based on correlation
if correlation > 0.6:
    return "weather_event"
elif correlation < 0.3:
    return "device_failure"
else:
    return "suspected"

Thresholds:

Correlation	Interpretation	Action
> 0.6	Strong positive correlation	Weather event (ignore)
0.3 to 0.6	Weak correlation	Suspected (manual review)
< 0.3	No correlation	Device failure (alert)

Advantages:

Measures trend, not absolute values
Robust to different base temperatures
Interpretable

Disadvantages:

Requires sufficient data points (> 10)
Can be affected by missing data
Assumes linear relationship

Distance-Based (Fallback)

Purpose: Static comparison when correlation fails

How It Works:

Compares current values directly using MAD:

[ \text{deviation} = \frac{|X_{\text{suspect}} - \text{median}(X_{\text{neighbors}})|}{\text{MAD}(X_{\text{neighbors}})} ]

Implementation:

import numpy as np

# Get current values from all neighbors
neighbor_values = [get_current_value(n) for n in neighbors]

median = np.median(neighbor_values)
mad = np.median(np.abs(neighbor_values - median))

deviation = abs(suspect_value - median) / mad

if deviation > 3:
    return "device_failure"
else:
    return "weather_event"

Use Cases:

Not enough historical data for correlation
All neighbors have missing data
Correlation computation fails

Method Comparison

Performance Comparison

Method	CPU Time	Memory	Accuracy	False Positives
ARIMA	High	Medium	★★★★★	Low
3-Sigma	Low	Low	★★★☆☆	Medium
MAD	Low	Low	★★★★☆	High
IQR	Low	Low	★★★☆☆	Medium
Isolation Forest	Medium	Medium	★★★★☆	Low
STL	High	Medium	★★★★☆	Medium
LOF	High	High	★★★☆☆	Medium

Recommended Use Cases

TemperatureHumidityWind SpeedPressureRainfall

Primary: ARIMA (captures daily cycles)

Backup: STL (if strong seasonality)

Quick Check: 3-Sigma

Primary: ARIMA

Backup: MAD (robust to spikes)

Quick Check: IQR

Primary: MAD (very volatile)

Backup: ARIMA

Quick Check: IQR

Primary: 3-Sigma (stable baseline)

Backup: ARIMA

Quick Check: MAD

Primary: IQR (sparse data, many zeros)

Backup: Isolation Forest

Quick Check: Simple threshold (> 100mm/h)

Tuning Guidelines

Start with ARIMA: Best overall performance
If too slow: Switch to 3-Sigma or MAD
If too many false positives: Lower threshold or switch to IQR
If missing real anomalies: Increase window size or use Isolation Forest
Always enable spatial verification: Reduces false positives by 80%