Detection Methods
Temporal Detection Methods
Temporal methods analyze a single station's behavior over time to detect anomalies.
ARIMA (AutoRegressive Integrated Moving Average)
Best For: Complex trends, seasonal patterns, weather forecasting
How It Works:
ARIMA models the time series as a combination of:
- AR (AutoRegressive): Current value depends on past values
- I (Integrated): Differencing to make the series stationary
- MA (Moving Average): Current value depends on past forecast errors
Parameters: (p, d, q) where:
p: Number of lag observationsd: Degree of differencingq: Size of moving average window
Implementation:
from statsmodels.tsa.arima.model import ARIMA
# Fit model on historical data
model = ARIMA(historical_data, order=(1, 1, 1))
fitted = model.fit()
# Forecast next value
forecast = fitted.forecast(steps=1)
confidence_interval = fitted.get_forecast(steps=1).conf_int()
# Check if actual value is outside confidence interval
if actual < confidence_interval[0] or actual > confidence_interval[1]:
return True # Anomalous
Advantages:
- Captures complex patterns
- Provides confidence intervals
- Adapts to trends
Disadvantages:
- Computationally expensive
- Requires sufficient historical data (> 30 points)
- May fail on very noisy data
Recommended Settings:
- Window size: 6 hours (36 data points at 10-min intervals)
- Order:
(1, 1, 1)for most weather variables - Confidence level: 95%
3-Sigma (Z-Score)
Best For: Quick outlier detection, normally distributed data
How It Works:
Calculates how many standard deviations the current value is from the mean:
[ Z = \frac{X - \mu}{\sigma} ]
Where:
- (X) = Current value
- (\mu) = Mean of historical data
- (\sigma) = Standard deviation
Implementation:
import numpy as np
mean = np.mean(historical_data)
std = np.std(historical_data)
z_score = (actual_value - mean) / std
if abs(z_score) > 3:
return True # Anomalous
Advantages:
- Extremely fast
- Simple to understand
- Works well for stable data
Disadvantages:
- Assumes normal distribution
- Sensitive to outliers in historical data
- Doesn't capture trends
Recommended Settings:
- Threshold: 3 (captures 99.7% of normal data)
- Window size: 6 hours minimum
MAD (Median Absolute Deviation)
Best For: Robust detection, data with outliers, stable baselines
How It Works:
Uses median instead of mean for robustness:
[ \text{MAD} = \text{median}(|X_i - \text{median}(X)|) ]
[ \text{Score} = \frac{|X - \text{median}(X)|}{\text{MAD}} ]
Implementation:
import numpy as np
median = np.median(historical_data)
mad = np.median(np.abs(historical_data - median))
score = abs(actual_value - median) / mad
if score > 3.5:
return True # Anomalous
Advantages:
- Robust to outliers
- No assumption of normal distribution
- Stable over time
Disadvantages:
- Can be too sensitive to changes
- Doesn't handle trends well
- May flag all values if data is flat
Recommended Settings:
- Threshold: 3.5
- Use when baseline is stable (e.g., barometric pressure)
IQR (Interquartile Range)
Best For: Exploratory analysis, boxplot-style outlier detection
How It Works:
Uses quartiles to define normal range:
[ \text{IQR} = Q_3 - Q_1 ]
Outliers are values outside:
[ [Q_1 - 1.5 \times \text{IQR}, Q_3 + 1.5 \times \text{IQR}] ]
Implementation:
import numpy as np
q1 = np.percentile(historical_data, 25)
q3 = np.percentile(historical_data, 75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
if actual_value < lower_bound or actual_value > upper_bound:
return True # Anomalous
Advantages:
- Intuitive (boxplot logic)
- Robust to outliers
- No distribution assumptions
Disadvantages:
- Static threshold
- Doesn't capture trends
- Less sensitive than other methods
Recommended Settings:
- Multiplier: 1.5 (standard) or 3.0 (conservative)
Isolation Forest
Best For: Multidimensional patterns, complex anomalies
How It Works:
Machine learning algorithm that isolates anomalies by randomly partitioning data:
- Anomalies are easier to isolate (fewer splits needed)
- Normal points are harder to isolate (more splits needed)
Implementation:
from sklearn.ensemble import IsolationForest
# Train on historical data (all variables)
X = historical_data[['temp_out', 'out_hum', 'wind_speed', 'bar', 'rain']]
model = IsolationForest(contamination=0.1, random_state=42)
model.fit(X)
# Predict on current value
current = [[temp, hum, wind, bar, rain]]
prediction = model.predict(current)
if prediction[0] == -1:
return True # Anomalous
Advantages:
- Finds subtle patterns
- Handles multiple variables simultaneously
- No assumption of distribution
Disadvantages:
- Black box (hard to interpret)
- Requires all variables present
- Sensitive to contamination parameter
Recommended Settings:
- Contamination: 0.1 (expect 10% anomalies)
- Random state: 42 (for reproducibility)
STL (Seasonal-Trend Decomposition)
Best For: Data with strong seasonality (daily/weekly cycles)
How It Works:
Decomposes time series into:
- Trend: Long-term progression
- Seasonal: Repeating patterns
- Residual: Remaining variation
Anomalies are detected in the residual component.
Implementation:
from statsmodels.tsa.seasonal import STL
# Decompose time series
stl = STL(historical_data, seasonal=13) # 13 for 6.5-hour cycle
result = stl.fit()
# Get residuals
residuals = result.resid
# Detect outliers in residuals using 3-sigma
threshold = 3 * np.std(residuals)
if abs(residuals[-1]) > threshold:
return True # Anomalous
Advantages:
- Handles seasonality well
- Separates trend from noise
- Interpretable components
Disadvantages:
- Requires sufficient data (> 2 cycles)
- Computationally expensive
- Fixed seasonal period
Recommended Settings:
- Seasonal period: 13 (for 2-hour cycles at 10-min intervals)
- Use for temperature (daily cycle)
LOF (Local Outlier Factor)
Best For: Density-based outlier detection, clustered data
How It Works:
Measures local deviation of density compared to neighbors:
- High LOF: Point is in a sparse region (outlier)
- Low LOF: Point is in a dense region (normal)
Implementation:
from sklearn.neighbors import LocalOutlierFactor
# Fit on historical data
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.1)
lof.fit(historical_data.reshape(-1, 1))
# Predict on current value
prediction = lof.fit_predict(
np.append(historical_data, actual_value).reshape(-1, 1)
)
if prediction[-1] == -1:
return True # Anomalous
Advantages:
- Finds local outliers
- Handles varying densities
- No global threshold needed
Disadvantages:
- Sensitive to k parameter
- Computationally expensive
- Requires retraining for each prediction
Recommended Settings:
- n_neighbors: 20
- contamination: 0.1
Spatial Verification Methods
Pearson Correlation (Default)
Purpose: Measure trend consistency between suspect and neighbors
Formula:
[ r = \frac{\sum (X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum (X_i - \bar{X})^2 \sum (Y_i - \bar{Y})^2}} ]
Where:
- (X) = Suspect station time series
- (Y) = Neighbor station time series
Implementation:
from scipy.stats import pearsonr
# Get time series for same window
suspect_series = get_series(suspect_station, window_hours=6)
neighbor_series = get_series(neighbor_station, window_hours=6)
# Compute correlation
correlation, p_value = pearsonr(suspect_series, neighbor_series)
# Classify based on correlation
if correlation > 0.6:
return "weather_event"
elif correlation < 0.3:
return "device_failure"
else:
return "suspected"
Thresholds:
| Correlation | Interpretation | Action |
|---|---|---|
| > 0.6 | Strong positive correlation | Weather event (ignore) |
| 0.3 to 0.6 | Weak correlation | Suspected (manual review) |
| < 0.3 | No correlation | Device failure (alert) |
Advantages:
- Measures trend, not absolute values
- Robust to different base temperatures
- Interpretable
Disadvantages:
- Requires sufficient data points (> 10)
- Can be affected by missing data
- Assumes linear relationship
Distance-Based (Fallback)
Purpose: Static comparison when correlation fails
How It Works:
Compares current values directly using MAD:
[ \text{deviation} = \frac{|X_{\text{suspect}} - \text{median}(X_{\text{neighbors}})|}{\text{MAD}(X_{\text{neighbors}})} ]
Implementation:
import numpy as np
# Get current values from all neighbors
neighbor_values = [get_current_value(n) for n in neighbors]
median = np.median(neighbor_values)
mad = np.median(np.abs(neighbor_values - median))
deviation = abs(suspect_value - median) / mad
if deviation > 3:
return "device_failure"
else:
return "weather_event"
Use Cases:
- Not enough historical data for correlation
- All neighbors have missing data
- Correlation computation fails
Method Comparison
Performance Comparison
| Method | CPU Time | Memory | Accuracy | False Positives |
|---|---|---|---|---|
| ARIMA | High | Medium | ★★★★★ | Low |
| 3-Sigma | Low | Low | ★★★☆☆ | Medium |
| MAD | Low | Low | ★★★★☆ | High |
| IQR | Low | Low | ★★★☆☆ | Medium |
| Isolation Forest | Medium | Medium | ★★★★☆ | Low |
| STL | High | Medium | ★★★★☆ | Medium |
| LOF | High | High | ★★★☆☆ | Medium |
Recommended Use Cases
Primary: ARIMA (captures daily cycles)
Backup: STL (if strong seasonality)
Quick Check: 3-Sigma
Primary: ARIMA
Backup: MAD (robust to spikes)
Quick Check: IQR
Primary: MAD (very volatile)
Backup: ARIMA
Quick Check: IQR
Primary: 3-Sigma (stable baseline)
Backup: ARIMA
Quick Check: MAD
Primary: IQR (sparse data, many zeros)
Backup: Isolation Forest
Quick Check: Simple threshold (> 100mm/h)
Tuning Guidelines
- Start with ARIMA: Best overall performance
- If too slow: Switch to 3-Sigma or MAD
- If too many false positives: Lower threshold or switch to IQR
- If missing real anomalies: Increase window size or use Isolation Forest
- Always enable spatial verification: Reduces false positives by 80%