How Does Ai Detect Anomalies?

I’m working on a small monitoring project and want to use AI to flag unusual behavior in data, like weird traffic spikes or suspicious user actions. I’ve read about anomaly detection, but most guides stay very high level and don’t explain what’s really going on under the hood. Can someone break down, in practical terms, how AI models detect anomalies, what data they need, and how to avoid lots of false positives?

Short version. Anomaly detection = learn “normal”, then flag stuff that does not fit.

For a small monitoring project, I’d keep it simple and go in layers.

  1. Start with dumb but solid baselines
    These catch 80 percent of weird traffic.

• Moving average + std dev
Example per metric per time window, like requests per minute.
Keep mean μ and std dev σ over a rolling window.
Flag if value > μ + kσ (k like 3) or < μ − kσ.
Pros: simple, explainable.
Cons: fails on trends and seasonality.

• Percentile thresholds
Learn 5th and 95th percentiles from history.
If traffic spikes above 99th percentile, flag it.
Works well for response time, latency, payload size.

  1. Handle seasonality and trends
    If your traffic has patterns, you need models that know “Monday 10am vs Sunday 3am”.

Options:

• STL or Prophet style logic
Decompose time series into trend + seasonality + residual.
Anomaly is large residual.
Many libs wrap this:

  • Python: Kats, Prophet, statsmodels.
    You feed timestamps + values. It predicts expected value + interval.
    If actual is far outside interval, flag.

• Exponential smoothing / EWMA
Keep an exponentially weighted mean:
m_t = α x_t + (1 − α) m_{t−1}
Large deviation from m_t means anomaly.
Good for simple load metrics.

  1. Unsupervised methods for “weird user behavior”
    Here you have feature vectors, not just counts.

Build per event or per session features like:
• Requests per minute.
• Distinct endpoints hit.
• Average request interval.
• Geo distance between IPs.
• User agent diversity.
• Failed login rate.

Then feed these to an unsupervised model.

Common ones:

• Isolation Forest
Idea: isolate points by random splits.
Anomalies isolate in fewer splits.
Good for medium feature sets.
Example in scikit-learn:

  • train IsolationForest on normal data.
  • score_samples on new data.
  • pick threshold from historical scores.

• One Class SVM
Learns a boundary around normal data.
Flags points outside.
Tends to be sensitive to hyperparams.
Not great for high dimensional or huge data.

• LOF (Local Outlier Factor)
Compares density of a point to its neighbors.
If density is much lower, mark outlier.
Works when “weird” means far from neighbors.

  1. Sequence based methods for actions over time
    If you care about things like “login then delete 100 items then change password”.

Simpler approach, before deep learning:

• Markov style patterns
Model probability of transitions between actions.
For each user, sequence of events like:
LOGIN → VIEW → VIEW → UPDATE → LOGOUT.
Learn typical transitions and counts.
If a transition or sequence has tiny historical probability, flag it.

Deep learning style, if you want more AI-ish:

• LSTM / GRU autoencoder on event sequences
Encode event sequence to a vector, decode back.
Train on normal sessions.
High reconstruction error means anomalous sequence.
Costs more time and effort, needs more data.

• Transformer based sequence models
Similar idea, more complex.
Overkill for a small side project in most cases.

  1. Simple autoencoder for tabular features
    Good starter “AI” anomaly model.

Pipeline:

• Gather normal data only, or at least mostly normal.
• Normalize features, for example StandardScaler.
• Train small autoencoder: input → bottleneck → output.
• Loss = MSE between input and output.
• After training, compute reconstruction error distribution on a validation set.
• Pick threshold like 99th percentile of error.
• At run time, compute error for each new sample.
If error above threshold, flag anomaly.

Works well when normal behavior lies on a lower dimensional manifold and anomalies deviate.

  1. Labeling and evaluation
    Even if you do unsupervised stuff, you need to test detection quality.

• Keep a small labeled set of incidents, like real spikes, real abuse sessions.
• Replay them through your detector.
• Measure recall: incidents detected.
• Measure precision: alerts that were real.
If you get spammy alerts, people will ignore them.

  1. Practical tips for your project

• Start metric first
Begin with a few core metrics:

  • Requests per second per endpoint.
  • Error rate.
  • Auth failures per user or IP.
  • Latency percentiles.
    Apply moving average + std or percentile logic.
    Add seasonal method later if needed.

• Use features with real meaning
For suspicious users, build 5 to 15 features.
Example:

  • login_count_1h
  • failed_login_ratio_1h
  • distinct_ips_24h
  • distinct_countries_24h
  • write_ops_ratio
    Then feed to Isolation Forest or autoencoder.

• Feedback loop
Log all alerts and whether you consider them real or false.
After some weeks, use that to adjust thresholds or retrain.

• Alert hygiene
Combine rules and ML.
Example:

  • Absolute rules: more than 50 failed logins in 1 minute is always high priority.
  • ML: anomaly score for more subtle stuff.
    Often hybrid is more reliable than pure “AI”.
  1. Simple code sketch in Python for metrics

Example with scikit-learn Isolation Forest for per user features:

from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
import numpy as np

X_train: rows are users or sessions, cols are features

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

model = IsolationForest(contamination=0.01, n_estimators=200)
model.fit(X_train_scaled)

For new data

X_new_scaled = scaler.transform(X_new)
scores = model.decision_function(X_new_scaled)

smaller score means more anomalous

threshold = np.percentile(scores, 1)
alerts = scores < threshold

For metric time series, look at libraries like:
• statsmodels SARIMA or Holt Winters.
• Kats or Prophet if you want a higher level API.

If you share the data shape you have, like “per minute counts only” or “per user event logs”, you get more targeted suggestions.

@himmelsjager already covered the “core toolbox”, so I’ll avoid rehashing Isolation Forest, autoencoders, etc. Let me zoom in on the how to actually wire this into a monitoring setup and where I slightly disagree.


1. Think in alerts, not just “anomalies”

Most anomaly tutorials stop at “here’s an outlier score”. In monitoring, that’s useless unless you turn it into:

  • a stable alert
  • with a clear reason
  • with some hysteresis so it doesn’t flap on and off

Pattern that works well:

  1. Compute a continuous anomaly score per metric / user / session.
  2. Smooth the score (like a small rolling median).
  3. Alert only when:
    • score > threshold
    • for at least N consecutive windows
    • and min gap between alerts is M minutes

That alone kills a ton of noisy “blips”.


2. Use change detection as a first-class tool

This is where I slightly disagree with focusing only on “learn normal”. For traffic spikes, you often care about sudden change more than absolute weirdness.

Look up:

  • CUSUM (CUmulative SUM) change detection
  • BOCPD (Bayesian Online Change Point Detection)

Idea: instead of asking “is this value rare?”, you ask “did the distribution just shift?”

Concrete example:

  • Metric: requests_per_minute
  • Maintain rolling baseline mean μ, variance σ²
  • Track cumulative deviation from baseline
  • When cumulative deviation crosses a threshold, raise a “regime change” alert

This picks up slow but meaningful drifts that a plain σ-based rule might miss or normalize away.


3. Multi-metric anomalies beat single-metric ones

In practice, “weird spike” is only scary when multiple things move together.

Example for web service:

  • requests_per_sec
  • error_rate
  • p95_latency
  • CPU / memory

Alone, each spikes a lot. Jointly, the pattern

  • requests_per_sec ↑
  • error_rate ↑
  • p95_latency ↑

is a strong “something broke” signal.

Simple trick without fancy ML:

  1. Normalize each metric to z-scores:
    z = (x − μ) / σ
  2. Combine a few into a single score, e.g.
    score = sqrt(z_req² + z_err² + z_lat²)
  3. Alert on the combined score instead of each one separately.

This is cheap and often beats over-complicated unsupervised models in monitoring because it reduces alert fatigue.


4. For “suspicious user behavior”: keep it explainable

Unpopular opinion: for security-ish stuff, black-box anomaly models suck when you have to explain “why was this user blocked”.

Instead of jumping straight to autoencoders / deep stuff:

  1. Build a few interpretable risk features, similar to what @himmelsjager said but with an explicit “why” baked in, for example:
    • login_fail_ratio_10m
    • unique_countries_24h
    • burst_writes_5m (writes per minute / median writes per minute)
    • night_activity_ratio per user local time
  2. Start with:
    • a simple linear model
    • or even a weighted score:
      risk = 3 * login_fail_ratio + 2 * unique_countries_24h + 4 * burst_writes_5m

Then, optionally add a fancier anomaly model on top of those features to get a second layer of detection.

Reason: when someone asks “why did the system flag this session?”, you can show:

  • 92% of user activity normally occurs 8am–8pm
  • This session: 95% actions at 3–4am
  • Writes 20x higher than baseline

Way easier than “uh, the latent embedding was 2.3σ from the normal manifold”.


5. Use “weak labels” and feedback, not pure unsupervised

Another place I disagree slightly with the pure-unsupervised vibe: you almost always have at least weak labeling:

  • incidents you know were bad
  • deploys / releases
  • known attack waves
  • maintenance windows

Exploit that:

  1. Tag historical periods with rough labels:
    • normal
    • degraded
    • attack
    • maintenance
  2. Train your anomaly model on “normal” only, but:
    • use the labeled bad periods to set thresholds based on what you actually want to catch
    • maybe fit a simple supervised model that learns “which anomaly patterns correlate with real incidents”

This is how you avoid a classic failure: mathematically beautiful anomaly scores that never line up with real outages.


6. Don’t forget context: tags & dimensions

Almost everyone underestimates how much value you can get from just slicing metrics smartly.

For traffic spikes:

  • Instead of one metric “requests_per_sec”, track:
    • by endpoint
    • by status code
    • by user / tenant / region

Then two layers:

  1. Local anomaly:
    For each endpoint / user / region, run a super-simple baseline (moving average, z-score, whatever).
  2. Global prioritization:
    When an anomaly triggers, also look at:
    • how important that endpoint / user is
    • whether similar dimensions also look weird

Simple example: if only one low-traffic endpoint spikes, that might be “meh”. If you see a spike of 500s across multiple key endpoints at once, that gets a higher severity.

You can do that with some small aggregation logic, no heavyweight AI needed.


7. Implementation sketch that’s actually deployable

Pipeline that tends to work for a small project:

  1. Ingest layer

    • All events / metrics go into something like Kafka, Kinesis, or just a DB if tiny.
  2. Feature & baseline layer (per minute)

    • Aggregate per metric and per “entity” (user, IP, endpoint):
      • counts, error ratios, etc.
    • Maintain rolling baselines in memory or Redis:
      • mean, std, EWMA, last hour percentiles
  3. Scoring layer

    • For metrics: combined z-score / CUSUM / change-point per series.
    • For users: risk score based on interpretable features.
    • Optional: one unsupervised model on top of those features for extra “AI flavor”.
  4. Alerting & feedback

    • Store:
      • score time series
      • which alerts fired
      • manual labels: true_incident / false_alarm
    • Periodically:
      • recompute thresholds to keep false positives at an acceptable rate
      • retire useless features

This is boring, but it works and scales a lot better than dropping a complex model in a notebook and hoping.


8. What I’d literally do for your use case

Given “small monitoring project” + “weird traffic spikes” + “suspicious user actions”:

  • Traffic:

    • per-minute metrics per endpoint:
      • requests_per_min
      • error_rate
      • p95_latency
    • z-score + CUSUM + multi-metric combined score
    • very light seasonal correction (like per-hour-of-day averages) if you see clear patterns
  • Users:

    • 8–12 features per user per 5 or 15 minutes
    • start with a manual risk score & some hard rules
    • later, if you want, train an Isolation Forest or autoencoder on those features as a second layer

If you share roughly what data you already log (columns + frequency), people can suggest a very concrete feature set and scoring scheme.

You already got solid “how” from @chasseurdetoiles and @himmelsjager. I’d zoom out and talk about where AI actually helps in anomaly detection, and where plain stats are better.


1. AI is mostly about the score, not the rule

Under the hood, almost every anomaly system is:

  1. Map raw data to features.
  2. Compute an anomaly score from those features.
  3. Convert score to alert with thresholds and rules.

The “AI” part lives almost entirely in step 2. For a small monitoring project, you can keep 1 and 3 very manual and transparent, and just experiment with smarter scoring in the middle.

I slightly disagree with jumping to Isolation Forest or autoencoders first. For monitoring, I would:

  • start with very interpretable scores
  • only add heavier models when you have a clear failure mode of the simple ones

Example: for “weird user actions”, instead of “feed raw events into LSTM”, first define domain-aware scores like:

  • burstiness = current_requests_per_min / median_requests_per_min_last_24h
  • geo_jump_km = distance(last_ip, current_ip)
  • access_novelty = fraction_of_endpoints_never_seen_for_this_user

Then, if you want “AI flavor”, learn a small model that maps these 5–15 scores to a single risk score.


2. The part no one likes to hear: you need a budget for false alarms

Both replies focused on techniques; I’d add an operational constraint: pick an alert budget first.

Example:

  • “I can tolerate at most 5 anomaly alerts per day.”

Then tune models and thresholds to hit that budget. Concretely:

  1. Run your candidate detector on a month of historical data.
  2. Sort scores from most anomalous to least.
  3. Pick a threshold that would have produced roughly 5 alerts/day in that history.

This way you avoid the classic outcome where models are “accurate” in theory but unusable in practice. It also gives you a direct way to compare methods from @chasseurdetoiles vs @himmelsjager: whichever meets your alert budget with higher recall on known bad periods wins.


3. Forget “best algorithm”: focus on coverage of anomaly types

Traffic and behavior anomalies are not one thing:

  • level shifts
  • spikes
  • drifts
  • pattern breaks in sequences
  • outlying individuals versus group anomalies

Instead of trying to find one magic detector, define 3–5 “anomaly stories” you care about, for example:

  1. “Sudden 10x traffic spike with rising errors.”
  2. “Slow degradation of latency over hours.”
  3. “User suddenly acts like a different user population.”
  4. “Endpoint that is normally cold becomes very hot.”

Then pick the minimal tool per story:

  • spikes and shifts → simple z-score or CUSUM
  • drifts → EWMA or change point detection
  • user acting like a different group → clustering or density-based methods on your features
  • cold to hot endpoint → simple ratio: current_rate / median_past_week_same_hour

AI comes in naturally when you have many dimensions and want “user acts unlike users like them” patterns. That is where something like an autoencoder or Isolation Forest actually solves a real pain, not just decorates the system.


4. How AI actually “learns normal” in practice

Conceptually, for your use case:

  • Tabular or metric behavior
    AI learns a compact representation of “typical” points and assigns a distance or isolation score. Anomalies are those far from the learned cluster or boundary.

  • Sequences of actions
    AI learns the probability of events or transitions. Anomalies are low probability subsequences relative to what it saw during training.

Two concrete mental models:

  1. Clustering view

    • You have a cloud of points (sessions / 5-minute user windows).
    • Algorithm finds dense areas (clusters).
    • Points in very sparse regions get high anomaly score.
  2. Predictive view

    • For timeseries or sequences you learn: “given the past, what is likely next?”
    • If actual next value is consistently surprising, it is anomalous.

You do not need heavy deep learning at your scale. Even a simple k-means + distance to nearest centroid can behave like a basic anomaly score for “weird user behavior” when your features are decent.


5. Where I’d not use AI in your setup

Some disagreement with the more “core toolbox” angle: there are areas where adding ML is mostly downside for a small project:

  • Static thresholds that are obviously correct
    Example: “> 50 failed logins per minute from one IP” or “> 95% error rate”. No model beats human common sense here.

  • Ultra low volume metrics
    If an endpoint sees 3 hits per day, AI cannot learn a curve. Just use a rule like “if this suddenly gets > X hits per hour, flag”.

  • Compliance or audit-heavy decisions
    If you need strong explainability, stick to rules or linear models. Autoencoders and one-class SVMs are painful to justify.

So treat AI as auxiliary to rules and simple stats, not a replacement.


6. About the mysteriously named product ’

Since you mentioned ', think of it as a candidate “AI anomaly layer” you could drop into the middle of the pipeline:

  • You still own feature engineering and alert logic.
  • ’ would focus on mapping features to anomaly scores.

Pros for integrating ’ in a project like yours:

  • Gives you a standard place to run different anomaly models without rewriting the whole monitoring stack.
  • Lets you experiment with multiple algorithms behind a common interface so you can compare their impact on your alert budget.
  • Can improve readability of your code and documentation because “feature builder → ’ → alert rules” is easier to reason about than a ball of glued scripts.

Cons:

  • Adds another dependency and layer of abstraction, which might be overkill if your stack is tiny.
  • If ’ hides too much logic, tuning thresholds and understanding why it flagged something can become harder.
  • For really simple “first version” monitoring, it could slow you down compared to a plain numpy / pandas / scikit-learn setup.

Compared to what @chasseurdetoiles and @himmelsjager sketched, ’ should sit beside those approaches, not replace them. They showed concrete recipes; ’ would be more of a framework that orchestrates those recipes.


7. Minimal concrete plan for you

If you want something you can actually ship soon:

  1. Traffic spikes & health

    • Pick 5 to 10 key metrics (RPS, error rate, p95 latency, auth failures).
    • For each, keep rolling mean and std, plus maybe per-hour-of-day averages for light seasonality.
    • Build a combined score like sqrt(z_rps² + z_err² + z_lat²) and alert when it exceeds a tuned threshold for N minutes.
  2. Suspicious user behavior

    • Aggregate per user per 5 or 15 minutes:
      • total requests, write ratio, failed login ratio, distinct IPs, distinct endpoints.
    • Start with a hand-crafted risk score: weighted sum of those.
    • Once that works, plug the same features into a simple anomaly model, whether you roll your own or route them through something like ’ for cleaner handling.
  3. Feedback

    • Log all alerts, quickly tag them “useful / useless”.
    • Re-tune thresholds every week based on that log, not on abstract metrics.

That gives you a practical system that uses AI where it actually adds value, instead of chasing fancy algorithms before you have real incidents to catch.