Adversarial ML Attack Patterns Mapped to MITRE ATLAS — A Defender’s Reference | Colorful White

In 2025 I published a peer-reviewed paper on data-free black-box adversarial attacks against image classifiers. The paper is firmly in the offensive ML literature — it shows that with only API-level query access to a target model and no access to its training data, an attacker can train a GAN-based substitute model and use it to craft transfer adversarial examples that succeed against deployed services (Microsoft Azure included) at greater than 78% rate.

A reasonable question for any defender reading that paper is: what would you do about it? This post is the answer I should have included in the original work but didn’t — a step-by-step mapping of the paper’s attack chain onto the MITRE ATLAS framework (v5.6.0 at time of writing), with one concrete detection idea for each step.

ATLAS is MITRE’s adversarial threat-landscape catalogue for AI systems — the ATT&CK equivalent for ML. It assigns a unique AML.TXXXX ID to each known offensive technique against ML, and groups them under tactics like AI Model Access, AI Attack Staging, and Impact. As of v5.6.0 it contains 170 techniques across 15 tactics. The mapping below uses the verbatim IDs.

The five-stage attack chain

The paper describes a complete attack pipeline. Reframed in defender language:

#	Step	What the attacker does	ATLAS technique
1	Probe the target	Treat the victim model as a black box; issue inference queries through its public API	AML.T0040 — AI Model Inference API Access
2	Build a surrogate	Train a substitute model that mimics the target’s decision boundary — without using any of the target’s training data	AML.T0005.001 — Train Proxy via Replication
3	Generate adversarial input on the surrogate	Apply FGSM / BIM / PGD against the substitute (white-box on a model that is the target’s shadow)	AML.T0043 — Craft Adversarial Data
4	Verify	Sanity-check that the perturbed input flips the substitute’s prediction	AML.T0042 — Verify Attack
5	Transfer to the target	Send the adversarial input to the real target’s API, induce a confident misclassification	AML.T0015 — Evade AI Model

The paper’s specific contribution is step 2 — using a conditional GAN with label-information conditioning to generate the substitute model’s training samples directly from random noise, rather than scraping or stealing data resembling the target’s training distribution. This collapses the data-acquisition prerequisite that most prior black-box attacks required, and it is the part that should worry defenders the most. The attacker no longer needs to know what your model was trained on.

Detection idea for each step

Step 1 — AML.T0040: AI Model Inference API Access

The attacker needs a lot of queries to train a useful substitute. The paper reports needing 35–60% fewer queries than DaST and MAZE — but “fewer” here is still in the tens to hundreds of thousands range. That volume is the defender’s primary signal.

Detection idea (operational):

Alert when (queries_per_principal_per_hour > P95 baseline × 5)
AND (input_diversity > P95 baseline)

input_diversity here is the entropy of the L2-distance histogram between consecutive queries. Legitimate users (in a typical SaaS ML deployment) tend to query semantically related inputs (e.g., consecutive frames of a video, related search images). A substitute-training attacker queries a structured exploration of the input space — the diversity histogram looks much flatter than normal traffic.

This is exactly the signal that the paper accidentally documents in its own ablation tables: the substitute training converges faster when the attacker samples diverse points, not when they cluster around any single class. Defenders should monitor for what attackers optimize for.

Step 2 — AML.T0005.001: Train Proxy via Replication

This step happens entirely on the attacker’s infrastructure. There is no direct observability from the defender’s side. However, two indirect signatures exist:

Detection idea 1 (input-side):

The paper’s GAN-based sample generator produces synthetic images. These images carry detectable artifacts — for any generator, there is a frequency-domain signature in the high-frequency spectrum that distinguishes its output from natural images. Run a lightweight classifier (a defender-trained ResNet18 on real-vs-synthetic) over the input stream. Flag principals whose query stream is dominated by GAN-generated inputs.

For each principal:
    last_N_queries = circular_buffer(1024 inputs)
    p_synthetic = mean(real_vs_synthetic_classifier(last_N_queries))
    if p_synthetic > 0.6:
        alert

A real user’s image upload stream sits at p_synthetic ≈ 0.02. An attacker training a substitute model often spikes to p_synthetic ≈ 0.7+.

Detection idea 2 (output-side):

Substitute training works by aligning the surrogate’s decision boundary with the target’s. The attacker therefore preferentially queries inputs that straddle the decision boundary — these inputs make the target return near-uniform softmax distributions. Track per-principal:

softmax_entropy_p95 = 95th percentile of entropy(model.output) over queries

A baseline web user has softmax_entropy_p95 close to 0 (most queries are confidently classified). A boundary-probing attacker sees softmax_entropy_p95 close to log(num_classes).

Step 3 — AML.T0043: Craft Adversarial Data

This step is also attacker-side, but the outputs of this step eventually appear at your API as step 5 queries. The crafting algorithm (FGSM / BIM / PGD) leaves specific perturbation patterns:

Algorithm	Perturbation signature
FGSM	Single sign-of-gradient step; perturbations are ε-bounded uniformly
BIM	Iterated FGSM with clipping; perturbations cluster at ε boundary
PGD	Iterated FGSM with random restart; perturbations look “natural” but feature-flatter than untouched inputs

Detection idea:

Train an adversarial-input detector (a small auxiliary network) on triplets (clean image, FGSM perturbation, PGD perturbation). At inference time, run the detector in parallel with the production model. Adversarial examples crafted by FGSM/BIM/PGD against a substitute trained on the same architecture as production have a detectable family of artifacts even after transfer — they are not random noise; they encode gradient direction in the target’s local geometry. The paper itself demonstrates 6–10 percentage-point boost in transfer attack success across CIFAR-100/10, SVHN, FMNIST, MNIST — which means there is a structural family signature, which means there is a structural family detector.

Step 4 — AML.T0042: Verify Attack

Verification is attacker-side and produces no direct observability. Skip.

Step 5 — AML.T0015: Evade AI Model

This is the last point at which the defender can catch the chain. By the time an adversarial input arrives at the target’s API, the attacker has already invested days of GPU time. But there are signal-rich features at this exact point:

Detection idea 1 — confidence-margin anomaly:

Adversarial examples generally cause the target to produce a confident-but-wrong prediction. From the defender’s side, this looks like a sudden change in the margin distribution:

margin = max(softmax) - second_max(softmax)

Adversarial inputs from a transfer attack often produce margin values in a different tail of the distribution than legitimate inputs — and specifically, the margin tail flips: legitimate confident-correct predictions look statistically different from confident-wrong adversarial predictions on classes the legitimate user almost never reaches. Per-class margin baselines catch this. The paper, again, accidentally documents the signal: their Microsoft Azure 78% attack success rate is a 78% rate of confidently-wrong predictions, which is itself measurable from the API logs on the operator side.

Detection idea 2 — ensemble disagreement:

Run the input through 2–3 architecturally diverse models (e.g., a CNN and a ViT) and measure disagreement. Adversarial examples crafted against any single model architecture transfer imperfectly to architecturally distant models. The paper acknowledges this in its own discussion section: their attack works best when the substitute and target share architectural family. Ensemble disagreement is exactly the inverse of that finding turned into a defender control.

Detection idea 3 — input pre-processing:

Apply randomized smoothing, JPEG re-compression, or bit-depth squeezing to the input before classification. These cheap transforms remove most pixel-level perturbations without affecting semantically meaningful content. A robust deployment runs both the original and the squeezed input, and alerts when their predictions disagree.

The mapping in one table

For the defenders who scroll to this section first:

ATLAS technique	What the paper does	Cheapest detection
AML.T0040 Inference API Access	High-volume queries to target API	Per-principal rate + input-diversity P95 cap
AML.T0005.001 Train Proxy via Replication	GAN-generated training samples	Real-vs-synthetic input classifier; softmax-entropy P95 monitoring
AML.T0043 Craft Adversarial Data	FGSM / BIM / PGD on the substitute	Auxiliary adversarial-input detector network
AML.T0042 Verify Attack	(attacker-side, no signal)	—
AML.T0015 Evade AI Model	Submit transferred adversarial input	Confidence-margin distribution monitoring + ensemble disagreement + input pre-processing
AML.T0031 Erode AI Model Integrity	Cumulative effect on production accuracy	Confusion-matrix drift monitoring against a held-out clean set
AML.T0034 Cost Harvesting	API quota consumed on misclassifications	Cost-per-correct-prediction baseline + alert on regression

The last two rows (T0031 and T0034) are not described in my paper as attack goals but are the downstream business impact of any adversarial campaign reaching production scale. Operators of paid inference APIs (think: cloud computer-vision services, content-moderation pipelines) should monitor them as KPIs anyway; if they degrade simultaneously with the upstream signals, the chain above is the likely cause.

Why this matters for the academic literature

Most published adversarial ML papers — mine included — stop at “we achieved X% attack success on benchmark Y.” That’s the right scope for an algorithms paper, but it leaves a gap. The defenders who actually deploy these models inherit the conclusion (“you are vulnerable”) without the operational playbook (“here is what to monitor on Monday”). ATLAS is the canonical place to write that playbook — every adversarial ML paper could close with a single table that maps its threat model onto ATLAS techniques and lists at least one detection idea per technique.

This post is that closing table for my paper. The same exercise applied to other recent black-box attack papers (DaST, MAZE, the Knockoff Nets line) would yield a complementary set of detection ideas. I will likely publish a survey-style follow-up that does exactly this when time permits — the inputs are the techniques’ AML.TXXXX IDs, and the outputs are concrete monitoring queries any MLOps team can drop into their existing observability stack.

If you publish ML offense, close with the defender’s table. It costs you an afternoon and it changes who can act on your work.

The paper is at DOI: 10.3778/j.issn.1002-8331.2311-0227. The ATLAS v5.6.0 raw data used for the technique IDs in this post lives at mitre-atlas/atlas-data. The detection-engineering ruleset for N-day CVEs is at sigma-detection-rules; the same workflow applied to ML adversaries would land in a sibling repository when the auxiliary-detector models are trained and validated — that work is on the roadmap.