Quantifying Nationality-Based Compliance Bias in LLMs

The Question

Do large language models comply differently with identical requests depending on the perceived nationality of the requester? The intuitive answer is "probably yes" — but quantifying how much, which models, and where in the network this happens is a different challenge entirely.

This was the core question I investigated at USF's Language GRASP Lab under Dr. Gene Kim, culminating in a presentation at the 2026 AI+X Symposium.

Methodology

We ran large-scale behavioral experiments on open-source models in the Qwen 2.5 family. The experimental design: take a fixed set of requests, modify only the nationality signal embedded in the prompt (e.g., through user-stated location, name conventions, or cultural framing), and measure compliance rates across the modified prompt variants.

Mechanistic interpretability tools let us go further than behavioral analysis. Using neuron probing and integrated gradients, we identified which layers and attention heads are most sensitive to nationality-related tokens — and found that the effect is localized rather than distributed uniformly across the network.

What We Found

The short version: yes, there are systematic differences, and they are not small. Certain request categories show compliance rate swings of 15–30 percentage points depending on nationality framing. The effect is stronger in earlier model layers and correlates with specific attention heads that appear to track "user context" broadly.

What's especially interesting is the directionality: the bias doesn't uniformly favor any single nationality. It's more like the models have learned rough stereotypical compliance priors per perceived cultural context, shaped by the distribution of their training data.

Implications

Mechanistic localization matters because it opens the door to targeted interventions — fine-tuning or activation steering on specific components rather than full retraining. It also raises questions about auditing requirements: behavioral benchmarks alone will miss this, because the effect is prompt-dependent and disappears under certain phrasings.

I presented these findings at the 2026 AI+X Symposium. A written version of the work is in progress.

Reading List

If you're interested in this area, these papers shaped my thinking:

Elhage et al., A Mathematical Framework for Transformer Circuits (Anthropic, 2021)
Sundararajan et al., Axiomatic Attribution for Deep Networks (ICML 2017)
Röttger et al., Political Compass Turns for LLMs (2024)