The Washington Post tested major chatbot models using political questions designed by researchers at Dartmouth and Stanford, finding sharp divergence across systems. Per the Post, ChatGPT returned left-only arguments in 80% of queries and right-only in 3%, while Gemini offered both-sides responses in about 93% of cases. Claude returned left-leaning answers 43% of the time and balanced responses the remaining 57%. Grok gave more right-leaning responses than other models tested (33% right-only, 40% left-only), making it the only chatbot where right-leaning positions appeared frequently. DeepSeek came in at 70% left-only and 7% right-only. Dartmouth researcher Sean Westwood told the Post these tools are not presenting "a truly neutral representation of really nuanced policy debates, on average." Google said Gemini is designed for balanced responses; Anthropic spokesperson Michael Aciman said Claude is trained to "treat different political viewpoints equally and test extensively for bias before every model launch," per the Post. Both outlets note the test does not show chatbots change how people vote.
The Washington Post ran a comparative test of major chatbot models using political questions designed by researchers at Dartmouth College and Stanford University, according to the Post. The analysis examined outputs from ChatGPT (OpenAI), Gemini (Google), Grok (xAI), Claude (Anthropic), DeepSeek, and Arya (Gab). Per the Post's published results: ChatGPT returned exclusively left-leaning arguments in 80% of queries and right-leaning positions in only 3%; Gemini was the clear outlier, offering both sides in roughly 93% of responses with only 7% left-only; Claude returned left-leaning answers 43% of the time and balanced responses the remaining 57%; Grok provided right-leaning responses in 33% of cases and left-leaning in 40%, making it the most balanced-to-right of the group; DeepSeek came in at 70% left-only, 7% right-only, and 23% both. The Post and subsequent coverage both note the test does not demonstrate that chatbots alter voting behavior.
Industry reporting frames this as an output-level measurement, not a probe of model internals. The Post's method sampled short-answer outputs across policy topics, capturing framing and argument selection rather than document-level citation behavior or information retrieval quality. For practitioners, evaluations of this kind expose gaps not visible in standard benchmark scores: prompt sensitivity, answer framing, and the mix of normative versus factual content in short answers. The model-by-model variation also illustrates that RLHF tuning, safety layers, and instruction design all affect political framing in ways that differ substantially across providers.
The Post's findings illustrate that nominally similar conversational interfaces can systematically differ in the balance of perspectives they present. Dartmouth researcher Sean Westwood told the Post these tools are not presenting "a truly neutral representation of really nuanced policy debates, on average." Companies pushed back: Google said Gemini "is designed to provide balanced responses that don't favor any political ideology." Anthropic spokesperson Michael Aciman said Claude is trained to "treat different political viewpoints equally and test extensively for bias before every model launch," per the Post. OpenAI, SpaceX, DeepSeek, and Gab did not respond to the Post's requests for comment, per Mediaite.
For practitioners deploying conversational agents, the logical next controls include: reproducible evaluation datasets measuring ideological framing; documentation from providers about safety and alignment testing for political content; and published methodologies showing how model sampling, instruction tuning, and safety layers affect answer balance. Teams responsible for public-facing Q&A should treat these results as motivation to add targeted, reproducible framing checks into release and monitoring pipelines.
The Washington Post analysis provides model-specific percentages across ChatGPT, Gemini, Claude, Grok, DeepSeek, and Arya, making it a concrete output-level evaluation with direct relevance for practitioners auditing conversational AI for framing bias. It is a single-publication test designed with academic researchers rather than a peer-reviewed study, so its methodological authority is informative but limited. Score reflects solid practitioner relevance for AI deployment and alignment teams without reaching the threshold of a formal research landmark.
A 5-minute Tuesday brief on AI & data science. Curated, no fluff.
No spam. Privacy.
Practice interview problems based on real data
1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.
News on Let's Data Science is compiled from multiple public sources with editorial oversight. See our Editorial Standards and Corrections Policy.

Leave a Reply