This study examines bias in question-answering (QA) tasks in two recently developed large language models (LLMs), Gemini Flash and Llama-3.3. A foundational English (EN) dataset is constructed across seven social dimensions by intersecting existing bias focused benchmarks: age, disability status, gender, physical appearance, religion, socioeconomic status and sexual orientation. The dataset is translated into German (DE) and Japanese (JA), where these languages are used to test the LLMs. This approach fills a critical gap in non-English bias benchmarks and provides a ready-to-use scaffold for extending the dataset to include culturally specific prompts in any language.
The models are evaluated on both ambiguous prompts (lacking detailed context) and disambiguated prompts (supplying full information to answer the question). Bias is quantified through associated scores and accuracy is measured in parallel. A translation quality check using multilingual embedding similarity metrics confirms that translation does not contribute to the observed bias patterns.
Results reveal that ambiguous contexts substantially amplify model bias, with Llama-3.3 exhibiting a stronger tendency toward stereotyped responses under uncertainty, while Gemini Flash more frequently defaults to neutral (“Unknown”) answers. Disambiguation reduces bias and enhances accuracy for both models in all languages reflecting architectural and training data differences. Despite this, Llama-3.3 retains slightly higher bias in certain categories. Category-level diagnostics identify age, religion and, in some settings, physical appearance as the most bias-prone dimensions, whereas socioeconomic status and sexual orientation often trend toward neutrality or over-correction trends.
Taken together, these findings demonstrate that straightforward prompt engineering, providing detailed context, can significantly reduce bias and enhance accuracy for multilingual LLMs. This approach lays the groundwork for future bias benchmark extensions in additional language specific cultural settings allowing for identifying where LLMs may lack in certain domains.