Identifying Accessibility Data Gaps in CodeGen Models

October 21, 2025

Cartoon robot standing on the edge of a cliff, looking across a gap toward another cliff with a checkered flag labeled "Accessible" against a bright yellow background.

Rather than relying on anecdotal evidence or cherry-picked examples, I built a systematic approach to evaluate how well LLMs — starting with GPT-4 — generate accessible HTML. The methodology is straightforward but comprehensive: I created a Python testing framework that sent carefully crafted prompts to Azure OpenAI’s GPT 4 model, collected the generated HTML responses, and then manually analyzed these responses for accessibility compliance.

Source: Identifying Accessibility Data Gaps in CodeGen Models :: Aaron Gustafson