Ph.D. Thesis Defense
Xiao Jing
(Faculty advisor: Professor Dimitri Mavris)
"Development of Aviation Bench: A Comprehensive Benchmark Framework for NLP Models in Aviation Safety"
Wednesday, July 15
2:00 - 3:00 p.m.
Weber, CoVE
Abstract:
The field of aviation safety is increasingly reliant on advanced natural language processing (NLP) techniques to extract insights from the growing volume of incident and accident narratives. Despite significant advances in both data collection and modeling capabilities, there exists a critical gap in the standardized evaluation of NLP systems in this safety-critical domain. Current NLP research on aviation safety remains fragmented, with isolated studies targeting different datasets and tasks, making holistic progress difficult to measure. Additionally, critical aviation safety categories are often underrepresented in existing corpora, leading to class imbalance and reduced model robustness on rare but high-stakes scenarios.
In the current paradigm, aviation safety NLP applications are evaluated using inconsistent metrics and datasets, often focusing on common scenarios while giving limited attention to underrepresented safety-critical cases. Technical challenges include the domain-specific terminology of aviation safety narratives, the complex causal relationships they describe, and the need for reliable performance on safety-critical edge cases. Moreover, annotations for complex tasks such as causal reasoning are scarce due to high labeling costs and the need for domain expertise. This motivates the overall objective of the current work.
Based on these observations, the research objective of this dissertation has been to develop AviationBench, a comprehensive benchmark framework for evaluating NLP models in aviation safety. This framework standardizes evaluation across multiple safety-relevant language understanding tasks, with particular attention to underrepresented aviation safety categories, complex causal reasoning, and annotation scarcity. To achieve this objective, three research areas have been defined. The first research area focuses on constructing and validating a multi-task aviation safety NLP benchmark across four core tasks: multi-label classification, causal chain extraction, question answering, and named entity recognition. The second research area focuses on knowledge-graph-grounded LLM generation and evaluation for addressing data imbalance in aviation safety reports. The third research area focuses on scalable structured annotation methods, including LLM-based teacher-student distillation and knowledge-guided label generation, for complex aviation safety NLP tasks.
Together, these research areas provide a unified methodology for benchmark construction, targeted data augmentation, and scalable structured annotation in aviation safety NLP. The resulting framework is evaluated through task-level and benchmark-level experiments, including multi-task model assessment, knowledge-grounded synthetic data evaluation, and structured annotation case studies. These evaluations demonstrate the ability of AviationBench to support more consistent, scalable, and safety-relevant assessment of NLP models for aviation safety applications.
Committee:
Dr. Dimitri Mavris (advisor), School of Aerospace Engineering
Dr. Duen Horng Chau, School of Computational Science and Engineering
Dr. Xiuwei Zhang, School of Computational Science and Engineering
Dr. Kuen-Da Lin, School of International Affairs
Dr. Mayank Bendarkar, Zoox, Inc.