Southern Association for Vascular Surgery
SAVS Home SAVS Home Past & Future Meetings Past & Future Meetings

Back to 2024 Posters


Can It Pass the Test? ChatGPT Performance on the Vascular Education and Self-Assessment Program (VESAP)
Quang Le, Kedar Laviniga, Michael F Amendola
VA Medical Center/VCU, Richmond, VA

Introduction:
The Vascular Education and Self-Assessment Program (VESAP) is an accepted continuing educational assessment tool by the Society for Vascular Surgery (SVS). Large Language Models (LLM) are generative artificial intelligence (AI) tools with previously notable performance on standardized medical tests. Our goal is to assess the ability of a readily available LLM (GPT-3.5-Turbo, OpenAI, CA) to answer VESAP questions in hopes of gaining insights regarding the use of AI in surgical education.
Methods:
The SVS Committee Self-Assessment Committee was petitioned and granted access to VESAP (4th Edition; VESAP4) in April 2023. VESAP4 materials (non-imaging questions; n=385) were submitted to GPT-3.5-Turbo (GPT) for processing via an Application Programming Interface. Two independent reviewers examined AI-generated responses for accuracy and content and compared them to the provided answer keys. API requests were triplicated to evaluate consistency. Data were reported in mean and standard deviation (SD).
Results:
385 questions were separated into ten domains of vascular surgery knowledge. GPT provided the correct answer in 49.4% (SD:0.5) of questions. 77.8% (SD:0.8) of correct responses were similar across all three queries. GPT performed best in questions on radiation safety, achieving a 54.4% (SD:2.1) correct percentage. In contrast, it only answered correctly 39.0% (SD:3.1) of dialysis access questions. Of the incorrectly answered questions, the most common cause of inaccuracy was either retrieval of false information or failure to retrieve important facts (59.6%, SD: 0.6). Other causes of inaccuracy were poor reasoning (23.3%, SD:1.6), poor comprehension of the question (2.7%, SD:1.3), and question stem ambiguity (2.9%, SD:1.1). In 17.0% (SD:1.5) of incorrectly answered questions, the cause could not be identified due to a lack of answer explanation by the LLM.
Conclusions:
Given VESAP’s complexity, GPT’s performance on this test is encouraging with surprising consistency. However, a roughly 50% accuracy rate for high-level surgical education is inadequate. Moreover, approximately 60% of incorrect answers either missed or provided false information, which is highly detrimental, limiting current application. Another limitation is the inability to process imaging, a vital component of vascular surgery. Newer LLM approaches with expanded capacities may be able to address these shortcomings for more consistent and accurate responses in the vascular surgery knowledge domain.
Back to 2024 Posters