Abstract (EN):
Objectives: Published systematic reviews display a heterogeneous methodological quality, which can impact decision-making. Large language models (LLMs) can support and make the assessment of the methodological quality of systematic reviews more efficient, aiding in the incorporation of their evidence in guideline recommendations. We aimed to develop an LLM-based tool for supporting the assessment of the methodological quality of systematic reviews. Methods: We assessed the performance of 8 LLMs in evaluating the methodological quality of systematic reviews. In particular, we provided 100 systematic reviews for eight LLMs (five base models and three fine-tuned models) to evaluate their methodological quality based on a 27-item validated tool (Reported Methodological Quality (ReMarQ)). The fine-tuned models had been trained with a different sample of 300 manually assessed systematic reviews. We compared the answers provided by LLMs with those independently provided by human reviewers, computing the accuracy, kappa coefficient and F1-score for this comparison. Results: The best performing LLM was a fine-tuned GPT-3.5 model (mean accuracy = 96.5% [95% CI = 89.9%-100%]; mean kappa coefficient = 0.90 [95% CI = 0.71-1.00]; mean F1-score = 0.91 [95% CI = 0.83-1.00]). This model displayed an accuracy >80% and a kappa coefficient >0.60 for all individual items. When we made this LLM assess 60 times the same set of systematic reviews, answers to 18 of 27 items were always consistent (ie, were always the same) and only 11% of assessed systematic reviews showed inconsistency. Conclusion: Overall, LLMs have the potential to accurately support the assessment of the methodological quality of systematic reviews based on a validated tool comprising dichotomous items. (c) 2025 The Author(s). Published by Elsevier Inc. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
Language:
English
Type (Professor's evaluation):
Scientific
No. of pages:
13