Analyses of analysis repositories are estimating the charges of hallucinated citations in analysis papers.Credit score: patpitchaya/iStock by way of Getty
The issue of synthetic intelligence fashions ‘hallucinating’ non-existent citations has not too long ago shot to prominence. Now a crew of researchers has sifted by way of 2.5 million papers and preprints to supply the perfect evaluation of their prevalence but.
Their audit encompassed 111 million references in papers and preprints listed in main repositories together with arXiv, bioRxiv, Social Science Analysis Community (SSRN), and PubMed Central servers, and located that there have been 146,932 hallucinated citations in materials printed in 2025 alone.
The evaluation additionally means that the prevalence of hallucinated citations depends upon the world of analysis. SSRN, a preprint server for social sciences analysis, had the best price of hallucinated citations at practically 2%, virtually 5 instances larger than every other main repository.
“We had been actually amazed by the general magnitude and dynamics of the entire physique of hallucinated citations,” says Yian Yin, assistant professor of data science at Cornell College in Ithaca, New York state, and a co-author of the examine.
The evaluation was posted on the preprint server arXiv1 and has not been peer-reviewed.
Bibliographic hallucinations
Yin and his colleagues had been prompted to research the size of the issue after recognizing some references to unfamiliar work, supposedly authored by researchers they knew. “I do know these authors,” says Yin, “and I’m 90% certain they don’t have a paper on that.”
To quantify the size of the issue, the researchers extracted reference titles from thousands and thousands of manuscripts and checked them towards Semantic Scholar, OpenAlex and Google Scholar. References that might not be matched, and that an LLM judged to be meant as tutorial sources, had been flagged as unmatched. As a result of bibliographic errors have at all times existed, the researchers solely counted defective references showing in materials printed after 2022, the 12 months during which ChatGPT, the primary publicly accessible LLM, was launched.
Hallucinated citations are polluting the scientific literature. What might be finished?
The evaluation discovered that the charges of hallucinated citations various between completely different repositories. SSRN ranked first with 1.91% of citations in research posted there by August 2025 deemed to be hallucinations. ArXiv, a bodily sciences repository, ranked second, with 0.39% of its citations incorrect or referring to non-existent papers or researchers.
The PubMed Central biomedical-science database had a price of 0.27% hallucinated citations in peer-reviewed publications. BioRxiv, a preprint server specializing in organic sciences, had a price of 0.21%.
Hallucinated citations are extra prevalent in work authored by researchers with little pre-2022 publication historical past. When faux citations happen, they disproportionately credit score already established, extremely cited, typically male authors, the examine discovered.