
A collaborative project that crosses multiple SRI labs and divisions is refining an application designed to take the drudgery out of scientific literature reviews.
Generative AI is already transforming the way software engineering teams create code. Soon, it may also transform the day-to-day work of scientific researchers. A new SRI project aims to create an AI assistant that will provide rapid insights to researchers engaged in scientific literature reviews.
Yunye Gong, the principal investigator for SRI’s SARA (Scientific Assistant with Reasoning Ability) project, is leading a team of SRI researchers intent on filling a significant gap in off-the-shelf large language models (LLMs). While models like GPT-4 show impressive abilities to process textual information, they struggle when it comes to mathematical reasoning.
“It’s a well-known issue,” Gong comments. “In one failure case that we identified with GPT-4o, it thought that 9.11 was greater than 9.8 and was very confident about it.”
Because general-purpose LLMs often misunderstand analytical equations and numeric tables, they struggle to accurately answer complex questions about scientific papers. That’s a problem that needs to be solved before researchers can trust AI tools to aid with rapid analysis of individual papers or larger fields of study. To address the problem, Gong engaged with fellow researchers in SRI’s Information and Computing Sciences division, Advanced Technologies and Systems Division, and Education division to explore solutions.
Teaching scientific reasoning to AI
Developing the SARA tool began with data curation: Collecting a broad and complex set of questions (ranging from fact checking to comparative analysis and application of reported results) that researchers might pose in relation to scientific literature. This dataset proved critical to testing and validating the design of the tool.
“What kind of analysis do scientists want to do when they are reviewing papers? We need to think carefully about that question as we build this new tool.” — Yunye Gong
Gong and her team then zeroed in on automated self-reflection as a way to improve the scientific reasoning of an LLM. SARA asks its base LLM to critique its initial response to a user query and, based on that self-reflection, improve the final response. This technique alone results in a 9% improvement in equation and table comprehension accuracy above the baseline GPT-4o model.
Additionally, SARA is able to call a curated set of external tools and functions to further improve the accuracy of its responses. An automated code generator has been particularly helpful in steering the overall tool toward improved mathematical comprehension. Instead of basing its comprehension solely on the LLM, the SARA tool can construct and refine Python scripts to perform analytical simulation of the underlying calculation involved in a scientific reasoning task (such as the evaluation of a reported equation or the quantitative comparison between two tables). These calculations provide mathematically accurate context information. This technique achieved a 14% improvement in mathematical reasoning above the baseline GPT-4o model.
Testing a scientific research tool
Of course, scientific research tools will only be useful if actual scientists find the tools amenable to their workflows. “What kind of analysis do scientists want to do when they are reviewing papers? We need to think carefully about that question as we build this new tool,” Gong observes.
Fortunately, with researchers working across many cutting-edge disciplines, SRI turns out to be the ideal laboratory for testing a platform like SARA. Gong began with a small internal experiment, providing the tool to a group of materials scientists in order to explore the tool’s efficiency, accuracy, and usability.
SRI scientists were split into three groups: One group reviewing a paper manually, one group reviewing with the help of the baseline GPT-4o model, and one reviewing with assistance from SARA. The group equipped with the SARA tool earned the best score on a timed quiz that involved detailed review of a single scientific paper. Because the users did not know which assistant they were leveraging, this small test also provided initial data on user preference: 75% of the researchers exposed to SARA expressed interest in continuing to use it regularly, while only 25% reported wanting to continue using the baseline GPT-4o model for literature reviews.
Building toward deployment
While these results are quite preliminary and have not been peer reviewed, they demonstrate that the SARA team’s techniques are well on their way to delivering a tool that is sufficiently trustworthy to be used for real-world research purposes.
This year, the team is focusing on building in retrieval-augmented generation (RAG, a technique widely known to reduce GenAI hallucinations by leveraging external knowledge bases) and uncertainty qualification (which will provide users with “confidence scores” for each response).
“With the baseline GPT-4o model, without domain expertise, it’s hard to tell whether it’s giving you the correct answer or not. Even if it’s completely incorrect, it still talks in a very confident tone,” Gong notes. The confidence score will give researchers a better sense of when and where they might need to double-check the tool’s outputs.
Gong expects that these interventions — and other avenues the team is discussing — will continue to improve the tool’s consistency, while also unlocking new capabilities that will make it even more capable of accelerating the work of future scientists.
Learn more about our innovations in AI or contact us.