Despite more than three decades of research, Low Energy Nuclear Reactions (LENR) still pose a challenge to comprehensive understanding. The vast array of papers, articles, presentations, reports etc. are vital but often difficult to collate. Thousands of literature pieces have been published over the years, and considering the significance of work towards LENR, it is crucial that the data is readily available and accessible across the globe. As a supplement to some of our prior efforts, with this project we introduce a chatbot that leverages our compilitation of LENR data repository to answer questions and other queries - using the LLMs ability the generate data given relevant context. More about our contributions below.
Our primary data source was Jed Rothwell's extensive LENR bibliography, available at https://lenr-canr.org/, comprising more than 4,000 entries. Each entry contains essential metadata, including title, author(s), publication year, publication source, abstract and PDF links, with about 2,000 documents directly accessible via LENR-CANR. This covers literature from 1980s to recent times, of which we extracted majority of the PDFs automatically and manually gathered roughly 1,000 documents.
The PDFs were parsed into XML files through an open-source machine learning tool for better segregation and oragnization of all documents. For easier ingestion of data and to optimize the retrieval of documents, we programmatically parsed all these XML files into JSON at the paragraph level. Some of the documents required further manual pre-processing due to certain challenges such as manual separation of combined publication documents, missing paragraphs and garbled text.
The final step is the LLM. After some testing with different open-source models and various parameter sizes, we identified the need for a large-scale model to truly aid the LENR research community and match the intended performance level. Thus, we proceeded to use the state-of-the-art GPT-4 model. With ChatFast, we hosted our processed documents such that we could use it as our chatbot knowledge base. Given the token/input limits of the LLM, each paragraph was treated as a separate training entity for the model, accompanied by additional metadata.
The Chatbot and this webpage is part of a research project led by
Prof. David Nagel of George Washington University and Prof. Anasse Bari of New York University, and
is
sponsored by the Anthropocene Institute. Largely the project aims to develop AI and Predictive
Analytics tools to support the commercialization of LENR.
This data was compiled, and chatbot was designed & developed by Yvonne Vu, Tanya Pushkin Garg, Sneha
Singh and Suryavardan Suresh.
Special thanks are extended to Emos Ker, Charles Wang, Adelina Simpson,
Gurmehr
Sohi, Saiteja Siddana, and Dongjoo Lee for their hard work on LENR data collection. All of these
individuals are members of the Predictive Analytics and AI group at New York University's Courant
Institute, under the leadership of Prof. Anasse Bari.
This project was built with support from ChatFast.