LENR ChatBot

About

Despite more than three decades of research, Low Energy Nuclear Reactions (LENR) still pose a challenge to comprehensive understanding. The vast array of papers, articles, presentations, reports etc. are vital but often difficult to collate. Thousands of literature pieces have been published over the years, and considering the significance of work towards LENR, it is crucial that the data is readily available and accessible across the globe. As a supplement to some of our prior efforts, with this project we introduce a chatbot that leverages our compilitation of LENR data repository to answer questions and other queries - using the LLMs ability the generate data given relevant context. More about our contributions below.

Data Collection

Our primary data source was Jed Rothwell's extensive LENR bibliography, available at https://lenr-canr.org/, comprising more than 4,743 entries. Each entry contains essential metadata, including title, author(s), publication year, publication source, abstract and PDF links, with 2,174 documents directly accessible via LENR-CANR. This covers literature from 1980s to recent times, of which we extracted 2,174 PDFs automatically and manually gathered another 1,250 documents.

Data Processing

The PDFs were parsed into TEI XML files through an open-source machine learning library, GROBID (GeneRation Of BIbliographic Data). For easier ingestion of data and to optimize the retrieval of documents, we programmatically parsed all TEI XML files into the target JSON format. Some of the documents required further manual pre-processing due to certain challenges such as manual separation of combined publication documents, missing paragraphs and garbled text.

Leveraging LLMs

The final step is the LLM. After some testing with different open-source models of various parameter sizes, we identified the need for a large-scale model to truly aid the LENR research community and match the intended performance level. Thus, with the support of ChatFast we were able to use the state-of-the-art GPT-4 model. ChatFast also gave provisions for a hosted database that we could use as our chatbot knowledge base. In-order to adhere to the chatbot restrictions and to maintain cohesion, we segmented our JSON into paragraphs for each document, given the token/input limits of the LLM. Each paragraph was treated as a separate training entity for the model, accompanied by metadata such as title, author, year and a link to the paper.

0000

Documents Processed

0000+

Authors Incorporated

000+

Publishers Included

000

Keywords

The Chatbot and this webpage is part of a research project led by Prof. David Nagel of George Washington University and Prof. Anasse Bari of New York University, and is sponsored by the Anthropocene Institute. Largely the project aims to develop AI and Predictive Analytics tools to support the commercialization of LENR.

This data was compiled, and chatbot was designed & developed by Yvonne Vu, Tanya Pushkin Garg, Sneha Singh and Suryavardan Suresh. Special thanks are extended to Emos Ker, Charles Wang, Adelina Simpson, Gurmehr Sohi, Saiteja Siddana, and Dongjoo Lee for their hard work on LENR data collection. All of these individuals are members of the Predictive Analytics and AI group at New York University's Courant Institute, under the leadership of Prof. Anasse Bari. This project was built with support from ChatFast.