Comparison of molecular structure elucidation to solving a crossword puzzle. Just as crossword clues provide hints for fitting words into a grid, spectroscopic data such as NMR, IR, and mass spectrometry offer complementary clues about a molecule’s structure. Integrating these diverse clues leads to a complete and consistent picture of the molecule, similar to how words fit together in a puzzle.
Artificial intelligence (AI) is revolutionizing chemistry, with significant impacts on industrial chemical engineering, drug discovery, and education. Large language models (LLMs) have successfully addressed predictive tasks such as molecular property prediction, reaction prediction, and experiment automation. Here, we introduce molecular structure elucidation, a task that presents a new challenge for AI. This task requires integrating diverse spectroscopic data, iterative hypothesis testing, and deep chemical reasoning to determine a molecule’s structure. Much like solving a complex crossword puzzle, it involves piecing together clues to form a coherent solution. The Figure highlights this analogy, illustrating the similarities in strategy and complexity between molecular structure elucidation and solving a crossword puzzle.
In this work, we present a novel approach to molecular structure elucidation, adapting the task for Large Language Models (LLMs) to explore their potential in chemical research. Our primary contribution is the introduction of the MolPuzzle dataset, comprising 234 complex structure elucidation challenges involving multimodal data like IR, MASS, H-NMR, and C-NMR spectra, as well as molecular formulas. Each instance requires LLMs to navigate three key sub-tasks: molecule understanding, spectrum interpretation, and molecule construction.
We tested 11 state-of-the-art LLMs, including GPT-4o and Claude-3-opus, alongside human benchmarks. Key findings include: (1) GPT-4o outperforms other models but still underperforms compared to humans, with only 1.4% of its answers exactly matching the ground truth;(2) LLMs struggle particularly in spectrum interpretation and molecule construction.
In summary, our contributions are two-fold: Our contributions are twofold: (1) A new reasoning challenge for the AI community focused on complex problem-solving in chemistry; and (2) New AI tools for the chemistry community, showcasing LLMs’ potential to accelerate molecular structure elucidation and inspire interdisciplinary collaboration.
The MolPuzzle benchmark is designed to test the reasoning capabilities of Large Language Models (LLMs) in molecular structure elucidation tasks. This dataset contains 200 instances of molecular structure elucidation challenges, simulating real-world chemistry tasks. Each instance in MolPuzzle involves three interlinked sub-tasks:
In total, Molpuzzle includes 23,678 data examples collected from each Stage.
We first conducted evaluation of a variety of LLMs for completing the individual tasks in each stage, including GPT-4o, GPT-3.5-turbo, Claude-3-opus, Gemini-pro, LLama-3-8B-Instruct, Vicuna-13B-v1.5, Mistral-7B-Instruct-v0.3, and in particular multimodal LLMs such as Gemini-pro-vision, LLava-Llama-3-8B, Qwen-VL-Chat, and InstructBlip-Vicuna-7B/13B.
Method | Stage 1 (Molecule Understanding) Tasks | |||
---|---|---|---|---|
SI | ARI | FGI | SDC | |
GPT-4o | 1.00±0.000 | 0.943±0.016 | 0.934±0.005 | 0.667±0.003 |
GPT-3.5-turbo | 0.451±0.025 | 0.816±0.017 | 0.826±0.075 | 0.5±0.099 |
Claude-3-opus | 0.361±0.009 | 0.988±0.015 | 0.934±0.001 | 0.856±0.016 |
Llama3 | 0.228±0.043 | 0.696±0.051 | 0.521±0.003 | 0.000±0.000 |
Human | 1.00±0.000 | 1.000±0.000 | 0.890±0.259 | 0.851±0.342 |
Method | Stage 2 (Spectrum Interpretation) Tasks | |||
---|---|---|---|---|
IR Interpretation | MASS Interpretation | H-NMR Interpretation | C-NMR Interpretation | |
GPT-4o | 0.656±0.052 | 0.609±0.042 | 0.618±0.026 | 0.639±0.010 |
LLava | 0.256±0.026 | 0.101±0.021 | 0.118±0.008 | 0.254±0.015 |
Human | 0.753±0.221 | 0.730±0.110 | 0.764±0.169 | 0.769±0.101 |
Method | Stage 3 (Molecule Construction) Tasks | |
---|---|---|
H-NMR Elucidation | C-NMR Elucidation | |
GPT-4o | 0.524±0.021 | 0.506±0.037 |
Llama3 | 0.341±0.015 | 0.352±0.017 |
Human | 0.867±0.230 | 0.730±0.220 |
Table 1: F1 scores (↑) of individual QA tasks in three stages. The best LLMs results are in bold font.
Tasks in stage 1 are SI: Saturation Identification, ARI: Aromatic Ring Identification, FGI: Functional Group Identification, and SDC: Saturation Degree Calculation.
For solving the entire molecule puzzles, the evaluation is limited to the three most advanced multimodal LMMs: GPT-4o, Claude-3-opus, and Gemini-pro, due to the involvement of spectrum image analysis in Stage 2.
Method | Acc. (↑) | Levenshtein (↓) | Validity (↑) | MACCS FTS (↑) | RDK FTS (↑) | Morgan FTS (↑) |
---|---|---|---|---|---|---|
GPT-4o | 0.014±0.004 | 11.653±0.013 | 1.000±0.000 | 0.431±0.009 | 0.293±0.013 | 0.232±0.007 |
Claude-3-opus | 0.013±0.008 | 12.680±0.086 | 1.000±0.000 | 0.383±0.050 | 0.264±0.040 | 0.241±0.037 |
Gemini-pro | 0.000±0.000 | 12.711±0.196 | 1.000±0.000 | 0.340±0.017 | 0.208±0.002 | 0.171±0.007 |
Human | 0.667±0.447 | 1.332±2.111 | 1.000±0.000 | 0.985±0.022 | 0.795±0.317 | 0.810±0.135 |
Error in solving the molecule puzzle
The Figure presents case studies that illustrate the iterative steps involved in Stage 3, showcasing the most common errors made by GPT-4o: the accumulation of errors in iterative steps, which can lead to catastrophic failures. Note that this stage focuses on selecting the correct fragments and assembling them step by step to form the final molecular structure. We find that GPT-4o can initially succeed in picking the correct fragment when the structure is comparatively simple. However, as the process progresses, it does no select structures that satisfy all the requirements indicated by the NMR data.
{ TO be released
}