Chemically Aware AI for Organic Synthesis

9 min readMay 4, 2023

Introduction

Hype around the AI keeps shaking the public’s opinion and exciting our minds. Will it make our lives easier? Will it take our jobs? Is the rise of machines just around the corner?.. With the power of AI ever increasing, so are the concerns regarding its responsible use. For instance, a group of researchers back in 2022 reported that deep learning generative models employed for drug toxicity prediction can be easily tweaked to design novel chemical warfare agents. Well, let’s not drift away. Here are some things to consider:

AI has the potential to improve our lives by automating repetitive tasks and providing more personalized experiences.
Artificial general intelligence is not yet realized. Thus, creativity and deep reasoning is currently out of AI’s scope.
While AI may replace some jobs, it also has the potential to create new jobs in applied fields such as data science and machine learning.
It’s important to approach AI development and implementation with caution, ensuring that ethical considerations such as privacy and bias are taken into account.

In this essay, we want to address a specific field of chemistry, and how we can change it with the help of AI for good.

Organic synthesis plays a crucial role in the discovery and development of new pharmaceuticals and materials used in various industries. The process involves designing and executing chemical reactions to create complex molecular structures.

While traditionally, chemists have manually carried out these reactions, recent advancements in AI have allowed for the development of chemically aware AI systems that can automate and optimize the synthesis process.

Future progress in this field relies on the development of explainable ‘chemistry aware’ methods. These methods should reveal the molecular mechanisms behind compound activity, suggest better compounds, and optimize synthetic routes using chemical knowledge.

So How Exactly Chemically Aware AI Works?

Much like large language models, chemical AI digests representations of chemical structures and reactions to identify patterns and predict outcomes based on known data. The ways we represent them might be different and mostly are a subject of such fields as chemoinformatic. They include string encoded structures per se (SMILES) or so called molecular fingerprints — unique holistic descriptions of the molecule nature. The output of a deep learning model depends on the purpose it was built for. It might be classification (toxic or non-toxic) or numerical value assignment (solubility or potency), or new structure generation.

However, the most interesting application where AI really shines is direct translation of one structure or reaction to another, much like machine translation between different natural languages. Recent advances enabled us to feed reagents as input for AI and get the presumable products as an output! Another innovative approach proposed back and forth conversion of distinct molecular structures into continuous chemical space. This literally implies that AI invented its own chemical language we don’t directly understand. This also enables AI, if we ask him to modify the given molecule, to come up with structures that are not obvious or even hard-to-imagine to human chemists.

Machine Learning: The Key to AI’s Success in Organic Synthesis

Machine learning plays a pivotal role in enhancing the accuracy of chemically aware AI. It relies on large datasets of chemical reactions and their outcomes to train the AI system. The more diverse and comprehensive the training data, the better the AI can predict and optimize chemical reactions.

As the AI system encounters new reactions and compounds, it continually updates its knowledge base through a process called iterative learning. This allows the chemically aware AI to fine-tune its predictions and suggestions, leading to more accurate results over time.

Another essential aspect of machine learning in chemically aware AI is transfer learning. This allows the AI system to apply knowledge gained from one set of chemical reactions to other, related reactions. By leveraging the similarities between different chemical processes, transfer learning enables the AI to generalize its predictions and recommendations across a wide range of organic synthesis scenarios. This is especially important since a comprehensive, reliable and systematic body of chemical knowledge is yet to be realized.

This capability helps chemically aware AI to handle new reaction types or complex molecules more effectively, making it a valuable tool in the discovery and development of novel pharmaceuticals and materials.

Where do we stand now?

A schema highlighting the DMTA model: Design, Make, Test, Analyze — AI in drug discovery with the DMTA cycle

Applied chemistry fields, such as drug discovery or materials science, rely on the design-make-test-analyse (DMTA) cycle to realise new products. Recent computational advancements turned AI and ML into a Swiss army knife that can facilitate every step of the DMTA cycle. Synthetic chemistry deals with the make phase, and data-driven approaches can facilitate it and reduce failures, which are inevitable during the make phase. Computer-aided synthesis planning (CASP) consists of three principal tasks:

Retrosynthesis divided into subproblems of generating single step suggestions, followed by using them recursively to calculate multi-step routes;
Recommending conditions for successful forward reactions to make suggestions actionable;
Forward reaction prediction to validate proposed synthetic steps.

Let’s discuss AI opportunities in each of them.

Retrosynthetic planning software can be categorised into two major types: Those that use expert-encoded rules or heuristics and those that learn how to generate recommendations. Many retrosynthetic methods rely on reaction templates, which are reaction rules that can be stored in a SMARTS or SMIRKS format. The algorithmic extraction of templates from a reaction dataset involves identifying the reaction center, atoms adjacent to the reaction center, and generalized functional groups involved in the reaction. Automated pipelines for extracting reaction templates allow for facile (re)training on proprietary data sets but are inconsistent with the expert approach.

Machine-learning-based approaches have focused on learning which templates provide the most strategic disconnections. The single-step retrosynthetic capabilities can be extended to full route design by using a tree search. Different implementations of tree searches have been investigated including depth-first, best-first, and proof-number search and Monte Carlo tree search algorithms. A retrosynthetic search is terminated once precursors are found that can be purchased. Other stop criteria such as the number of occurrences in the literature or chemical logic can be used. The ability to identify a pathway does not guarantee its chemical feasibility. The best method of validation would be to perform the chemistry in the lab; this is prohibitively expensive to undertake for every route generated and not a scalable approach to validating new methods in retrosynthesis planning.

The process of planning a retrosynthetic route is only one aspect of a full CASP system. To make it actionable, chemists must propose a set of reaction conditions that can achieve the desired transformation. Finding the optimal or acceptable set of conditions for a reaction can require time-consuming empirical screening even if similar reactions are described in the literature. Machine-learning models for condition recommendation can infer suitable conditions if trained on historical condition data. However, the lack of high-quality data hampers progress in developing such models.

Data-driven approaches have demonstrated the ability to suggest conditions for specific reaction classes and diverse reaction sets. Condition recommendation models would likely be developed to suit the needs of a particular area of chemistry. Opportunities for techniques in AI exist to accelerate the empirical optimization of reaction conditions. Model-based techniques construct a surrogate model of reaction performance as a function of reaction conditions, and various search strategies can be layered on top of these models to help select the next set of conditions to try and refine the model. Machine-learning-based models have the potential to provide better estimates of performance and uncertainty to accelerate the search.

The third key task of CASP is to ensure that recommendations obtained through algorithmic synthesis design are robust and actionable by anticipating potential reaction products. Machine learning methods for reaction prediction include inferring reaction rules from a predefined list, using graph convolutional neural networks, and predicting product SMILES with sequence-to-sequence models. Forward-reaction predictors can also be used for side-product prediction, which helps identify potentially harmful or difficult-to-separate intermediates. Many reactions can lead to multiple regio- or stereoisomeric compounds, making information about a reaction’s selectivity and possible side products crucial for prioritizing syntheses and structure assignment. These models will be indispensable for the consideration and design of purification strategies once they can make quantitative predictions.

Reaction prediction also has applications in make-on-demand virtual libraries and hit expansion in drug discovery. An automated pipeline can search for all combinations of available starting materials that could be substituted and a forward predictor can score which combinations are likely to lead to successful reactions. This allows for a rapid assessment of the accessible chemical space surrounding the target. This capability is closely related to integrating the goals of diversity-oriented synthesis into CASP.

What holds us back?

AI application in organic synthesis has its limits (at least for now):

Insufficient or biased training data, which can affect the accuracy and reliability of AI predictions. Predicting rare or unexpected chemical reactions is difficult due to the limited training data. Negative results are rarely, if ever, published. Published research and datasets, in turn, suffer from what is referred to now as ‘reproducibility crisis’. Crucial details about a synthetic procedure may be omitted or intentionally tweaked to prevent competitors’ success. AI has no means to verify it in advance.
Difficulty in representing and predicting the behavior of large, complex molecules. Larger molecular entities often act not as fragments they are composed from, but acquire rather new and unexpected properties. It means that deep learning models trained on simpler molecules might be powerless against more complicated structures. Bottom-up approaches like quantum chemistry calculations or molecular dynamic simulations could be a venue to overcome this hurdle.
The need for explainable AI that can provide transparent insights into the reasoning behind its decisions. AI models are black boxes by design. We know that large language models are able to make up answers without any real understanding of the question. The lack of supportive evidence or reasoning chain is a big concern for rigorous scientific approach to research. So, experimental validation of AI predictions is not to be underestimated.
The knowledge gap between chemists and computer scientists, creating communication barriers. There are two opposing trends in scientific development. First one is increasing specialization of research fields as techniques and instruments employed become more complicated. On the other hand, the quantity and quality of data generated encourage chemists and biologists to acquire programming and data science skills. No doubt the design of ergonomic AI-products will ease their implementation and adoption by science professionals.
Ensuring the security and privacy of sensitive chemical data used in training and applying chemically aware AI systems. Robust data protection measures must be in place. This includes secure storage and transmission of data, as well as strict access controls to prevent unauthorized use or manipulation of the data. Additionally, regulatory frameworks must be developed to ensure that chemically aware AI is used safely and responsibly in research and development.
The risk of overreliance on AI and the potential for reduced creativity in the synthesis process. There’s a fine line between automating low-engaging but laborious tasks and delegating all the things you don’t want to work through to your AI assistant. Human creativity is what creates value and breakthrough discoveries, while AI is here to help and facilitate, but not to substitute it.

Outlook

As AI-based CASP tools become more widely used, they are increasingly integrated into medicinal chemistry workflows. They bear the potential to increase efficiency, speed, and accuracy of reaction prediction and optimisation, reduce costs and waste in the synthesis process, generate novel compounds, and improve drug design overall. Despite its current limitations, ongoing research and development are being conducted to improve the accuracy and scope of chemically aware AI. With responsible use, chemically aware AI has the potential to make a significant impact on the future of drug discovery and development.

Chemical data is complex and is highly dimensional, making it challenging for traditional machine learning algorithms to work effectively. New approaches that can better handle this complexity should be explored. Another area that needs improvement is the integration of chemically aware AI with other technologies, such as robotics and automation. By combining chemically aware AI with robotics, researchers can create systems that can automatically perform chemical experiments and analyze the results. Likely, implementation of this approach will necessitate development of suitable hardware and analytical tools that can generate more accurate and comprehensive chemical data. In turn, high-quality data is indispensable to make accurate predictions and classifications. Progress in the field critically depends on our ability to integrate public knowledge with in-house data to both maintain intellectual property and benefit the society.

In the last few months, we’ve gotten accustomed to the new ability to reach out to our trusty chat bots in case of a writer’s block or to compose a polite reply to an email that we don’t feel is worthy of our own time. In the near future, chemists should be able to ask an AI-research assistant to analyse given data, or what compound should be synthesised next, and how to do it. At HMND we believe that the advent of chemically aware AI stands to change the way chemists work, opening new opportunities for research and discovery.