Every chemistry-related research project needs to answer two essential questions: “What compound to design?” and “How to synthesize it?”.
Answering the first question is fundamentally difficult, that’s because we never know whether the our compound design is good enough, until we evaluate the actual product. As a matter of fact, we aren’t sure whether there is a solution at all. All we can do is repeat the design-synthesise-test cycle hoping to come up with something fit for the purpose.
Now, what comes to the second question regarding synthesis, everything seem to be much simpler. Once we have designed our compound, all the power of organic synthesis is here to serve us. This might be straightforward for basic structures and established workflows, but what if we dare to synthesize this beast?
As opposed to synthesis, retrosynthetic analysis aims to break down complex molecules into simpler components to discover the most efficient routes to get the desired product. This type of analysis has become increasingly important as the need for faster and more efficient drug discovery has grown. You can bet it requires a great deal of comprehensive knowledge, experience, creativity and serendipity to practice this art. Fortunately, the development of software designed to facilitate retrosynthetic analysis has made it easier than ever before.
It took months to devise a retrosynthesis of paclitaxel by the world’s best chemists back in the 1990s. In 2016, Chematica (Now Synthia) did it in just 7 seconds.
Software designed to assist retrosynthesis relies on algorithms to generate synthetic pathways. This approach is incredibly powerful, as it allows users to explore the chemical space more quickly than any human being. Generally, algorithms enumerate plausible routes of synthesis and weigh them on a second step to identify the optimal ones. Sounds much like a game of chess, right? Given that computers already beat human folk in chess and go, it looks safe to assume that computers might also excel in retrosynthesis as well, sounds logical? Maybe. Let’s dig in a little more.
Navigating known chemical space
Historically, the first and the most straightforward approach is to break some bonds in the target molecule and ask the literature what transformations could bind these pieces together. One ‘little’ complication is that the literature pile to search through is enormous and with each synthetic step we trace back to commercially available raw materials the number of possible variants grows exponentially. If the target compound takes 5 steps to obtain, the average number of synthetic routes to take into account is around 10^16.
Early software solutions for this problem worked like chemical calculators able to process large amounts of information. It requires a comprehensive reaction database with chemically appropriate data structure, elaborate algorithms and a good deal of computing power.
Popular proprietary services are SciFinder from the American Chemical Society and Reaxsys Synthesis Planner (Elsevier). These platforms combine retrosynthetic analysis with a user-friendly interface and easy access to a range of chemical databases. It also offers an automated patent search tool, making it easy to find relevant information.
Merck’s Synthia (formerly Chematica) is one of the most advanced solutions of this kind. The software integrates automated pathway search based on manually curated reaction rules with intelligent scoring function, enabling chemists to define criteria for optimal strategy.
Among the open source projects ASKCOS initially developed by MIT under DARPA Make-It initiative deserves a mention. An interesting open source solution for synthetic biology and discovery of metabolic pathways is RetroPath 2.0.
Knowledge-based approaches work decently with relatively simple structures requiring few synthetic steps to realise, that’s good enough to speed up routine work. However, this approach fails with non-trivial molecules. Machine-querying the chemical reaction database is fast, but not ingenious enough to properly apply tricks and judgements that might be proposed by chemists skilled in the art. And most importantly, this type of search is by definition unable to propose any novel synthetic strategy.
Augmenting chemical reality with AI
A paradigm shift happened when reaction databases were converted to reaction graphs and fed to neural networks. Searching for connections became computationally more efficient, and AI is better suited for fuzzy scores as opposed to predefined algorithms. And this is quite important considering many factors that contribute to the viability of the proposed strategy (cost of starting materials, reaction yields, reliability of reagents, labor intensity, and so on).
Recent open source tool for retrosynthetic planning of this kind is AiZynthFinder built with RDKit and TensorFlow. It uses a Monte Carlo tree search algorithm to break down target molecules into precursors that can be purchased based on a library of known reaction templates.
Finally, with the computer-assisted synthetic planning using published methods maturing, we’ll soon be facing the next big challenge. As you might have already guessed, it is the discovery of new reaction conditions, types and mechanisms which have never been reported. In other words, qualitative transition from a retrosynthetic search engine to a truly intelligent chemical mind.
The next big step for retrosynthesis is a qualitative transition from a retrosynthetic search engine to a truly intelligent chemical mind
Towards intelligent retrosynthesis
First steps in this direction involved the selection or interpolation of appropriate reaction conditions using statistical and machine learning techniques. This is a two-step process where initial choice of conditions depends on the limited set of predefined reaction rules. Therefore, it suffers from the same limitations as we pointed earlier.
Most advanced current technology here is called the seq2seq algorithm. Initially designed for machine translation of natural languages, it uses multilayered Long Short-Term Memory deep learning to convert input molecules as SMILES string to abstract vector and then decode it back to output SMILES of the product(s). Hence, it literally translates reagents to products automatically deciphering chemical grammar rules. Fascinating! Reaction prediction success rates up to 80% was reported. Though, mistakes happen — it might produce invalid SMILES or infeasible reactions, much like any other generative model (hello, ChatGPT!).
The way to inject sanity and reason into this chemical mind is to discover new chemistry with quantum mechanics calculations — one of the most strict, accurate and scientifically sound basis we comprehend nature with. Successful attempts of this kind date back to the 1980s, when the IGOR program was shown to predict unexpected and novel transformations, but have been undeveloped since (except limited attempts to integrate quantum chemistry module in Chematica and QCaRA method by Maeda group). Given tremendous progress in computing power and mathematics, the time looks ripe to marry quantum chemistry with retrosynthesis.
The time looks ripe to marry quantum chemistry with retrosynthesis.
Where do we go from here?
Humans have long been the creative force behind finding solutions to immensely difficult retrosynthesis problems. With the literature rapidly growing to an unmanageable size, computer algorithms emerged to assist in identifying reaction pathways. The development of software for retrosynthetic analysis has made it easier and more accessible than ever before. We have seen an evolution of computational approaches from manually curated reaction rules to automatically extracted, and from algorithmic workflows to machine learning. Many computer-assisted retrosynthesis tools are available, but no gold standard yet.
Despite comprehensive reaction databases and elaborate learning algorithms having been developed, retrosynthetic software is still in its infancy. Using the aforementioned language analogy, it learned how to read and write, but mastering writer’s skills is still a future task. Regioselectivity, enantioselectivity, protecting groups and optimal general strategy remains particularly challenging issues for current platforms. Many researchers suggest that future efforts should integrate expert knowledge with chemically-aware AI to take the best of both worlds.
With the right software solution, you would be able to solve your chemical problems more quickly and efficiently, enabling more efficient drug discovery.
At HMND we believe that automated synthetic labs of the future will likely be assisted with software solutions to facilitate design and scoring of synthetic strategies. Imagine a system that could analyze and classify thousands of potential synthetic routes, recognizing those that can be executed using automated processes, evaluate the availability of the necessary raw materials and initiate the automated procedure on an appropriate platform. With the right software solution, you would be able to solve your chemical problems more quickly and efficiently, enabling more efficient drug discovery.