Based on Transformers, our new Enformer architecture advances genetic research by improving the ability to predict how DNA sequence influences gene expression.
When the Human Genome Project succeeded in mapping the DNA sequence of the human genome, the international research community were excited by the opportunity to better understand the genetic instructions that influence human health and development. DNA carries the genetic information that determines everything from eye colour to susceptibility to certain diseases and disorders. The roughly 20,000 sections of DNA in the human body known as genes contain instructions about the amino acid sequence of proteins, which perform numerous essential functions in our cells. Yet these genes make up less than 2% of the genome. The remaining base pairs — which account for 98% of the 3 billion “letters” in the genome — are called “non-coding” and contain less well-understood instructions about when and where genes should be produced or expressed in the human body. At DeepMind, we believe that AI can unlock a deeper understanding of such complex domains, accelerating scientific progress and offering potential benefits to human health.
Today Nature Methods published “Effective gene expression prediction from sequence by integrating long-range interactions” (first shared as a preprint on bioRxiv), in which we — in collaboration with our Alphabet colleagues at Calico — introduce a neural network architecture called Enformer that led to greatly increased accuracy in predicting gene expression from DNA sequence. To advance further study of gene regulation and causal factors in diseases, we also made our model and its initial predictions of common genetic variants openly available here.
Previous work on gene expression has typically used convolutional neural networks as fundamental building blocks, but their limitations in modelling the influence of distal enhancers on gene expression have hindered their accuracy and application. Our initial explorations relied on Basenji2, which could predict regulatory activity from relatively long DNA sequences of 40,000 base pairs. Motivated by this work and the knowledge that regulatory DNA elements can influence expression at greater distances, we saw the need for a fundamental architectural change to capture long sequences.
To continue reading this article, click here.