For over seven decades, scientists have been trying to crack the code of proteins, the machinery of life, out of proteomics research—the biological field dedicated to describing and cataloging proteins and their functions in living things. In the process, scientists began collecting valuable information about some of life’s most fundamental processes and how various stresses, mutations, or other changes can impact health.
However, there can be too much of a good thing. As technologies have made protein sequencing and cataloging cheaper and faster, medical professionals and researchers have rapidly increased the size and scope of protein databases. With terabytes (one thousand gigabytes) of information to sift through, those needing quick answers now encounter many difficulties in finding information that is pertinent and helpful to a specific diagnosis, treatment plan, or research goal.
To address this problem, researchers at the Technical University of Munich (TUM) partnered with staff at the Leibniz Supercomputing Centre (LRZ) to use high-performance computing (HPC) and a machine learning approach to help structure and organize these extremely valuable databases.
“What we’ve developed is certainly already useful for computer-savvy researchers. Over time, we hope this method will also enable medical professionals and researchers who don’t have access to computer clusters to search through these databases faster,” said Michael Heinzinger, member of the bioinformatics lab led by Prof. Burkhard Rost at TUM (Rostlab). “For example, when a patient comes to a hospital in an emergency, doctors may need to know what protein mutations that person has and what potential effect those mutations have so that they can better assess potential impacts on treatment options. Bringing this work to the commercial level is our goal.”
Early returns are promising, and the team has already made protein databases more accessible. The team recently published its results in BMC Bioinformatics.
The language of life
Proteins are molecules that act as the unsung heroes of living things. Whether a protein is helping replicate DNA, automate processes like blinking and breathing, or processing and reacting to external stimuli, these molecules underpin most of the things that make us sentient. Proteins are long chains of amino acids, one of the fundamental building blocks of life. Much like how words connect to form sentences, paragraphs, or whole novels, the order of these amino acids determines the role a protein will play in the body, helping convey its “meaning” to the complex molecular inner workings of the body.
The analogy between how grammar and syntax impart meaning and a protein’s amino acid sequence served as an inspiration behind the team’s work. Specifically, the team began using an algorithm developed to study language in its protein database work. The team used ELMo, a natural language processing (NLP) algorithm for analyzing word use and variation in different contexts.
Ahmed Elnaggar, another member of the Rostlab, came across this algorithm in the context of machine learning and deep learning research, and brought it to the attention of the biology focused members of the lab. “I am working on these problems with a machine learning and deep learning focus, but we have many interdisciplinary discussions in the lab,” he said. “I brought this to Michael’s attention, and we started this collaboration. This kind of work would not be doable if people from the purely computational or the biological sides of this research worked alone on these problems.”
This interdisciplinary collaboration led researchers to use ELMo for organizing proteins by their similarities and begin to shine light on a dark corner of proteomics research—the dark proteome. Essentially, the dark proteome came to be from the rapid expansion of protein sequencing.
“What happens often times is that we have no information other than the sequence information itself,” Heinzinger said. “More than half of a terabyte of data is dedicated to just the amino acid sequences comprising proteins, but that doesn’t mean we understand their origin, functions, or any other context. It is like having the words of an ancient language without grammar or syntax to organize them into something meaningful.”
To shed light on this ancient language, the team trained ELMo to detect re-occurring patterns within millions of protein sequences. Ultimately, this compresses protein databases into a more organized, computer-readable format. Using the pre-trained ELMo model allows researchers to draw conclusions about their protein of interest without the need to search hundreds of millions of proteins stored in today’s databases. This shifts the hardware requirements from clusters needed for searching databases to a normal laptop needed to run the pre-trained model.
Working with LRZ’s Big Data and Artificial Intelligence team, the team scaled its application using the SuperMUC supercomputer at LRZ and the centre’s NVIDIA DGX-1 cluster. In the process, it was able to not only achieve a three-fold speedup; the researchers also got valuable experience scaling their application from a single node to hundreds of nodes, allowing them to take advantage of larger GPU-centric architectures.
“Training large deep learning models on large datasets requires tremendous amount of computing power and storage, which are usually not directly available at universities’ chairs,” Elnaggar said. “That’s when LRZ shines because it provides the computing power and storage required by researchers for training large deep learning models or for simulations. Simply put, without having access to the supercomputers on LRZ, our research would not have been possible.”
Accelerating and advancing
By proving their concept and method, the team now looks forward to continuing to scale the application in order to further compress and organize protein databases. The team still has a large task of filtering out “noise,” or uncorrelated, disorganized data in databases, but with access to increasingly powerful GPU-powered clusters and HPC resources, the team is in a good position to bring this work to the people who need it most.
“The concept here was to make a representation of protein sequences in a space that makes sense computationally, and that is the gist of our recent paper,” said team member Christian Dallago. “But the bigger point here is to shift this research to the people who need it and might just have a laptop. We are trying to democratize the use of these machine learning algorithms for people who might not have access to a supercomputer or powerful cluster, be it research institutions in the developing world or hospitals and other medical facilities who don’t have a lot of resources.”
Related publication: Heinzinger, et al. BMC Bioinformatics 20 (723) DOI: 10.1186/s12859-019-3220-8