Will we Manage to Store the World’s Genetic Data?

HMND
6 min readMar 12, 2023

--

Population genetics have been a trend in genetic research for a while now, and researchers are pouring more and more effort into studying populations to draw conclusions and correlations. It is quite known that an individual human genome, for example, is quite a large dataset. Now, what if we aren’t studying the genomes of individuals but genomes of whole populations?

Let’s try to build a basic intuition about the amount of hard-drive memory can be required to store genetic data. Let’s start with the karyotype a.k.a the human chromosome apparatus, which includes 46 chromosomes. We get 23 chromosomes from the mother, 23 from the father, while each chromosome from the mother has a pair from the father, such chromosomes are called homologous. Homologous chromosomes include the same genes, such as genes for eye color or Rh factor. But each chromosome contains its own variant of the gene, for example, in one of them the gene for brown eyes, and in the other for blue eyes, in the first gene for a positive Rh factor, and in the second for a negative.

The human genome of 23 decoded chromosomes includes about 3 billion symbols, which is roughly about 3 GB. Now, if we want to fully sequence all gene variants from all 46 human chromosomes in order to know all the information about an individual, we’d naturally have to multiply the previous figure by 2 and get a whopping 6 GB of data.

But luckily, the genetic code includes not only unique sequences of symbols, but also many repetitions and copies of genes, so the genome, in theory, can be compressed to 750 MB of data with tools like Genozip or PetaSuite.

Moreover, a huge part of our genome simply does nothing, or we don’t yet understand what it does. According to various theories, this “useless” genetic information is pure ballast from ancestors, locked mutations, a reserve for DNA repair in case of damage, etc.

With that being said, the volume of our working genome can be eventyally squeezed into about 30 MB of data. These 30 MB contain the data that determines all our signs and properties, as well as hereditary diseases.

Obviously, it is much easier and faster to work with a 30 MB file than with a text of 3 billion characters. But this is just a theory. The fact remains that to obtain a decoded human genome written in one line of 3 billion symbols, about 600 GB of primary data are needed.

A primer on DNA sequencing

Methods for deciphering the genetic code are based on the principles of cutting DNA into small fragments, with their subsequent decoding.

Image copyright: APOLLO INSTITUTE. https://apollo-institute.org/sanger-sequencing/

The figure above schematises a Sanger sequencing method. Under laboratory conditions, DNA of interest to the researcher is copied. A special enzyme builds new DNA on the matrix of the one that is in the working solution. For the replication of DNA, nucleotides are used (the letters of the genetic code), which are also in solution. Among the nucleotides, a small part is modified in a way that stops the process of DNA synthesis immediately after its incorporation into the genetic code. Also, the modified nucleotides have a glowing fluorochrome label on them, marking their short sequence.

After the synthesis of several new DNA fragments, it is necessary to separate these fragments using gel electrophoresis, and as a result, shorter and lighter fragments will shift further than long and heavy ones. Finally, using step-wise laser detection, a chain of signals gets converted into a sequence of digitized genetic code. So you can get a set of puzzle (reads) from which the sequenced DNA is assembled.

At the next stage, it is necessary to align the obtained reads with the help of special software and assemble the complete decoded genome using read matches. The number of reads, and hence the amount of memory required to store them, directly depends on their length; the longer the reads, the fewer of them are.

Modern sequencing methods used in parallel with the Sanger method give reads ranging from several hundred letters long to one million. The convenience and time of work in specialised software directly depends on the alignment algorithm used and on how competently the software interface is designed. Now there are many options available on the market, each of which has its own advantages and disadvantages. At HMND, our team has a clear vision set on building such products, taking into account the needs of the modern consumer, using the joint work of engineers and scientists.

So, why bother?

Taking into account the technical capabilities of modern methods of human genome sequencing and the average volume of primary data of 600 GB for one person, we can conclude that for a population genetic study of 1000 people, we need 600 TB only for storing primary data. Nowadays, genetic tests are gaining popularity, allowing to find out some of the predispositions and diseases of a particular patient. Such tests only partially study the patient’s genome, require an incomparably smaller amount of data for their storage, but the peculiarity of such tests is their mass character, and the number of people decide to undergo such testing is gaining traction year over year.

DNA is only a place of storage of hereditary information, some genes can work more actively, and some can be completely turned off throughout life. The fact is that on the matrix of the global storage of our hereditary information — DNA, an intermediary — mRNA is synthesised.

It is mRNA that is an indicator of the activity of a particular gene, and their totality is called a transcriptome. The human genome contains about 20,000 genes, but due to the mechanism of alternative splicing, the amount of mRNA will always be greater than the total number of genes. The gene consists of coding regions — exons and non-coding regions — introns. In alternative splicing, introns are excised in mRNA and exons are selectively left in various combinations. This mechanism makes possible obtaining different proteins from one gene, which sometimes perform opposite functions in the body.

To study some aspects, a transcriptome analysis is also carried out, such an analysis creates an even larger amount of stored genetic data, but gives better insight about the condition of an organism at the moment. Taking into account the fact that the transcriptome in the cells of different tissues may differ, for a full analysis of the transcriptome of one person, storage requirements increase manyfold. Nowadays, there are predictions that by 2025 the amount of sequenced genetic data could reach 40 exabytes, which will exceed all content on YouTube.

Let’s assume that at one point every person from a population of 8 billion humans sequences their genome, that would result in data amounting to at least 4.8 yottabytes (4.8e24 bytes), and that’s primary data only. That’s 24 zeroes after the 1, quite a large number to process.

4.8 yottabytes displayed as bytes. It’s a number with 24 zeroes.
Will humanity manage to process 4.8 yottabytes of data?

In addition to the issue of physically storing genetic data for mass research purposes, the question arises pertaining to the protection of this data which, if compromised, can pose a risk of leaking a person’s full genetic information, paving way for a whole new generation of malicious intent.

Now, let’s assume that we eventually manage to store humongous genetic data, will we be able to store it securely?

--

--

HMND

We envision the future and help you drive innovation.