The Computer Science behind DNA Sequencing

by Alex Cabral
figures by Sean Wilson

In 2003, with the completion of the Human Genome Project, the entire human genome was sequenced for the first time. The sequencing cost nearly $1 billion and took 13 years to complete. Today, the human genome can be sequenced for about $1000 in less than two days. Industry leaders hope to bring that cost down to just $100 within the next ten years. Of course, the scientific knowledge gained from the Human Genome Project helped propel DNA sequencing technology to its current state, but another major factor in the process has been the advancement of Computer Science and Engineering.

Increased Storage and Speed

One of the biggest differences between computers today and those in 2003 is the amount of storage space and processing speed they have. In fact, these improvements helped launch a completely different type of DNA sequencing technique than the one used only two decades ago. Sanger Sequencing, the method used in the Human Genome Project, could only read small fragments of DNA at a time. These overlapping reads were put together to produce longer strands and finally assemble the entire genome.

Next-Generation Sequencing (NGS), the technique widely used today, works by parallelizing many micro-scale reactions at the same time. As a result, NGS systems output about 15,000 times as much data per day as a Sanger Sequencer (Figure 1). Thanks to the advances in processing, scientists can sequence an entire genome on an NGS machine in just days, compared to the years it took on a Sanger Sequencer. This increase in processing power and data generation necessitated a comparable increase in storage capacity, as an NGS machine can produce 1 terabyte of data in a single day. Because of the memory upgrades in hard drive space and RAM, NGS sequencers can store the terabytes of information generated from genome sequencing, a feat that would otherwise be impossible. Without the rapid computational advances in the past decade, it is unlikely that DNA sequencing would have the speed and low cost that it does today.

*Figure 1:* Sanger vs. Next Generation Sequencing. A Next Generation Sequencer can produce much more output than a Sanger Sequencer.

Computational Biology

Once a genome is sequenced, it must be interpreted. Specifically, billions of data points have to be analyzed to detect variations (i.e., mutations) within the genome. Finding mutations can help identify the cause, and eventually cures, for a number of diseases. Completing this task by hand would be infeasible, but with the rise of computational biology, researchers and doctors are able to find mutations of interest in a matter of days. Computational biologists use pattern-matching algorithms, mathematical models, image processing, and other techniques to summarize and derive meaning from the sequencing data. In addition, many computational biologists run simulations to predict how certain biological systems will react under different environments. For example, a simulation might predict how cancer cells react to different drug treatments, and in turn help find a cure. The hope is that these computational models and simulations may ultimately lead to the discovery of new treatments for a number of diseases.

Crowdsourcing

A major implication of the advancements in sequencing technology is the rise of at-home genetic testing kits, such as 23andMe. Although such tests do not sequence the entire genome, their success has been enabled by the low cost of NGS and the availability of modern sequencing technologies. Many people use these kits to find out more about their ancestry and potential health risks written in their DNA. Researchers, however, have found a different opportunity in services like 23andMe.

With a wide array of users and access to a number of genetic variants, 23andMe encourages its customers to participate in various research projects. Many of these projects involve a series of survey questions covering specific traits that range from the slightly whimsical ‘can you do a side split?’ to the more serious family history of an illness or disease. Data scientists and researchers then use the survey results to cluster similar groups of individuals together based on the mutations they find from the sequencing data. These results can be used to build better prediction models of different traits and provide more accurate and beneficial information, such as potential health risks, to customers. For example, users who indicate having a family history of breast cancer would all have their DNA sequences compared to predict segments or mutations of DNA that correlate with the disease.

Databases and Cloud Storage

With the rise of crowdsourced genetic information came the development of various online databases and cloud storage platforms that allow this wealth of data to be accessed and analyzed. Modern databases and cloud storage platforms provide enough storage space for an estimated 1 in 25 Americans’ DNA data to be stored for repeated reference. The storage capabilities, backups, and universal access provided by cloud-based databases allow global teams to work together to solve some of the toughest problems we face in medicine today.

Law enforcement officials have also found value in DNA databases. In April 2018, in the US, the infamous Golden State Killer was captured via genetic information that one of his distant relatives shared on a genealogy website called GEDMatch. In fact, many US states and other countries collect DNA data from crime scenes and people arrested for serious crimes to build a forensic DNA database of their own. Although DNA databases have undoubtedly helped catch criminals who should be behind bars, it has raised many questions of privacy.

The Pitfalls – Privacy and Security

As with any data stored in the cloud, there is an inherent risk to DNA that has been sequenced and stored in a public database. Although most people are not serial killers, there is still growing concern about the privacy and security issues surrounding DNA databases. Of course, public databases can be accessed by anyone, so combining the nearly 1 million DNA profiles on GEDMatch with public ancestry records could help someone find genealogical information about more than half of the US population.

Even with private databases such as 23andMe, there is still the potential for data to be hacked. The implications of data hacking are vast, and some people worry that they could be framed for a crime or even cloned unwillingly if their genetic information were stolen. The sensitive, personal nature of DNA provides amazing security and privacy challenges that can only be addressed with the help of computer science experts.

The Future – DNA Storage

Technological advances have clearly facilitated significant advances in our understanding of DNA, but is it possible that the converse may also be true? With the rise of big data, global internet users now collectively generate over 2.5 exabytes (1 exabyte = 1 billion gigabytes) of data each day. Researchers have been working to find new ways to store this vast amount of data, causing some to turn to a surprising place: our DNA. One of the most amazing things about DNA is its potential to store and encode information via DNA sequences, prompting large-scale projects such as Microsoft’s DNA storage initiative that aim to use DNA as a medium for data storage. In 2016, researchers from Microsoft and University of Washington successfully stored a record-breaking 200 megabytes of data by using synthetic DNA, artificial genes that are created in a laboratory, as a storage medium (Figure 2). Now, the researchers are working to store even more data and figure out ways to access it quickly, in a push for synthetic DNA to change the future of data storage.

*Figure 2:* Synthetic DNA as a medium for data storage. DNA storage will allow for data to be stored much more efficiently in a small amount of space. Pictured on the left is DNA, blown up from scale. It can store 1250 terabytes per cubic millimeter, compared to the 1.25 gigabytes per cubic millimeter shown for the flash drive (in the middle) and hard drive (to the right). For the same unit of physical space, this means that DNA could potentially store 1,000,000 times the amount of data.

Conclusion

The strong ties between DNA and Computer Science have revolutionized the biological and medical fields to places that were unimaginable just 20 years ago. This relationship is pushing us to a world where everyone may have personalized, predictive, and preventative medicine. Furthermore, these ties have propelled Computer Science forward, leading to better, faster algorithms, more secure storage methods, and now synthetic DNA. I am excited to see where this relationship leads in the future, and I am certain that 20 years from now, we will again see things we could not even imagine today.

Alex Cabral is a first-year PhD student in the Computer Science Department at the John A. Paulson School of Engineering and Applied Sciences, where she focuses on human-computer interaction.

Sean Wilson is a fifth-year graduate student in the Department of Molecular and Cellular Biology at Harvard University

For more information:

To read about the social and cultural implications of widespread whole genome sequencing, check out this Bloomberg article
For more information on whole genome sequencing and its associated costs, see this article from the National Institute of Health
To learn about the technique used by 23andMe to analyze customer’s DNA, see this explanation
To learn about the privacy implications of genome sequencing, check out this piece from the Los Angeles Times
For more information about the implications of whole genome sequencing in human disease, check out this Guardian article
To learn about how DNA sequencing could impact law enforcement, check out this piece from NOVA
To read about how DNA sequencing facilitated to capture of the Golden State Killer, see this New York Times piece
For more information about synthetic DNA storage, see this Wired article

This article is part of our SITN20 series, written to celebrate the 20th anniversary of SITN by commemorating the most notable scientific advances of the last two decades. Check out our other SITN20 pieces!

3 thoughts on “The Computer Science behind DNA Sequencing”

Mobilunity says:

April 21, 2023 at 6:39 pm

Regardless of whether it is a B2B or B2C arrangement, there are normal highlights for all SaaS applications. They are facilitated in a cloud, which makes them accessible online from any gadget or area. Also, they are very advantageous for end-clients and suppliers.

SaaS application improvement arrangements offer admittance to programming with no need to download or introduce anything.

Rachel Michelle says:

May 15, 2020 at 6:59 am

Sometimes I wonder how much technologically advance we have become.

Aleem says:

December 2, 2019 at 6:16 am

Early computers have been simplest conceived as calculating gadgets. Since ancient instances, easy guide gadgets like the abacus aided humans in doing calculations. Early in the Industrial Revolution, a few mechanical devices have been built to automate lengthy tedious tasks, which includes guiding patterns for looms.

Science in the News

Opening the lines of communication between research scientists and the wider community

The Computer Science behind DNA Sequencing

Increased Storage and Speed

Computational Biology

Crowdsourcing

Databases and Cloud Storage

The Pitfalls – Privacy and Security

The Future – DNA Storage

Conclusion

For more information:

3 thoughts on “The Computer Science behind DNA Sequencing”

Leave a Reply Cancel reply

Increased Storage and Speed

Computational Biology

Crowdsourcing

Databases and Cloud Storage

The Pitfalls – Privacy and Security

The Future – DNA Storage

Conclusion

For more information:

Share this:

3 thoughts on “The Computer Science behind DNA Sequencing”

Leave a Reply Cancel reply