by Rebecca Fine
figures by Elayne Fivenson
The Human Genome Project, one of the most ambitious scientific projects ever undertaken, achieved a monumental goal: sequencing the entire human genome. Since its completion in 2003, this project has laid the groundwork for thousands of scientific studies associating genes with human diseases.
DNA and the genome: a primer
First, let’s talk a little bit about terminology. DNA is a molecule that carries genetic information. It is made up of four types of smaller molecules, referred to as “bases”: adenine (A), thymine (T), cytosine (C), and guanine (G). The order of these bases provides instructions for assembling the essential building blocks of life. A gene is a segment of DNA that contains instructions for one of these building blocks, such as a single protein. A genome, in contrast, is a complete set of DNA instructions, including all of a person’s genes. In humans, the genome consists of 3 billion bases. All humans share about 99.9% of this genome, and the remainder is variable (and 0.1% of 3 billion is still 3 million bases – nothing to sneeze at!). A spot in the genome that can differ between people (e.g., where some people have an A and others have a G) is called a single nucleotide polymorphism, or SNP (Figure 1). The version of a SNP a person has is called their genotype, and these small genetic differences are part of what makes people unique.
The Human Genome Project: decoding our DNA, one base at a time
The Human Genome Project (HGP), which began in 1990, was a massive international effort carried out by twenty research centers and universities in six countries. The primary goal of this project was to determine the order of all 3 billion bases in the entire human genome; this process is called sequencing. You can think of sequencing as assembling a puzzle. First, scientists collect a biological sample, such as saliva or blood. Then, they make lots of copies of the DNA in the sample and break those copies into many smaller, overlapping pieces (Figure 2). The order of bases, or the sequence, in each of those pieces can then be determined by a series of chemical reactions. The DNA must be broken into pieces prior to sequencing because these reactions can only read short DNA strands, typically less than 1000 bases.
Next, the challenge is to assemble the pieces of this “puzzle” into the correct order, using the overlapping sequences from each piece as a guide. This is a difficult computational problem, especially for a genome containing 3 billion bases! Scientists also knew this problem would be even more complicated in humans than in other organisms because the human genome contains many highly repetitive sequences (e.g., patterns such as AGAGAGA or TTTTTTT). Using overlaps to guide reconstruction of the genome is especially challenging in these types of regions – imagine a puzzle in which many of the pieces are shaped almost identically.
Another major goal of the project was to determine how many genes we actually have in our genome. Previous estimates ranged widely, with some scientists believing there might be up to 100,000 genes. The HGP found that, in fact, humans have only about 20,000-25,000 genes (current estimates peg this to the lower end of that range). This number was quite a surprise to many scientists – many other organisms, such as rice and water fleas, actually have many more genes than we do! This was an important lesson for genetics: the complexity of an organism is not necessarily correlated with how many genes it has.
Genome-wide association studies: how does genetics relate to common diseases?
The Human Genome Project made it possible to ask and address new types of scientific questions. One example of such an important question is determining which SNPs increase or decrease risk for a given disease (recall that SNPs are genetic bases which can differ between people). Before the HGP, if scientists wanted to answer this type of question, they could only realistically focus on a few small regions of the genome at a time. Now, it would theoretically be possible to sequence many people with and without a disease and systematically test each base in the genome, asking: is one version of a SNP more common in people who have the disease? This type of study design is called a genome-wide association study (GWAS) (Figure 3). One of the important considerations for GWAS is cost efficiency, as sequencing the entire genome is still far too expensive to perform on large numbers of people. Therefore, scientists often use a cheaper approach: selecting hundreds of thousands of known SNPs ahead of time and testing each individual’s genotype at only those SNPs.
Scientists had previously been fairly successful at determining the genes that cause many rare and severe diseases, such as cystic fibrosis and sickle-cell anemia. For these types of diseases, often a single SNP with an extremely strong effect could be pinpointed (though it’s important to note that gene discovery does not immediately translate into therapeutic drug development – it is only the first step of a long and complex process). It seemed natural to hope that GWAS would prove similarly effective at determining the genetic basis for more common diseases, such as heart disease, diabetes, inflammatory bowel disease, and schizophrenia. In the first years of GWAS, however, it became apparent that matters would not be so simple: the findings suggested that a very large number of genes – for some traits, easily into the hundreds or perhaps even thousands – might have effects on a given disease. Moreover, these effects tended to be very small for each SNP (for example, a given SNP that affects risk for obesity is usually associated with gaining only a fraction of a pound).
This conceptual discovery has been an important advance in our understanding of human biology. In the context of drug development, this finding means that targeting a single gene with a drug may not cure all people with a particular disease; scientists are working to use information gained from GWAS to develop and improve therapeutic treatments.
GWAS in the present: where are we now?
Fast-forward sixteen years from the completion of the HGP, and genomics has moved at a speed no one could have predicted. In recent years, one of the most significant developments in human genetics has been a resource called the UK Biobank. This is a massive dataset consisting of genotype information (which can be used for GWAS) from about 500,000 human volunteers. Each participant also provides a veritable treasure trove of health data, ranging from basic information such as height and weight to dietary questionnaires and disease status (a total of over 2,400 traits!). This resource has revolutionized genomics, not only because of the huge sample size and detailed medical information, but also because the data is freely accessible to any scientist who applies to use it. As a result, the genetic analysis of the UK Biobank data has essentially been crowdsourced to scientists all over the world. The impact of this is clear from the numbers – since UK Biobank’s initial release in 2015, almost 600 papers have analyzed it, with countless new studies on the way.
Modern genomics is a triumph of collaborative science and shows how much there is to gain with large-scale, collective projects. Eighteen years ago, we didn’t even have the complete human genome sequence. Now, we have a publicly available resource of 500,000 genomes, on top of the millions of other people from whom genetic information has been collected for other studies. And GWAS is only one example of the type of research enabled by the HGP; there are countless other scientific fields that have sprung up in its wake. To name just a few other ongoing efforts, researchers have developed tests for genetic diseases, created a large catalogue of genetic abnormalities observed in many different types of cancers, studied DNA of ancient hominids to better understand human evolution, and developed ever-improving and ever-cheaper sequencing methods; for a great example of human genetics directly contributing to therapeutic development, check out this story about PCSK9 and cholesterol. All of these endeavors have helped us better understand human biology and have improved medical research. The Human Genome Project set in motion genomic research on a scale that would have been hard to imagine in 2001, and the field shows no sign of slowing down anytime soon.
Rebecca Fine is a fifth-year graduate student in the Biological and Biomedical Sciences PhD program at Harvard Medical School, where she studies human genetics. You can follow her on Twitter at @rebeccasfine.
Elayne Fivenson is a second-year PhD student in the Biological and Biomedical Sciences program at Harvard Medical School, where she is studying the genetics and biochemistry of the bacterial cell envelope.
For more information:
- For more information on the HGP, see this description from Nature Education
- To learn more about how DNA sequencing made the HGP possible, check out this article from Nature Education
- To learn about the studies facilitated by the UK Biobank, check out this piece from Science
This article is part of our SITN20 series, written to celebrate the 20th anniversary of SITN by commemorating the most notable scientific advances of the last two decades. Check out our other SITN20 pieces!