Of the trillions of cells that compose our body, from neurons that relay signals throughout the brain to immune cells that help defend our bodies from constant external assault, almost every one contains the same 3 billion DNA base pairs that make up the human genome – the entirety of our genetic material. It is remarkable that each of the over 200 cell types in the body interprets this identical information very differently in order to perform the functions necessary to keep us alive. This demonstrates that we need to look beyond the sequence of DNA itself in order to understand how an organism and its cells function.

Studying the Genome as a Whole

So how do we start to understand the genome as a whole?  In 2000, the Human Genome Project provided the first full sequence of a human genome [].  The DNA that makes up all genomes is composed of four related chemicals called nucleic acids – adenine (A), guanine (G), cytosine (C), and thymine (T).  A sequence of DNA is a string of these nucleic acids (also called “bases” or “base pairs”) that are chemically attached to each other, such as AGATTCAG, which is “read out” linearly. Experimental methods to determine the sequence of DNA, along with help from some powerful computers, ultimately gave scientists a sequence full of A’s, G’s, C’s, and T’s that was 3 billion letters long.  At the time, researchers thought they knew enough about how DNA worked to search for the functional units of the genome, otherwise known as genes.  A gene is a string of DNA that encodes the information necessary to make a protein, which then goes on to perform some function within our cells.

After the Human Genome Project, scientists found that there were around 20,000 genes within the genome, a number that some researchers had already predicted.  Remarkably, these genes comprise only about 1-2% of the 3 billion base pairs of DNA [].  This means that anywhere from 98-99% of our entire genome must be doing something other than coding for proteins – scientists call this non-coding DNA.  Imagine being given multiple volumes of encyclopedias that contained a coherent sentence in English every 100 pages, where the rest of the space contained a smattering of uninterpretable random letters and characters.  You would probably start to wonder why all those random letters and characters were there in the first place, which is the exact problem that has plagued scientists for decades.

Why is so much of our genome not being used to code for protein? Does this extra DNA serve any functional purpose? To start to get an idea of whether we need all of this extra DNA, we can look at closely related species that have wildly varying genome sizes.  For instance, the genus Allium, which includes onions, shallots, and garlic, has genome sizes ranging anywhere from 10 to 20 billion base pairs.  It is very unlikely that such a large amount of extra DNA would be useful in one species and not in its genetic cousin, perhaps arguing that much of the genome is not useful [].  Furthermore, these genomes are much larger than the human genome, which indicates either that an onion is highly complex, or more likely that the size of a genome says nothing about how complex the organism is or how it functions.

Which Parts of the Genome Are Functional?

Due to amazing technological advances in sequencing DNA and in using computers to help analyze the resulting sequences (collectively known as bioinformatics), large-scale projects similar to the Human Genome Project have begun to unravel the complexity and size of the human genome.  One particular project, ENCODE, or the Encyclopedia Of DNA Elements, set out to find the function of the entirety of the human genome [2, 3].  In other words, while the Human Genome Project set out to read the blueprints of human life, the goal of ENCODE was to find out which parts of those blue prints actually do something functional.  A group of labs from around the world work on the ENCODE project, which started in 2003 and is funded by the National Human Genome Research Institute.  Just this month, the consortium published its main results in over 30 scientific journal articles, and it has been given a significant amount of attention by the media [].

Figure 1. The 46 chromosomes (top) that compose the entire human genome.  Each chromosome (middle) is a long, continuous stretch of DNA sprinkled with genes that encode the information necessary to make a protein.  Genes only make up a small percentage of the genome, and the rest is composed of intergenic regions (bottom) that do not code for proteins.  These are the regions that ENCODE is most interested in studying. (Image Credit: Wikimedia Commons; User – Plociam)

To better appreciate the goal of ENCODE, it is first helpful to understand what we mean by “functional.”  Remember that genes encode the information necessary to make proteins, which are the molecules that perform functions in the cell.  How much protein a given gene ultimately produces, or whether it is allowed to make any at all, is determined by its gene expression.  In the case of the genome, any non-protein-coding sequence that is functional would presumably have some effect on how a gene is expressed; that is to say, a functional sequence in some way regulates how much protein is made from a given coding DNA sequence.  It is the difference in the composition of proteins that helps give a cell its identity.  Since every cell contains the exact same DNA and genome, it is therefore the levels of gene expression that determine whether a cell will be a neuron, skin, or even an immune cell.

Whereas the Human Genome Project primarily used the technique of DNA sequencing to read out the human genome, actually assigning roles to and characterizing the function of these DNA bases requires a much broader range of experimental techniques.  The ENCODE project used six approaches to help assign functions to particular sequences within the genome. These approaches included, among others, sequencing RNA, a molecule similar to and made from DNA that carries instructions for making proteins, and identifying regions of DNA that could be chemically modified or bound by proteins []. Researchers picked these methods because they each give clues as to whether a given sequence is functional (i.e., whether it influences gene expression).  If the cell is expending energy to make RNA from DNA, then it is likely being used for something.  Additionally, proteins that bind to DNA influence whether a gene is expressed, and chemical modifications of DNA can also prevent or enhance gene expression.

Each of these approaches can identify sequences within the genome that have some sort of biochemical activity, and to add to the usefulness of this project, the labs conducted these techniques in multiple cell types in order to account for natural variability.  So what did they ultimately find? Using the six approaches, the project was able to identify biochemical activity for 80% of the bases in the genome []. Although this does not necessarily mean that all of those predicted functional regions actually do serve a purpose, it strongly suggests that there is a biological role for much more than the 1% of our DNA that forms genes.  Many scientists already suspected this, but with ENCODE, we now have a large, standardized data set that can be used by individual labs to probe these potentially functional areas.  Likewise, because it was such a large project with strict quality controls, we can be sure that the data are reproducible and reliable.

Usefulness and Controversy

Although the main benefits stemming from this project may not be realized for some years (similar to the Human Genome Project), at the moment there are already some areas where this enormous data set will be useful.  There are a host of diseases that seem to be associated with genetic mutations; however, many of the mutations that have been discovered are not within actual genes, which makes it difficult to understand what functional changes the mutations cause.  Using the data from the ENCODE project, researchers will be able to hone in on the disease-causing mutations more quickly, since they can now associate the mutations with functional sequences found in the ENCODE database.  By matching these two, researchers and doctors should be able to start understanding why a particular mutation causes a disease, which will help with the development of appropriate therapies.

Though the ENCODE project was a remarkable feat of scientific collaboration, there is still controversy surrounding the project [5, 6, 7].  Some scientists have voiced their concern that the money spent on this project (upwards of $200-300 million) could have been more useful in supplying individual researchers with grants. Some biologists have also voiced their concerns regarding how the results of the project were presented to the public, both in terms of the hype surrounding the project and the results themselves.  Because of the expense and complexity of these types of studies, it is important for scientists to present an impartial perspective.  The need for careful presentation to the public was demonstrated by the hype surrounding a recent paper published by NASA scientists on bacteria that could use arsenic in a way that had never been observed before.  After announcing that they had discovered something new and exciting, even to the point of calling a press conference, the self-generated hype eventually imploded after the findings were ultimately refuted [].  As with any new large-scale project, both scientists and the public must be patient in assigning value until the true benefits of the project can be realized.

One other major criticism of the papers published by the ENCODE group focused on the meaning of the phrase “biological function.” In the main ENCODE journal paper, the authors stated that they had assigned a biological function to about 80% of the human genome [].  As others have noted, just because a given DNA sequence binds protein or is associated with some chemical modification does not necessarily mean that it is functional or serves a useful role.  Many protein binding events are random and inconsequential. It has also been known for some time that much of the non-coding “junk” DNA is not actually junk, so some researchers have called into question the novelty of the results of ENCODE.  All of these concerns are certainly justified, and, in fact, the conversation surrounding the project demonstrates precisely how science is supposed to work.

It will most likely take years to fully understand how ENCODE has helped the scientific community, but nevertheless, this project has highlighted how important it is to study the genome as a whole, not only to understand why we have so much non-coding DNA within each and every cell, but also to inform us on topics that are relevant to the majority of people, notably how rare or multiple genetic mutations lead to the development of disease.

Jonathan Henninger is a graduate student in the Biological and Biomedical Sciences Program at Harvard University.

Further Information

Video – ENCODE’s lead coordinator Ewan Birney discusses the main goals of the project.

References

[] Human Genome Project Homepage <http://www.ornl.gov/sci/techresources/Human_Genome/home.shtml>

[] ENCODE Homepage <http://www.genome.gov/10005107>

[] ENCODE articles published in Nature <http://www.nature.com/encode/>

[] “Bits of Mystery DNA, Far From ‘Junk,’ Play Crucial Role,” Gina Kolata, The New York Times <http://www.nytimes.com/2012/09/06/science/far-from-junk-dna-dark-matter-proves-crucial-to-health.html?pagewanted=all>

[] reddit.com “Ask me Anything” with ENCODE project contributors <http://www.reddit.com/r/askscience/comments/znlk6/askscience_special_ama_we_are_the_encyclopedia_of/>

[] “Blinded by Big Science: The lesson I learned from ENCODE is that projects like ENCODE are not a good idea,” by Michael Eisen <http://www.michaeleisen.org/blog/?p=1179>

[] “ENCODE says what?” by Sean Eddy <http://selab.janelia.org/people/eddys/blog/?p=683>

[] “New Science Papers Prove NASA Failed Big Time in Promoting Supposedly Earth-Shaking Discovery That Wasn’t,” by Matthew Herper <http://www.forbes.com/sites/matthewherper/2012/07/08/new-science-papers-prove-nasa-failed-big-time-in-promoting-supposedly-earth-shaking-discovery-that-wasnt/>

[] “Evolution of genome size across some cultivated Allium species.” Ricroch et al., Genome 2005. <http://www.ncbi.nlm.nih.gov/pubmed/16121247>

[] “An integrated encyclopedia of DNA elements in the human genome.” The ENCODE Project Consortium, Nature 2012. <http://www.nature.com/nature/journal/v489/n7414/full/nature11247.html>

7 thoughts on “The 99 Percent… of the Human Genome

  1. Is a genome 23 Chromosomes or 46 Chromosomes? Aren’t there 3 billion base pairs (molecules) in 23 Chromosomes? So 46 Chromosomes would be twice as many base pairs. What was actually mapped – 23 Chromosomes, and X and a Y? Does a maternal Chromosome 01 map differently from a paternal Chromosome 01?

  2. I enjoyed the frank tone of your article. It was very informative. One small nit to pick: you cannot ‘hone in on something’ : hone means to sharpen as for example skills. The appropriate expression is ‘HOME in on’ .

      1. To hone in has another linked meaning which is the sharpening aspect linked to cutting and dividing down and down to get to the part that really matters in a particular situation as in “his intellect was razor sharp”.

Leave a Reply

Your email address will not be published. Required fields are marked *