by Layla Siraj
figures by Rebecca Senft

Imagine if you could tell, through some combination of your environment and your genetics, what illnesses you might develop. This could give you the ability to either prevent these illnesses before they even happen or catch and treat the illnesses early enough to prevent long-lasting effects.

This reality is one step closer with the release of the UK Biobank, a study of approximately 500,000 volunteers in the United Kingdom that is revolutionary not only in its size but also in the number of different kinds of data available from each participant. These volunteers enrolled between 2006 and 2010 and will have their medical records followed for the next 30 years. Additionally, each participant will have their genome, or their complete set of DNA, sequenced. This data will help researchers make associations between certain areas of the genome and traits in health and disease, with the ultimate goal being to improve disease diagnosis and treatment.

The Genetics of Disease

The human genome, comprised of long stretches of DNA, provides our cells with the instructions for making the components of life, from the muscles that allow our hearts to pump to the vessels that carry blood throughout our bodies. With great power, though, comes great responsibility: DNA is vital for life, but mutations, or changes to the normal DNA sequence, can also lead to the development of disease. While some diseases have well-established genetic causes, such as Fragile X Syndrome and Tay-Sachs disease, most diseases are incompletely understood, meaning that we do not know which gene or genes contributes to disease development. This is because the human genome consists of 3 billion chemical units, called bases, that create the code for life. Finding which of these bases, when mutated, contributes to a particular disease is difficult because of sheer volume. Moreover, it is not uncommon for many mutations to contribute to the development of a given disease, rather than a single mutation being completely causative in and of itself.  So often, we have to find multiple needles in a haystack rather than just one.

To make matters more complicated, some mutations that correlate with disease may not actually be causative. For example, it may be true that 9/10 people with disease X have mutation Y. It would be tempting, then, to claim that mutation Y contributes to the development of disease X. This, though, would be a very risky claim, because sections of DNA are commonly linked and inherited together. This means that, even though mutation Y is highly correlated to disease X, it may be the case that it is simply located near the true cause of the disease that has thus far eluded identification.

All of this to say: it’s complicated. Taking one or a few human genomes, which are riddled with mutations and genetic alterations, and trying to elucidate disease connections is a messy, convoluted task. The solution to this problem? Numbers.

The UK Biobank Breakthrough

Genomic research is often limited by lack of power, which, in statistics, means a lack of samples. The more genomes we have to work with, the better we are able to pin down particular sequences that are highly associated with specific diseases. That’s where the UK Biobank comes in. This study is groundbreaking in terms of the number of genomes sequenced (Figure 1). The first human genome was sequenced in 2004 through the Human Genome project. Since then, as the ease and accuracy of genome sequencing increased steadily, collections of human genome data began to appear. The current go-to database for human genomes is the 1000 Genomes Project, which, in its most recent phase released in 2014, includes 2,504 individuals from around the world.  Other studies, like the GoNL study, have data on hundreds of family trios – parents and a child. The UK Biobank includes the genomes of 500,000 people.

Figure 1: Timeline and Size of Genome Databases. The Human Genome Project released the first sequence of a human genome in 2004. In 2012, the 1000 Genomes Project released 1092 human genomes, with another 2504 genomes released in 2014. In 2015, the UK Biobank released 150,000 genomes with another 500,000 genomes released in 2017.

Notably, due to the cost associated with sequencing the full genome, the UK Biobank samples are only partially sequenced, having been directly analyzed at 820,967 sites across the genome. However, because of the aforementioned linking of DNA that causes some sequences be inherited together, millions of additional sites can be inferred. This type of statistical inference, called imputation, increases the number of sites on which we have information to 96 million. While still a small percentage of the human genome – on the order of 3 billion bases – this is more sites in more people than researchers have had access to before, which will greatly help in identifying diseases with statistical power.

Brains and Brawn

It’s more than just size that makes the UK Biobank such a groundbreaking resource. On top of its volume of samples, the database also contains measurements on over 2,000 different phenotypes (measured traits). When participants were enrolled in 2014, they answered questionnaires, underwent physical testing and measurements, and gave blood, urine, and saliva samples (Figure 2). They partook in imaging, from x-rays to brain MRIs, at specialized and standardized centers across the UK. These tests and more allowed researchers to collect information on many phenotypes, ranging from bodily indicators, such as height and weight, to presence or absence of diseases like different types of cancers or immune diseases. The data also include various physiological markers such as iron levels in the blood or certain protein levels in urine, as well as behavioral data such as how many times a person walks to work in a week and their average mood. The participants also consented to follow-up by clinicians for 30 years involving their medical chart, so data can continue to be collected through a review of health records. Information like this, when cross-referenced with a person’s genetic sequence, helps to correlate genetic markers with different phenotypes, disease conditions, or biological processes that can affect our day to day lives.

Figure 2: The process of UK Biobank data collection. At the first visit, each patient undergoes a battery of imaging tests, physiological tests, questionnaires, and sample withdrawal. One of those samples collects DNA, which then gets amplified and is sequenced at hundreds of thousands of genetic markers.

Already, the first studies using the data from the UK Biobank have taken us closer to understanding how our genome contributes to health and disease. Researchers have identified new mutations that predispose an individual to blood cancer by either increasing the error rate in DNA replication or causing the cells to replicate faster than its neighbors. A second group of researchers used brain images from the UK Biobank, alongside the genomic data, to identify groups of genes important for determining the architecture of our brains. These genes fell into two groups: genes that play a role in brain development, and genes that are involved in iron transport. The brain development genes are linked to mental disorders such as schizophrenia and depression, and the iron transport genes are related to neurodegenerative disorders like Parkinson’s disease or Alzheimer’s disease. The connection with the brain architecture could mean that differences detectable in brain imaging could be early signs of these mental health disorders or neurodegenerative diseases, which means it could help with early diagnosis. This is an area where the long-term follow up will be informative in supporting this hypothesis.

The possibilities for genomic discovery, in short, are immense. With 96 million sequenced sites across the genome to work with, studies can be done to identify which areas of the genome are associated with particular diseases. Having so much phenotypic data on top of the genetic data is key for connecting both genetics and lifestyle to disease. Furthermore, in depth resources that combine phenotypic and genetic markers from healthy individuals will help researchers learn about how diseases start, before they are detected. This will enable efforts in preventative medicine. The UK Biobank has the potential to revolutionize our understanding of not only how our genes contribute to disease, but our genome as a whole.


Layla Siraj is a fourth-year student in the Harvard-MIT MD-PhD program, an MD student in the Harvard-MIT Health Sciences and Technology program at HMS, and a rising second-year PhD student in the Biophysics department of Harvard’s Graduate School of Arts and Sciences. She is studying genomic regulation and epigenetics.

Rebecca Senft is a fifth-year Program in Neuroscience PhD student at Harvard University who studies the circuitry and function of serotonin neurons in the mouse.

For more information:

 

Leave a Reply

Your email address will not be published. Required fields are marked *