A staggering batch of over 30 papers published in Nature, Science, and other journals this month, firmly rejects the idea that, apart from the 1% of the human genome that codes for proteins, most of our DNA is “junk” that has accumulated over time like some evolutionary flotsam and jetsam.

The papers, representing 10 years of work of the ENCODE (“Encyclopedia of DNA Elements”) project, completed by hundreds of scientists from dozens of labs around the world, reveal that 80% of the human genome serves some purpose and is biochemically active, for example, in regulating the expression of genes situated nearby.

The sequencing of the human genome helped find out which mutations in protein-coding genes can cause disease. Now a map of the non-coding regions will help find out how mutations in the regulatory elements lead to diseases like lupus and diabetes.

John A. Stamatoyannopoulos, associate professor of genome sciences and medicine at the University of Washington, led one of the teams that carried out the mapping and analysis. He told the press:

“Genes occupy only a tiny fraction of the genome, and most efforts to map the genetic causes of disease were frustrated by signals that pointed away from genes.”

“Now we know that these efforts were not in vain, and that the signals were in fact pointing to the genome’s ‘operating system’ – the instructions for which are hidden in millions of locations around the genome,” he added.

Ben Raphael is an associate professor of computer science at Brown University in the US, and his research interests include cancer genomics and applying mathematical methods to biological questions. He said the ENCODE findings should help us better understand human biology and how genomic variations can cause disease.

“The most exciting part is now we’re getting a whole genome annotation of functional elements,” said Raphael, who was not involved in the research.

“Every time you want to understand what a particular piece of the genome is doing, you can use the data from this project,” he added.

Altogether, the ENCODE scientists mapped more than 4 million regulatory regions in the human genome.

From genetic sequencing data for 140 types of cells, they identified thousands of DNA regions that help fine tune gene activity and influence which genes are turned on and off in different kinds of cells.

They found that far from being junk, these non-coding units of DNA are busy doing things like offering landing sites for proteins that control gene activity, or serving as locations for chemical changes that control gene expression.

The ENCODE data and results are so complex that some of the journals have joined to make a portal of published information so readers can work through it in a systematic way.

You can enter the portal at the Nature ENCODE website.

The ENCODE scientists studied the chemical modifications of individual stretches of DNA that control which genetic regions will be active. The modifications, known collectively as the epigenome, are different for different cells, and effect their control either directly on DNA or by altering the histone proteins that DNA wraps around.

To map the modifications, the teams collected many different kinds of data from different cell types. Some labs measured how accessible stretches of DNA were by cutting it into pieces with enzymes. Others measured modifications to DNA or histones.

One team of computational scientists was that led by Manolis Kellis, an associate professor in the Department of Electrical Engineering and Computer Science at Massachusetts Institute of Technology (MIT) in the US.

Team member Anshul Kundaje, a research scientist in MIT’s Computational Biology Group, helped lead the work to analyze and integrate the massive amount of data that came out of the various labs. They developed an almost purely automated system to do it:

“Given that we were getting more than 1,000 data sets, we had to figure out ways to automatically calibrate experiments,” said Kundaje.

The researchers found that 80% of the genome undergoes some kind of biochemical event that is significant, such as binding to proteins that control how often a neighbouring gene is used.

They also found the same regulatory region can play several roles, depending on which cell it is regulating.

To find links with common diseases and clinical traits, the teams analyzed genetic variants that had been linked to them previously in genome-wide association studies (GWAS).

GWAS compare genetic information between groups of people with and without a particular disease or trait. The last ten years or so have seen a wealth of such studies covering over 400 diseases and traits and involving hundreds of thousands of people all over the world.

But 95% of the time, the studies pointed to genetic variants that lay outside protein-coding regions of the genome.

When the ENCODE scientists examined the findings through the lens of their new maps of non-coding regulatory regions they found that previous GWAS effort was far from wasted, because they discovered:

  • 76% of non-coding region variants linked to diseases are either within or tightly linked to regulatory DNA, suggesting many diseases are caused by alterations to when, where and how protein-coding genes are switched on rather than mutations of the genes themselves.
  • 88% of the regulatory regions that have DNA variants linked to diseases are active in the early stages of human fetal growth. While many of them are tied to diseases that occur in adulthood, the ENCODE discoveries suggest what goes on the genome’s circuitry before birth may affect the chances of getting a disease much later in life.
  • DNA mutations linked to specific diseases appear to occur in specific sections of short DNA codes that are read by proteins that regulate processes of those diseases or the organs or cells the diseases affect. For instance, mutations that occur with diabetes tend to be located in the DNA codes used by proteins that regulate sugar metabolism and insulin secretion. And mutations linked to auto-immune diseases like multiple sclerosis, asthma, and lupus, tend to occur in DNA codes read specifically by proteins that regulate immune function.
  • Many diseases that one might regard as unrelated, appear to share common regulatory circuits in the genome. These include diseases of the immune system, certain cancers, and a range of neuropsychiatric disorders.
  • Thousands of variants linked with disease found in GWAS studies that had previously been ignored also became significant when examined through the lens of regulatory DNA. These are highly selectively localized within regulatory DNA of disease-specific cell types.
  • A surprising finding was that the regulatory DNA maps may help pinpoint cell types that play a role in specific diseases, without needing to know how the disease works. For example, there are genetic variants linked to Crohn’s disease, a common type of inflammatory bowel disease, that the scientists found concentrated in the regulatory regions of two types of immune cell: the same types that previously took researchers decades to link to Crohn’s disease.

The ENCODE researchers also looked at the building blocks of DNA, the nucleotides A, T, C and G, and how they were conserved in the newly-mapped regulatory regions. Conserved means they have not changed over long periods of evolution. The scientists can see this by analyzing variability in those DNA regions within and among species.

Kellis and colleagues recently published a paper that showed 5% of non-coding DNA is conserved across mammals.

In one of the ENCODE papers published online in Science on 5 September, Kellis and Lucas Ward, a postdoctoral researcher at MIT, reveal that another 4% of non-coding DNA is conserved in humans, suggesting those regions control traits that have recently evolved and of which some are unique to humans.

Many of the genes in the newly identified regulatory regions encode regulators that switch on other genes, as Ward explained:

“Genes involved in the nerve growth pathway and color vision, both of which have been hypothesized to be recent innovations in the primate lineage, are enriched in human-constrained elements in non-conserved regions.”

The scientists also found that the nucleotides most likely to be linked with disease when mutated, were also the most conserved ones.

In their papers they show how mutations associated with autoimmune diseases like lupus and rheumatoid arthritis, are situated in regions that are only active inside immune cells. And regions with variants linked to metabolic diseases are active only in liver cells.

The new studies effectively map a set of reference notes on common human genome functions.

In their next phase of work, the ENCODE teams want to find out how variations lead to disease, by personalizing the maps, as Kellis explains:

“…to basically ask how they vary naturally between individuals, by profiling different cell types from different people, and how their variation relates to human disease and complex human traits.”

Written by Catharine Paddock PhD