In the early 1990s, scientists set out to map the entire DNA sequence of the human genome.
The so-called Human Genome Project aimed to find genetic links to diseases and to understand the function and structure of various elements of the genome, such as which genes encode proteins and what factors regulate gene expression.
The initial results of the Human Genome Project predicted that there are 40,000 genes that can encode proteins, large molecules that are vital for the good functioning of the body's tissues and organs.
However, as that project drew to a close in 2003, estimates for that number fell to around 20,000–25,000 protein-encoding genes.
Since that point, scientists have been striving to come up with the final proteome — that is, the total number of proteins that can be expressed by genes — and have been focusing on understanding how the genetic expression of these proteins is mutated in several diseases.
To this end, an international team of researchers led by Michael Tress, from the Spanish National Cancer Research Centre Bioinformatics Unit in Madrid, Spain, has now examined the genes considered protein-coding by the main proteome databases available.
Tress and colleagues published the results of their research in the journal Nucleic Acids Research. Federico Abascal, of the Wellcome Trust Sanger Institute in Hinxton, United Kingdom, is the first author of the paper.
At least 2,000 genes are 'pseudogenes'
Tress and team found that, of the total number of 22,210 genes listed as protein-encoding, only 19,446 featured in all three collections.
Then, they zoomed in on the difference of 2,764 genes, examining the experimental evidence and the information available from the annotations.
Evidence suggested that the majority of these genes were "noncoding genes or pseudogenes."
Also, the scientists found that an additional 1,470 genes — which were listed as protein-coding in the three collections — did not have the functional characteristics or the typical evolution of protein-encoding genes.
Therefore, the researchers "believe that the three reference databases currently overestimate the number of human coding genes by at least 2,000, complicating and adding noise to large-scale biomedical experiments."
"Determining which potential noncoding genes do not code for proteins is a difficult but vitally important task since the human reference proteome is a fundamental pillar of most basic research and supports almost all large-scale biomedical projects."
Directions for future research
Tress shares how the researchers are taking their findings further. "We have been able to analyze many of these genes in detail," he explains, "and more than 300 genes have already been reclassified as noncoding."
"Surprisingly," chimes in study co-author David Juan, "some of these unusual genes have been well studied and have more than 100 scientific publications based on the assumption that the gene produces a protein."
The results could therefore change the field of biomedicine as we know it. However, more research is needed.
"Our evidence," adds Abascal, "suggests that humans may only have 19,000 coding genes, but we still do not know which [those] 19,000 genes are."