Share on Pinterest
Researchers have published a gapless sequence of the human genome. John Niklasson/Getty Images
  • Researchers belonging to the Telomere-to-Telomere (T2T) consortium have published the complete sequence of the human genome, filling in gaps present in previous versions.
  • Previously published sequences accounted for 92% of the human genome and were incomplete due to technological limitations.
  • The T2T consortium of researchers deployed advanced sequencing technologies to sequence the remaining 8% of the human genome, adding 3 billion base pairs of new sequences.
  • The publication of the complete sequence will help scientists understand the role of the previously unsequenced regions in human development, evolution, and diseases.

Although the Human Genome Project announced the completion of the sequencing of the human genome in 2003, there were unsequenced regions in the genome due to technical limitations.

Scientists at the Telomere-to-Telomere (T2T) consortium have now sequenced the complete human genome, which includes 8% of the genome that was unsequenced until now.

The recently released human genome includes gapless assemblies of all chromosomes in the human genome except Y, is referred to as T2T-CHM13, and will serve as a reference genome. This means it will be a template against which other genomes can be compared by researchers and clinicians.

The T2T-CHM13 genome includes the sequence of almost 200 million base pairs that were missing in the previously used reference genome, GRCh38, published by the Genome Reference Consortium. Besides filling the gaps in the genome, the T2T-CHM13 genome has also corrected errors present in the GRCh38.

Dr. Karen Miga, a co-lead of the T2T consortium and professor at the University of California, Santa Cruz, told Medical News Today, “The availability of a complete genome sequence will advance our understanding of the most difficult-to-sequence and repeat-rich parts of the human genome.”

“In the future, when someone has their genome sequenced, researchers and clinicians will be able to identify all of the variants in their DNA and use that information to better guide their healthcare. Knowing the complete sequence of the human genome will provide a comprehensive framework for scientists to study human genomic variation, disease, and evolution.”

– Dr. Miga

The study describing the sequencing of the complete human genome appears in the journal Science. Five companion studies by T2T consortium scientists accompany the manuscript. In them, scientists are further investigating the structure and the function of the previously unsequenced regions of the genome.

During the preparation of the earlier drafts of the human genome, scientists used an approach involving the sequencing of a large number of short overlapping fragments of DNA covering the entire chromosome. These gene fragments were then aligned together based on having an overlapping sequence, allowing the researchers to reconstruct the sequence for each chromosome.

The scientists adopted such an approach because the DNA sequencing technology available at that time was only capable of sequencing DNA fragments, or reads, around 500 base pairs long.

The genetic information carried by the DNA is present in the form of a specific sequence of four nitrogen bases: adenine (A), thymine (T), guanine (G), and cytosine (C). Certain regions of the genome consist of repetitive sequences, which include similar or identical copies of a specific DNA sequence.

These repetitive sequences can be present on either the same or different chromosomes. For instance, telomeres, the regions at each end of the chromosome, tend to consist of the sequence TTAGGG repeated multiple times over a stretch of 2,000 to 50,000 base pairs.

In the case of regions of the genome containing repetitive sequences, the researchers were unable to reconstruct the sequence of chromosomes due to multiple DNA fragments overlapping with each other. Moreover, researchers were unable to determine the number of copies of such repetitive sequences present on chromosomes.

Advances in technology have made it possible to sequence larger fragments of DNA. Current sequencing technologies are capable of sequencing DNA fragments ranging in length from a few kilobase pairs (1,000 bases) to over 100 kilobase pairs.

These technologies are useful for sequencing large DNA fragments with repetitive sequences but have a relatively high error rate. To ensure a high level of accuracy, the T2T consortium researchers combined these long-read sequencing technologies with a different sequencing technology possessing a read length of 20 kilobase pairs and low-error rates.

Individuals tend to show differences in the copy number or the orientation of repetitive DNA sequences, which can have health implications. The GRCh38 reference genome was generated using genetic material obtained from multiple different individuals and does not represent a complete set of chromosomes from a single individual.

To address this shortcoming, the T2T consortium researchers used a cell line called CHM13 derived from a complete hydatidiform mole. A complete hydatidiform mole is a form of non-viable pregnancy involving the formation of a mass of cells generally composed of two sets of identical chromosomes, including 2 X chromosomes, derived from the male parent.

The use of this cell line in the present study made it easier to sequence the genome and provided a complete sequence of a single set of chromosomes.

A major region of the chromosome with missing sequences in the GRCh38 genome was the centromere, which contains a large number of repeated DNA sequences.

The centromere is a constricted region of the chromosome that divides the chromosome into a short arm and a long arm. Centromeres play an important role in the segregation of chromosomes between the daughter cells during cell division.

Using the advanced sequencing technologies, the T2T consortium researchers were able to sequence the centromeres and regions surrounding the centromeres, which account for 6.2% of the entire genome.

In a companion study, T2T researchers led by Dr. Miga used the T2T-CHM13 genome to characterize DNA sequences in centromeres that interact with kinetochores, a protein complex that facilitates the separation of chromosomes during cell division. They were also able to gain insights into how these centromere DNA sequences might have evolved.

Moreover, using the T2T-CHM13 as a reference, the researchers compared the sequences of the centromere of the X chromosomes of individuals with diverse genetic backgrounds. They found considerable variation in the DNA sequence of centromeres among these individuals, and this could potentially help understand the impact of this genetic variation on centromere function.

Dr. Steven Henikoff, a molecular biologist at Fred Hutchinson Cancer Center, told MNT, “Despite the central role [of centromeres] in biology, researchers still don’t know what it is about them that makes the DNA sequence that specifies a centromere so different from that of the rest of the chromosome.”

“Understanding the centromere as a unit is needed to fully understand errors in chromosome movement when cells divide, which is thought to be a driver in cancer and some other human diseases, including birth defects. So finishing the job of sequencing the human genome is important not only because it’s needed to fully understand a central problem in genetics, but also because of the importance of centromeres in human health and disease,” added Dr. Henikoff.

In addition to centromeres, the T2T-CHM13 genome also includes the sequence of the short arm of five chromosomes that were, to a large extent, unsequenced. These five chromosomes are acrocentric, with their short arms being disproportionately shorter than their long arm.

Besides containing repetitive sequences, there is a significant degree of similarity among the sequences of the short arms of the five acrocentric chromosomes, explaining the difficulty in sequencing these regions.

The short arms of acrocentric chromosomes encode ribosomal RNA molecules, which do not code for proteins but are components of ribosomes. Ribosomes are sites where protein synthesis occurs, highlighting the importance of sequencing these acrocentric chromosomes to understand the regulation of protein synthesis. In the present study, the researchers sequenced 9.9 megabase pairs of DNA that encode ribosomal RNA.

Dr. Brian McStay, a professor at the National University of Ireland, Galway, told MNT: “The short arms of the five human acrocentric chromosomes are key to building nucleoli, the largest structures present in the human nucleus. Nucleoli are the factories where ribosomes, the biological machines that manufacture proteins, are constructed. A complete sequence for these chromosome arms will kick-start a new era of research into how nucleoli function in normal, diseased, and aging human cells.”

The T2T consortium researchers also used over 3000 genome samples from individuals across the globe and compared these genome samples with the T2T-CHM13 and the GRCh38 reference genomes. They identified a number of gene variants associated with human health and disease in the regions that were missing in the GRCH38 reference genome and were able to remove variants incorrectly identified by GRCh38.

Significantly, the T2T-CHM13 helped the identification of variants of these medically relevant genes with 12-fold greater accuracy than the GRCh38 genome. This included genes for a wide variety of conditions, including cancer, immune disorders, muscular dystrophy, and hearing loss.

However, more research is needed to identify additional variants of medically relevant genes in the previously unsequenced regions.

The study’s co-author, Dr. Justin Zook, a biomedical engineer at the National Institute of Standards and Technology, says:

“What we found is that this new reference improved accuracy across the board. So, regardless of what the ancestry of the individual was, whether they were African, Caucasian, or Asian, the new reference improved results for them.”

In companion studies, the T2T consortium researchers have also used computational methods to characterize the expression profile of genes in the previously unsequenced regions and how these genes may be regulated. Such efforts will further improve the understanding of the regulation of gene expression in these unsequenced regions in diverse populations and in various medical conditions.

Dr. Miga noted that the “T2T-CHM13 genome does not capture the full diversity of human genetic variation. To address this bias, the Human Pangenome Reference Consortium has joined with the T2T Consortium to build a collection of high quality reference genomes from diverse populations. This will be a critical focus in the upcoming years.”

Dr. Miga also added that the Y chromosome is not expressed by the CHM13 cell line and needs to be sequenced using cells from a different source.