Next-Generation Sequencing: Past, Present, and Future
Cutler Simpson
Summary
Sequencing tools have been revolutionary for genomic medicine. From cancer care to prenatal treatment and rare disease diagnosis, these technologies have been invaluable. The history of sequencing began with Sanger sequencing in the 70s and has since become a constantly evolving field utilizing the latest bioinformatics, computational, and healthcare data to better provide for patients. Parallelization has allowed more data to be processed simultaneously, reducing the need for multiple tests, and improving diagnostic capabilities. As this technology continues to progress, the goal of the advanced tooling will be focused on processing more data at lower costs with more accuracy.
Background
Next-generation sequencing (NGS), also known as second-generation sequencing, is a phrase that is regularly used in genomic medicine to describe all massively parallel sequencing technology. This technology allows for simultaneously sequencing millions of DNA or RNA fragments and analyzing several genes or gene regions with a single test, making the tools more broadly used than previous sequencing technology. Processing data using NGS is so much faster that when Sanger sequencing was used to sequence the human genome, it took over a decade to deliver results; using NGS this can be done in a single day5. This next-generation sequencing technology is being used clinically in testing for germline and somatic genetic mutations, targeted panels typically as the first line of testing for inherited disorders, whole-exome sequencing (WES), whole-genome sequencing (WGS), mitochondrial DNA sequencing, prenatal cell-free DNA, pharmacogenetics, liquid biopsy for screening cancer as well as monitoring progression and relapse, and more5.
First Generation
To understand why NGS instrumentation was necessary, it is beneficial to understand the original, or first-generation, sequencing tools. The two primary first-generation DNA sequencing methods were Maxam-Gilbert chemical cleavage and Sanger dideoxy synthesis (Sanger sequencing)5. Maxam-Gilbert cleavage was based on the chemical modification of DNA and was used much less commonly. Sanger sequencing was developed in 1977 and utilized specific dideoxynucleotides radioactively labeled so that they could be detected in sequencing gel. This method became the sequencing standard and is still appropriate for many applications such as verifying plasmid constructs and PCR products5. The major flaws with Sanger sequencing are that it is relatively slow, especially by NGS standards, and is low throughput; however, significant innovations such as fluorescent rather than radioactive dyes and software developments for interpreting sequences have helped with its continued use. Despite these improvements, the need for higher throughput sequencing at lower costs of larger genomes led to the development of second-generation technology with improved processes.
Second Generation Workflow
The workflow of second-generation sequencing technology for these clinical applications follows the same template in that initially the DNA is extracted followed by the library preparation step where the DNA is prepared to be used on a sequencer and broken into fragments with adaptors added to the end5. After this preparation step comes the target enrichment process which is then sequenced. During the sequencing process, the first step is to immobilize the fragment so that it can be amplified to generate a large enough signal to be detected. These raw data reads are then processed to ultimately deliver an outputting variant call file (VCF) that is mapped to a reference genome where any differences are noted5. Even though the mapping results are organized in this singular BAM file, interpreting the variants is still a complicated process. It is difficult to map recessive disorders and there are fewer data in underrepresented communities, making the mapping less accurate5. Additionally, DNA sequencing may result in incidental findings that need to be reported back to the patient which is difficult and requires genetic counseling for the patient to understand the implications of the results5. The final and the most crucial step in the second-generation sequencing workflow is validating the results. Because NGS sequencing is subject to false positives it is imperative to validate the process at each step of the workflow.
Producers
There are currently two main producers of sequencers for second-generation tooling: Illumina and Ion Torrent, with Illumina being the dominant manufacturer. Illumina produces sequencers such as the HiSeq, MiSeq, and NexSeq machines but these devices tend to perform poorly in regions rich in guanine and cytosine5. The HiSeq machine, however, broke barriers when it was first introduced by increasing sequencing capacity to astonishing levels and the overall dominance of the Illumina line of products can be seen in the amount of data submitted to the Sequence Read Archive5. The Ion Torrent machines are not without fault either and tend to face truncation errors5. These two producers create sequencers that fall into two broad categories: sequencing by hybridization and sequencing by synthesis. Sequencing by hybridization is a technique that was originally developed in the 1980s and has since been primarily utilized in technologies that rely on specific probes such as identifying disease-related single nucleotide polymorphisms (SNPs)5. The sequencing by synthesis technology, on the other hand, is a further development of Sanger sequencing but without dideoxy terminators. This method relies on shorter reads than Sanger that are then run in parallel in order to reduce cost per base pair. However, sequencing by synthesis tends to yield higher error rates than that of Sanger and depends on high sequence coverage to identify a consensus sequence and accommodate for the higher error rate5. Regardless of the higher error rate, the sequencing by synthesis NGS strategy has several advantages over Sanger sequencing, namely, it captures a broader spectrum of mutations than Sanger1. Sanger is useful for identifying substitutions and small insertions or deletions, but NGS is entirely unselective in that it is used to interrogate complete genomes and exomes attempting to discover new mutations or disease-causing genes. Furthermore, NGS is used for significantly more sensitive investigations, including for fetal DNA from maternal blood and tracking tumor levels via liquid biopsy, because these tools can identify variants located only in a fraction of the cells1. With all the benefits of next-generation sequencing, there are still major limitations facing the current second wave of tooling.
Limitations
The primary limitation and concern with all emerging technology is the cost. Second-generation sequencing tools run reads in parallel in order to reduce the cost per base pair; however, the infrastructure still needs to be in place to process the sequencing. Infrastructural considerations such as the counselors available to the patients to help explain results after testing and the risks of potential results before testing, having the necessary consent forms in place and ensuring patients understand these consent forms, as well as healthcare providers understanding which incidental findings should be reported back to the patient all must be established to successfully implement NGS clinically5. In addition to the human components, there needs to be enough computer capacity and storage to handle the large amounts of data resulting from sequencing, then the human expertise must be present to analyze said data1. If all of this is in place validating NGS assays alone can cost between $50,000 and $70,0005. Cost aside, to use the results clinically there needs to be a way to integrate with the patients’ electronic medical records which historically has not been done successfully outside of academic centers focused on a specific subset of information5.
Once the necessary steps have been taken to prepare for running a sequence, limitations by the sequencers themselves come into play. For example, the analytical sensitivity for single nucleotide variation is between 5 and 10% due to the noise from PCR amplification, broader sequencing errors, and other systematic errors5. Systematic errors alone have been found to lead to a 4% to 6% error rate which may either be sequence-specific or in a particular location within the read5. The two main methods for improving these error rates are to use overlapping paired end reads and to introduce a unique identifier (UID) in the form of a random nucleotide tag added to the DNA fragments resulting in a UID family that can be analyzed as a group5. Moreover, certain regions are more difficult to sequence also contributing to the error rate. Homologous regions, or areas with high sequence similarity, are notoriously difficult to sequence and this problem is not unique to NGS5. Repetitive regions such as the poly-A tail are difficult for second-generation sequencers to handle properly and most testing of these areas uses traditional Sanger sequencing5. NGS tools also struggle to sequence regions rich in guanine and cytosine because they appear to have higher background noise, resulting in lower quality of sequencing, especially on sequencers produced by Illumina5. Outside of these specific regions, second-generation sequencing technology tends to report high rates of false positives regarding structural rearrangements, also known as copy number variations5. Finally, all this data needs to be stored so that it can be interpreted. The results of sequencing can be found in public, private, or lab-specific databases but the ability to interpret said data is not as advanced as the technology to provide it. No database is entirely comprehensive nor error-free, so results need to be validated before inferring conclusions. Additionally, many databases lack data quality assurances and may not be up to date or contain conflicting data that needs to be deciphered before the results are clinically useful5. To address some of these limitations, the next wave of technology known as third-generation sequencing has started being used in research settings.
Third Generation Sequencing
The Oxford Nanopore and PacBio SMRT (Single Molecule Real Time) tools have been tested for research and show promise that they may address problems in clinically relevant regions in the future. The PacBio SMRT technology has fostered interest as it allows for very long fragments to be sequenced and the real-time sequencing allows the rates of each nucleotide addition to be measured1. These long reads allow the third generation to overcome issues with repeat regions and don’t require amplification which helps to reduce noise1. The main drawbacks of this tooling at the moment are the high price and low throughput in conjunction with the high error rates seen; however, running multiple sequences and aligning the separate individual reads overcomes these errors1. There is a lot of promise for the third generation of sequencing tools and as sequencing technology continues to advance, the goal will be to sequence more data, faster and more accurately with lower amounts of input sample required all at a lower cost.
Clinical Applications
To illustrate the effectiveness and clinical necessity of NGS, one study analyzed SARS-CoV-2 in 2020 to attempt to monitor and counter the spread of human infectious diseases2. This study needed “ultra-rapid and cost-effective methods for the reconstruction of the genomic sequences of emerging pathogens…” which would have been unimaginable just a few years ago2. NGS provided genomic sequencing data about SARS-CoV-2 that helped to identify sites that could potentially adapt to human hosts2 which is vital in the fight against COVID-19 and other infectious diseases. There is still work to be done regarding organizing large amounts of data, however, NGS methods allowed for answering complex biological questions about the disease quickly.
Conclusions
Sequencing technology has come a long way since Sanger became a standard in the ‘70s. The current wave of next-generation tools has vastly improved in areas where Sanger struggled but is still not perfect. Difficult to sequence regions as well as overall accuracy are significant areas of interest for improvements moving forward. The third-generation tools currently being developed are fueled by a desire to obtain as much information as possible at as low a cost as possible with as little sample as possible by generating longer reads and maintaining the massively parallel nature of second-generation tech. The tools have come a long way in revolutionizing genomic medicine having been used for tumor screening, fetal monitoring, infectious disease counteraction, and so much more.
The future is bright for where this technology can take genomic medicine.
References
1. Behjati S, Tarpey PS. Arch Dis Child Educ Pract Ed 3013;98:236–238.
2. Chiara, Matteo et al. “Next generation sequencing of SARS-CoV-2 genomes: challenges, applications and opportunities.” Briefings in bioinformatics vol. 22,2 (2021): 616-630. doi:10.1093/bib/bbaa297
3. Levy, Shawn E, and Braden E Boone. “Next-Generation Sequencing Strategies.” Cold Spring Harbor perspectives in medicine vol. 9,7 a025791. 1 Jul. 2019, doi:10.1101/cshperspect.a025791
4. Slatko, B. E., Gardner, A. F., & Ausubel, F. M. (2018). Overview of next-generation sequencing technologies. Current Protocols in Molecular Biology, 122, e59. doi: 10.1002/cpmb.59
5. Sophia Yohe, Bharat Thyagarajan; Review of Clinical Next-Generation Sequencing. Arch Pathol Lab Med 1 November 2017; 141 (11): 1544–1557. doi: https://doi.org/10.5858/arpa.2016-0501-RA