Volume 3, Issue 3 e690
PROTOCOL
Open Access

Profiling DNA Ligase Substrate Specificity with a Pacific Biosciences Single-Molecule Real-Time Sequencing Assay

Alexander T. Duckworth

Alexander T. Duckworth

University of Wisconsin-Madison, Madison, Wisconsin

Contribution: Methodology, Writing - original draft, Writing - review & editing

Search for more papers by this author
Katharina Bilotti

Katharina Bilotti

New England Biolabs, Ipswich, Massachusetts

Contribution: ​Investigation, Methodology, Writing - original draft, Writing - review & editing

Search for more papers by this author
Vladimir Potapov

Vladimir Potapov

New England Biolabs, Ipswich, Massachusetts

Contribution: Data curation, Formal analysis, Methodology, Software, Validation, Visualization, Writing - original draft, Writing - review & editing

Search for more papers by this author
Gregory J. S. Lohman

Corresponding Author

Gregory J. S. Lohman

New England Biolabs, Ipswich, Massachusetts

Corresponding author: [email protected]

Contribution: Conceptualization, ​Investigation, Methodology, Supervision, Writing - original draft, Writing - review & editing

Search for more papers by this author
First published: 07 March 2023
Citations: 5

Published in the Molecular Biology section

Abstract

DNA ligases catalyze the joining of breaks in nucleic acid backbones and are essential enzymes for in vivo genome replication and repair across all domains of life. These enzymes are also critically important to in vitro manipulation of DNA in applications such as cloning, sequencing, and molecular diagnostics. DNA ligases generally catalyze the formation of a phosphodiester bond between an adjacent 5′-phosphate and 3′-hydroxyl in DNA, but they exhibit different substrate structure preferences, sequence-dependent biases in reaction kinetics, and variable tolerance for mismatched base pairs. Information on substrate structure and sequence specificity can inform both biological roles and molecular biology applications of these enzymes. Given the high complexity of DNA sequence space, testing DNA ligase substrate specificity on individual nucleic acid sequences in parallel rapidly becomes impractical when a large sequence space is investigated. Here, we describe methods for investigating DNA ligase sequence bias and mismatch discrimination using Pacific Biosciences Single-Molecule Real-Time (PacBio SMRT) sequencing technology. Through its rolling-circle amplification methodology, SMRT sequencing can give multiple reads of the same insert. This feature permits high-quality top- and bottom-strand consensus sequences to be determined while preserving information on top-bottom strand mismatches that can be obfuscated or lost when using other sequencing methods. Thus, PacBio SMRT sequencing is uniquely suited to measuring substrate bias and enzyme fidelity through multiplexing a diverse set of sequences in a single reaction. The protocols describe substrate synthesis, library preparation, and data analysis methods suitable for measuring fidelity and bias of DNA ligases. The methods are easily adapted to different nucleic acid substrate structures and can be used to characterize many enzymes under a variety of reaction conditions and sequence contexts in a rapid and high-throughput manner. © 2023 New England Biolabs and The Authors. Current Protocols published by Wiley Periodicals LLC.

Basic Protocol 1: Preparation of overhang DNA substrates for ligation

Basic Protocol 2: Preparation of ligation fidelity libraries

Support Protocol 1: Preparation of ligation libraries for PacBio Sequel II sequencing

Support Protocol 2: Loading and sequencing of a prepared library on the Sequel II instrument

Basic Protocol 3: Computational processing of ligase fidelity sequencing data

INTRODUCTION

Polynucleotide ligases play critical roles in genome replication and repair (Lehman, 1974; Shuman, 2009; Tomkinson, Vijayakumar, Pascal, & Ellenberger, 2006). Several DNA ligases are also critical reagents in molecular biology techniques. Protocols such as cloning, DNA assembly, DNA sequencing library preparation, and molecular diagnostics applications rely on high-efficiency and high-fidelity ligation. In contrast to enzymes with specific small-molecule substrates, defining the details of substrate specificity for enzymes that operate on nucleic acid substrates can be quite challenging. Although general features of the ligation substrate are easily noted (i.e., strand breaks in a DNA helix with an adjacent 3′-hydroxyl and 5′-phosphate group), the primary sequence, the presence of non-canonical base pairs or modified bases, the secondary structure, and other factors can influence reaction rates and outcomes (Lohman et al., 2015; Luo, Bergstrom, & Barany, 1996; Nakatani, Ezaki, Atomi, & Imanaka, 2002; Shuman, 1995; Sriskanda & Shuman, 1998; Wang, Lamarche, & Tsai, 2007; Wu & Wallace, 1989). Understanding the details of ligase substrate specificity is important for fully understanding both the biological roles of these enzymes and their optimal use in molecular biology protocols.

Enzyme bias (different preferences or outcomes based on different primary nucleic acid sequences) and fidelity (here defined as discrimination between Watson-Crick versus mismatched base pairs) are difficult to profile given the vast possible sequence diversity of nucleic acid substrates. For example, rates of action for DNA ligases can vary based on the precise structure of the substrate (e.g., DNA nick, cohesive ends, or blunt-ended substrates), the identity of the sugars (e.g., ribose, deoxyribose, or other backbones), the DNA sequence, and the presence and location of mismatched base pairs within the sequence (Bauer et al., 2017; Bullard & Bowater, 2006; Lohman et al., 2015). To study each possible variation in parallel would be tedious for even a relatively small number of sequence variants, and effectively impossible when considering the vast possible ranges of sequence contexts. Multiplexed assays utilizing sequencing-based readouts permit evaluation of enzyme action on far more substrate sequences than can be reasonably studied in parallel, but can present significant design challenges when using next-generation sequencing (NGS) methods such as Illumina sequencing. These methods depend on amplification as part of the sequencing method, which complicates strand pairing, can obfuscate or eliminate information on strand mismatches and base modifications, and can introduce additional errors during the sequencing process (Schmitt et al., 2012; Slatko, Gardner, & Ausubel, 2018).

Third-generation long-read sequencing methods such as Pacific Biosciences Single-Molecule Real-Time (PacBio SMRT) sequencing provide important advantages when seeking to profile enzyme activity on a broad range of nucleic acid sequence contexts in one experiment (Athanasopoulou, Boti, Adamopoulos, Skourou, & Scorilas, 2022; Potapov & Ong, 2017; Potapov, Ong, Langhorst, et al., 2018; Rhoads & Au, 2015; Roberts, Carneiro, & Schatz, 2013). SMRT sequencing is a true single-molecule sequencing method, with no preamplification of the library, permitting individual reaction products of a nucleic acid enzyme to be sequenced directly (Hu, Chitnis, Monos, & Dinh, 2021; Rhoads & Au, 2015). SMRT sequencing generates structures with hairpin ends, tying the two strands of the helix together in a single sequencing reaction. The sequencing polymerase reads this SMRTbell structure in rolling-circle fashion, providing multiple reads of each strand, which permits generation of high-quality consensus reads for both strands. This feature preserves the sequence of each strand, allowing identification of features such as mismatches, insertions, and deletions that would be difficult to extract using most other NGS sequencing platforms.

Here, we present detailed protocols for the high-throughput profiling of DNA ligase mismatch tolerance and substrate sequence preference through application of SMRT sequencing methods. In Strategic Planning, we discuss the design of dimerizable substrates for producing ligation libraries using degenerate bases on the ssDNA overhangs as well as an internal synthesis bias control barcode. Basic Protocol 1 describes the preparation of these substrates from purchased precursor oligonucleotides. Basic Protocol 2 describes the construction of Pacific Biosciences SMRT sequencing-compatible DNA libraries by ligation of these substrates, allowing for variations in experimental conditions. Support Protocols 1 and 2 describe the loading and sequencing of these libraries on the Sequel II instrument. Basic Protocol 3 covers data analysis pipelines used for extraction of fidelity and bias data. An example data set from the ligation of a 5′ four-base overhang substrate by T4 DNA ligase is used to illustrate the protocols (Potapov, Ong, Kucera, et al., 2018).

STRATEGIC PLANNING

Design of DNA Substrates

The most important part of preparing to apply this method is considering the substrate design. The features of DNA substrates required for this assay are (1) degenerate base regions that will form the overhangs, (2) the SMRTbell adaptor sequence required for PacBio sequencing, (3) a Type IIS restriction enzyme recognition sequence that will be placed so that cutting will generate the desired end structure, and (4) an internal degenerate sequence used to assess oligo synthesis biases (Fig. 1A and Table 1). Type IIS restriction enzymes are used due to their ability to cut outside the recognition sequence, allowing for generation of overhangs of any sequence.

Details are in the caption following the image
Workflow for profiling ligase substrate specificity. (A) Substrate preparation. In step 1, a single-stranded precursor oligo is extended using DNA polymerase I to create a double-stranded hairpin. In step 2, the substrate is digested with the type IIS restriction enzyme BsaI-HFv2 to create a 5′ four-nucleotide overhang one nucleotide away from its binding site. The substrate pool contains overhangs of every possible four-nucleotide sequence composition. In step 3, T4 DNA ligase is added. Successful ligation of both strands results in dimerization to a fully closed product that is a substrate for PacBio SMRT sequencing. Finally, in step 4, digestion with ExoIII (a dsDNA-specific exonuclease) and ExoVII (a ssDNA-specific exonuclease) removes any remaining starting material and incomplete ligation products, leaving only fully ligated substrate available for PacBio sequencing. The resulting DNA from each step in the substrate preparation workflow is visualized on a bioanalyzer gel image, which includes the DNA 1000 kit molecular weight ladder. (B) PacBio library preparation and sequencing involves binding of the PacBio primer followed by rolling-circle amplification to generate a long concatemer of alternating insert and SMRTbell adapter sequences for each ligation product. Consensus sequences are built for the top and bottom strands, allowing for information about the overhang sequence to be extracted for both strands in each ligation event.
Table 1. Oligonucleotide Sequences
Oligonucleotide Sequencea MW (g/mol)b Extinction coefficient (L/mol/cm)c
Precursor TCACGTNNNNGGAGACCTGCGATCCAGTGCGCCGTCCATTGATCAACGNNNNNNCAA ATCTCTCTCTTTTCCTCCTCCTCCGTTGTTGTTGTTGAGAGAG 30,635.3 916,150
Extended TCACGTNNNNGGAGACCTGCGATCCAGTGCGCCGTCCATTGATCAACGNNNNNNCAA ATCTCTCTCTTTTCCTCCTCCTCCGTTGTTGTTGTTGAGAGAG ATTTGNNNNNNCGTTGATCAATGGACGGCGCACTGGATCGCAGGTCTCCNNNNACGTGA 48,915.5 1,476,700
BsaI-HFv2 cut pNNNNGGAGACCTGCGATCCAGTGCGCCGTCCATTGATCAACGNNNNNNCAA ATCTCTCTCTTTTCCTCCTCCTCCGTTGTTGTTGTTGAGAGAG ATTTGNNNNNNCGTTGATCAATGGACGGCGCACTGGATCGCAGGTCTCC 44,052.3 1,321,375
Sequencing insert TTGNNNNNNCGTTGATCAATGGACGGCGCACTGGATCGCAGGTCTCCNNNNGGAGACCTGCGATCCAGTGCGCCGTCCATTGATCAACGNNNNNNCAA 30,218.5 924,775
  • a The Type IIS restriction enzyme recognition site is indicated in bold. The SMRT adapter region is underlined. The internal synthesis control is indicated in italics. The expected sequencing insert length is 98 nt. The location of the four-base overhang is in position 48-51. The two internal synthesis control regions are in positions 4-9 and 90-95.
  • b A molecular weight of 309 g/mol was used for degenerate bases (N).
  • c The extinction coefficient was calculated using the IDT Oligo Analyzer, which utilizes the nearest neighbor method.
For this protocol, we use the precursor oligonucleotide sequence:
  • 5′-TCACGTNNNNGGAGACCTGCGATCCAGTGCGCCGTCCATTGATCAACGNNNNNNCAAATCTCTCTCTTTTCCTCCTCCTCCGTTGTTGTTGTTGAGAGAG-3′

At the 5′ end, the sequence contains six defined nucleotides followed by four degenerate bases. This is followed by a Type IIS recognition sequence (BsaI-HFv2, 5′-GAGACC-3′) oriented to produce a 5′ four-base overhang comprised of the degenerate bases upon cutting with the restriction enzyme. We found that a crucial factor in the generation of substrates with an even representation of all sequences in this degenerate region was the use of a custom phosphoramidite ratio during oligo synthesis of 31% A, 33% C, 17% G, and 19% T. This ratio yielded a large reduction in synthesis bias compared with either even phosphoramidite mixtures or standard mixes intended to produce an unbiased incorporation ratio. Enough lead time should be left to design, order, and receive the oligonucleotide before planning to begin the experiments.

The six nucleotides that are 5′ of the Type IIS cut site (TCACGT) permit clear visualization of successful cleavage in the cutting step (Fig. 1A, lanes 1 and 2), but the specific sequence here is not critical. After the Type IIS binding site are 31 nucleotides of constant but arbitrary sequence (TGCGATCCAGTGCGCCGTCCATTGATCAACG). After this is an additional six-nucleotide degenerate base sequence used as an internal synthesis control to quantify degenerate nucleobase synthesis bias. Finally, the SMRTbell sequence is included at the 3′-end of the oligo. While standard PacBio library preparation protocols rely on ligation of SMRTbell adaptors to linear DNA libraries, this protocol includes the SMRTbell in the ordered DNA sequence. This ensures that strands cannot be separated during substrate preparation or any step of library ligation, preventing loss of any top strand/bottom strand paired information. In the design provided, the SMRTbell region can self-anneal to form a looped secondary structure that provides a primer-template junction to be used in Basic Protocol 1 to convert the precursor to dsDNA by a DNA polymerase (Fig. 1A, step 1). Once extended, the substrates can be cut with restriction enzyme BsaI-HFv2 to generate ssDNA overhangs (Fig. 1A, step 2).

The total length of the oligonucleotide is chosen such that, when the two substrates are ligated, the insert region between the two SMRTbells is 98 base pairs. This length is long enough to avoid sequencing reads being ignored by the sequencer as a primer dimer missing an insert, but short enough that the precursor oligo can be synthesized conveniently using phosphoramidite chemistry. While the 31-base internal constant region was an arbitrarily chosen sequence, it is important to note that parts of this sequence will be used in the bioinformatics pipeline for data processing.

If designing alternate sequences, the user could change the restriction enzyme binding site to use a different enzyme and/or produce a different overhang structure. When doing so, care should be taken to ensure that the constant sequence regions do not include any additional binding sites for the restriction enzyme specificity. Note that the degenerate bases could potentially create an additional restriction enzyme recognition site if flanking bases are not chosen carefully to prevent this possibility. Finally, if varying the sequence described here, note that the sequence used in the bioinformatic pipeline in Basic Protocol 3 must be changed to match the user-provided sequence.

Many NGS methods benefit from pooling samples for sequencing, also known as multiplexing, to increase throughput and lower cost. With the increased sequencing depth available from updated PacBio platforms (beginning with the Sequel and Sequel II instruments), it is now possible to sequence multiple experiments in a single sequencing run/SMRT Cell and still achieve the required read depth for data analysis. While traditional PacBio library prep multiplexing methods rely on the ligation of barcoded SMRTbell adapters, the SMRTbell adapters in this workflow are incorporated during oligo synthesis, and therefore any desired barcodes must also be incorporated in the precursor oligo design and synthesis. Multiplexing can also be achieved by using a series of substrates that contain a unique region within the arbitrary constant sequence for each desired experiment. This allows the libraries from multiple experiments to be sequenced together and the data from different experiments to be separated bioinformatically post-sequencing by changing the sequence used in Basic Protocol 3 to match each substrate in turn.

Selecting Ligation Reaction Conditions

The example data set presented here uses 1.75 μM T4 DNA ligase to ligate a 100 nM (4.4 ng/µl) 5′ four-base overhang DNA library for 1 hr at room temperature. This protocol can be easily altered to use different ligases, different reaction temperatures or timescales, various DNA substrate libraries, etc., allowing the effects of these variables on ligation outcomes to be observed. Reaction conditions may depend on the ligase and end structure to be studied and the specific research questions being investigated (Bauer et al., 2017; Bilotti et al., 2022).

Determining Necessary Sequencing Depth

For the example data presented here, there are 256 overhangs giving rise to 32,896 unique possible sequence pairings. However, most events in the studied example result from the ligation of fully Watson-Crick overhang pairings or those with only one or two mismatches. In this case, we find that 100,000 reads post-filtering (see Basic Protocol 3) are sufficient to identify all single-base mismatches with at least 90% correlation between replicates (see Supporting Information for Table S1, Figure S1, and supplementary files) (Bujang & Adnan, 2016). As of the date of writing, a typical PacBio Sequel II run can provide several million high-quality reads, each with multiple passes around the insert sequence, which is more than is needed for the evaluation of the substrate used in the presented protocol. If one wishes to multiplex libraries in a sequencing run, alter the substrate to vary more positions, or accurately quantify rare events, the sequencing depth needed to provide enough data to evaluate the outcomes must be considered.

Planning Sequencing Runs

An important consideration is whether the researcher generating the libraries will be loading the sequencer themselves (see Support Protocols 1 and 2) or sending the sample out to a core facility. If the former, all reagents and sequencing consumables should be on hand before beginning the library binding. If the latter, the sequencing facility must be made aware of any modifications to standard protocols, particularly the need to skip any bead purification/sizing step and the need to load the libraries at a higher concentration than recommended in the standard protocol. We consistently find that loading these libraries at 3× the manufacturer's recommended loading amount (450 vs. 150 pM) gives an optimal number of reads.

Basic Protocol 1: PREPARATION OF OVERHANG DNA SUBSTRATES FOR LIGATION

Here, we discuss the preparation of DNA substrates to be used in ligation fidelity experiments using the PacBio Sequel II instrument. This protocol describes the generation of a hairpin substrate with degenerate four-nucleotide overhangs using the Type IIS restriction enzyme BsaI-HFv2. First, a single-stranded precursor oligo is converted to dsDNA by a DNA polymerase (Fig. 1A, step 1 and gel lane 1). After cleanup and analysis of the extended dsDNA oligo, it is cut with BsaI-HFv2 to generate a 5′ four-nucleotide overhang. Cutting the dsDNA substrate leaves the desired ssDNA overhang (Fig. 1A, step 2 and gel lane 2). For more details on precursor oligo design, see Strategic Planning.

Materials

  • 10× NEBuffer 2 (New England Biolabs, cat. no. B7002S)
  • 100 μM (3.06 g/L) precursor oligo (see recipe)
  • 5000 U/ml Klenow fragment (3′→5′ exo) (New England Biolabs, cat. no. M0212S)
  • 100 U/ml yeast inorganic pyrophosphatase (New England Biolabs, cat. no. M2403S)
  • 10 mM dNTPs (New England Biolabs, cat. no. N0447S)
  • Milli-Q water (Sigma-Aldrich, cat. no. W4502)
  • 0.5 M EDTA (Thermo Fisher, cat. no. 15575020)
  • Monarch PCR & DNA Cleanup Kit (New England Biolabs, cat. no. T1030L)
  • Monarch plasmid miniprep columns (20 μg, New England Biolabs, cat. no. T1017L)
  • 200 proof ethanol (Sigma-Aldrich, cat. no. E7023)
  • 10× rCutSmart buffer (New England Biolabs, cat. no. B6004S)
  • 20,000 U/ml BsaI-HFv2 (New England Biolabs, cat. no. R3733S)

  • PCR strip tubes (Millipore, cat. no. 11667009001)
  • Thermocycler (Bio-Rad T100 Thermal Cycler, cat. no. 1861096) or heat block
  • 1.5-ml LoBind tubes (Eppendorf, cat. no. 022431021)
  • 2100 Bioanalyzer (Agilent, cat. no. G2939BA)
  • Agilent DNA 1000 kit (Agilent, cat. no. 5067-1504)

Extend precursor oligo to generate dsDNA

1. Prepare the following extension reaction in PCR tubes and mix by gentle pipetting:

  • 1.5 μl 10× NEBuffer 2
  • 3 μl 100 μM precursor oligo
  • 3 μl 5000 U/ml Klenow fragment
  • 3 μl 100 U/ml yeast inorganic pyrophosphatase
  • 1.5 μl dNTPs (10 mM each)
  • 3 μl Milli-Q water

2. Incubate 1 hr at 37°C.

This will generate ∼15 μg (∼0.31 nmol) of dsDNA.

3. Add 1 μl of 0.5 M EDTA (final 31 mM) and 34 μl Milli-Q water to bring the reaction volume to 50 μl.

EDTA stops the polymerase activity through binding of the Mg2+ in the buffer.

Perform cleanup and bioanalysis

These steps describe the cleanup of the reaction using the Monarch PCR & DNA Cleanup Kit following the Oligonucleotide Cleanup Protocol. It is important to note that although we use the buffers from this kit, we use 20-μg Monarch Plasmid Miniprep Columns rather than the 5-μg columns in the kit due to the large amount of DNA generated in this protocol.

4. Add 100 μl Binding Buffer from the kit plus 300 μl ethanol and mix well by gentle pipetting.

5. Add the entire diluted reaction mixture to the top of a 20-μg Monarch miniprep column.

6. Centrifuge 1 min at 16,000 × g and discard the flowthrough.

7. Add 500 μl Wash Buffer to the column.

8. Centrifuge 1 min at 16,000 × g and discard the flowthrough.

9. Repeat steps 7-8.

10. Centrifuge once more for 1 min at 16,000 × g.

A final centrifugation removes residual ethanol from the column, permitting maximum recovery in the next step.

11. Transfer column to a clean 1.5-ml LoBind tube.

12. Add 30 μl Elution Buffer to the column and incubate 10 min at room temperature.

13. Centrifuge 1 min at 16,000 × g to elute the purified dsDNA extended hairpin (Table 1).

14. Analyze the size, purity, and concentration of the extended precursor oligo using a Bioanalyzer 2100 and accompanying DNA 1000 Kit following the manufacturer's instructions.

We generally load 1 μl of a 1:10 dilution on the Bioanalyzer. For Bioanalyzer visualization of the extended dsDNA oligo, see Figure 1A (gel lane 1). The concentration of the extended precursor stock is typically 2-4 μM (100-200 ng/µl). The oligo is stable for at least 3 years when stored at −20°C.

If a Bioanalyzer is not available, purity and concentration can be checked using an agarose gel and Nanodrop instrument (Armstrong & Schulz, 2015).

Digest with BsaI-HFv2 to create degenerate overhangs

15. Prepare the following digestion reaction in PCR strip tubes and mix by gentle pipetting:

  • 10 μl 10× rCutSmart buffer
  • 15 μl 20,000 U/ml BsaI-HFv2
  • 29 μl 3 μM extended precursor oligo
  • 46 μl Milli-Q water

If concentration of extended precursor oligo is not 3 μM, adjust the volumes of extended precursor and Milli-Q water so the final concentration in the digestion reaction is ∼1 μM (50 ng/μl) in a 100-μl reaction.

16. Incubate 1 hr at 37°C.

17. Stop reaction by adding 5 μl of 0.5 M EDTA (final 25 mM).

18. Add 200 μl Binding Buffer plus 600 μl ethanol and mix well by gentle pipetting.

19. Repeat steps 5-14 to purify, concentrate, and analyze the dsDNA product.

The purified dsDNA hairpin now contains the randomized overhang for ligation fidelity and bias analysis. Typical recovery is 50%, with expected 30 µl of a solution at 1.5 μM (66.1 ng/μl) in dsDNA library precursor. See Figure 1A, gel lane 2 for Bioanalyzer visualization of the pure product.

Basic Protocol 2: PREPARATION OF LIGATION FIDELITY LIBRARIES

Below, we describe how used to ligate the 5′ four-base overhang library generated in Basic Protocol 1. First, the library is ligated using T4 DNA ligase (Fig. 1A, step 3 and gel lane 3). Next, exonuclease treatment is used to remove potentially convoluting DNA sequences that have not been ligated (Fig. 1A, step 4 and gel lane 4). This ensures that any unligated or partially ligated (on one strand only) DNA is removed from the library. This will also remove any uncut or unextended oligos from previous steps. DNA substrates that have been ligated on both strands will be closed and protected from the exonucleases. At the end of this protocol, the ligation library is ready for PacBio sequencing (see Support Protocols 2 and 3).

Additional Materials (also see Basic Protocol 1)

  • BsaI-HFv2-cut ligation substrate (5′ four-base overhang, typically ∼1.5 μM, ∼66.1 ng/μl; see Basic Protocol 1)
  • 2,000,000 U/ml (35 μM) T4 DNA Ligase and 10× reaction buffer (New England Biolabs, cat. nos. M0202M and B0202S)
  • 10× Standard Taq Reaction Buffer (New England Biolabs, cat. no. B9014S)
  • 100 U/μl exonuclease III (New England Biolabs, cat. no. M0206S)
  • 10 U/μl exonuclease VII (New England Biolabs, cat. no. M0379S)

Ligate substrates

1. Prepare the following ligation reaction in PCR tubes and mix by gentle pipetting:

  • 5 μl 10× T4 DNA Ligation Reaction Buffer
  • 2.5 μl 35 μM T4 DNA Ligase
  • 3.3 μl 1.5 μM BsaI-HFv2-cut ligation substrate (Basic Protocol 1)
  • 39.2 μl Milli-Q water

2. Incubate 1 hr at 25°C.

3. Stop reaction by adding 2.5 μl of 0.5 M EDTA (25 mM final).

4. Purify, concentrate, and analyze the product dsDNA as described (see Basic Protocol 1, steps 4-14), but use the 5-μg columns provided with the kit and change the elution volume to 15 μl.

The Monarch PCR & DNA Cleanup 5-μg spin columns should be used in this purification step rather than the 20-μg Monarch Miniprep columns as there should only be ∼0.33 μg (∼3.75 pmol) ligated DNA in the sample. We also elute with 15 μl elution buffer to increase the concentration. We typically load 1 µl of the purified ligation reaction on the Bioanalyzer. See Figure 1A, gel lane 3 for Bioanalyzer visualization of the purified ligation reaction.

Treat with exonucleases

5. Prepare the following exonuclease digestion reaction in PCR tubes and mix by gentle pipetting:

  • 5 μl 10× Standard Taq Reaction Buffer
  • 14 μl ligated DNA
  • 0.5 μl 100 U/μl ExoIII
  • 0.5 μl 10 U/μl ExoVII
  • 30 μl Milli-Q water

6. Incubate 1 hr at 37°C.

7. Immediately purify, concentrate, and analyze the product dsDNA as described (see Basic Protocol 1, steps 4-14), but use the 5-μg columns provided with the kit and change the elution volume to 15 μl.

The exonuclease treatment removes any DNA that has not been ligated on both strands to produce a closed SMRTbell sequencing substrate, including unligated starting material. The ligated, exonuclease-treated library is typically recovered at ∼30-100 nM (∼3-10 ng/µl).

8. Dilute library to 1 ng/µl in Milli-Q water for sequencing sample preparation.

Support Protocol 1: PREPARATION OF LIGATION LIBRARIES FOR PACBIO SEQUEL II SEQUENCING

Below, we describe the protocol for sequencing the ligation library on the PacBio Sequel II system (Fig. 1B). Briefly, this protocol includes annealing a sequencing primer to the SMRTbell adapter followed by binding of the polymerase to the primer/SMRTbell junction. We also describe steps taken to maximize reads on the relatively small DNA substrate we are using.

Materials

  • Sequencing Primer v4 (Pacific Biosciences, cat. no. 101-654-600)
  • 1× Elution Buffer (Pacific Biosciences, cat. no. 100-159-800)
  • 1 ng/μl ligation library (see Basic Protocol 2)
  • 10× Primer Buffer v2 (Pacific Biosciences, cat. no. 001-560-849)
  • Sequel II Binding Kit 2.1 (Pacific Biosciences, cat. no. 101-843-000), including:
    • Sequel II Polymerase 2.1
    • Sequel Binding Buffer
    • Sequel dNTPs
    • DTT
    • Sequel Additive

  • PacBio SMRT Link 10 software
  • PCR tubes
  • Thermocycler (Bio-Rad T100 Thermal Cycler, cat. no. 1861096)

Calculate sample setup using SMRT Link

PacBio has created a software tool called SMRT Link (https://www.pacb.com/support/software-downloads/) that contains a “Sample Setup” calculator to help calculate the proper volumes of reagents to add during library preparation. In general, we follow the recommendations on this calculator with the noted exceptions of omitting the AMPure PB bead cleanup step and loading a larger quantity of DNA on the SMRT Cells than this tool recommends. Below is a brief protocol describing the inputs for this calculator using SMRT link version 10.2.

1. Navigate to the SMRT Link 10 portal.

2. Select “Sample Setup”.

3. Select “New calculation” and specify a sample name.

4. Under the “Application” tab, select “<3 kb amplicons”.

5. Enter the “Available Volume”, “Concentration”, and “Insert Size” in the respective fields.

We generally dilute prepared libraries to 1 ng/µl to avoid overconsumption of reagents and small pipetting volumes.

6. Leave the “Internal Control” field blank.

We do not usually add the Internal Control, but it may be helpful for troubleshooting.

7. Enter 100% for the “Cleanup Anticipated Yield”.

8. Enter 450 pM under “Specify Concentration on Plate”.

While the recommended value here is 40-150 pM, we have found that loading more DNA on the SMRT Cells leads to increased sequencing yield for these short-insert substrates.

9. Enter 1 under “Cells to Bind”.

This is the number of SMRT Cells to be used to sequence this sample. As each SMRT Cell has eight million wells, we have found that using one SMRT Cell per sample is sufficient for this application.

10. For “Sequencing Primer”, select the recommended v4 primer.

11. Under “Binding Kit”, select Sequel II Binding Kit 2.1.

This is important, as different binding kits are used for different insert lengths. It is crucial to use the Sequel II Binding Kit 2.1 for <3 kb amplicons.

Prepare ligation libraries for sequencing

Once the SMRT Link calculator has the proper inputs, it will determine the volumes of reagents to use for library preparation. Below is the protocol we use for preparation of our ligation fidelity libraries.

12. Add 1 μl Sequencing Primer v4 to 29 μl 1× Elution Buffer in a PCR tube.

13. Incubate 2 min at 80°C, then hold at 4°C to condition the primer.

14. Combine the following in a PCR tube to give a 57-μl reaction volume:

  • 3.7 μl ligation library sample
  • 6.8 μl conditioned primer
  • 11.4 μl 10× Primer Buffer v2
  • 35.1 μl H2O

15. Incubate 1 hr at 25°C hr, then hold at 4°C.

16. Dilute Sequel II Polymerase 2.1 10-fold with Sequel Binding Buffer in a PCR tube.

For example, add 1 µl polymerase to 9 µl buffer. The required amount of diluted polymerase may vary depending on the number and concentration of DNA libraries.

17. Combine the following in a PCR tube to give a 103.5-μl reaction volume:

  • 51.8 μl ligation library sample with annealed primer
  • 10.3 μl Sequel dNTPs
  • 10.3 μl DTT
  • 17.6 μl Sequel Binding Buffer
  • 3.4 μl diluted Sequel II polymerase
  • 10.1 μl H2O

18. Incubate 1 hr at 30°C, then hold at 4°C.

19. For the final loading dilution, add the following to the 103.5 μl polymerase- and primer-bound ligation library (total volume 116.2, final DNA concentration ∼450 pM):

  • 11.5 μl DTT
  • 1.2 μl Sequel Additive

20. Store prepared library at 4°C for up to 24 hr until loading on the PacBio Sequel II instrument.

Support Protocol 2: LOADING AND SEQUENCING A PREPARED LIBRARY ON THE SEQUEL II INSTRUMENT

Before setting up the Sequel II instrument, the sequencing parameters must be defined in the “Run Design” tab of the SMRT Link online portal. In this portal, you can import the samples that were previously created in the “Sample Setup” tab (they must be locked in that tab to be accessible here). This will autofill most of the required fields. Importantly, we do 10-hr runs with no pre-extension and use the Sequel II Sequencing Plate 2.0. The Run Design also requires an input for “Template Prep Kit” that is not used in this protocol. Thus, we input the recommended “SMRTbell Express Template Prep Kit 2.0”. A sample should be designed for each sample to be sequenced. Once the sequencing run has been designed, the instrument can be set up for sequencing as described below.

Materials

  • Prepared libraries with bound PacBio primer/polymerase (450 pM; see Support Protocol 1)
  • Sample Plate (Pacific Biosciences, 000-448-888)
  • Sequel Sample Plate Foil (Pacific Biosciences, 100-667-400)
  • ALPS 50 V-Manual Heat Sealer (Thermo Fisher, cat. no. AB-1443A)
  • Sequel II Sequencing Plate 2.0 (contains reagents for four SMRT Cells; Pacific Biosciences, cat. no. 101-820-200)
  • Sequel II System (Pacific Biosciences)
  • Sequel Pipette Tips (Pacific Biosciences, cat. no. 100-667-601)
  • SMRT Cell 8M tray (contains four SMRT Cells; Pacific Biosciences, cat. no. 101-389-001)
  • Sequel Mixing Plate (Pacific Biosciences, cat. no. 100-667-500)
  • SMRT Cell Oil (Pacific Biosciences, cat. no. 100-621-300)
  • Tube Septum (Pacific Biosciences, cat. no. 100-667-700)

1. Transfer all 116.2 μl of sample into well A1 of the sample plate. If more than one SMRT cell will be used, transfer additional samples to wells B1, C1, and D1 in order.

2. Seal plate with sample plate foil using the heat sealer according to manufacturer's instructions.

3. Thaw the sequencing plate at room temperature for 1 hr.

4. At the Sequel II instrument, load the sealed sample plate, thawed sequencing plate, pipette tips, SMRT Cell tray, and mixing plate in their indicated locations.

5. Remove the cap from a tube of SMRT Cell oil and replace it with a tube septum.

6. Place the oil in its indicated position with the red line on the tube aligned with the red line on the instrument.

7. Close the instrument and have it scan for all the materials.

8. Click “Select Existing Run” and select the designated sample name (see Support Protocol 1, step 3).

This will load the run details defined in Support Protocol 1 and start the protocol.

Basic Protocol 3: COMPUTATIONAL PROCESSING OF LIGASE FIDELITY SEQUENCING DATA

Following PacBio sequencing, the user will utilize the sequencing data and the following set of scripts to extract overhang pairs in the ligated products from each DNA library. A detailed description of the ligase fidelity computational workflow, with all the necessary custom processing scripts, is available on GitHub at https://github.com/potapovneb/CP-LigaseFidelity. This protocol starts with the subreads.bam file from your run, which can be retrieved via the SMRT Link portal or requested from the sequencing service provider.

1. Run the split.py script on the subreads.bam file using the following command:

  • $ split.py --subread-len 98 --adapter-len 45 --outfile0 subreads.0.txt --outfile1 subreads.1.txt subreads.bam

The PacBio SMRT sequencing run produces a subreads.bam file containing individual sequencing subreads for the double-stranded ligation products. The script looks for the longest sequence ...-[subread]-[adapter]-[subread]-[adapter]-... such that both subread and adapter lengths are within expected ranges. The default SMRTbell adapter length is 45 nt, and the subread length for the library generated in Basic Protocol 2 is 98 nt. By default, 25% variation in expected subread and adapter lengths is allowed. The subread names are saved to two separate files: subreads.0.txt and subreads.1.txt. If a different substrate is used, then the correct subread length must be provided using the ‘–subread_len’ option. In our experiments, there are on average 25 subreads per strand.

2. Use the samtools tool to extract subread sequences to two separate subreads.0.bam and subreads.1.bam files via the following command:

  • $ samtools view -N subreads.0.txt --output subreads.0.bam subreads.bam
  • $ samtools view -N subreads.1.txt --output subreads.1.bam subreads.bam

This script separates the alternating top and bottom strand reads from each well (arbitrary strand designations) into separate files.

3. Run the PacBio ccs tool on each subreads file using the following commands:

  • $ ccs --min-passes=3 subreads.0.bam subreads_ccs.0.bam
  • $ ccs --min-passes=3 subreads.1.bam subreads_ccs.1.bam
  • $ samtools index subreads_ccs.0.bam
  • $ samtools index subreads_ccs.1.bam

This step builds consensus sequences for the two strands from each well and indexes the resulting files. While the individual PacBio subreads can be noisy, the consensus sequences built from multiple subreads are highly accurate. At least three subreads per consensus sequence are required.

4. Run the summarize_results.py script on the subreads_css.0.bam and the subreads_css.1.bam files using the following command:

  • $ summarize_results.py --left-bc 'TTG([ACGT]{6})CGT' --overhang 'TCC([ACGT]{4})GGA' --right-bc 'ACG([ACGT]{6})CAA' --num-passes 3 subreads_ccs.0.bam subreads_ccs.1.bam

This step extracts the ligated overhang and the synthesis control region barcode from each consensus sequence. For each consensus strand sequence, the script locates the left barcode region, overhang, and right barcode region using patterns provided in the command line. The script applies a number of filters: (a) overhang and barcode regions must strictly follow the expected patterns, (b) flanking bases in the opposite strands must match exactly, and (c) at least three passes are required for each strand.

This script produces nine output data files: 01_fragments.csv, 02_overhangs.csv, 03_barcodes.csv, 04_barcodes-counts.csv, 05_barcodes-percentages.csv, 06_matrix.csv, 07_fidelity.csv, 08_mismatch-e.csv, and 09_mismatch-m.csv. See Supporting Information for example files. See Understanding Results for a discussion of file formats and output data.

The input parameters in this step are defined by the substrate sequence used in Basic Protocols 1 and 2. For the substrate used in the presented protocol, the left synthesis control barcode region is six randomized bases flanked by TTG and CGT (TTGNNNNNNCGT). The corresponding command line input parameter is ‘TTG([ACGT]{6})CGT’. The ligation overhang region is four randomized bases flanked by TCC and GGA (TCCNNNNGGA) and the corresponding input is ‘TCC([ACGT]{4})GGA’. The right barcode region is six randomized bases flanked by ACG and CAA (ACGNNNNNNCAA) and the corresponding input is ‘ACG([ACGT]{6})CAA’. If a different substrate sequence is employed, it is critical to modify these inputs to match the sequence used.

5. Optional: Run the included plot_data.py script to visualize the data in four plots using the following command:

  • $ plot_data.py .

The plots generated are: (a) a frequency heat map of all ligation events (06_matrix.png), (b) a stacked bar plot showing the frequency of ligation products containing each overhang (07_fidelity.png), (c) the frequency of specific base pair mismatches by position (the edge position; 08_mismatch-e.png), and (d) the frequency of specific base pair mismatches by position (the middle position; 09_mismatch-m.png). These plots are visualized in Figures 2 and 3.

Details are in the caption following the image
Assay results for ligation of randomized four-base overhangs by T4 DNA ligase. Top: Frequency heat map of all ligation events (log-scaled). Overhangs are listed alphabetically (AAAA, AAAC, AAAG, AAAT …TTTA, TTTC, TTTG, TTTT) left to right and bottom to top such that Watson-Crick pairings are shown on the diagonal. Bottom: Stacked bar plot showing the frequency of ligation products containing each overhang, corresponding to each column in the heat map. Fully Watson-Crick paired ligation results are indicated in orange and ligation products containing one or more mismatches are in grey.
Details are in the caption following the image
Frequency of specific base pair mismatches by position. This figure was generated from the same data shown in Figure 2. The incidence of each possible mismatched base pair observed in the edge (N1 and N4) and middle (N2 and N3) positions of the overhang is reported.

REAGENTS AND SOLUTIONS

Oligonucleotide storage buffer, 10×

  • 10 ml Milli Q water (Sigma-Aldrich, cat. no. W4502)
  • 100 μl 1 M Tris, pH 7.5 (VWR, cat. no. 75800-958)
  • 2 μl 0.5 M EDTA (Thermo Fisher, cat. no. 15575020)
  • Store up to 5 years at −20°C

Precursor oligo, 100 μM

Order precursor oligo (see Table 1 for sequence) from Sigma-Aldrich with custom phosphoramidite ratios for degenerate base positions (31% A, 33% C, 17% G, 19% T). Order at a synthesis scale of 0.2 µmol, which results in enough prepared substrate for at least 100 ligation reactions under the conditions described here (50-µl reaction volume with substrate concentration of 100 nM). The oligo must be cartridge purified, verified by mass spectrometry, and received in lyophilized form. Resuspend the lyophilized oligo precursor in 1× oligonucleotide storage buffer at a final concentration of 100 μM. Store up to 3 years at −20°C.

COMMENTARY

Background Information

DNA ligase fidelity has been well studied in the context of nick ligation (Lohman et al., 2015; Luo et al., 1996; Nakatani et al., 2002; Shuman, 1995; Sriskanda & Shuman, 1998; Wang et al., 2007; Wu & Wallace, 1989). Most studies have investigated substrates one at a time in parallel reactions, generally varying the base pairing at positions closest to the ligation junction. These studies have revealed general features of ligation substrate preference, showing that ligases tolerate mismatches that minimally distort the helix or have multiple hydrogen bond pairings (Rossetti et al., 2015), particularly G:T mismatches that satisfy both preferences. Ligases typically also have a greater tolerance for mismatches at the base pair providing the 5′-phosphate (Lohman et al., 2015; Showalter et al., 2006; Wu & Wallace, 1989); protocols that use ligases for SNP detection such as the Ligase Detection Reaction or Padlock therefore recommend placing the base pair to be discriminated such that it is at the 3′-OH position (Cao, 2001; Nilsson et al., 1994). These general preferences differ in specifics from ligase to ligase, with different absolute fidelities and different relative mismatch preferences for each base pair documented in the DNA ligases studied.

Conversely, little had been reported on fidelity and bias during end joining prior to the development of this assay; most studies judging end structure preference similarly look at one substrate (end type) with a defined sequence per reaction (Bauer et al., 2017). General trends favor longer cohesive ends, but different efficiencies have been reported for short overhangs and blunt ends among the ligases capable of robust end joining, and little had been reported on the effects of sequence or mismatch tolerance in an end joining context.

Studying ligation substrates one structure and sequence at a time has some advantages over the highly multiplexed method described here. Standard and inexpensive methods of imaging products (such as gel electrophoresis) can be employed, and the relative simplicity of one-substrate systems allows for in-depth kinetic analysis to determine kcat, kcat/KM, and single turnover kinetic parameters for these substrates, potentially giving more detailed insight into mechanism and permitting quantitative comparison of rates. However, the limitations of investigating substrates one at a time or in small multiplexed pools are in time and the number of reactions required to evaluate even a small number of sequences for a single ligase.

With the current method, it becomes possible to evaluate the sequence preferences of ligation for many sequences in highly multiplexed libraries (Potapov, Ong, Kucera, et al., 2018; Potapov, Ong, Langhorst, et al., 2018). In the example data provided, we are able to profile the fidelity and sequence bias of end joining looking at every possible 5′ four-base overhang sequence in a single reaction. Varying ligase and/or reaction conditions has permitted analysis of how sequence preferences vary across these conditions in a comprehensive manner (Bilotti et al., 2022). With the read depth available on current Pacific Biosciences instruments, all four-base-pair overhang sequences can be analyzed at once, with all Watson-Crick and single-base mismatches observed, along with rarer observations of double mismatch ligation events. By running multiple cells or multiplexing multiple libraries in a single cell, data investigating the relative preferences of multiple ligases, end types, and sequence space can be generated in a short time.

SMRT sequencing is particularly well suited to this application. By using a small insert, the rolling-circle sequencing method generates many reads of each insert strand (Rhoads & Au, 2015). This permits generation of high-quality consensus reads for each strand of each individual ligation reaction product, permitting the identification of the exact substrate sequences that joined to produce the products and any mismatch positions formed. Nanopore-based single-molecule methods generally permit reading each strand only once, making it much more difficult to accurately identify the bases present at the varied positions in each reaction product (Hu et al., 2021; Slatko et al., 2018). Other NGS methods (Illumina) offer many more reads per sequencing run and are less expensive for any given level of read depth (Hu et al., 2021). However, the amplification intrinsic in these methods complicates the methods needed to pair strand information and confidently call short varied regions. Amplification can obscure mismatch pairing information, and strand separation means that MID tags or similar methods must be used to tie strand information together, an issue that is avoided in PacBio sequencing due to the loop adapters (Schmitt et al., 2012). Data analysis in the presented method is more straightforward, and the lack of amplification steps ensures more accurate quantitation of the relative frequency of each substrate sequence in the pool.

The ligase substrate specificity and mismatch discrimination data can give insight into the biochemical differences between ligases and potential insights into the biology of these enzymes (Bilotti et al., 2022). The protocol described here was used to interrogate a panel of ligases capable of efficient end joining. Mismatch tolerance was found to vary in both degree and kind between ligases. Further, the sequence preferences intrinsic to each DNA ligase were interrogated to provide insight into the mechanism of end-joining ligation. The data have also been applied to optimization of molecular biology protocols dependent on ligation, finding particular success in guiding high-complexity Golden Gate Assembly design (Potapov, Ong, Kucera et al., 2018; Pryor et al., 2020; Pryor, Potapov, Bilotti, Pokhrel, & Lohman, 2022). Golden Gate Assembly is dependent on the high-efficiency and high-accuracy ligation of multiple cohesive end fusion sites in a single reaction, and the comprehensive picture of ligation fidelity generated by this protocol allows selection of overhang sets that minimize potential side products.

Critical Parameters and Troubleshooting

It is key to be aware of the custom modifications to adapt the standard PacBio library preparation required for this protocol, as detailed in Support Protocols 1 and 2, in particular omission of the typical bead purification steps that would remove the desired small insert ligation library. Proper loading is also important. We found that loading at the recommended concentration of 40-150 pM resulted in very poor P1 reads, and that increasing loading to 450 pM was necessary to take full advantage of the chip capacity with this library design. Note that sequencer runs that produce low numbers of reads compared to theoretical can still be used in the data processing pipeline. Low loading does not affect the quality of the individual reads, it simply limits the total number of reads and may therefore require additional sequencing runs to obtain enough reads to confidently identify all significant reaction outcomes (see Strategic Planning and Supporting Information on needed read depth).

Finally, note that this protocol is written for the Pacific Biosciences binding and sequencing protocols for the versions of the binding chemistry (Sequel Binding Kit 2.1), sequencer (Sequel II), and SMRT portal (SMRT Link version 10.2) used in our facility at the time of publication. Sequencing chemistry, associated protocols, and the sequencer itself change frequently, and the user should be aware of any changes needed to adapt this protocol to the latest generation of SMRT sequencing technology. We have used this method on multiple generations of Pacific Biosciences sequencers (Sequel and Sequel II, see Supporting Information, Figure S2) with no issues, but the binding and loading protocols change as per manufacturer's recommendations. See www.pacb.com for the latest Pacific Biosciences protocols and reagent kits.

Table 2 provides a list of common problems in the protocols along with possible causes and potential solutions.

Table 2. Troubleshooting Guide
Problem Possible cause Solution
Inefficient extension of precursor oligo Polymerase reaction conditions are not optimal Check that correct concentrations of all components are used; vary amount of polymerase or dNTPs; vary incubation time
Expired reagents Use fresh reagents
Input DNA incorrectly quantified Re-quantify precursor oligo using Nanodrop
No cutting by restriction enzyme Error in substrate design Check location and sequence of restriction enzyme recognition site
Low amount of ligated product Ligation conditions not optimal Ensure correct amounts of DNA and enzyme in the reaction
Reaction buffers expired Buffers with ATP can expire after several freeze/thaw cycles. Use fresh ligation reaction buffer.
Chosen ligation conditions do not yield appropriate amount of ligated library Alter ligation reaction conditions using the described conditions as a control
Ligation library incubated with exonucleases too long Incubating the library with exonuclease for too long can cause digestion of the entire library. Reduce exonuclease incubation time.
Low yield from cleanup steps Cleanup column overloaded Check that column capacity is appropriate for the amount of DNA loaded (20-µg column for Basic Protocol 1, 5-µg column for Basic Protocol 2)
Wrong protocol used Use oligonucleotide cleanup protocol, not DNA cleanup and concentration protocol
Low P1 (<30%) after PacBio sequencing runa If P0 is high (>60%), not enough DNA loaded onto sequencer Ensure that 450 pM is loaded onto the SMRT Cells rather than the recommended 40-150 pM
Make sure the bead purification step is skipped during the sample preparation
Check that the ligation library concentration is 1 ng/µl by dsDNA-specific method (Qubit)
Ensure that correct binding kit is used (Sequel II Binding Kit 2.1 for short inserts)
If P2 is high (>60%), too much DNA loaded onto sequencer Reduce the amount of DNA loaded onto sequencer
Reagents expired Check that SMRT Cells, binding kits, and reagent plates are not expired
Low number of detected ligation events Incorrect subread and/or adapter length for splitting top/bottom reads Provide the correct expected subread and adapter lengths via the “–subread-len” and “–adapter-len” command line options
Incorrect input parameters for detecting overhang and barcode regions Follow the example (see Basic Protocol 3, step 4) to provide the correct input
  • a See Understanding Results.

Understanding Results

Basic Protocol 1 starts with ssDNA precursor oligos and ends with a library of extended dsDNA substrates harboring randomized ssDNA overhangs. The success of each enzymatic manipulation of the DNA library can be readily monitored via gel electrophoresis or bioanalysis. In the first step, the ssDNA precursor oligos are converted to dsDNA. Because ssDNA is not detected by either ethidium bromide in an agarose gel or the dsDNA detection kit for the Bioanalyzer, the appearance of a band after the reaction is indicative of dsDNA formation (Fig. 1A, gel lane 1). Subsequent cutting of the dsDNA library with Bsa-HFv2 will produce the ssDNA overhang substrate library that can be distinguished from the uncut library based on its shorter size (Fig. 1A, gel lane 2).

In Basic Protocol 2, DNA ligase is added to the ssDNA overhang library. Unligated or partially ligated products are subsequently removed by exonuclease treatment. As described above, ligation of two ssDNA overhang substrates produces a closed dsDNA product that is twice the length of the unligated substrate library. Thus, the success of the reaction can be monitored by gel electrophoresis or bioanalysis, which will separate the unligated substrate from ligated product (Fig. 1A, gel lane 3). Success of the exonuclease treatment will be evident in the disappearance of unligated substrates (Fig. 1A, gel lane 4).

After sequencing on the PacBio instrument, the first metric of success is the P1 value. PacBio sequencing occurs on a chip containing 8 million sequencing wells for Sequel II. The wells (also called zero-mode waveguides) can each hold a single molecule of DNA and are individually monitored in real time for sequencing. At the conclusion of a sequencing run, the SMRT Link software provides a summary of the contents of these wells, indicating the percentage of wells classified as either P0 (empty well, no sequencing substrate present), P1 (one molecule of DNA present, contains usable sequencing data), or P2 (multiple molecules of DNA present, so data are not usable). Typical sequencing runs for these experiments will have P1 values ranging from 30% to 60%. Following the data processing pipeline detailed in Basic Protocol 3, the user can expect the final read count to be approximately 25%-33% of the original number of P1 reads. In the example data provided, the total number of polymerase reads before filtering was 513,036 and the final number of reads after Basic Protocol 3 was 124,805.

After processing the sequencing run through the bioinformatics pipeline, several pieces of data are generated. All overhang and barcode sequences are always written in 5′-to-3′ direction in every output file. Example script output files are provided (see Supporting Information). Descriptions of the files generated by executing the script on the subreads.bam file (as described in Basic Protocol 3) follow.

Raw ligase fidelity data (01_fragments.csv)

This file provides the following for each PacBio polymerase read: read name (column “qname”); number of passes for the first strand (column “np1”); sequence of the left barcode region (column “left_bc1”), overhang (column “overhang1”), and the right barcode region (column “right_bc1”) in the first strand; number of passes for the second strand (column “np2”), sequence of the left barcode region (column “left_bc2”), overhang (column “overhang2”), and the right barcode region (column “right_bc2”) in the second strand; and the number of mismatching bases for each overhang pair (column “overhang_mismatch”). All other output tables are built based on this raw ligation fidelity data.

Overhangs frequency (02_overhangs.csv)

This file provides the frequency of each observed overhang pair. The four-base overhang sequences observed in the top and bottom strands for each read and the number of times that particular combination of overhangs appeared in the sequencing data set are extracted into the overhangs data file. The identity of the overhangs in the pair is provided in columns “O1” and “O2” and the number of times this overhang pair was observed is provided in column “Count”. For example, the line in this file “ACCG,CGGT,3909” indicates that this Watson-Crick pair was detected 3,909 times in the sequencing run. Importantly, the overhang pair “ACCG,CGGT” can be considered in two equivalent ways:

Top strand 5′-ACCG-3′
Bottom strand 3′-TGGC-5′

or

Top strand 5′-CGGT-3′
Bottom strand 3′-GCCA-5′

The definition of top and bottom is arbitrary and both are summed in the corresponding count entry. All overhang and barcode sequences are always written in 5′-to-3′ direction in every output file.

Barcodes (03_barcodes.csv, 04_barcodes-counts.csv, 05_barcodes-percentages.csv)

The first file (03_barcodes.csv) provides sequence and frequency of every detected barcode. This information is used to generate two additional files (04_barcodes-counts.csv and 05_barcodes-percentages.csv) that provide frequency of four bases (A, C, G, T) in every barcode position (N1, N3, N3, N4, N5, N6). The column "NN" provides the frequency of each base across all barcode positions combined.

As described previously, an empirically derived ratio of phosphoramidites was used during oligonucleotide synthesis to achieve maximally equal representation of all possible overhang sequences in the multiplex substrate pool. However, as only ligated overhang sequences are represented in our sequencing data, the overhang region is a reflection of ligation bias as well as synthesis bias. Therefore, the sequencing substrate contains the additional degenerate region distal from the site of ligation to serve as an internal control for randomization during oligonucleotide synthesis. The table of barcode base frequencies details the distribution of nucleobases in this internal synthesis control region.

If synthesis was completely unbiased, each nucleobase would be present at exactly 25% of the total base frequency. We note that we do not observe any positional bias within the barcode region, and the distribution of nucleobases at each of the six barcode positions is consistent within an oligo synthesis. Typically, we observe some modest deviation in this distribution, and consider a successful synthesis to have no more than 5% deviation from 25% for any individual nucleobase. It is possible to calculate the maximum relative over- and under-representation of an overhang sequence within the multiplex pool by calculating the predicted fraction of the homopolymer overhang for the most- and least-frequent nucleobases (i.e., GGGG vs. CCCC). Finally, we note that, while this data can be used to qualitatively evaluate the successful randomization of an oligo synthesis, we do not use the data to normalize for the predicted presence of each overhang. Any normalization process would involve making many assumptions about the dynamically annealing pool of substrates and the mechanism of end-joining ligation, which is not completely understood at this time.

Matrix representation of ligase fidelity data (06_matrix.csv)

This file displays the frequency of every possible ligation product in the matrix form. These data are also represented in Figure 2A. Every combination of four-base overhangs is listed on both axes, creating a 256 × 256 matrix. Each square in the matrix represents a ligation event of the corresponding overhang combination. The color of the square denotes the ligation frequency (log-scaled) of that ligation event. The plot is organized so that Watson-Crick ligation events are located on the main diagonal of the matrix. As expected, these ligation events are found at the highest frequencies and the main diagonal is correspondingly the darkest color. Off-axis ligation events represent mismatch ligation events and range from very common (dark gray) to not observed (white).

Ligase fidelity data per overhang (07_fidelity.csv)

This file organizes ligase fidelity data for each of the 256 possible four-base overhang sequences, detailing the total number of reads for each overhang sequence (“Total”), how many of the reads were paired with the correct Watson-Crick partner (“Correct”), and how many reads contained a mismatch pairing (“Mismatch”). The table also includes the sequence identity and read counts for the five most common mismatch partners for each overhang.

These data are also represented by a linear transform plot where each bar represents one of the possible 256 four-base overhangs (Fig. 2B). The height of each bar denotes the number of ligation events of that overhang, and thus the relative ligation bias of each overhang can be observed. Each bar is further broken down into different colors representing Watson-Crick (orange) and mismatch (grey) ligation events. The proportion of these two colors represents the ligation fidelity of each overhang.

Frequency of specific base pair mismatches by position (08_mismatch_e.csv, 09_mismatch_m.csv)

The final piece of information extracted from the sequencing data compiles all observed ligation events to broadly report the location and identity of mismatches. Mismatches can be either at the edge positions of the overhang (N1 or N4, Fig. 3A) or in the middle of the overhang (N2 or N3, Fig. 3B). The percentage of reads containing each particular mismatch pairing is reported. In the example provided, the mismatch pairing that is most tolerated at both the edge and middle positions is G:T/T:G.

Time Considerations

Each enzymatic step in Basic Protocol 1 (extension and digestion) takes ∼1.5 hr to set up and incubate. Each enzymatic step is followed by a cleanup step (1 hr) and analysis on a gel or Bioanalyzer (1 hr), so the total time for this protocol is 7 hr. The protocol timeline is flexible and can be paused at many points. Basic Protocol 1 should generate enough substrate to allow for multiple runs of Basic Protocol 2.

Basic Protocol 2 also consists of two enzymatic reactions (ligation and exonuclease digestion) that take ∼1.5 hr to set up and incubate. Each of these is also followed by cleanup and gel/Bioanalyzer steps, so the total time is again 7 hr. This protocol is also flexible and provides enough ligated substrates for multiple PacBio sequencing runs.

After using the SMRT Link software to plan for the sequencing reaction, preparation of the ligation libraries for sequencing in Support Protocol 1 consists of annealing the sequencing primer and binding the polymerase. In total, this takes 3 hr.

Setup of the PacBio instrument in Support Protocol 2 takes about 1 hr. We use 10-hr sequencing runs to sequence each sample. Thus, if all four SMRT Cells are used, sequencing will take 40 hr.

The overall time necessary to run the computational processing for PacBio Sequel II sequencing data in Basic Protocol 3 is ∼2.5 hr on a 24 CPU core Linux workstation. Splitting top and bottom strands to separate .bam files takes ∼30 min. Building CCS reads using the PacBio command line tool takes ∼1 hr for each strand. The slowest step (building CCS reads) can be significantly accelerated by using a multicore workstation or splitting the data into smaller chunks and processing in parallel in a computer cluster, if available.

Acknowledgments

We are grateful to Tasha José for assistance with figure design and production. We thank Kelly Zatopek and Andrew Sikkema for critical feedback and careful reading of the manuscript. New England Biolabs funded the research described, paid salaries for V.P., K.B., and G.J.S.L., and provided funding for the open access charge. The research reported in this publication was also supported by the National Institute of General Medical Sciences of the National Institutes of Health under Award Number T32GM008349 to A.T.D.

    Author Contributions

    Alexander T. Duckworth: Methodology, writing original draft, writing review and editing; Katharina Bilotti: Investigation, methodology, writing original draft, writing review and editing; Vladimir Potapov: Data curation, formal analysis, methodology, software, validation, visualization, writing original draft, writing review and editing; Greg Lohman: Conceptualization, investigation, methodology, supervision, writing original draft, writing review and editing.

    Conflict of Interest

    K.B., V.P., and G.J.S.L. are employees of New England Biolabs, a manufacturer and vendor of molecular biology reagents including DNA ligases. This affiliation does not affect the authors’ impartiality, adherence to journal standards and policies, or availability of data.

    Data Availability Statement

    Sequencing data pertaining to this study has been deposited into the Sequencing Read Archive (https://www.ncbi.nlm.nih.gov/sra) under accession number PRJNA894239. Custom software tools are available in the GitHub repository at https://github.com/potapovneb/CP-LigaseFidelity. Other data that support the findings of this study are available in the supplementary files (see Supporting Information).