Welcome to the NEB TV webinar series. Today we'll be hearing from Andrew Barry, who is one of our product marketing managers. Hey Andrew.
Hey, Deana, how are you doing?
Could you tell us a little bit about what you'll be talking about today?
Absolutely. So, in today's webinar we'll explore some of the challenges that translational researchers face as they try to apply target enrichment, and next generation sequencing towards the understanding of human disease, and then we're going to get into the NEBNext Direct, which is New England Biolabs' new solution for target enrichment.
Great. Let's get started.
Welcome everyone, and thanks for joining us today.
I think any story around target enrichment, and the opportunities for target enrichment, really begins with translational research. If we think about this on the most basic level, translational research applies findings from basic science to enhance human health, and well being. In the medical research context, it aims to translate these findings from fundamental research into medical practice and meaningful health outcomes.
There's many ways that we can break down the different levels of activities that really fall under the realm of translational research. If we think about it as a tiered approach, we can think about it as fundamental building resources, and projects like the human genome project, and 1000 genomes, and TCGA, the cancer genome analysis, as really building these fundamental data sets and information that we can build upon.
Moving forward to clinical and molecular phenotyping, and understanding how these things apply in larger patient populations, understanding specifics around genes and any modifications to those that are going to impact those phenotypes, and then really breaking this down further into characterization of molecular mechanisms, all the way towards the application of this in patient care, through diagnostics, and choices, and therapies.
If we think about the specifics to genomics, really we think about casting our net quite broadly at the beginning in the discovery phase, sort of this intermediate translational phase, all the way into clinical practice.
Several different types of organizations are involved in these type of processes, and I think it's important that we realize that all of these contribute uniquely to this application of basic research towards clinical practice. Everything from academic research, through hospitals and pharmaceutical companies, all the way to hospitals, and industries. I think it's important to recognize also, that each of these different types of organizations engages in different activities that really present unique constraints, including types of funding, the way that different projects are managed, all the way through regulatory aspects that really fundamentally, even though these organizations are using the same technologies, are going to apply them in many different ways, and have very different constraints, as they think about the best way to apply these.
If we can think about this towards the applications and population screening, from the discovery standpoint, towards some of the translational activities, including things like clinical trials, all the way through diagnostic tests that are used to impact patient treatment options, we can think about the whole range, in terms of the scope of different types of tools that actually get applied for this. So, at the very high level, things like comprehensive profiling, whole genome sequencing, whole transcriptome sequencing, and whole exome sequencing, all the way through more comprehensive target enrichment panels, down towards very focused panels, which are only going to be looking at clinically actionable type mutations and variants that are being used to really impact the way the decisions are being made.
When we think about the advances that have been made in next generation sequencing, here we're looking at the NIH, every year they track what the cost per genome is, and as you can see from this, over the last 15 or 20 years, the cost has dramatically decreased to the point where we're approaching, and potentially surpassing the $1000 per genome mark. We always look at this relative to something known as Moore's law, which comes from the semiconductor industry, which really is a function of how densely we can pack transistors onto these arrays. We see that genomics has really outpaced this, and the cost for doing whole genome sequencing has decreased so much that we might actually question, "Why enrich at all instead of just sequencing the whole genome, and get more comprehensive information?"
I think the answer to this really lies in the fact that we have to go deeper for certain translational activities. We need to sequence the high depths of coverage to find the variants that we're interested in. If we look at the chart on the bottom right, you can see the differences between these different types of applications, with whole genome sequencing covering three gigabases, at roughly 30 X, including 23000 genes, plus non coding elements, and you can see the costs that are associated with that.
These applications are mainly used for variant discoveries, as well as population genomics, to really understand the frequency of some of these variants as they exist in normal populations, but also through specific subsets of patients that might be included in a patient cohort.
Whole exome sequencing was developed to really just look at the protein coding regions of the genome, and can be done for a reasonable cost, and provides a nice alternative to whole genome sequencing, or for depth of coverage for whole exome sequencing are still only getting us to about the 150 X depth of coverage, and to do so beyond that becomes relatively financial intractable.
These have been applied in many cases for a translational research, variant discovery, and theranostic applications, and we really look at these as ways to look at the things that we know are going to be most impactful, but also get other contextual information about this.
Now, when we move towards targeted sequencing, with panels ranging from just a few kilobases, all the way up to about 10 megabases, you see that we're really looking for technology that's going to give us high depth of coverage to be able to find low frequency variants, and these tend to range from 10's to 100's of genes, with costs that are relatively lower than what we see for whole exome sequencing. So, it's not really just about the amount of data that we produce, but really the density of that data towards our practical application, and as things move closer to clinical applications, we see the need for more focused panels that are only going to provide the information that is necessary to help make the decisions that are being made based off of the tests.
So, if we are going to go ahead and enrich, we want to make sure we're doing so with high efficiency.
There's certain key metrics that I think are important to cover as we talk about some of the different approaches for target enrichment. The most important, primarily, is the sensitivity. So, really when we think about this, we use the analogy of finding a needle in a hay stack, but we're really looking for the lowest frequency variant that we can find in a given data set, and this information becomes compounded by factors, including false positive variants, low allele fraction variants, and stromal contamination.
The other metrics that we use for target enrichment are really a function of the efficiency of the actual enrichment, itself. So, when we think about specificity, again with the needle in the hay stack, how efficiently can I separate my targets from the rest of the genome, and again, this gets confounded by any off target reads that might be present in the data itself.
And, finally, uniformity. I think it's important when we think about the efficiency of target enrichment, to really think about the efficiency of the sequencing. So, when we ask ourself the questions of, "How many targets can we find, and can we find them with equivalent efficiency in a data set?" It's important to understand the factors that are going to limit our ability to do so efficiently, including PCR duplicates, PCR bias, and the ability to balance specific dates present in a panel.
With clinical applications, there's a clear trend towards generating more types of data, and more densely packed data, from relatively limited sample inputs. A great example of this is formalin fixed paraffin embedded tissue. This is the most popular and common form of clinical samples, and is very common because of the fact that it not only fixes the cells, and preserves the nucleic acid, but does so in a way that allows it to be stored for relatively long periods of time at room temperature.
This benefit doesn't come without it's drawbacks, however, as we see lots of protein-DNA crosslinking, and DNA-DNA, happening from that fixation process, and the nucleic acids that we're able to extract from these tissues, tend to have backbone breaks, including double stranded breaks, nicks, and gaps.
All of these are going to impact the amount and quality of the material that we have available for any DNA prep, including target enrichment.
In addition, we tend to see lots of errors that are introduced as a function of the fixation process in the chemicals that are used to extract DNA from these FFP blocks. We see things like hydrolysis, and base deamination, as well as oxidation events. All of these are going to accumulate errors in our sequencing reads, that are going to appear to us as false positives, making our ability to call sensitively nucleic acid variants more challenging.
To make matters worse, we're often attempting to detect variants that, for many reasons, are present at extremely low variant allele fractions, and when we think about this, it's really, "How many reads possess the variant that we're interested in?" And, this is a function of a lot of the upfront actual characteristics of the cells that are going into the prep.
On the left you can see a pathologist annotated FFPE section, where the nuclei that are showing to be malignant are circled, and potentially are going to be extracted for further analysis. It's important to note that tumors themselves are a heterogeneous mixture of both stromal healthy cells, which typically contain what we would call a normal genome, or tumor malignant cells that will often have more complex genomes because of the fact that they're heavily mutated.
Also, it's important to understand that the variants themselves are present at a wide range of frequency. We think about things as driver mutations, versus passenger mutations, and when we're trying to detect driver mutations relatively early, for things like prognostic applications, they tend to be present at very low frequencies, which makes the challenge even harder.
Further to this, the variants that are often present because of the complex genomes in the cancerous cells, are present at high copy number, which can make the challenge worse. If you see the image on the left, what we see there is a focal amplification event, where instead of having just one copy of the region of interest, we actually have two copies. So, if we think about the impact of this on the variant allele fraction, the variant would be present if the copy number was one. If the variant was present in 50% of the reads, the allele frequency would be 0.5. However, when we have an increased copy number, and the variant is still present in 50% of the reads, the variant allele frequency would only be 25%. So, the higher the copy number, which tends to be a characteristic of these mutated samples, we're going to actually have more challenges in terms of the allele fraction being lower.
One of the primary challenges that we've seen in many target enrichment strategies, is the fact that we often need to use PCR as a means to enrich, and this can vary in terms of being the primary means of enrichment, versus actually generating enough material to go into a hybridization, or going onto a sequencer. And, the issue is sort of two fold. When we create copies of our target molecules, it becomes difficult to understand, and to be able to PCI amplify these regions equivalently, to where the impact on the number of variants present in the reads, may not be indicative of the true number of variants that were present in the original molecules.
We also have the added problem that, as we do enrich for these using PCR, we tend to introduce some false positive snips, and these are perpetuated as the amplification proceeds.
When we think about the impact of this on variant calling, if you look at the reads here, it's impossible to understand which of these variants associated with these reads were actually present in the original sample, and which of these are true positive snips, or false positive snips.
One of the real main developments in target enrichment, that's taken place over the last few years, is this concept of adding a unique molecule index, or UMI, also referred to as a unique identifier. What these are, are short barcodes that are affixed to each molecule independently. So, they're present as a random mixture of sequences that provide a single molecule index. So, each molecule in solution will only be associated with one, and only one, of these indexes. If we think about the impact of this as we perpetuate our samples through PCR amplification, we can later use these data to group our variants associated with the reads.
So, as you see here, between the yellow and the teal, basically we're able to categorize these molecules as they are present in the original mixture. This allows us to understand the true PCI duplicates that are present, and really understand the true coverage of each of our target regions, but it also allows us to do some interesting things with the variants that are detected in variant calling algorithms. So, one of the things that we can do is, taking this example, we can see that our red true positive variant is only present in the reads that are associated with the yellow UMI. This allows us to understand that that was present in the original copy of that yellow molecule, and not in the teal copy. Further, if we look at the reads just associated with the yellow UMI, we can filter based on that, and understand that while every read containing the yellow UMI contains the true positive variant. Only a subset of these are going to have the false positive variant.
Filtering based on this allows us to remove some of these false positives informatically, and understand the true positives, and be able to identify those more accurately. This also allows for more accurate assessment of allele variant frequencies, which is going to help with our understanding of how these variants actually track with patient progression, as well as across populations.
To summarize some of the challenges, thus far, we talked about some of the difficult samples, which includes FFPE, as well as, more recently, applications including cell-free DNA. We talked about tumor heterogeneity and the impact that that has on variant allele frequencies, as well as PCR duplication, false positive variants, as well as estimating allele frequencies. Sensitivity is pretty clear, we want to detect the lowest frequency variant possible, and able to do that by adjusting the signal to noise ratio. As far as work flows are concerned for clinical applications, we always want things to be faster turnaround, and we want them to be amenable to lab automation, which will allow us to scale this across large sample numbers.
When we think about the sequencer efficiency, really this comes back to those metrics, including the specificity, as well as the coverage uniformity across targets.
Finally, we'll talk a little bit about content scalability, and really, this comes down to having to support multiple work flows dependent on the size of the panel that somebody might be interested in utilizing. So, if we think about available technologies to date, hybridization-based enrichment is one of the more common. This is really a function of making an upfront library, doing hybridization to biotinylated baits, capturing those on streptavidin beads, and then having to elute those off, and pass off to sequencing.
It was developed originally for whole exome sequencing, and works really well for whole exome sequencing, as well as larger panels. One of the challenges to hybridization-based enrichment is that it often employs a ray based synthesis, which makes balancing across different targets present in that pool very difficult, since they're all synthesized together. This would present a challenge in order to efficiently amplify, or select each of those regions uniformly, and would have impact on the amount of sequence coverage necessary.
Because the library preparation precedes hybridization, there's PCR that takes place there, which can make pulling down off target more common using this method, as well as the PCR that takes place after the hybridization, in order to generate sufficient material for sequencing. The specificity for these enrichment technologies is typically driven through the hybridization itself, and this is something that typically requires a relatively long hybridization, from 18 to 72 hours. What we've seen in our hands is that the specificity for hybridization-based enrichment panels, tends to decrease as the panels get smaller. So, while the technology is very good for more comprehensive, and whole exome sequencing, the specificity makes it challenging for doing more focused panels.
Multiplex PCR is another technology that's been around for a very long time. It's a PCR-based enrichment where the primary method of enrichment is the PCR, itself. This has manifested itself in a couple of different forms, either by doing PCR-based enrichment, where we do two rounds of PCR, one to amplify the targets, and the second to use tailed primers to add the specific adapters necessary to do next generation sequencing, and we've also seen this manifest where we'd do a single PCR and then ligadon adapters.
Uniformity is probably the biggest challenge with multiplex PCR, because you need to design specific primers, and a pair of them, to a given target. You're limited by the melting temperature of those oligos, and it makes it very difficult, in terms of scaling those panels beyond a certain point. We have seen the upper limit of this around 50 kilobases, and again, some of those interactions between primers, as we scale this beyond that, becomes very challenging, especially when we're talking about doing amplification and generating even coverage across our targets. Also, makes it difficult, over longer exons, to be able to tile, and we tend to see high levels of PCR duplication as the enrichment itself is based on PCR.
Through the years, a number of alternative strategies have been developed to really bridge the gap between hybridization-based panels, and multiplex PCR. Everything from selector probes to molecular inversion probes, and multiplex extension ligation. These technologies have really fit in that range between these other two technologies. Again, most of these use PCR as the primary enrichment, and they tend to have more complex work flows that require samples to be split across multiple wells, and often times require highly customized consumables that can make running large sample numbers difficult to achieve.
So, when we look across the available tools today, we do see a gap that exists that balances the performance necessary to really make target enrichment tractable for some of these challenges that we've discussed, and what we is a need for continued innovation and development in this area.
So, moving on, we'll discuss NEBNext Direct, which is a novel target enrichment strategy from New England BioLabs. This technology was developed in partnership with a company called Directed Genomics, and it's the only hybridization-based enrichment that directly enriches genomic DNA, and does not require an upfront library preparation. It is a hybridization-based enrichment, but it targets both DNA strands in parallel. Because there is no library prep, it allows us to capture both single, and double stranded DNA that might be present in our clinical samples, as well as the fact that the enrichment proceeds any amplification, allows this to be used across a relatively wide range of target content.
The workflow itself, begins with fragmented DNA. We tend to fragment this to about 200 base pairs. We basically do this to achieve coverage across the middle of our targets, and make sure that we don't have too much overlap between those. Again, it uses the native DNA, and can use a range between 10 nanograms, to one microgram of input DNA. We do include an optional FFPE DNA phosphorylation step if the sample of choice is formalin fixed paraffin embedded tissue.
The very first step in the workflow, is the hybridization of the oligonucleotide baits. These are biotinylated, which allows us to select these out of solution using a streptavidin bead. The baits themselves, as you'll see, target both strands of DNA, and are relatively short at an average size of about 60 nucleotides in length. Again, we mention that we can target single stranded DNA, because of the lack of a need for an upfront library.
You'll notice in this image, you see that the baits themselves are specific to the three prime end of the target, and then we do an extension step that allows us to capture the full region. The hybridization is relatively short, it's only 90 minutes, and then the next step that we actually employ is the capture using streptavidin beads. So, this is a magnetic bead based isolation, and we're doing a washes here to remove the off target.
I think it's important to note that the specificity itself is not solely driven on the hybridization, but as we'll see in the next step, we can actually employ enzymatic removal of off-target sequence. So, what we do here is actually use an enzyme to remove any off-target sequence, and you'll notice that the three prime end of each target has been truncated to the point where the bait is bound. This also will chew up any unbound DNA that's present in the sample. So, not only are we relying on the hybridization to drive specificity, but we're using enzymes to remove any off-target sequence.
You also notice that by removing the off-target sequence upstream of the three prime end, we are defining the three prime end of the target, which allows us to get higher specificity.
We then begin the process of inverting the captured material into a full length library. So, the first thing we do is ligate our three prime loop adaptor. You'll notice that the complex here remains bound to streptavidin beads throughout the protocol, so it is a single-tube protocol, which helps to eliminate any sample loss that we might incur.
As we've mentioned, we then extend the bait across the end of the target. What this does is create double-stranded DNA across the full length, and this is going to provide us with a random five prime end of each read, based on fragmentation.
The next step is to ligate our five prime UMI adapter. So, this is an adapter that has a 12 base pair unique molecule index, and this is in the i5 position. So, we use the dual barcode scheme for Illumina, in order to read the unique molecule id, in the index the UMI is actually present in the i2 position.
Because we're using a 12 base pair UMI, we have a theoretical upper limit of 17000000 unique actual indexes that get incorporated. So this can be used even to relatively high depths of sequencing, in order to understand the PCR duplicates, and to resolve any false positive artifacts.
Once we have our five prime adapter, we actually cleave the loop adapter on the three prime end, and perform a PCR amplification step. This PCR is going to add the sample barcode that's, again, in the i5 position, and this allows for pooling of samples prior to sequencing.
I think it's important to note that the bait strand does not result in, and is not carried through the PCR present in the final library. So, the only fragments that we're actually sequencing are those that were present in the original samples, or copies of that that have been marked with the UMI's prior to any amplification.
The result of the prep is a fairly unique read profile. As you can see here, we mention that we have, for each strand, a fixed three prime end, and a variable five prime end, for each of the reads covering this, and we actually use the five prime ends of each read. The ends of those, along with the UMI itself, in order to identify PCR duplicates, and perform any false positive variant filtering.
What you see on the left here, is an actual IGV view of an exon three of the gene kit. This is a great example of a gene where we have a longer exon, and we've had to employ tiling of baits across that region. One other thing you'll notice is the exceptional evenness of coverage across that target. This is something that's achieved, but because we employ individual oligonucleotide synthesis, and can do really fine tune balancing of those probes, for any given region. What you're seeing here in the green, red, and purple numbers, one, two, and three, are the three pairs of baits that are used for each of the targets that overlap one another, and generate this coverage. And, as you'll notice, the coverage profile across the exon really looks more like a plateau than a mountain, because we don't have that normal distribution that you would typically expect if you were to create a library upfront, and then hybridize that to baits, and pull those down.
So, we do get very precise enrichment of the targets of interest, and again, we're able to use those UMI's and variable insert sizes to do the PCR duplicate filtering, and, or, correction.
The first product that we launched with this technology was a cancer hotspot panel, and some of the data that we'll be reviewing is specific to this. So, this is a hotspot cancer panel. It's a 37 kilobase panel. Provides highly specific enrichment of 190 targets associated with cancer, and these are across 50 genes. We do target the full exons, and there's no gaps in coverage, so we get uniform coverage with no drop-outs, and within that 37 kilobases of target sequence, because it's the complete exon, we get over 18000 cosmic features.
The target size for this is designed specifically for Illumina 2 X 75 paradigm sequencing, and all the data that you'll see is based off of that configuration.
One of the key metrics that we discussed is specificity, and we look at that across some of the more challenging samples. So, in this figure, what we see is the percent of aligned reads that map to the targets using 100 nanograms of input DNA. We see that across fresh frozen DNA, cell free DNA, as well as formalin fixed paraffin embedded DNA.
Because we have a relatively short bait, we're able to capture some of the more degraded fragments that are present in these difficult sample types, and give us a higher percentage of chances to actually convert that into a final library.
When we look at the specificity relative to traditional hybridization approaches, we wanted to test this. So, one of the things we did was actually design a custom hybridization panel specific to the regions that are included in the cancer hotspot targets. We actually looked at two commercially available solutions and screened these using a controlled DNA. Looking at the percentage of inserts mapping to targets, you can see that the NEBNext Direct cancer hotspot panel is able to achieve 90% of the inserts mapping to targets, where, for some of the custom panels that we designed, we were having a difficult time getting above 40%.
One of the key aspects to the technology is the ability to scale well between very small focused panels, as well as relatively larger panels. So, here we see really four examples ranging from a very focused two gene, 17 kilobased panel, all the way to a relatively large 360 kilobased panel that includes 133 genes plus some break points, and copy number variant controls. What we see there is that we're able to maintain high specificity in terms of the percentage of reads on target. So, for the highly focused panel, you see that percentage of reads on target at 96%, where as, even with the larger panel, that number drops just slightly down to 92% of reads on target.
When we think about uniformity, we tend to look at this as a function of the variants off of the mean target coverage. This can be a bit of a confusing way to look at this, but really what we're thinking of is, if we look at the mean target coverage for the panel, what is the highest deviation that we get from that, and how does that vary based off of the percentage of target bases sequenced to at least X% of the mean depth, and here we're looking at this at 15%, 33%, and 25%. What we see is that 100% of the target bases are covered to 25% of the mean target coverage, and if we drop that to 50% of the mean target coverage, that number is up at 90%. So, if you think about this, if we had a panel where the mean target coverage was just to take 100 X as the mean target coverage, that would mean that 100% of our target bases were covered at 75 X or greater, and 90% of those target bases would be covered at 50 X, or greater.
I think an easier way to conceptualize this, is to look at figures like this. So, what we see in these histograms, is really the actual mean coverage across the target plotted, the normalized mean coverage plotted, relative to the actual target, itself. So, on the X axis you see the target itself, and on the Y axis you see the normalized mean coverage, and really what we're looking for is to make this a straight line that would intersect the Y axis at one, and as you can see they have a cancer hotspot panel on the left, the NEBNext cancer hotspot panel does a nice job of providing uniform coverage.
We tested several similar hotspot panels, and this is about what we would expect, where we see relatively large differences between the highest and lowest performing targets within those panels. And the impact here is again, on the sequencing itself, and the amount of sequencing necessary to generate the sufficient coverage to call variants, and what we end up doing in many cases is having to over sequence the high performers, in order to bring the low performers up to that normalized coverage, and make sure we have sufficient coverage for variant calling.
When we look at this, relative to all of the technologies that are out there, so, looking at this, we looked at several commercially available panels of similar size and content. So, this includes multiplex PCR based panels, selector probes, extension ligation, as well as hybridization based panels, and this figure is really just looking at the measured uniformity as a percentage of bases, within 25% of the mean target coverage, and as you can see, the NEBNext Direct cancer hotspot panel, in covering 100% of those targets at 25% of the mean coverage is leaving it best in class.
Finally, looking at the overall sensitivity to detect variants, in this experiment what we did, because we we're running such a focused 37 kilobase panel, we actually started by pooling 24 HapMap samples. So, these HapMap samples have well characterized variants with well characterized variant allele frequencies, and the reason we did that was really just to increase the density of variants that are present within the targeted regions of the hotspot panel. Using 100 nanograms of input DNA, we screened this with the cancer hotspot panel, and ran our internal variant calling pipeline.
What you see here is the plotted expected, versus observed variant allele frequency. As you can see, we looked between 2 and 100% variant allele frequency, and were able to detect 100% of those variants across that range. Also, the fact that that is a relatively high R squared, shows that the data itself is highly predictive of variant allele frequency. All of the data here was using UMI's to do PCR duplicate filtering. However, we haven't yet applied the newer algorithms for doing actual false positive resolution into this. So, we're really hoping that these data continue to push the limits of sensitivity, in terms of variant allele frequency.
So, in summary, I think despite lots of advances in sequencing, we really need to think about target enrichment as an area for continued development and innovation to further support the challenges that translational research presents to us. We continue to try to push the envelope, in terms of lower variant allele frequency variant calling, and really that whole concept of getting maximized optimal data from really minimal, and potential severely challenged samples, is really the trend that we see here.
NEBNext Direct does provide a tractable solution for target enrichment. It does so across a relative wide range of panel sizes, and as you see we have a high specificity for target regions, uniform coverage across those targets, and then the high sensitivity to detect nucleic acid variants.
So, in conclusion, I'd like to acknowledge a great group of people, both at New England BioLabs, but also with our partners at Directed Genomics, who are responsible for the development of the technology, and really looking forward to conversations, and getting this technology out there in the market for folks to test.
So, with that, I think we'll transition to our Q and A portion of the webinar, and I'm just going to take a look at some of the questions that have come in.
So, I have a question here about the length of the workflow. So, the workflow itself is a single day workflow. It takes about six and a half to seven hours to do the prep. That's from genomic DNA, all the way to getting on sequencers. We do this routinely in our lab every day, so we have a fairly high confidence, in terms of that time frame, for an experienced user.
I have another question here. How are the UMI's incorporated in sequence?
So, we mentioned that the UMI's are incorporated in the five prime adapter. So, they're actually using the index two read of the actual dual index scheme, and that's the experience, in terms of incorporating those, and obviously making sure that the sequencer configuration is set up to read the UMI's in that index two read.
I have a question here on our experience with GC rich regions. This is something that the technology itself performs very well. Again, this is a function of the uniformity that we showed. The GC bias tends to be quite low. This is really capable because we're able to individually synthesize the baits themselves, and we can do very fine tuned balancing of those in the actual prep itself.
I have a question here about how we deal with pseudogenes.
So, pseudogenes are obviously a challenging thing, because what we see in many cases, is that we're able to very precisely enrich material present in pseudogenes, but then we have issues with the actual mapping quality of those reads. So, really, pseudogenes do pose a challenge, we have some strategies that we can employ to overcome those, including being able to move the bait slightly, and potentially into the intronic regions, and those can be used to uniquely identify those, versus the pseudogenes themselves.
I have a question here about does the technology lend itself to customization, and developing new assays?
I think the answer to that is yes. We have a few different options for customization. There's the traditional true custom, where you could send us your specific genomic coordinates, and we can synthesize baits specific to those, and in doing so, the first thing we do is run a design and pick targets, and basically try to provide you with any feedback upfront on things that might be challenging, any pseudogenes or repetitive elements in those regions that might cause problems for specificity, or as I mentioned earlier, mapping quality. The other option that we do have is this capability, where we've, and specifically in the realm of cancer, and starting to expand into other diseases, have gone ahead and actually pre-designed, synthesized, and balanced, full coding regions for about 450 cancer genes. This allows us to pool specific genes on demand, and provide the custom kit for you, assuming that the genes of interest fall within the things that we've already developed. We can do that in a fairly rapid timeframe.
I have another question here that I think is interesting. How does the approach deal with fragmented or degraded DNA?
The answer to that is quite well. I think that one of the benefits that we have, is that the probe is relatively short at about 60 nucleotides in length. So, in the fact that there's only one of them per strand of DNA. So, where we tend to have more fragmented DNA, obviously the size of those is smaller, the probability of being able to land a bait on that is higher, because we only have one, as opposed to, say, a multiplex PCR approach, where you have to design specific primers across that. You might have a high probability of landing a single PCR primer, but the probability of landing both becomes a challenge. So, it deals quite well with fragmented and degraded DNA, as well as the fact that we are able to capture single stranded DNA in the prep, which is typically present in things like FFPE DNA.
I'm just looking at the questions.
There's a question here, the 60 base pair bait. The 60 base pair bait is DNA, not RNA. So, it's a DNA based oligonucleotide, and the second half of the question was, 60 bases should be easier to mishybridize than the 120 base pair bait.
From our experience, we tend to see that our actual 60 nucleotide baits are able to tolerate about 50% mismatches, which is useful if, for instance, there's an indele on the region that you're targeting, we'll still capture that. So, we don't see a lot of mishyrbidization, we tend to see that the 60 nucleotide bait is enough to pull down the regions of interest, and then as we mentioned, the actual specificity is further enhanced because we do the enzymatic reduction in any off target, or regions that aren't bound to the DNA. So, any DNA that wasn't hybridized to baits would be eliminated basically from the prep.
So, another question here about what the optimum in a maximum number of targets that can be targeted with the technology, and what are the average gene lengths?
So, the question around the average gene length is, it's highly variable, so genes can be very short if they only have a few smaller exons, or they can be quite long. So, we tend to think about this in terms of the total target territory. Today we feel confident in the technologies ability to produce highly specific and sensitive enrichment between very focused panels of a single gene, all the way up to about 500 kilobases in total target territory.
I think that's all the time that we have for today. If we have any other questions, I'd recommend looking in the bottom rail. You'll see some of the various assets that will provide some additional information, as well as to visit the website at NEBNextDirect.com. And, with that, I thank you all for your attention, and wish you the best of luck in your research.
To save your cart and view previous orders, sign in to your NEB account. Adding products to your cart without being signed in will result in a loss of your cart when you do sign in or leave the site.