Question

How to handle off-target regions in amplicon panel analysis?

0

Entering edit mode

6.6 years ago

lamteva.vera ▴ 220

Dear biostars inhabitants!

I'm trying to figure out how to analyse data from Illumina's TrueSeq Custom Amplicon panel.

The manifest file for the panel provided by Illumina includes Probes and Targets sections, encompassing 788 and 822 entries, respectively. The Targets section includes "an expected off-target region" "in addition to the submitted genomic region", as documentation claims.

As far as I understand, these 34 expected off-target regions are regions, highly likely to bind primer pairs originally designed to target regions of interest. Thus, some of the targeted regions are not actually well-covered by the panel since it's hard to unambigously map the amplicons. Correct me if I'm wrong.

I'm looking for your expert advise: how can I use the information about predicted off-targets in sequencing data analysis? Should I exclude such regions from interval list used to restrict variant calling?

Thank you for your time. Have a nice day!

off-target targeted resequencing • 3.0k views

ADD COMMENT • link updated 6.6 years ago by Kevin Blighe 87k • written 6.6 years ago by lamteva.vera ▴ 220

score 2 · Answer 1 · 2017-09-27

Your assumptions are correct, and it's important to realise that a large chunk of the human genome exhibits some level of homology (generally, for a sequence of DNA to be regarded as homologous, it must exhibit 30% similarity to another or other regions). Don't quote me, but I read somewhere that >50% of all genes have a processed or unprocessed pseudogene elsewhere in the genome. Thus, the 'expected off-target' regions provided by Illumina for your panel are most likely these regions that exhibit high homology to your primary target regions of interest.

All of this poses great issues for alignment tools, which have to faithfully map each read to a position in the genome. If a read maps to >1 location, its mapping quality will suffer. However, if it maps to just a single region, then it will certainly have a high mapping quality. Base errors in each read neither help, in this regard, as they further reduce mapping quality and make the task of the aligner more difficult.

This issue is also in part to explain for the very uneven depth of coverage profile that you get with this type of sequencing, whereby one region may have >1000 reads mapped to it, whereas others may have just 20 (other reads that could have mapped to it were 'robbed' by homologous regions during PCR amplification and/or during in silico alignment).

From my experience of targeted sequencing using Illumina's kits, the amount of off-target reads is generally 30-40% of all reads (i.e. 30-40% of reads in each sample will map to regions outside of the primary regions of interest). There is not much that you can do about this other than work with Illumina to attempt to improve the problem.

Many regions of the genome are just not suited for massively parallel sequencing using short reads - the data from these regions just cannot be trusted due to the fact that such regions exhibit high homology to others in the genome. The way to tackle these is with long-range PCR or Sanger sequencing, where you can design primers far outside your region of interest in a region of unique sequence.

From an analysis perspective, the way that I manage this issue specifically is by:

Trim bases off the ends of reads that fall below Phred-scaled quality score of 30
Eliminate short reads (<50 or 70bp)
Only include uniquely-mapped reads (Bowtie allows this) or filter out reads with MAPQ<40 or 50 (BWA)
Use a BED file to filter out all reads or variants called in the off-target regions

Other people will of course have their own ideas, which are welcome.

I really appreciate your question as it touches on what is a major issue in next generation sequencing.