Filtering both concordantly and discordantly mapped reads
2
0
Entering edit mode
4.3 years ago
valerie ▴ 90

Hi guys,

Is it possible to filter both concordantly and discordantly mapped reads?

I believe that there can be huge deletions/insertions in my reference sequence and the distance between reads can differ from 100 nucleotides that are expected by aligner (I use bowtie). I think this is the best way to find paired reads that mapped to the reference to study insertions and deletions.

Thank you!

ngs sequencing samtools • 1.5k views
2
Entering edit mode
4.3 years ago

BBMap is very good at correctly mapping reads with long indels. For example -

bbmap.sh in=r1.fq in2=r2.fq ref=ref.fa maxindel=400k out=mapped.sam


That takes care of deletions up to around 400kbp. Insertions are trickier. With 150bp reads, BBMap won't reliably map reads spanning insertions longer than ~50bp (obviously, insertions >150bp are fundamentally impossible to capture in a 150bp read). But it can spit out discordant pairs:

bbmap.sh in=r1.fq in2=r2.fq ref=ref.fa maxindel=400k  outm=mapped.sam outu=unmapped.fq po


This will send all unmapped reads to "outu" (unmapped.fq) as an interleaved fastq. Also, the "po" flag means all discordant pairs will be marked as unmapped, meaning all half-mapped pairs and discordant pairs will go to that file, which can then be re-mapped with different parameters. You may want to set the "pairlen" flag also to tell it the maximum distance allowed for concordant pairs.

0
Entering edit mode

Thank you, Brian! This is exactly what I was looking for!

1
Entering edit mode
4.3 years ago
d-cameron ★ 2.3k

There are over 50 publicly available structural variant callers, with about 1/4 able to detect small insertions and large deletions using the technique use describe. These include BreakDancer, HYDRA, GASVPro, and CLEVER. Other detection signals that can be used to detect structural variants include read alignment CIGAR, split reads, soft clipped reads, read pairs with only one read mapped, and multiple different assembly-based approaches. Tools using such methods (most of which combine one or more of these signals with read pair information) include Pindel, SRiC, CREST, SVseq2, PRISM, SoftSearch, Socrates, LUMPY, cortex-var, lasSV, AsmVar, SoftSV, manta, and GRIDSS.

I would strongly advise looking at the existing tools and literature before reinventing a wheel.

Disclaimer 1: I didn't check that every one of these tools reports small insertions. Some tools, such as DELLY, use read pair and split read information but do not report insertion events.

Disclaimer 2: As I wrote GRIDSS, it should come as no surprise that I recommend it. My extensive benchmarking shows it outperforms other callers (breakdancer, cortex, delly, hydra, lumpy, pindel, software, tigra) at 15x or greater coverage. That said, if you're interested in novel insertions larger than the library fragment size, you'll need to use a de novo assembly based technique (eg cortex) or a specialised caller such as NovelSeq.

0
Entering edit mode