Question: bwa mem perfectly align reads over their entire length
2
gravatar for goldberg.jm
2.7 years ago by
goldberg.jm30
United States
goldberg.jm30 wrote:

I would like use bwa mem to identify reads that align perfectly or almost perfectly with the reference genome over the entire read. I have tried:

bwa mem -B 40 -O 60 -E 10 H2007/genome.fa Left.fq Right.fq > stringent.sam.

This gives nearly perfect aligned regions, but the result includes many reads that align over only an internal contiguous region with their ends flapping around in the breeze. I would like set bwa mem to exclude these.

I have also played with the -T and -k flags and spent many minutes googling and reading, but no joy.

I would prefer not to have to resort to unix scripts to address this.

Thank you,

Jon

alignment • 4.6k views
ADD COMMENTlink modified 2.7 years ago by piet1.4k • written 2.7 years ago by goldberg.jm30
1

Have you tried increasing the penalty for end clipping (-L)?

" -L INT[,INT]  penalty for 5'- and 3'-end clipping [5,5] "

ADD REPLYlink written 2.7 years ago by donfreed1.1k
1

I -L is a good advise. Alternatively you can identify partially mapped reads in SAM by cigar. Partial hits start and/or end "S"/"H" (depending on your clipping settings). E.g. "10S75M15S" could be a perfect match of 75bp with 10 bp in front and 15bp at the end unaligned.

 

ADD REPLYlink written 2.7 years ago by thackl2.4k
1

Thanks for the tips! I increased the –L clipping penalty like so: 

bwa mem -B 40 -O 60 -E 10  –L100 genome.fa Left.fq Right.fq > stringent.sam

I got rid of unmapped and supplemental alignments like so:

samtools view -b -f 2 -F 2316 stringent.bam > mapped_primary.bam

This helped but did not entirely eliminate partial alignments. Counting CIGARs = 101M (my reads are 101nt):

Default –L [5]: 70.1% full length alignments 
For –L100: 77.4% full length alignments

It looks like I can keep raising –L to clear more partial alignments. The clipping penalty is subtracted from the score. Does anyone know where bwa's alignment score appears in the sam file? It does not seem to be the AS:i:xxx field, which seems to be just the number of matches from the CIGAR field. Maybe I’ll end up going the unix route, but the more manipulations the more chance for errors.

Thanks again,

Jon

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by goldberg.jm30
1

Hi Jon,

Since you are looking for global rather than local alignments, you might find that you have better control with a global aligner such as BBMap.  In this case, you could achieve the result you want with a command something like this:

bbmap.sh ref=genome.fa in1=Left.fq in2=Right.fq minid=0.95 maxindel=1 outm=mapped.sam

Hanging off the end of a scaffold is penalized by BBMap, not as much as mismatches, but about half as much - so those alignments won't show up.  If you only want pairs where both are mapped you can add the "po" (pairedonly) flag.  "outm" will only capture the mapped reads (keeping pairs together).

If you really only want perfect matches, then add the flag "perfectmode" instead of "minid=0.95 maxindel=1".  "perfect" is defined by BBMap as not going off the end of a contig.  "semiperfectmode" on the other hand allows going off the end but does not allow any mismatches.

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by Brian Bushnell14k

Thanks all, I am currently testing 'bwa mem ...-L 10...' (default is 5). The hard clipping filter 'bwa mem ...-H...' does not do anything at all. Just gives back the command prompt (bwa_0.7.10).

A question about using the CIGAR field for filtering. Can CIGAR filters be applied directly from within samtools? I don't think so. I can filter the CIGAR field on -v '\b[0-9]+S|H\b' or '\b101M\b' using unix (I happen to have 101nt reads), but I think this will disrupt pairs when one but not the other is fully aligned. For subsequent analysis steps (e.g. IDBA_ud )I would like to keep pairs together.

ADD REPLYlink written 2.7 years ago by goldberg.jm30
2
gravatar for piet
2.7 years ago by
piet1.4k
planet earth
piet1.4k wrote:

>the result includes many reads that align over only an internal contiguous region

it may be easier to remove those partial aligned reads from the SAM file instead of avoiding them altogether. Reads aligned only partially will have 'H' or 'S' in their CIGAR string (Hard or Soft clipped). It should not take much effort to filter them out.  

ADD COMMENTlink written 2.7 years ago by piet1.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 758 users visited in the last hour