Question

Removal Of Low Quality Reads With Trailing "B"

0

Entering edit mode

12.4 years ago

Bioscientist ★ 1.7k

This is kind of follow-up thread for: http://biostar.stackexchange.com/questions/14723/trim-the-low-quality-end-of-100bp-read

I tried to run bwa -aln -I -q 20 to trim off the low-quality sequences at the right end of reads. However, my concern is: for example, for some 100bp reads, there'll only be, say, 30 bp left after trimming (means 70bp from right end to left are kind of crappy). Then BWA just tried to map this 30bp onto the reference genome.

Well, 30bp is still enough to uniquely locate the read; but I would think, doesn't a 100bp read with only 30bp in good shape suggest the whole read is unreliable? So I would just throw the whole read.

Thus I would like to first trim off the trailing "B" by myself, but next question is: what's the cutoff for the percentage of good-shape sequences out of the whole 100bp? I hope to set as 70%; but simply my personal feeling.

bwa trimming • 3.9k views

ADD COMMENT • link updated 12.4 years ago by Chris Penkett ▴ 490 • written 12.4 years ago by Bioscientist ★ 1.7k

score 1 · Answer 1 · 2011-11-29

Filter the CIGARs for these dodgy reads and look at the base qualities:

samtools view out.bam | perl -F"\t" -ane 'print "$F[5]\t$F[10]\n" if $F[5] =~ /[7-9][0-9]S/' > test.txt

I've only got 76bp data to hand, so my CIGAR needs a different reg-ex:

samtools view out.bam | perl -F"\t" -ane 'print "$F[5]\t$F[10]\n" if $F[5] =~ /(4[6-9]|[5-9][0-9])S/' | head -10
29M47S  6;BABDBBAC@BCC?A;8@@CB9@/%1%)&+(%%%%**0%:%%(2((#%#&%$$#$%#%%#8,#,$#$)1*$$#$=
21M55S  ;;ACBDCCBC@CC@E>=CB62%>/&%%.$/%%%'$.,.0'%$$$%%%$%%%%%'(%#%$$$#$#$#%$76$,0%%$
29M47S  81:6<7:B$,4;?@8BB?4BBBBACB?=?*#/$#'$+%%%'%%%+5$2<5(-?%'81$26(6$)%%%$$+1&3%%$
51S15M3D10M =ACDDDCEC6EBB>ACEAE@AEEEDDEEDBEDEDEFEDDDFCDADEDEDECC=CDCDDDEEDECEDDBCDACC@AA
46S30M  %%)/$$)$%$$-#)&%$##$%$%6$%$$$$#$#"#""#'$'-(%%$0%%##$$/0.34):7A9C=BACCBDCBBA;
27M49S  66997?&07355<8>'9&?*05#01)$*%+7A%$9C.2.-B-@&4&&,423-95#B>);4)6%,'<+'*.0-=4?>
49S27M  6%=B2/4*2;/@<8?>>C@@B+?.18B@7?C>?-A6"#,4>/)49:--"1;>8=D2A;<?8>+287./5>C?5/5.
30M46S  <1@?CBAC7B==A@<46%$$%4*=294%03'$#?%-#',$'##%/(4?,'%2)/1$416,21%,6--3%$%30('$
10M2I5M1I11M47S <<BAC@BABDBCAB%##%%$58442%%$)%<3%%(-'""##+$$/$3'-)%%%-(2&4%$%$$%(%%$,%',$##+

Most look a bit messy to me, but sometimes they can help to pair a read I guess.

In this case you can filter after the BWA alignment if you want (or even add the unmapped flag to keep the read in the BAM file - $F[1] + 0x04 if not $F[1] & 0x04):

samtools view -h out.bam | perl -F"\t" -ane 'print, next if /^@/; print if $F[5] !~ /[7-9][0-9])S/' | samtools view -Sb -o out_good.bam -

Chris

score 0 · Answer 2 · 2011-11-28

I would look at the FastQC report to try identify patterns in the way sequencing qualities are distributed. That would be the starting point for coming up with a data filtering strategy.

Lots of trailing bad qualities do not necessarily indicate that the entire read is useless although it often correlates with that. This depends on sequencing methodology.

I often compile stats on how many read would be removed by a given filter, and most often I find that is not really worth pushing my luck - various parameters won't make that much of a difference in which case I try to be more stringent.