Question: Removal Of Low Quality Reads With Trailing "B"
gravatar for Bioscientist
7.5 years ago by
Bioscientist1.7k wrote:

This is kind of follow-up thread for:

I tried to run bwa -aln -I -q 20 to trim off the low-quality sequences at the right end of reads. However, my concern is: for example, for some 100bp reads, there'll only be, say, 30 bp left after trimming (means 70bp from right end to left are kind of crappy). Then BWA just tried to map this 30bp onto the reference genome.

Well, 30bp is still enough to uniquely locate the read; but I would think, doesn't a 100bp read with only 30bp in good shape suggest the whole read is unreliable? So I would just throw the whole read.

Thus I would like to first trim off the trailing "B" by myself, but next question is: what's the cutoff for the percentage of good-shape sequences out of the whole 100bp? I hope to set as 70%; but simply my personal feeling.

trimming bwa • 2.7k views
ADD COMMENTlink modified 7.5 years ago by Chris Penkett480 • written 7.5 years ago by Bioscientist1.7k
gravatar for Chris Penkett
7.5 years ago by
Chris Penkett480
Cambridge, UK
Chris Penkett480 wrote:

Filter the CIGARs for these dodgy reads and look at the base qualities:

samtools view out.bam | perl -F"\t" -ane 'print "$F[5]\t$F[10]\n" if $F[5] =~ /[7-9][0-9]S/' > test.txt

I've only got 76bp data to hand, so my CIGAR needs a different reg-ex:

samtools view out.bam | perl -F"\t" -ane 'print "$F[5]\t$F[10]\n" if $F[5] =~ /(4[6-9]|[5-9][0-9])S/' | head -10
29M47S  6;BABDBBAC@BCC?A;8@@CB9@/%1%)&+(%%%%**0%:%%(2((#%#&%$$#$%#%%#8,#,$#$)1*$$#$=
21M55S  ;;ACBDCCBC@CC@E>=CB62%>/&%%.$/%%%'$.,.0'%$$$%%%$%%%%%'(%#%$$$#$#$#%$76$,0%%$
29M47S  81:6<7:B$,4;?@8BB?4BBBBACB?=?*#/$#'$+%%%'%%%+5$2<5(-?%'81$26(6$)%%%$$+1&3%%$
46S30M  %%)/$$)$%$$-#)&%$##$%$%6$%$$$$#$#"#""#'$'-(%%$0%%##$$/0.34):7A9C=BACCBDCBBA;
27M49S  66997?&07355<8>'9&?*05#01)$*%+7A%$9C.2.-B-@&4&&,423-95#B>);4)6%,'<+'*.0-=4?>
49S27M  6%=B2/4*2;/@<8?>>C@@B+?.18B@7?C>?-A6"#,4>/)49:--"1;>8=D2A;<?8>+287./5>C?5/5.
30M46S  <1@?CBAC7B==A@<46%$$%4*=294%03'$#?%-#',$'##%/(4?,'%2)/1$416,21%,6--3%$%30('$
10M2I5M1I11M47S <<BAC@BABDBCAB%##%%$58442%%$)%<3%%(-'""##+$$/$3'-)%%%-(2&4%$%$$%(%%$,%',$##+

Most look a bit messy to me, but sometimes they can help to pair a read I guess.

In this case you can filter after the BWA alignment if you want (or even add the unmapped flag to keep the read in the BAM file - $F[1] + 0x04 if not $F[1] & 0x04):

samtools view -h out.bam | perl -F"\t" -ane 'print, next if /^@/; print if $F[5] !~ /[7-9][0-9])S/' | samtools view -Sb -o out_good.bam -


ADD COMMENTlink written 7.5 years ago by Chris Penkett480
gravatar for Istvan Albert
7.5 years ago by
Istvan Albert ♦♦ 80k
University Park, USA
Istvan Albert ♦♦ 80k wrote:

I would look at the FastQC report to try identify patterns in the way sequencing qualities are distributed. That would be the starting point for coming up with a data filtering strategy.

Lots of trailing bad qualities do not necessarily indicate that the entire read is useless although it often correlates with that. This depends on sequencing methodology.

I often compile stats on how many read would be removed by a given filter, and most often I find that is not really worth pushing my luck - various parameters won't make that much of a difference in which case I try to be more stringent.

ADD COMMENTlink written 7.5 years ago by Istvan Albert ♦♦ 80k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 960 users visited in the last hour