Question: QC-passed reads in BWA aln and BWA mem
1
gravatar for kai
17 months ago by
kai10
kai10 wrote:

Hey everyone,

currently I compare BWA aln vs BWA mem for EBV-transformed cellline data against the EBV genome just using the default settings Comparing the flagstat summary BWA mem produces much more (nearly all!) QC-passed reads than BWA aln, resulting into much more disk space consumption (340M for BWA aln BAM file vs. 114G for BWA mem BAM file)

I want to understand how QC-passing is done in BWA and if there are options to add more filters in BWA to reduce the amount of QC-passed reads (in a comprehensible way), does anyone have some good resources for me?

Here are the flagstat summaries:

BWA aln:

4926214 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
4797373 + 0 mapped (97.38% : N/A)
4926214 + 0 paired in sequencing
2463104 + 0 read1
2463110 + 0 read2
4657947 + 0 properly paired (94.55% : N/A)
4671269 + 0 with itself and mate mapped
126104 + 0 singletons (2.56% : N/A)
7032 + 0 with mate mapped to a different chr
3663 + 0 with mate mapped to a different chr (mapQ>=5)

BWA mem:

1625410367 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
5511 + 0 supplementary
0 + 0 duplicates
4982679 + 0 mapped (0.31% : N/A)
1625404856 + 0 paired in sequencing
812702428 + 0 read1
812702428 + 0 read2
4925242 + 0 properly paired (0.30% : N/A)
4956262 + 0 with itself and mate mapped
20906 + 0 singletons (0.00% : N/A)
8732 + 0 with mate mapped to a different chr
3013 + 0 with mate mapped to a different chr (mapQ>=5)

Thanks in advance

ADD COMMENTlink modified 17 months ago by swbarnes26.7k • written 17 months ago by kai10

Hello kaiho ,

Thank you for asking a question on the forum. As you're a new user, we'd like to know how the forum looks to you and to understand that, I'd like some feedback on a couple of things from you:

  1. Did you add a bookmark to this question? If you did, could you tell us why you felt it was necessary to do so please?
  2. Did you request a colleague or an acquaintance to upvote this post?
ADD REPLYlink modified 17 months ago • written 17 months ago by RamRS24k
1

Hello Ram,

that upvote was me, for a well formulated, well formated and interesting question.

fin swimmer

ADD REPLYlink written 17 months ago by finswimmer12k

Thank you, fin. You were mighty quick there, upvoting the question minutes after it was created! Thanks for letting us know the flag was a false positive :-)

ADD REPLYlink written 17 months ago by RamRS24k

Hello Ram,

1.) I add a bookmark to it since I initially thought then to get a mail-notification if someones answers, but this does not seem to be true?

2.) No, I do not requested an upvote from one of my colleagues.

ADD REPLYlink modified 17 months ago by RamRS24k • written 17 months ago by kai10

Thank you for the response, kaiho. Yes, you do not need a bookmark - all posts created by you will automatically be subjected to an email-follow, which means that unless you change the setting, you'll automatically get emails for any comment/answer activity in your post. You can switch to local messages in which case you will not get emails but you will continue to see messages in your inbox ( https://www.biostars.org/local/messages/ )

You can read more about this in the http://biostars.org/t/how-to posts, specifically: A: How to Use Biostars, Part II: Post types, Deleting, Linking and Bookmarking

ADD REPLYlink written 17 months ago by RamRS24k

Did you align the exact same sequences, so identical fastq files? Even if one of the algorithms does align a lot more reads than the other, the total number of sequences in the final file should be the same.

ADD REPLYlink written 17 months ago by ATpoint24k

The fastq input files are identical as well as the reference genome. BWA aln outputs less sequences and I do not understand whats going on.

ADD REPLYlink written 17 months ago by kai10
1
gravatar for swbarnes2
17 months ago by
swbarnes26.7k
United States
swbarnes26.7k wrote:

Notice the number mapped is nearly the same. Samtools flagsat is just reading the flags in the sam file, I don't know that either bwa program will assign a QC fail flag to the resulting sam file.

As a sanity check, how many reads does the fastq really have? 1.6 billion reads is just about an entire Hiseq flow cell. Is that really what you have?

Also, posting your command lines would help.

ADD COMMENTlink modified 17 months ago • written 17 months ago by swbarnes26.7k

I have in overall 4 runs with 2 fastq files each for which I individually perform BWA Alignment and afterwards merge the BAM files using Picard MergeSamFiles. 1 Fastqfiles has ~205 million reads, therefore all 8 fastq files have in total 1.6 billion reads.

However, I solved the mystery (took me two days...). Deep in my scripts there was an "awkfilter" activated after the bwa sampe step for bwa aln algorithm which I was not aware of.

awk '( $1 ~ /^@/) || (and ( $2 , 0x0004 ) && ! and ( $2 , 0x0008 )) || (! and ( $2 , 0x0004 ))'

I think It extracts only mapped reads and unmapped reads next to them and discards the rest (which is not needed for my purpose). Since I do not need a sampe step for BWA mem, the awkfilter was not active there, explaining the huge differences.

Thanks for your help!

ADD REPLYlink modified 17 months ago by RamRS24k • written 17 months ago by kai10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1666 users visited in the last hour