QC-passed reads in BWA aln and BWA mem
1
1
Entering edit mode
6.0 years ago
kai ▴ 10

Hey everyone,

currently I compare BWA aln vs BWA mem for EBV-transformed cellline data against the EBV genome just using the default settings Comparing the flagstat summary BWA mem produces much more (nearly all!) QC-passed reads than BWA aln, resulting into much more disk space consumption (340M for BWA aln BAM file vs. 114G for BWA mem BAM file)

I want to understand how QC-passing is done in BWA and if there are options to add more filters in BWA to reduce the amount of QC-passed reads (in a comprehensible way), does anyone have some good resources for me?

Here are the flagstat summaries:

BWA aln:

4926214 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
4797373 + 0 mapped (97.38% : N/A)
4926214 + 0 paired in sequencing
2463104 + 0 read1
2463110 + 0 read2
4657947 + 0 properly paired (94.55% : N/A)
4671269 + 0 with itself and mate mapped
126104 + 0 singletons (2.56% : N/A)
7032 + 0 with mate mapped to a different chr
3663 + 0 with mate mapped to a different chr (mapQ>=5)

BWA mem:

1625410367 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
5511 + 0 supplementary
0 + 0 duplicates
4982679 + 0 mapped (0.31% : N/A)
1625404856 + 0 paired in sequencing
812702428 + 0 read1
812702428 + 0 read2
4925242 + 0 properly paired (0.30% : N/A)
4956262 + 0 with itself and mate mapped
20906 + 0 singletons (0.00% : N/A)
8732 + 0 with mate mapped to a different chr
3013 + 0 with mate mapped to a different chr (mapQ>=5)

Thanks in advance

BWA BWA aln BWA mem QC-passed Alignment • 3.0k views
ADD COMMENT
0
Entering edit mode

Hello kaiho ,

Thank you for asking a question on the forum. As you're a new user, we'd like to know how the forum looks to you and to understand that, I'd like some feedback on a couple of things from you:

  1. Did you add a bookmark to this question? If you did, could you tell us why you felt it was necessary to do so please?
  2. Did you request a colleague or an acquaintance to upvote this post?
ADD REPLY
1
Entering edit mode

Hello Ram,

that upvote was me, for a well formulated, well formated and interesting question.

fin swimmer

ADD REPLY
0
Entering edit mode

Thank you, fin. You were mighty quick there, upvoting the question minutes after it was created! Thanks for letting us know the flag was a false positive :-)

ADD REPLY
0
Entering edit mode

Hello Ram,

1.) I add a bookmark to it since I initially thought then to get a mail-notification if someones answers, but this does not seem to be true?

2.) No, I do not requested an upvote from one of my colleagues.

ADD REPLY
0
Entering edit mode

Thank you for the response, kaiho. Yes, you do not need a bookmark - all posts created by you will automatically be subjected to an email-follow, which means that unless you change the setting, you'll automatically get emails for any comment/answer activity in your post. You can switch to local messages in which case you will not get emails but you will continue to see messages in your inbox ( https://www.biostars.org/local/messages/ )

You can read more about this in the http://biostars.org/t/how-to posts, specifically: A: How to Use Biostars, Part II: Post types, Deleting, Linking and Bookmarking

ADD REPLY
0
Entering edit mode

Did you align the exact same sequences, so identical fastq files? Even if one of the algorithms does align a lot more reads than the other, the total number of sequences in the final file should be the same.

ADD REPLY
0
Entering edit mode

The fastq input files are identical as well as the reference genome. BWA aln outputs less sequences and I do not understand whats going on.

ADD REPLY
1
Entering edit mode
6.0 years ago

Notice the number mapped is nearly the same. Samtools flagsat is just reading the flags in the sam file, I don't know that either bwa program will assign a QC fail flag to the resulting sam file.

As a sanity check, how many reads does the fastq really have? 1.6 billion reads is just about an entire Hiseq flow cell. Is that really what you have?

Also, posting your command lines would help.

ADD COMMENT
0
Entering edit mode

I have in overall 4 runs with 2 fastq files each for which I individually perform BWA Alignment and afterwards merge the BAM files using Picard MergeSamFiles. 1 Fastqfiles has ~205 million reads, therefore all 8 fastq files have in total 1.6 billion reads.

However, I solved the mystery (took me two days...). Deep in my scripts there was an "awkfilter" activated after the bwa sampe step for bwa aln algorithm which I was not aware of.

awk '( $1 ~ /^@/) || (and ( $2 , 0x0004 ) && ! and ( $2 , 0x0008 )) || (! and ( $2 , 0x0004 ))'

I think It extracts only mapped reads and unmapped reads next to them and discards the rest (which is not needed for my purpose). Since I do not need a sampe step for BWA mem, the awkfilter was not active there, explaining the huge differences.

Thanks for your help!

ADD REPLY

Login before adding your answer.

Traffic: 2241 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6