Question: GATK removing all primary reads with a secondary alignment
gravatar for tanya.copley
3.2 years ago by
tanya.copley10 wrote:

Hi, I am trying to do indel realignment using the GATK tool IndelRealigner. I did my alignment using bwa mem with the -M command for alternative alignments. When I go to use GATK IndelRealigner it removes all of my reads that have the potential to have secondary alignments even if the secondary alignment score is low. Everything that I have read says to align with secondary alignments kept before going into GATK, but if I do this then 95-98% of my reads are removed when performing the IndelRealigner step. I have tried the --filter_bases_not_stored, -rf NotPrimaryAlignment and still get 95-98% of my reads removed. Any input would be appreciated. Thanks

indelrealigner bwa gatk • 1.5k views
ADD COMMENTlink modified 3.2 years ago • written 3.2 years ago by tanya.copley10
gravatar for Santosh Anand
3.2 years ago by
Santosh Anand5.0k
Santosh Anand5.0k wrote:

That's extremely unusual. Are you sure that those many reads are of secondary alignments? Check using samtools flagstat

GATK applies these filters in IndelRealigner by default. Check if they are filtered out due to that

Note that indel realignment is no longer necessary for variant discovery if you plan to use a variant caller that performs a haplotype assembly step, such as HaplotypeCaller or MuTect2. However it is still required when using legacy callers such as UnifiedGenotyper or the original MuTect.

ADD COMMENTlink modified 3.2 years ago • written 3.2 years ago by Santosh Anand5.0k

So, if I put -rf NotPrimaryAlignment they fail because of this filter even though I have primary alignments for the read. If I put the -filter_bases_not_stored, they fail because of MalformedRead. When I check the headers using picard's ValidateSamFile, I do not get any errors returned. This is why I am stumped.
And yes, samtools flagstat gives a high amount of secondary alignments

ADD REPLYlink written 3.2 years ago by tanya.copley10

I decided to attempt the HaplotypeCaller instead to see if it would give the same error. Rather, it gave me failing for the MappingQuality. After looking at my bam file, most of my alignments have low MAPQ- mostly 8 or even 0 for primary alignments. I blasted a few sequences in ncbi and ma having mixed feelings about the MAPQ values- some with values of 0 mapped perfectly to the soybean genome without any filtering for organisms (i.e. blasting aganist the entire NCBI database), while others with high MAPQ (32) aligned in two portions and not in one piece. I've done alignments before, but always with TopHat and never encountered any problems like this. From my understanding, and please correct me it I am wrong, there is no need to change the MAPQ values from bwa alignments (unlike with other alignment programs), so why do I have such low MAPQ values?

ADD REPLYlink written 3.2 years ago by tanya.copley10

Are you mapping to a highly repetitive genome? BWA assigns MAPQ=0 to reads mapping to multiple locations, as they are anyway (almost) useless for variant calling. Did you see more than one equally plausible hits in Blast for reads with MPAQ=0? On the other hand, I find your results very unusual. Something is fundamentally wrong. Why don't you try mapping with TopHat just for comparison. What params are you using for BWA? What is your whole commandline? Lastly, bear in mind that GATK pipeline is highly optimized for human. Non-humans, especially multiploid organisms can show some unsuspecting surprises. You should also consult GATK forums, there is plethora of materials and usually they are swift in reply.

ADD REPLYlink written 3.2 years ago by Santosh Anand5.0k

The BWA-MEM algorithm performs local alignment. It may produce multiple primary alignments for different part of a query sequence. This is a crucial feature for long sequences. However, some tools such as Picard’s markDuplicates does not work with split alignments. One may consider to use option -M to flag shorter split hits as secondary.

Is it possible that your genome is not the right genome, due to which BWA is forced to split the reads?

ADD REPLYlink written 3.2 years ago by Santosh Anand5.0k

I am working with soybean, so it does have repetitive regions and is polyploid. My bwa command is as follows: bwa mem -r 1 -a -M -R (RG link) genome input.bam

My secondary hits are marked as such (flag 256), but it appears that GATK is not accepting any reads with duplicates. I am also finding that even reads with a MAPQ of 60 are getting secondary alignments.

As for the GATK pipeline, I was using sambamba markdup -r to remove duplicates, samtools index to index my bam files, and the above mentioned commands for the downstream processes of GATK.

I am simply beginning to suspect that GATK will not work for soybean. Too bad, I was appreciative of how it realigns indels to weed out any uncertainty. I'll find/create another similar script. Thank-you for your time

ADD REPLYlink written 3.2 years ago by tanya.copley10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1587 users visited in the last hour