Question

Strategies for calling variants in a transposon-rich genome

0

Entering edit mode

6.6 years ago

Dave Carlson ★ 1.7k

Hi All,

I'm trying to call variants in a set of plant genome samples using a bwa mem followed by GATK's HaplotypeCaller. However, I'm running into a potential roadblock that I'm hoping to get advice on. These genomes are rich in transposable elements, and when I visualize the alignments with IGV, I'm finding that that are sizable stretches in the alignment (that includes a non-trivial proportion of the total reads) corresponding to TE sequences where essentially all the reads are multi-mapping and thus have a MAPQ score of zero.

My understanding is that GATK's HaplotypeCaller will not call variants in these regions due to the MAPQ scores. For various reasons, I would like to be able to call variants in these repeat sequences, but I'm not sure how to go about doing so without removing the MAPQ minimum threshold (which seems likely to be a bad idea!).

One of my colleagues suggested that I could split my BAM files to separate out the reads mapping to TE sequences in the reference (I have already annotated TEs in the reference and have a GFF with their locations) and then change the MAPQ scores for these reads from zero to some arbitrary value that would pass the GATK threshold (minimum value of 10).

That's an interesting idea, but I'm not sure if there are other, better options. Anybody have suggestions for how I can call variants in these repeat regions?

Thanks!

genome gatk SNP • 1.2k views

ADD COMMENT • link 6.6 years ago by Dave Carlson ★ 1.7k

0

Entering edit mode

How similar are the TE copies? You could mask the original TEs, and retain just one copy of each family, so their mapping will be unique.

ADD REPLY • link 6.6 years ago by h.mon 35k