Question

filtering pindel results

2

Entering edit mode

7.8 years ago

Tootiki ▴ 40

I have BAMs from which I am trying to detect reliable variants, to find differences between two strains of Aspergillus. I have used HaplotypeCaller to call call SNPs and small InDels, and Pindel to call longer InDels and other structural variants. I understand that filtering is always a balancing act and there may be no objectively best answer to this question, but any advice or experience anyone has would be appreciated. Also, feel free to criticise any of the background I give you!

Cobbling together ideas from here and there, I have used the following filters on my HaplotypeCaller results (incidentally, do you like them?):

for SNPs:

--filterExpression " QD            <     2.0 " --filterName "LowQD" 
--filterExpression " FS            >    60.0 " --filterName "HighFS" 
--filterExpression " SOR           >     4.0 " --filterName "HighSOR" 
--filterExpression " MQ            <    40.0 " --filterName "LowMQ" 
--filterExpression " DP            <    10   " --filterName "LowDP" 
--filterExpression "MQRankSum      <   -12.5 " --filterName "LowMQRankSum" 
--filterExpression "ReadPosRankSum <    -8.0 " --filterName "LowReadPosRankSum"

for short InDels:

--filterExpression " QD            <    2.0 " --filterName  "LowQD" 
--filterExpression " FS            >  200.0 " --filterName  "HighFS" 
--filterExpression " SOR           >   10.0 " --filterName  "HighSOR" 
--filterExpression " DP            <   10   " --filterName "LowDP" 
--filterExpression "ReadPosRankSum <  -20.0 " --filterName  "LowReadPosRankSum"

This reduces the number of short variants I’m dealing with from an already modest 2,065 to 1,965 (in one example sample).

However, I also care about larger variants, so I ran Pindel on the same BAMs. I used pretty much default parameters, and have not yet attempted to filter. I get a whopping 37,856 variants to deal with (admittedly this is across my two samples, so not totally comparable to my above figures, but we are clearly not in the same ball park). These are:

27410   Deletions
 6821   Short insertions
 1730   Inversions (INV)
 1312   deletions with non-template sequences
  583   Tandem duplications
    0   Tandem duplications with non-template sequence (TD_NT)
    0   Inversions with non-template sequence (INV_NT)

I guess to be meaningful and usable along side my short variants, they need some amount of filtering, but exactly how should I go about this, particularly with the range of different types of variant?

I've had a bit of a google and found several half-answers:

There seems to be talk online of just filtering by DP and sample number (eg https://www.biostars.org/p/70476/). How safe is that? I guess it still leaves me prone to many kinds of error, such as any that are introduced by mapping errors. Also I have just two samples...

I've had a quick look at AgilePindelFilter, which looks cool if you have more biological knowledge than I do. I'll let the biologists know about this when I pass back the work, but I’d like to reduce the numbers for them a bit before then, to make it something like equivalent to the HaplotypeCaller results.

There are also a couple of scripts available, without much documentation: Filter_Pindel_del_vcf.py, to filter out deletions (and insertions?) with a variant quality of <30; and Filter_Pindel_inv_vcf.py, to filter out inversions that are less than 100 bp (why do we want to do that?) and (/or?) with a variant quality of <30. do these look like good parameters? Also, what about the other variant types that Pindel reports (e.g. tandem duplication)?

Thanks very much.

pindel indel filtering calling rearrangement • 2.9k views

ADD COMMENT • link 7.7 years ago by Tootiki ▴ 40

score 0 · Answer 1 · 2016-08-08

if anyone else is looking for the answer, i have it!

firstly, to be clear, i'm always talking about the vcf file made by pindel2vcf from my initial pindel results file.

this contains many sites that pindel doesn't predict as alt in any whole samples.
i think that any site that is predicted as alt in any single read ends up in the vcf, which seems a little useless and inconvenient to me, but i guess in other circumstances it must be desirable. so the vast majority of the reported sites listed actually turned out to be 0/0 in both samples. before i noticed this, I had assumed that i only had sites that were actually being called as alt in at least one whole sample.

once i filtered these out i had a much more manageable number of sites, and i was able to live out my days peacefully.

i did this using:

grep "\(0\/1\|1\/1\|1\/0\|#\)" .........../results.vcf >  ........../results.just_sites_called_in_samples.vcf