Question: VCF Filtering by Multiple Parameters
1
gravatar for rc16955
2.5 years ago by
rc1695560
rc1695560 wrote:

Dear all,

I think I need some ground-up assistance with filtering VCF files. I have generated a VCF file with samtools + bcftools after an initial alignment with bowtie2 and from what I can tell this VCF file looks as expected. Now I want to filter it in several ways and am struggling to figure out how to do it. The following is what I need to do:

  • DP >= 10
  • DP_STRAND >= 10
  • QUAL >= 50
  • MQ >= 30
  • Not on a repetitive region
  • Not less than 9nt away from a gap

Unfortunately I'm rather limited in the software I can use; I have bcftools and vcftools, as well as the ability to run any Python or Perl script, so any answers that use those would be ideal. As I'm not using my own computer I can't freely install other things, although if there's something I absolutely need I can probably ask for it.

Thanks in advance.

snp • 1.9k views
ADD COMMENTlink modified 2.5 years ago by Manuel Landesfeind1.2k • written 2.5 years ago by rc1695560
3
gravatar for Manuel Landesfeind
2.5 years ago by
Göttingen, Germany
Manuel Landesfeind1.2k wrote:

For the first four criteria, I recommend to use the BCFTools filter sub-command. To identify repetitive regions, you first need to have a database of these regions (preferably as BED file). Afterwards, you can employ BEDTools intersect (check out the -v parameter) to remove variants from the VCF or use BCFTools annotate to add the information to the VCF info field.

The same procedure can be used for the filtering for variants close to gaps: first identify them and write them to a BED file (you can potentially add the 9nt margins by using a custom AWK or Perl script). Then use the BED file to remove/annotate the variants.

When working with BED and VCF files at the same time, please keep in mind that VCFs uses 1-based positions while BED files uses 0-based and end-exclusive intervals.

EDIT: I recommend using variant normalization and - for multiple filter criteria - to use pipes, e.g.:

bcftools view my_variants.vcf \
   | bcftools norm -m - \
   | bcftools norm -f $reference_genome \
   | bcftools filter -e $FILTER_CRITERIA_1 \
   | bcftools filter -e $FILTER_CRITERIA_2 \
   | bcftools filter -e $FILTER_CRITERIA_3 \
   | ... \
   | bedtools intersect -v -a - -b $my_repetitive_regions.bed \
   | bedtools intersect -v -a - -b $my_gap_regions.bed \
   | bcftools view -o my_variants.filtered.vcf
ADD COMMENTlink modified 2.5 years ago • written 2.5 years ago by Manuel Landesfeind1.2k

EDIT: Just saw your edit; thanks I will take it on board :)

Hi, thank you very much for your answer, it's very informative. I have now used BCFTools to filter out reads with DP>=10 and/or MQ>=30 so I have made progress, however I'm a bit stumped about how to filter according to QUAL and DP_STRAND, as these do not appear in the vcf file header and BCFTools naturally cannot then refer to them. I apologise if this a very basic/stupid question, but do these values perhaps have different names depending on how the file was produced?

The file was produced using bcftools-1.2 and samtools-1.2, command line:

samtools mpileup -d1000 -uf s_ratti_reference.fa ED132_66_10304_55.sorted.bam | bcftools call -mv > ED132_55.vcf

The filtering I've done so far (in case I'm somehow doing it wrong) was:

bcftools filter -e "INFO/MQ <= 30" ED132_55.vcf > ED132_55_filtered.vcf 
bcftools filter -e "INFO/DP <= 10" ED132_55_filtered.vcf > ED132_55_filtered2.vcf
ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by rc1695560

Depending on what you effectively want to test with DB_STRAND, you could take a look at DP4. For example to filter variants with more than 10 reads on forward AND on reverse strand, you could do:

bcftools filter -e "DP4[0] + DP[2]) <= 10" | bcftools filter -e "DP4[1] + DP[3]) <= 10"

You can filter for all fields available. For example,

bcftools filter -e "QUAL < 50" | bcftools filter -e "FORMAT/AD[1] > 10"

Also, check the additional fields that can be reported by samtools during pileup (-t parameter). In particular, the DPR (samtools 1.2) or the AD (samtools >= 1.3) are of interest! In particular, the AD is important if you encounter multi-allelic variants...

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by Manuel Landesfeind1.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 821 users visited in the last hour