SNP subsetting using bcftools
Entering edit mode
16 months ago
alexcullsci ▴ 10

I am trying to subset SNPs from 32 cultivars. I have the following exclusion criteria:

  1. >20% missing data
  2. SNPs present in less than 50% of genotypes
  3. Each genotype must have at least 50% of included SNPs
  4. >=60% heterozygous calls
  5. Markers missing data across genotypes
  6. Bi-allelic SNPs only
  7. No very rare variants (i.e. MAF<0.05

With the help of the bcftools manual, I've managed to figure out how to subset only biallelic SNPs with a MAF of over 0.05, dealing with criteria 6 and 7.

bcftools view --output-type u --min-alleles 2 --max-alleles 2 \
--types snps --exclude MAF[0]<0.05 --exclude **   --known  variants_raw_sorted.bcf > SNP_filtered_sorted_kkf.bcf

Where the ** is, I am trying to implement the remainder my exclusion criteria, (criteria 1-5), however I am having a great deal of difficulty understanding how to use the expression in bcftools. I am looking to understand how to use the information in the info fields (AN,AC, AF/MAF, NS) to implement my exclusion criteria. Any help is very much appreciated.

Futher information : This is the upstream code that led to the bcf file.

fastp -i ACDC_R1.fastq.gz -I ACDC_R2.fastq.gz \
    -o fastp_ACDC_R1P.fastq.gz fastp_ACDC_R2P.fastq.gz \
    -R "ACDC_fastp_QC" -h "Fastp_ACDC" -j "ACDC_fastp" \
    -e 20 -c -a "auto" -L -r \
    -m --merged_out "ACDC_fastp_Merged" --out1 "ACDC_R1_fastp_Unmerged" --out2 "ACDC_R2_fastp_Unmerged" \
    --unpaired1 "ACDC_R1Pass_R2Fail" --unpaired2 "ACDC_R2Pass_R1Fail"

bowtie2 -p 20 -x cs10 -U ACDC_fastp_Merged.fastq.gz -S ACDC_aligned.sam

samtools view -bS ACDC.sam > ACDC.bam

bcftools mpileup --output-type u -f cs10.fasta -b bam_list.txt | bcftools call -vmO z -o all_raw_condition_1.vcf.gz --threads=20

bcftools view --output-type u --output-file variants_raw.bcf variants_raw.vcf.gz

bcftools sort --max-mem 14000 --output-type u --output-file variants_raw_sorted.bcf
SNP next-gen bcftools • 1.3k views
Entering edit mode
16 months ago


It should definitely be possible to find a way to apply these filters - there is always a way. However, it is importat to know which tags are currently in your VCF / BCF. Judging by how you have called your variants, there will be only 1 or 2 tags included by default. You can add more by calling variants like this:

bcftools mpileup \
  --redo-BAQ \
  --min-BQ 30 \
  --per-sample-mF \
  -f Ref_1_20040804.fa ref_1.sorted.bam | \
  bcftools call \
    --multiallelic-caller \
    --variants-only \
    --skip-variants indels \
    -Ov > ref_1_snps.vcf ;

You can add more via a BCFtools plugin, as I show here: A: How to use bcftools to calculate AF INFO field from AC and AN in VCF?

With AF (Allele Frequency) and AN (Allele Number) included, you have immediately added 2 more key components to your 'filtering' repertoire.


For other filtering, it may involve more complex, indirect ways, like I show here:


Finally, I presume that you looked through the Expressions section at the end of the bCFtoools manual - ?

Sorry that I cannot help with each individual filter query, but i hope that I've provided enough information such that you can get through them (and I notice that you posted this question 2 days ago, with no answer).


Entering edit mode


Really appreciate the time you took with your answer and the useful information therein. I was not familiar with the method to get that INFO field full of useful data. I will take a look at your linked answers and see what I can figure out.

Take care, Alex


Login before adding your answer.

Traffic: 848 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6