SNP subsetting using bcftools
1
0
Entering edit mode
4.1 years ago
alexcullsci ▴ 10

I am trying to subset SNPs from 32 cultivars. I have the following exclusion criteria:

  1. >20% missing data
  2. SNPs present in less than 50% of genotypes
  3. Each genotype must have at least 50% of included SNPs
  4. >=60% heterozygous calls
  5. Markers missing data across genotypes
  6. Bi-allelic SNPs only
  7. No very rare variants (i.e. MAF<0.05

With the help of the bcftools manual, I've managed to figure out how to subset only biallelic SNPs with a MAF of over 0.05, dealing with criteria 6 and 7.

bcftools view --output-type u --min-alleles 2 --max-alleles 2 \
--types snps --exclude MAF[0]<0.05 --exclude **   --known  variants_raw_sorted.bcf > SNP_filtered_sorted_kkf.bcf

Where the ** is, I am trying to implement the remainder my exclusion criteria, (criteria 1-5), however I am having a great deal of difficulty understanding how to use the expression in bcftools. I am looking to understand how to use the information in the info fields (AN,AC, AF/MAF, NS) to implement my exclusion criteria. Any help is very much appreciated.


Futher information : This is the upstream code that led to the bcf file.

fastp -i ACDC_R1.fastq.gz -I ACDC_R2.fastq.gz \
    -o fastp_ACDC_R1P.fastq.gz fastp_ACDC_R2P.fastq.gz \
    -R "ACDC_fastp_QC" -h "Fastp_ACDC" -j "ACDC_fastp" \
    -e 20 -c -a "auto" -L -r \
    -m --merged_out "ACDC_fastp_Merged" --out1 "ACDC_R1_fastp_Unmerged" --out2 "ACDC_R2_fastp_Unmerged" \
    --unpaired1 "ACDC_R1Pass_R2Fail" --unpaired2 "ACDC_R2Pass_R1Fail"

bowtie2 -p 20 -x cs10 -U ACDC_fastp_Merged.fastq.gz -S ACDC_aligned.sam

samtools view -bS ACDC.sam > ACDC.bam

bcftools mpileup --output-type u -f cs10.fasta -b bam_list.txt | bcftools call -vmO z -o all_raw_condition_1.vcf.gz --threads=20

bcftools view --output-type u --output-file variants_raw.bcf variants_raw.vcf.gz

bcftools sort --max-mem 14000 --output-type u --output-file variants_raw_sorted.bcf
SNP next-gen bcftools • 3.8k views
ADD COMMENT
1
Entering edit mode
4.1 years ago

Hey,

It should definitely be possible to find a way to apply these filters - there is always a way. However, it is importat to know which tags are currently in your VCF / BCF. Judging by how you have called your variants, there will be only 1 or 2 tags included by default. You can add more by calling variants like this:

bcftools mpileup \
  --redo-BAQ \
  --min-BQ 30 \
  --per-sample-mF \
  --annotate FORMAT/AD,FORMAT/ADF,FORMAT/ADR,FORMAT/DP,FORMAT/SP,INFO/AD,INFO/ADF,INFO/ADR \
  -f Ref_1_20040804.fa ref_1.sorted.bam | \
  bcftools call \
    --multiallelic-caller \
    --variants-only \
    --skip-variants indels \
    -Ov > ref_1_snps.vcf ;

You can add more via a BCFtools plugin, as I show here: A: How to use bcftools to calculate AF INFO field from AC and AN in VCF?

With AF (Allele Frequency) and AN (Allele Number) included, you have immediately added 2 more key components to your 'filtering' repertoire.

---------------

For other filtering, it may involve more complex, indirect ways, like I show here:

--------------

Finally, I presume that you looked through the Expressions section at the end of the bCFtoools manual - https://samtools.github.io/bcftools/bcftools.html ?

Sorry that I cannot help with each individual filter query, but i hope that I've provided enough information such that you can get through them (and I notice that you posted this question 2 days ago, with no answer).

Kevin

ADD COMMENT
1
Entering edit mode

Kevin,

Really appreciate the time you took with your answer and the useful information therein. I was not familiar with the method to get that INFO field full of useful data. I will take a look at your linked answers and see what I can figure out.

Take care, Alex

ADD REPLY

Login before adding your answer.

Traffic: 2281 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6