I am trying to subset SNPs from 32 cultivars. I have the following exclusion criteria:
- >20% missing data
- SNPs present in less than 50% of genotypes
- Each genotype must have at least 50% of included SNPs
- >=60% heterozygous calls
- Markers missing data across genotypes
- Bi-allelic SNPs only
- No very rare variants (i.e. MAF<0.05
With the help of the bcftools manual, I've managed to figure out how to subset only biallelic SNPs with a MAF of over 0.05, dealing with criteria 6 and 7.
bcftools view --output-type u --min-alleles 2 --max-alleles 2 \ --types snps --exclude MAF<0.05 --exclude ** --known variants_raw_sorted.bcf > SNP_filtered_sorted_kkf.bcf
** is, I am trying to implement the remainder my exclusion criteria, (criteria 1-5), however I am having a great deal of difficulty understanding how to use the expression in bcftools. I am looking to understand how to use the information in the info fields (AN,AC, AF/MAF, NS) to implement my exclusion criteria. Any help is very much appreciated.
Futher information : This is the upstream code that led to the bcf file.
fastp -i ACDC_R1.fastq.gz -I ACDC_R2.fastq.gz \ -o fastp_ACDC_R1P.fastq.gz fastp_ACDC_R2P.fastq.gz \ -R "ACDC_fastp_QC" -h "Fastp_ACDC" -j "ACDC_fastp" \ -e 20 -c -a "auto" -L -r \ -m --merged_out "ACDC_fastp_Merged" --out1 "ACDC_R1_fastp_Unmerged" --out2 "ACDC_R2_fastp_Unmerged" \ --unpaired1 "ACDC_R1Pass_R2Fail" --unpaired2 "ACDC_R2Pass_R1Fail" bowtie2 -p 20 -x cs10 -U ACDC_fastp_Merged.fastq.gz -S ACDC_aligned.sam samtools view -bS ACDC.sam > ACDC.bam bcftools mpileup --output-type u -f cs10.fasta -b bam_list.txt | bcftools call -vmO z -o all_raw_condition_1.vcf.gz --threads=20 bcftools view --output-type u --output-file variants_raw.bcf variants_raw.vcf.gz bcftools sort --max-mem 14000 --output-type u --output-file variants_raw_sorted.bcf