Question

Tutorial:How to extract and filter SNP data from the genotyping-by-sequencing (GBS) data in vcf format using bcftools

1

Entering edit mode

4.2 years ago

Asad Prodhan ▴ 10

A 'bcftools' script for:

Extracting SNP data from GBS data in vcf file format
Filtering out raw SNPs to a usable set of SNPs

I. Extracting the headers:

bcftools view -h  input_file.vcf.gz | awk '/CHROM/{print}' | cut -f 1-5,10-  > output_file_header.txt

h, header; 'awk' stands for "Aho, Weinberger, Kernighan (authors)" which is a printing and row-based filtering function, ‘awk’ prints the output; ‘cut’ selects columns

II. Extracting the SNPs from the GBS data:

bcftools query -f '%CHROM:%POS\t%CHROM\t%POS\t%REF\t%ALT[\t%GT]\n' input_file.vcf.gz | \
  awk '{gsub ("0/0","A");gsub ("0/1","H");gsub ("1/1","B");gsub ("./.","NA");print}'  > output_file_without_filter.txt

'f', format; '%CHROM', 'CHROM' column; '\t' tab separated; '\n' new line; gsub is a function to substitute (sub) something from everywhere in the file (global)- global substitution (gsub); '0', reference allele; '1', alternative allele; 'A', Parent 1 (reference parent in the alignment) and 'B', Parent 2 (this is an example of bi-parental breeding population)

Count the row numbers to keep tract of filtering outcomes later on:

awk 'END { print NR }'  output_file _auto_without_filter.txt

Total number of rows: XXX,XXX

III. Filtering out the extracted SNPs:

There are several filtering options
Filtering can be done separately using an individual option to see the outcome of the filter
Or, filtering can be done using all the options together using '&' in the command line

Filter 1A: Filtering against read depth (DP):

bcftools query -i'FMT/DP>10' -f '%CHROM:%POS\t%CHROM\t%POS\t%REF\t%ALT[\t%GT]\n'  
input_file.vcf.gz |\
  awk '{gsub ("0/0","A");gsub ("0/1","H");gsub ("1/1","B");gsub ("./.","NA");print}'  > output_file_filter1A_DP.txt

'i', include; 'FMT', Format; 'DP', read depth; 'GQ', Genotype quality

Count the row numbers:

awk 'END { print NR }'  output_file_filter1A_DP.txt

Total number of rows: XXX

Filter 1B: Filtering against genotype quality (GQ):

bcftools query -i'FMT/GQ>20' -f '%CHROM:%POS\t%CHROM\t%POS\t%REF\t%ALT[\t%GT]\n'  
input_file.vcf.gz |\
  awk '{gsub ("0/0","A");gsub ("0/1","H");gsub ("1/1","B");gsub ("./.","NA");print}'  > output_file_filter1B_GQ.txt

Count the row numbers:

awk 'END { print NR }'  output_file_filter1B_GQ.txt

Total number of rows: XXX

Filter 1C: Filtering against minor allele frequency (MAF):

bcftools query -i'MAF>=0.05' -f '%CHROM:%POS\t%CHROM\t%POS\t%REF\t%ALT[\t%GT]\n'  
input_file.vcf.gz |\
awk '{gsub ("0/0","A");gsub ("0/1","H");gsub ("1/1","B");gsub ("./.","NA");print}'  > output_file_filter1C_MAF.txt

Count the row numbers:

awk 'END { print NR }'  output_file_filter1C_MAF.txt

Total number of rows: XXX

Filter 1D: Filtering against missing genotypes (F_MISSING):

bcftools query -i'F_MISSING<=0.1' -f '%CHROM:%POS\t%CHROM\t%POS\t%REF\t%ALT[\t%GT]\n'  
input_file.vcf.gz |\
  awk '{gsub ("0/0","A");gsub ("0/1","H");gsub ("1/1","B");gsub ("./.","NA");print}'  > output_file_filter1D_F_MISSING.txt

Count the row numbers:

awk 'END { print NR }'  output_file_filter1D_F_MISSING.txt

Total number of rows: XXX

Filter 1E: Filtering against DP and GQ:

bcftools query -i'FMT/DP>10 & FMT/GQ>20' -f '%CHROM:%POS\t%CHROM\t%POS\t%REF\t%ALT[\t%GT]\n'  
input_file.vcf.gz |\
  awk '{gsub ("0/0","A");gsub ("0/1","H");gsub ("1/1","B");gsub ("./.","NA");print}'  > output_file_filter1E_DP_GQ.txt

'&' enforces filters within a sample; '&&' enforces filters but not necessarily within a sample

Count the row numbers:

awk 'END { print NR }'  output_file_filter1E_DP_GQ.txt

Total number of rows: XXX

Filter 1F: Filtering against DP, GQ, MAF and F_MISSING:

bcftools query -i'FMT/DP>10 & FMT/GQ>20 & MAF>=0.05 & F_MISSING<=0.1' -f '%CHROM:%POS\t%CHROM\t%POS\t%REF\t%ALT[\t%GT]\n'  
input_file.vcf.gz |\
  awk '{gsub ("0/0","A");gsub ("0/1","H");gsub ("1/1","B");gsub ("./.","NA");print}'  > output_file_filter1F_DP_GQ_MAF_F_MISSING.txt

Count the row numbers:

awk 'END { print NR }'  output_file_filter1F_DP_GQ_MAF_F_MISSING.txt

Total number of rows: XXX

Filter 2A: Remove the uncalled, monomorphic, and heterozygous sites from Parent 1 i.e keep only A:

Manually find the column number for Parent 1 in the 'header' text file and confirm:

awk '{print $8}' output_file_header.txt

Return: Parent 1

So, Parent 1 column number is confirmed to be $8; use this in the filtering command below

Filtering:

awk '{ if($8 == "A") print $0;}' output_file_filter1F_DP_GQ_MAF_F_MISSING.txt > output_file_filter2A_clean_Parent 1.txt

Count the row numbers:

awk 'END { print NR }'  output_file_filter2A_clean_Parent 1.txt

Total number of rows: XXX

Filter 3A: Remove the uncalled, monomorphic, and heterozygous sites from Parent 2 i.e keep only B:

Find the column number for Parent 2:

awk '{print $103}' output_file_header.txt

Returns: Parent 2

So, Parent 2 column number is $103

Filtering:

awk '{ if($103 == "B") print $0;}'  output_file_filter2A_clean_Parent 1.txt  > output_file_filter3A_clean_Parent 2.txt

Count the row numbers:

awk 'END { print NR }'  output_file_filter3A_clean_Parent 2.txt

Total number of rows: XXX

bcftools SNP-Filtering VCF • 3.1k views

ADD COMMENT • link updated 13 months ago by Ram 43k • written 4.2 years ago by Asad Prodhan ▴ 10

1

Entering edit mode

Hi Asad Prodhan,

thank you for your effort. In order to make the code well-readable please use the code option to highlight your code/data examples/error messages with markdown

enter image description here

ADD REPLY • link 4.2 years ago by ATpoint 81k