Tutorial: How to extract and filter SNP data from the genotyping-by-sequencing (GBS) data in vcf format using bcftools
0
gravatar for Asad Prodhan
3 months ago by
Australia Perth
Asad Prodhan0 wrote:

A 'bcftools' script for:

 Extracting SNP data from GBS data in vcf file format

 Filtering out raw SNPs to a usable set of SNPs

I. Extracting the headers:

bcftools view -h input_file.vcf.gz | awk '/CHROM/{print}' | cut -f 1-5,10- > output_file_header.txt

h, headrer; ‘awk’ stands for “Aho, Weinberger, Kernighan (authors)” which is a printing and row-based filtering function, ‘awk’ prints the output; ‘cut’ selects columns

II. Extracting the SNPs from the GBS data:

bcftools query -f '%CHROM:%POS\t%CHROM\t%POS\t%REF\t%ALT[\t%GT]\n' input_file.vcf.gz | awk '{gsub ("0/0","A");gsub ("0/1","H");gsub ("1/1","B");gsub ("./.","NA");print}' > output_file_without_filter.txt

‘f’, format; ‘%CHROM’, ‘CHROM’ column; ‘\t’ tab separated; ‘\n’ new line; ‘gsub’ is a function to substitute (sub) something from everywhere in the file (global)- global substitution (gsub); '0', reference allele; '1', alternative allele; 'A', Parent 1 (reference parent in the alignment) and 'B', Parent 2 (this is an example of bi-parental breeding population)

Count the row numbers to keep tract of filtering outcomes later on:

awk 'END { print NR }' output_file _auto_without_filter.txt

Total number of rows: XXX,XXX

III. Filtering out the extracted SNPs:

• There are several filtering options

• Filtering can be done separately using an individual option to see the outcome of the filter

• Or, filtering can be done using all the options together using '&' in the command line

Filter 1A: Filtering against read depth (DP):

bcftools query -i'FMT/DP>10' -f '%CHROM:%POS\t%CHROM\t%POS\t%REF\t%ALT[\t%GT]\n' input_file.vcf.gz | awk '{gsub ("0/0","A");gsub ("0/1","H");gsub ("1/1","B");gsub ("./.","NA");print}' > output_file_filter1A_DP.txt

‘i’, include; ‘FMT’, Format; ‘DP’, read depth; ‘GQ’, Genotype quality

Count the row numbers:

awk 'END { print NR }' output_file_filter1A_DP.txt

Total number of rows: XXX

Filter 1B: Filtering against genotype quality (GQ):

bcftools query -i'FMT/GQ>20' -f '%CHROM:%POS\t%CHROM\t%POS\t%REF\t%ALT[\t%GT]\n' input_file.vcf.gz | awk '{gsub ("0/0","A");gsub ("0/1","H");gsub ("1/1","B");gsub ("./.","NA");print}' > output_file_filter1B_GQ.txt

Count the row numbers:

awk 'END { print NR }' output_file_filter1B_GQ.txt

Total number of rows: XXX

Filter 1C: Filtering against minor allele frequency (MAF):

bcftools query -i'MAF>=0.05' -f '%CHROM:%POS\t%CHROM\t%POS\t%REF\t%ALT[\t%GT]\n' input_file.vcf.gz | awk '{gsub ("0/0","A");gsub ("0/1","H");gsub ("1/1","B");gsub ("./.","NA");print}' > output_file_filter1C_MAF.txt

Count the row numbers:

awk 'END { print NR }' output_file_filter1C_MAF.txt

Total number of rows: XXX

Filter 1D: Filtering against missing genotypes (F_MISSING):

bcftools query -i'F_MISSING<=0.1' -f '%CHROM:%POS\t%CHROM\t%POS\t%REF\t%ALT[\t%GT]\n' input_file.vcf.gz | awk '{gsub ("0/0","A");gsub ("0/1","H");gsub ("1/1","B");gsub ("./.","NA");print}' > output_file_filter1D_F_MISSING.txt

Count the row numbers:

awk 'END { print NR }' output_file_filter1D_F_MISSING.txt

Total number of rows: XXX

Filter 1E: Filtering against DP and GQ:

bcftools query -i'FMT/DP>10 & FMT/GQ>20' -f '%CHROM:%POS\t%CHROM\t%POS\t%REF\t%ALT[\t%GT]\n' input_file.vcf.gz | awk '{gsub ("0/0","A");gsub ("0/1","H");gsub ("1/1","B");gsub ("./.","NA");print}' > output_file_filter1E_DP_GQ.txt

‘&’ enforces filters within a sample; ‘&&’ enforces filters but not necessarily within a sample

Count the row numbers:

awk 'END { print NR }' output_file_filter1E_DP_GQ.txt

Total number of rows: XXX

Filter 1F: Filtering against DP, GQ, MAF and F_MISSING:

bcftools query -i'FMT/DP>10 & FMT/GQ>20 & MAF>=0.05 & F_MISSING<=0.1' -f '%CHROM:%POS\t%CHROM\t%POS\t%REF\t%ALT[\t%GT]\n' input_file.vcf.gz | awk '{gsub ("0/0","A");gsub ("0/1","H");gsub ("1/1","B");gsub ("./.","NA");print}' > output_file_filter1F_DP_GQ_MAF_F_MISSING.txt

Count the row numbers:

awk 'END { print NR }' output_file_filter1F_DP_GQ_MAF_F_MISSING.txt

Total number of rows: XXX

Filter 2A: Remove the uncalled, monomorphic, and heterozygous sites from Parent 1 i.e keep only A:

Manually find the column number for Parent 1 in the 'header' text file and confirm:

awk '{print $8}' output_file_header.txt

Return: Parent 1

So, Parent 1 column number is confirmed to be $8; use this in the filtering command below

Filtering:

awk '{ if($8 == "A") print $0;}' output_file_filter1F_DP_GQ_MAF_F_MISSING.txt > output_file_filter2A_clean_Parent 1.txt

Count the row numbers:

awk 'END { print NR }' output_file_filter2A_clean_Parent 1.txt

Total number of rows: XXX

Filter 3A: Remove the uncalled, monomorphic, and heterozygous sites from Parent 2 i.e keep only B:

Find the column number for Parent 2:

awk '{print $103}' output_file_header.txt

Returns: Parent 2

So, Parent 2 column number is $103

Filtering:

awk '{ if($103 == "B") print $0;}' output_file_filter2A_clean_Parent 1.txt > output_file_filter3A_clean_Parent 2.txt

Count the row numbers:

awk 'END { print NR }' output_file_filter3A_clean_Parent 2.txt

Total number of rows: XXX


ADD COMMENTlink modified 3 months ago • written 3 months ago by Asad Prodhan0
1

Hi Asad Prodhan,

thank you for your effort. In order to make the code well-readable please use the code option to highlight your code/data examples/error messages with markdown

enter image description here

ADD REPLYlink modified 3 months ago • written 3 months ago by ATpoint35k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1262 users visited in the last hour