A 'bcftools' script for:
- Extracting SNP data from GBS data in vcf file format
- Filtering out raw SNPs to a usable set of SNPs
I. Extracting the headers:
bcftools view -h input_file.vcf.gz | awk '/CHROM/{print}' | cut -f 1-5,10- > output_file_header.txt
h, header; 'awk' stands for "Aho, Weinberger, Kernighan (authors)" which is a printing and row-based filtering function, ‘awk’ prints the output; ‘cut’ selects columns
II. Extracting the SNPs from the GBS data:
bcftools query -f '%CHROM:%POS\t%CHROM\t%POS\t%REF\t%ALT[\t%GT]\n' input_file.vcf.gz | \
awk '{gsub ("0/0","A");gsub ("0/1","H");gsub ("1/1","B");gsub ("./.","NA");print}' > output_file_without_filter.txt
'f', format; '%CHROM', 'CHROM' column; '\t' tab separated; '\n' new line; gsub
is a function to substitute (sub) something from everywhere in the file (global)- global substitution (gsub); '0', reference allele; '1', alternative allele; 'A', Parent 1 (reference parent in the alignment) and 'B', Parent 2 (this is an example of bi-parental breeding population)
Count the row numbers to keep tract of filtering outcomes later on:
awk 'END { print NR }' output_file _auto_without_filter.txt
Total number of rows: XXX,XXX
III. Filtering out the extracted SNPs:
- There are several filtering options
- Filtering can be done separately using an individual option to see the outcome of the filter
- Or, filtering can be done using all the options together using '&' in the command line
Filter 1A: Filtering against read depth (DP):
bcftools query -i'FMT/DP>10' -f '%CHROM:%POS\t%CHROM\t%POS\t%REF\t%ALT[\t%GT]\n'
input_file.vcf.gz |\
awk '{gsub ("0/0","A");gsub ("0/1","H");gsub ("1/1","B");gsub ("./.","NA");print}' > output_file_filter1A_DP.txt
'i', include; 'FMT', Format; 'DP', read depth; 'GQ', Genotype quality
Count the row numbers:
awk 'END { print NR }' output_file_filter1A_DP.txt
Total number of rows: XXX
Filter 1B: Filtering against genotype quality (GQ):
bcftools query -i'FMT/GQ>20' -f '%CHROM:%POS\t%CHROM\t%POS\t%REF\t%ALT[\t%GT]\n'
input_file.vcf.gz |\
awk '{gsub ("0/0","A");gsub ("0/1","H");gsub ("1/1","B");gsub ("./.","NA");print}' > output_file_filter1B_GQ.txt
Count the row numbers:
awk 'END { print NR }' output_file_filter1B_GQ.txt
Total number of rows: XXX
Filter 1C: Filtering against minor allele frequency (MAF):
bcftools query -i'MAF>=0.05' -f '%CHROM:%POS\t%CHROM\t%POS\t%REF\t%ALT[\t%GT]\n'
input_file.vcf.gz |\
awk '{gsub ("0/0","A");gsub ("0/1","H");gsub ("1/1","B");gsub ("./.","NA");print}' > output_file_filter1C_MAF.txt
Count the row numbers:
awk 'END { print NR }' output_file_filter1C_MAF.txt
Total number of rows: XXX
Filter 1D: Filtering against missing genotypes (F_MISSING):
bcftools query -i'F_MISSING<=0.1' -f '%CHROM:%POS\t%CHROM\t%POS\t%REF\t%ALT[\t%GT]\n'
input_file.vcf.gz |\
awk '{gsub ("0/0","A");gsub ("0/1","H");gsub ("1/1","B");gsub ("./.","NA");print}' > output_file_filter1D_F_MISSING.txt
Count the row numbers:
awk 'END { print NR }' output_file_filter1D_F_MISSING.txt
Total number of rows: XXX
Filter 1E: Filtering against DP and GQ:
bcftools query -i'FMT/DP>10 & FMT/GQ>20' -f '%CHROM:%POS\t%CHROM\t%POS\t%REF\t%ALT[\t%GT]\n'
input_file.vcf.gz |\
awk '{gsub ("0/0","A");gsub ("0/1","H");gsub ("1/1","B");gsub ("./.","NA");print}' > output_file_filter1E_DP_GQ.txt
'&' enforces filters within a sample; '&&' enforces filters but not necessarily within a sample
Count the row numbers:
awk 'END { print NR }' output_file_filter1E_DP_GQ.txt
Total number of rows: XXX
Filter 1F: Filtering against DP, GQ, MAF and F_MISSING:
bcftools query -i'FMT/DP>10 & FMT/GQ>20 & MAF>=0.05 & F_MISSING<=0.1' -f '%CHROM:%POS\t%CHROM\t%POS\t%REF\t%ALT[\t%GT]\n'
input_file.vcf.gz |\
awk '{gsub ("0/0","A");gsub ("0/1","H");gsub ("1/1","B");gsub ("./.","NA");print}' > output_file_filter1F_DP_GQ_MAF_F_MISSING.txt
Count the row numbers:
awk 'END { print NR }' output_file_filter1F_DP_GQ_MAF_F_MISSING.txt
Total number of rows: XXX
Filter 2A: Remove the uncalled, monomorphic, and heterozygous sites from Parent 1 i.e keep only A:
Manually find the column number for Parent 1 in the 'header' text file and confirm:
awk '{print $8}' output_file_header.txt
Return: Parent 1
So, Parent 1 column number is confirmed to be $8; use this in the filtering command below
Filtering:
awk '{ if($8 == "A") print $0;}' output_file_filter1F_DP_GQ_MAF_F_MISSING.txt > output_file_filter2A_clean_Parent 1.txt
Count the row numbers:
awk 'END { print NR }' output_file_filter2A_clean_Parent 1.txt
Total number of rows: XXX
Filter 3A: Remove the uncalled, monomorphic, and heterozygous sites from Parent 2 i.e keep only B:
Find the column number for Parent 2:
awk '{print $103}' output_file_header.txt
Returns: Parent 2
So, Parent 2 column number is $103
Filtering:
awk '{ if($103 == "B") print $0;}' output_file_filter2A_clean_Parent 1.txt > output_file_filter3A_clean_Parent 2.txt
Count the row numbers:
awk 'END { print NR }' output_file_filter3A_clean_Parent 2.txt
Total number of rows: XXX
Hi Asad Prodhan,
thank you for your effort. In order to make the code well-readable please use the code option to highlight your code/data examples/error messages with markdown