Question: Filtering results difference amongst VCF files
0
gravatar for prasundutta87
14 months ago by
prasundutta87360
prasundutta87360 wrote:

Hi,

I have performed a multisample variant calling on 81 non-human mammal samples using GATK. I wanted to do allele-specific expression analysis of 4 samples out of the 81 for which I have RNAseq data. For doing that, I selected only biallelic heterozygous variants from those 4 samples from the multisample VCF file to make separate VCFs for each sample and filtered variants whose GQ (Genotype quality) Phred score < 40.

When I compared the statistics of the remaining variants, the number of variants remaining in chromosome 1 of one sample was in the range of 300000 to 400000, but the other samples had >500000 variants remaining in chromosome 1.

Can there be a biological explanation to this? What all additional analyses can be done to explain this difference? Kindly let me know if any more information is needed for this question.

variants filtering snp gatk • 582 views
ADD COMMENTlink written 14 months ago by prasundutta87360

Yes, it may help to show your sequence of commands.

ADD REPLYlink written 14 months ago by Kevin Blighe52k

Ok. So, after obtaining the multisample VCF file after adapting the GATK multisample variant calling pipeline, only biallelic SNPs were obtained using this command:

bcftools view -v snps -m2 -M2 --threads 8 <multisample_VCF_file>|bgzip --threads 8 ><biallelic_gzipped_VCF>

Then I performed hard-filtering on the multisample VCF file to remove potential false positives (cannot perform soft-filtering due to the absence of truth sets)

bcftools filter -i '<GATK recommended hard-filtering parameters>' --threads 8 <biallelic_gzipped_VCF>|bgzip --threads 8 > <biallelic_filtered_gzipped_VCF>

Variant annotation was done after this:

java -jar snpEff.jar -v <reference_genome> <biallelic_filtered_gzipped_VCF>|bgzip ><biallelic_filtered_gzipped_VCF_snpeff>

Sample-specific VCFs obtained using the following command for 4 samples:

java -jar gatk-package-4.0.4.0-local.jar SelectVariants -R <reference_genome> -V <biallelic_filtered_gzipped_VCF_snpeff> -O <Sample_1_biallelic_filtered_gzipped_VCF_snpeff> -sn <Sample_1>

Heterozygous variants from the 4 samples with GQ >=40 was obtained using this command:

bcftools view -g het -i 'GQ >= 40' <Sample_1_biallelic_filtered_gzipped_VCF_snpeff> > <Sample_1_biallelic_het_GQ40.vcf>

I hope this is ok. Kindly let me know if any more information is needed.

P.S. bcftools version: 1.6, GATK version: 4.0.4.0, the 4 samples are from the same species and breed

ADD REPLYlink modified 14 months ago • written 14 months ago by prasundutta87360

Just a hint for the bcftools command: There is no need to pipe to bgzip. Just use -o to define the output name. bcftoolsrecognize the file extension and will compress on its own. Also you can use -O to define explicit how the output should looke like.

fin swimmer

ADD REPLYlink written 14 months ago by finswimmer13k

Thanks a lot for the hint! :-)

ADD REPLYlink written 14 months ago by prasundutta87360

Hello,

have you checked if there are more low depth regions for the sample with less remaining variants? Thsi could cause that there were less variants called.

fin swimmer

ADD REPLYlink written 14 months ago by finswimmer13k

Just checked that after running 'samtools depth' with -a option on only chromosome 1 for all 4 samples. There are more low depth regions for the sample with less remaining variants. There were 39118 more loci with 0 depth than the sample having the second highest number of loci with 0 depth. It was difficult for me to make a distribution graph for each depth file as the files are huge. So, I just counted the number for '0' for each depth file using 'grep'.

ADD REPLYlink modified 14 months ago • written 14 months ago by prasundutta87360
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1007 users visited in the last hour