I am new to analysing NGS data but have managed to get from fastq files to vcf files, however I can't seem to get the vcftools package to output a filtered vcf file, the manual here shows an example at the top with the recode tag in the format that I have used, here is my code:
vcftools --gzvcf $vcf_files/g14_variants_Q20.vcf.gz --bed $truseq --gzdiff $dbSNP --diff-indv-discordance --recode --recode-INFO-all --out $vcf_files/g14_filtered
here is the error I get with this line:
Error: Only one output function may be called.
The word 'recode' is there twice deliberatelty as that is what the manual suggests as an example. I have tried this is several ways as suggested by the manual examples, with 1 recode tag gives an output but no vcf file:
vcftools --gzvcf $vcf_files/g14_variants_Q20.vcf.gz --bed $truseq --gzdiff $dbSNP --diff-indv-discordance --recode-INFO-all --stdout | gzip -c > $vcf_files/g14_filtered.vcf.gz
the log file says:
Outputting Discordance By Individual... Read 213847 BED file entries. Found 0 sites common to both files. Found 33145 sites only in main file. Found 0 sites only in second file. After filtering, kept 33145 out of a possible 562965 Sites Run Time = 210.00 seconds
I double checked how many variants were in each file using:
vcftools --gzvcf $vcf_files//g14_variants_Q20.vcf.gz
returns 562965 variants
vcftools --gzvcf $dbSNP
returns 36838812 variants (this is the current dnSNP archive in compressed vcf format)
I have made the code simpler by removing the bed file and tag in an attemtp to just get a vcf file produced but still nothing. the log file says the same thing for each (except the BED file missing for the attempts where I havent used it).
The log says 0 sites common to both files, being that this is a human exome I find it surprizing, its also a bit strange that it says 0 sites found only in the second file (the dbSNP vcf file). I downloaded it directly from here. This seems odd behavior given that the standard vcftools command returns a large number of variants (shown above).
One thing I had noticed is that the out.diff.indv file points to the BAM linked to the sample file, its pointing to the correct place for the sample file but the dbSNP file obviously has no accompanying BAM file - could this be what is confusing vcftools?
Is there something I am missing? Or alternatively is there a better way to filter common variation from my sample_vcf file and keep the filtered calls that are within the bed file (the target regions of the exome capture kit I used).