Question

vcftools returns no vcf output after filtering

0

Entering edit mode

8.6 years ago

wes3985 ▴ 10

I am new to analysing NGS data but have managed to get from fastq files to vcf files, however I can't seem to get the vcftools package to output a filtered vcf file, the manual here shows an example at the top with the recode tag in the format that I have used, here is my code:

vcftools --gzvcf $vcf_files/g14_variants_Q20.vcf.gz --bed $truseq --gzdiff $dbSNP --diff-indv-discordance --recode --recode-INFO-all --out $vcf_files/g14_filtered

here is the error I get with this line:

Error: Only one output function may be called.

The word 'recode' is there twice deliberatelty as that is what the manual suggests as an example. I have tried this is several ways as suggested by the manual examples, with 1 recode tag gives an output but no vcf file:

vcftools --gzvcf $vcf_files/g14_variants_Q20.vcf.gz --bed $truseq --gzdiff $dbSNP --diff-indv-discordance --recode-INFO-all --stdout | gzip -c > $vcf_files/g14_filtered.vcf.gz

the log file says:

Outputting Discordance By Individual...
        Read 213847 BED file entries.
Found 0 sites common to both files.
Found 33145 sites only in main file.
Found 0 sites only in second file.
After filtering, kept 33145 out of a possible 562965 Sites
Run Time = 210.00 seconds

I double checked how many variants were in each file using:

vcftools --gzvcf $vcf_files//g14_variants_Q20.vcf.gz

returns 562965 variants

vcftools --gzvcf $dbSNP

returns 36838812 variants (this is the current dnSNP archive in compressed vcf format)

I have made the code simpler by removing the bed file and tag in an attemtp to just get a vcf file produced but still nothing. the log file says the same thing for each (except the BED file missing for the attempts where I havent used it).

The log says 0 sites common to both files, being that this is a human exome I find it surprizing, its also a bit strange that it says 0 sites found only in the second file (the dbSNP vcf file). I downloaded it directly from here. This seems odd behavior given that the standard vcftools command returns a large number of variants (shown above).

One thing I had noticed is that the out.diff.indv file points to the BAM linked to the sample file, its pointing to the correct place for the sample file but the dbSNP file obviously has no accompanying BAM file - could this be what is confusing vcftools?

Is there something I am missing? Or alternatively is there a better way to filter common variation from my sample_vcf file and keep the filtered calls that are within the bed file (the target regions of the exome capture kit I used).

Many thanks

sequencing software error snp next-gen • 6.6k views

ADD COMMENT • link 8.6 years ago by wes3985 ▴ 10

1

Entering edit mode

Could you check if your bed and vcf file use the same chromosome notation? e.g not one using 'chr6' and the other '6'.

ADD REPLY • link 8.6 years ago by WouterDeCoster 48k

0

Entering edit mode

You're right, there is a discrepancy, anyone else stuck with this problem here is the python script I used to fix it:

with open('dbSNP.vcf', 'r') as r:
dbSNP_in = [line.split('\t') for line in r]

active = False
active2 = False
with open('dbSNP_converted.vcf','w') as w:
    for i in range(len(dbSNP_in)):
        if active:
            active2=True
        if not active2:
            w.write('\t'.join(dbSNP_in[i]))
        if dbSNP_in[i][0] == '#CHROM':
            active=True
        if active2:
            chrom='chr'+dbSNP_in[i][0]
            out = [chrom] + dbSNP_in[i][1:]
            w.write('\t'.join(out))

ADD REPLY • link 8.6 years ago by wes3985 ▴ 10

0

Entering edit mode

I would suggest something like:

cat <(grep '^#' dbSNP.vcf) <(grep -v '^#' dbSNP.vcf | sed 's/^/chr/g') > dbSNP_converted.vcf

I love python, but for a lot of jobs it's not the most optimal tool.

ADD REPLY • link 8.6 years ago by WouterDeCoster 48k