Question

SnpEff output summary for hundreds of vcfs

0

Entering edit mode

9.8 years ago

moorem ▴ 250

Hi all!

I was wondering whether anyone is aware of any existing tools to summarise snpEffs vcf output.

I have a few hundred vcf files from snpEff and want to know how often there is, for example a high impact*, mutation in the same gene. So the output would indicate in how many of my vcf files was there a high impact mutation.

* As per snpEffs definitions. Relevant to microbes: stop_gained, stop_lost, frameshift

Thanks in advance for any help with this!

SNP sequence next-gen-sequencing alignment • 5.2k views

ADD COMMENT • link updated 2.9 years ago by Ram 45k • written 9.8 years ago by moorem ▴ 250

Ram · Accepted Answer · 2015-10-12

4

Entering edit mode

9.8 years ago

Pierre Lindenbaum 166k

I use my tool "GroupByGene" https://github.com/lindenb/jvarkit/wiki/GroupByGene to group this kind of VCF information to a gene. All samples must be grouped in the same filtered VCF (filtered= keep "high impact" variants ).

ADD COMMENT • link 9.8 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

This looks really promising and an impressive suite of tools overall!

The issue I'm having with the output right now is I'm getting crazy numbers. I'll check I haven't done anything wrong with snpEff but otherwise have you seen this before:

java -jar ../dist-1.139/groupbygene.jar file1_snpEff.vcf file2_snpEff.vcf > out.txt

produces:

#chrom  min.POS max.POS gene.name       gene.type       samples.affected        count.variations        LES1    LES2
NC_002516      2832    13283   PA0006  snpeff-gene-name        2       134     86      93

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 9.8 years ago by moorem ▴ 250

0

Entering edit mode

what's crazy ? everytime you have a variation (synonymous, non-synonymous, etc...) in this gene, the number is incremented.

ADD REPLY • link 9.8 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

yes, I too want to know what "crazy" is - right now if someone asked I couldn't even put an order of magnitude on what "crazy" is

ADD REPLY • link 9.8 years ago by Istvan Albert 102k

0

Entering edit mode

I often don't use taht tool. If you suspect an error, please report an issue: https://github.com/lindenb/jvarkit/issues

ADD REPLY • link 9.8 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

I suspect it's my issue, otherwise I will do, thanks again for your help!

ADD REPLY • link updated 2.9 years ago by Ram 45k • written 9.8 years ago by moorem ▴ 250

0

Entering edit mode

The example gene is only 537bp so 86 or 93 would be a crazy amount between this organism and the reference. It must be an error with my data as biologically that is not what would be expected --certainly not for every single gene. The sum of the mutations reported-within genes is also greater than the number of mutations in the entire genome so this seems to be a fair assumption.

ADD REPLY • link updated 2.9 years ago by Ram 45k • written 9.8 years ago by moorem ▴ 250

0

Entering edit mode

not agree: min.POS =2832 max.POS=13283 so length > 10kb

ADD REPLY • link updated 2.9 years ago by Ram 45k • written 9.8 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

thanks, I'll check this!

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 9.7 years ago by moorem ▴ 250

0

Entering edit mode

Thanks for the reply!

I've interpreted count.variations, as you say, as counting all the variations in each gene. The total mutations present in all genes with at least one variation ends up as 297,882, but there are only 30,599 mutations overall.

Have I misunderstood? :/ I think there has to be an issue with the reference that I've used with snpEff.

ADD REPLY • link updated 2.9 years ago by Ram 45k • written 9.8 years ago by moorem ▴ 250

0

Entering edit mode

could you give an example command line? I'm trying to look if my genes of interest are present in all of my VCF files. I have few hundreds of annotated VCF files.

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 9.7 years ago by jan ▴ 170

0

Entering edit mode

Hi, I have an error when I run the program

2015-11-17 18:05:30 "Your input file has a malformed header: unexpected tag count 6 in line <ID=GERM,Number=1,Type=Integer,Description="Counts of donor occurs this mutation, total recorded donor number",Source=/opt/local/genomeinfo/qannotate/icgc_germline_qsnp_PUBLIC.vcf,FileDate=null>"
htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: unexpected tag count 6 in line <ID=GERM,Number=1,Type=Integer,Description="Counts of donor occurs this mutation, total recorded donor number",Source=/opt/local/genomeinfo/qannotate/icgc_germline_qsnp_PUBLIC.vcf,FileDate=null

The annotation was done by in-house qSNP and qAnnonate. Is there a way I could fix this?

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 9.7 years ago by jan ▴ 170