So I'm trying to use snpEff to annotate the effects of variants, on an organism without much public data, and I'm getting a lot of warnings that the reference does not match the genome. I've tried both freeBayes and GATK to call variants and get these warnings from snpEff in either case, despite using the same genome reference.
In the case of freeBayes, I'm running on BAM files we made based on our own sequencing data, it like so:
freebayes --fasta-reference organism_123.fa */*Realigned.bam > variants.fb.vcf
and then snpEff:
java -jar snpEff.jar -v organism.123 variants.fb.vcf > fb.eff.vcf
The database organism.123 was one I generated with snpEff from a .gff file since there isn't a db available publicly. The .gff is gzipped as data/organism.123/genes.gff.gz
and a copy of organism_123.fa
was gzipped as data/genomes/organism.123.fa.gz
. I made the db with:
java -jar snpEff.jar build -gff3 -v organism.123
I get thousands of the ref-does-not-match-genome warnings in snpEff's output, along with an order of magnitude more no-start-codon warnings. The latter could mean my .gff is bad somehow but that couldn't cause the former errors, could it? FreeBayes never even saw that file. Anything obvious I'm doing wrong here?
It might be informative to see what snpEff thinks the reference is in these cases, but I don't see that in the annotations it produces.
Yes that dump command is very handy - should have thought of it. Has helped me out on some occasions.
Thanks for letting us know about the solution