Question: Error building snpEff database "Transcript 'hypothetical_protein' already exists"
0
gravatar for Lina F
2.2 years ago by
Lina F180
Boston, MA
Lina F180 wrote:

Hi all,

I am trying to build a snpEff database but I'm running into the following error message:

java.lang.RuntimeException: Transcript  'hypothetical_protein' already exists
        at org.snpeff.snpEffect.factory.SnpEffPredictorFactory.add(SnpEffPredictorFactory.java:135)
        at org.snpeff.snpEffect.factory.SnpEffPredictorFactoryFeatures.addMrna(SnpEffPredictorFactoryFeatures.java:183)
        at org.snpeff.snpEffect.factory.SnpEffPredictorFactoryFeatures.addFeatures(SnpEffPredictorFactoryFeatures.java:134)
        at org.snpeff.snpEffect.factory.SnpEffPredictorFactoryFeatures.create(SnpEffPredictorFactoryFeatures.java:330)
        at org.snpeff.snpEffect.commandLine.SnpEffCmdBuild.run(SnpEffCmdBuild.java:369)
        at org.snpeff.SnpEff.run(SnpEff.java:1183)
        at org.snpeff.SnpEff.main(SnpEff.java:162)
java.lang.RuntimeException: Error reading file '/home/lina/snpEff/./data/my_organism/genes.gbk'
java.lang.RuntimeException: Transcript  'hypothetical_protein' already exists
        at org.snpeff.snpEffect.factory.SnpEffPredictorFactoryFeatures.create(SnpEffPredictorFactoryFeatures.java:344)
        at org.snpeff.snpEffect.commandLine.SnpEffCmdBuild.run(SnpEffCmdBuild.java:369)
        at org.snpeff.SnpEff.run(SnpEff.java:1183)
        at org.snpeff.SnpEff.main(SnpEff.java:162)

The file I am using for the database is a Genbank file that I downloaded from NCBI. It contains 12760 genes and 2764 of them are annotated with product="hypothetical protein"

Based on the error message I assume having more than one gene labeled hypothetical protein is a problem for snpEff. However, I assume there must be many organisms where that is the case.

Does anyone have any insight into this?

Thanks!

~Lina

build snpeff database • 567 views
ADD COMMENTlink modified 2.2 years ago by Pierre Lindenbaum130k • written 2.2 years ago by Lina F180

what the acn of this genbank file ?

ADD REPLYlink written 2.2 years ago by Pierre Lindenbaum130k

it's GCA_001007165.2

ADD REPLYlink written 2.2 years ago by Lina F180
2
gravatar for Pierre Lindenbaum
2.2 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum130k wrote:

It's because the name of the gene 'hypothetical protein" should be uniq

you can try (not tested)

 awk  '($0 ~ /product="hypothetical protein"/) {gsub(/ein/,"ein"NR);print;} {print;}'  in.gb > fixex.gb
ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by Pierre Lindenbaum130k

Thank you, this worked! However, now it's finding other non-unique names... looks like I will have to run this fix a few times. Thanks!

ADD REPLYlink written 2.2 years ago by Lina F180
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1279 users visited in the last hour