Question: Error building snpEff database "Transcript 'hypothetical_protein' already exists"
0
gravatar for Lina F
2.5 years ago by
Lina F200
Boston, MA
Lina F200 wrote:

Hi all,

I am trying to build a snpEff database but I'm running into the following error message:

java.lang.RuntimeException: Transcript  'hypothetical_protein' already exists
        at org.snpeff.snpEffect.factory.SnpEffPredictorFactory.add(SnpEffPredictorFactory.java:135)
        at org.snpeff.snpEffect.factory.SnpEffPredictorFactoryFeatures.addMrna(SnpEffPredictorFactoryFeatures.java:183)
        at org.snpeff.snpEffect.factory.SnpEffPredictorFactoryFeatures.addFeatures(SnpEffPredictorFactoryFeatures.java:134)
        at org.snpeff.snpEffect.factory.SnpEffPredictorFactoryFeatures.create(SnpEffPredictorFactoryFeatures.java:330)
        at org.snpeff.snpEffect.commandLine.SnpEffCmdBuild.run(SnpEffCmdBuild.java:369)
        at org.snpeff.SnpEff.run(SnpEff.java:1183)
        at org.snpeff.SnpEff.main(SnpEff.java:162)
java.lang.RuntimeException: Error reading file '/home/lina/snpEff/./data/my_organism/genes.gbk'
java.lang.RuntimeException: Transcript  'hypothetical_protein' already exists
        at org.snpeff.snpEffect.factory.SnpEffPredictorFactoryFeatures.create(SnpEffPredictorFactoryFeatures.java:344)
        at org.snpeff.snpEffect.commandLine.SnpEffCmdBuild.run(SnpEffCmdBuild.java:369)
        at org.snpeff.SnpEff.run(SnpEff.java:1183)
        at org.snpeff.SnpEff.main(SnpEff.java:162)

The file I am using for the database is a Genbank file that I downloaded from NCBI. It contains 12760 genes and 2764 of them are annotated with product="hypothetical protein"

Based on the error message I assume having more than one gene labeled hypothetical protein is a problem for snpEff. However, I assume there must be many organisms where that is the case.

Does anyone have any insight into this?

Thanks!

~Lina

build snpeff database • 628 views
ADD COMMENTlink modified 2.5 years ago by Pierre Lindenbaum133k • written 2.5 years ago by Lina F200

what the acn of this genbank file ?

ADD REPLYlink written 2.5 years ago by Pierre Lindenbaum133k

it's GCA_001007165.2

ADD REPLYlink written 2.5 years ago by Lina F200
2
gravatar for Pierre Lindenbaum
2.5 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum133k wrote:

It's because the name of the gene 'hypothetical protein" should be uniq

you can try (not tested)

 awk  '($0 ~ /product="hypothetical protein"/) {gsub(/ein/,"ein"NR);print;} {print;}'  in.gb > fixex.gb
ADD COMMENTlink modified 2.5 years ago • written 2.5 years ago by Pierre Lindenbaum133k

Thank you, this worked! However, now it's finding other non-unique names... looks like I will have to run this fix a few times. Thanks!

ADD REPLYlink written 2.5 years ago by Lina F200
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1514 users visited in the last hour