Question: Error building snpEff database "Transcript 'hypothetical_protein' already exists"
0
gravatar for Lina F
17 months ago by
Lina F160
Boston, MA
Lina F160 wrote:

Hi all,

I am trying to build a snpEff database but I'm running into the following error message:

java.lang.RuntimeException: Transcript  'hypothetical_protein' already exists
        at org.snpeff.snpEffect.factory.SnpEffPredictorFactory.add(SnpEffPredictorFactory.java:135)
        at org.snpeff.snpEffect.factory.SnpEffPredictorFactoryFeatures.addMrna(SnpEffPredictorFactoryFeatures.java:183)
        at org.snpeff.snpEffect.factory.SnpEffPredictorFactoryFeatures.addFeatures(SnpEffPredictorFactoryFeatures.java:134)
        at org.snpeff.snpEffect.factory.SnpEffPredictorFactoryFeatures.create(SnpEffPredictorFactoryFeatures.java:330)
        at org.snpeff.snpEffect.commandLine.SnpEffCmdBuild.run(SnpEffCmdBuild.java:369)
        at org.snpeff.SnpEff.run(SnpEff.java:1183)
        at org.snpeff.SnpEff.main(SnpEff.java:162)
java.lang.RuntimeException: Error reading file '/home/lina/snpEff/./data/my_organism/genes.gbk'
java.lang.RuntimeException: Transcript  'hypothetical_protein' already exists
        at org.snpeff.snpEffect.factory.SnpEffPredictorFactoryFeatures.create(SnpEffPredictorFactoryFeatures.java:344)
        at org.snpeff.snpEffect.commandLine.SnpEffCmdBuild.run(SnpEffCmdBuild.java:369)
        at org.snpeff.SnpEff.run(SnpEff.java:1183)
        at org.snpeff.SnpEff.main(SnpEff.java:162)

The file I am using for the database is a Genbank file that I downloaded from NCBI. It contains 12760 genes and 2764 of them are annotated with product="hypothetical protein"

Based on the error message I assume having more than one gene labeled hypothetical protein is a problem for snpEff. However, I assume there must be many organisms where that is the case.

Does anyone have any insight into this?

Thanks!

~Lina

build snpeff database • 386 views
ADD COMMENTlink modified 17 months ago by Pierre Lindenbaum124k • written 17 months ago by Lina F160

what the acn of this genbank file ?

ADD REPLYlink written 17 months ago by Pierre Lindenbaum124k

it's GCA_001007165.2

ADD REPLYlink written 17 months ago by Lina F160
2
gravatar for Pierre Lindenbaum
17 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum124k wrote:

It's because the name of the gene 'hypothetical protein" should be uniq

you can try (not tested)

 awk  '($0 ~ /product="hypothetical protein"/) {gsub(/ein/,"ein"NR);print;} {print;}'  in.gb > fixex.gb
ADD COMMENTlink modified 17 months ago • written 17 months ago by Pierre Lindenbaum124k

Thank you, this worked! However, now it's finding other non-unique names... looks like I will have to run this fix a few times. Thanks!

ADD REPLYlink written 17 months ago by Lina F160
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2115 users visited in the last hour