Question: Error building snpEff database "Transcript 'hypothetical_protein' already exists"
0
gravatar for Lina F
5 months ago by
Lina F150
Boston, MA
Lina F150 wrote:

Hi all,

I am trying to build a snpEff database but I'm running into the following error message:

java.lang.RuntimeException: Transcript  'hypothetical_protein' already exists
        at org.snpeff.snpEffect.factory.SnpEffPredictorFactory.add(SnpEffPredictorFactory.java:135)
        at org.snpeff.snpEffect.factory.SnpEffPredictorFactoryFeatures.addMrna(SnpEffPredictorFactoryFeatures.java:183)
        at org.snpeff.snpEffect.factory.SnpEffPredictorFactoryFeatures.addFeatures(SnpEffPredictorFactoryFeatures.java:134)
        at org.snpeff.snpEffect.factory.SnpEffPredictorFactoryFeatures.create(SnpEffPredictorFactoryFeatures.java:330)
        at org.snpeff.snpEffect.commandLine.SnpEffCmdBuild.run(SnpEffCmdBuild.java:369)
        at org.snpeff.SnpEff.run(SnpEff.java:1183)
        at org.snpeff.SnpEff.main(SnpEff.java:162)
java.lang.RuntimeException: Error reading file '/home/lina/snpEff/./data/my_organism/genes.gbk'
java.lang.RuntimeException: Transcript  'hypothetical_protein' already exists
        at org.snpeff.snpEffect.factory.SnpEffPredictorFactoryFeatures.create(SnpEffPredictorFactoryFeatures.java:344)
        at org.snpeff.snpEffect.commandLine.SnpEffCmdBuild.run(SnpEffCmdBuild.java:369)
        at org.snpeff.SnpEff.run(SnpEff.java:1183)
        at org.snpeff.SnpEff.main(SnpEff.java:162)

The file I am using for the database is a Genbank file that I downloaded from NCBI. It contains 12760 genes and 2764 of them are annotated with product="hypothetical protein"

Based on the error message I assume having more than one gene labeled hypothetical protein is a problem for snpEff. However, I assume there must be many organisms where that is the case.

Does anyone have any insight into this?

Thanks!

~Lina

build snpeff database • 157 views
ADD COMMENTlink modified 5 months ago by Pierre Lindenbaum115k • written 5 months ago by Lina F150

what the acn of this genbank file ?

ADD REPLYlink written 5 months ago by Pierre Lindenbaum115k

it's GCA_001007165.2

ADD REPLYlink written 5 months ago by Lina F150
2
gravatar for Pierre Lindenbaum
5 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum115k wrote:

It's because the name of the gene 'hypothetical protein" should be uniq

you can try (not tested)

 awk  '($0 ~ /product="hypothetical protein"/) {gsub(/ein/,"ein"NR);print;} {print;}'  in.gb > fixex.gb
ADD COMMENTlink modified 5 months ago • written 5 months ago by Pierre Lindenbaum115k

Thank you, this worked! However, now it's finding other non-unique names... looks like I will have to run this fix a few times. Thanks!

ADD REPLYlink written 5 months ago by Lina F150
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1174 users visited in the last hour