Error building snpEff database "Transcript 'hypothetical_protein' already exists"
1
0
Entering edit mode
6.3 years ago
Lina F ▴ 200

Hi all,

I am trying to build a snpEff database but I'm running into the following error message:

java.lang.RuntimeException: Transcript  'hypothetical_protein' already exists
        at org.snpeff.snpEffect.factory.SnpEffPredictorFactory.add(SnpEffPredictorFactory.java:135)
        at org.snpeff.snpEffect.factory.SnpEffPredictorFactoryFeatures.addMrna(SnpEffPredictorFactoryFeatures.java:183)
        at org.snpeff.snpEffect.factory.SnpEffPredictorFactoryFeatures.addFeatures(SnpEffPredictorFactoryFeatures.java:134)
        at org.snpeff.snpEffect.factory.SnpEffPredictorFactoryFeatures.create(SnpEffPredictorFactoryFeatures.java:330)
        at org.snpeff.snpEffect.commandLine.SnpEffCmdBuild.run(SnpEffCmdBuild.java:369)
        at org.snpeff.SnpEff.run(SnpEff.java:1183)
        at org.snpeff.SnpEff.main(SnpEff.java:162)
java.lang.RuntimeException: Error reading file '/home/lina/snpEff/./data/my_organism/genes.gbk'
java.lang.RuntimeException: Transcript  'hypothetical_protein' already exists
        at org.snpeff.snpEffect.factory.SnpEffPredictorFactoryFeatures.create(SnpEffPredictorFactoryFeatures.java:344)
        at org.snpeff.snpEffect.commandLine.SnpEffCmdBuild.run(SnpEffCmdBuild.java:369)
        at org.snpeff.SnpEff.run(SnpEff.java:1183)
        at org.snpeff.SnpEff.main(SnpEff.java:162)

The file I am using for the database is a Genbank file that I downloaded from NCBI. It contains 12760 genes and 2764 of them are annotated with product="hypothetical protein"

Based on the error message I assume having more than one gene labeled hypothetical protein is a problem for snpEff. However, I assume there must be many organisms where that is the case.

Does anyone have any insight into this?

Thanks!

~Lina

snpeff database build • 1.8k views
ADD COMMENT
0
Entering edit mode

what the acn of this genbank file ?

ADD REPLY
0
Entering edit mode

it's GCA_001007165.2

ADD REPLY
2
Entering edit mode
6.3 years ago

It's because the name of the gene 'hypothetical protein" should be uniq

you can try (not tested)

 awk  '($0 ~ /product="hypothetical protein"/) {gsub(/ein/,"ein"NR);print;} {print;}'  in.gb > fixex.gb
ADD COMMENT
0
Entering edit mode

Thank you, this worked! However, now it's finding other non-unique names... looks like I will have to run this fix a few times. Thanks!

ADD REPLY
0
Entering edit mode

Hi, I have the same issue. After running the command, I run the "sh scripts/buildDbNcbi.sh U00096.2" to download again. Then the same issue appears. May I ask for the details of your operation? Thank you very much! Jane

ADD REPLY

Login before adding your answer.

Traffic: 793 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6