Help required to Rectify/reformat the fasta header of nr database fasta sequences
1
0
Entering edit mode
7.1 years ago
bilal.sarwar ▴ 10

Hello all

I am a beginner in bioinformatics. I have downloaded the complete nr database from the NCBI. it contains106785170 nr protein sequences altogether. the fasta header of every sequence start with fatsta symbol > followed by the accession number and other information. here is the examples

>WP_003131952.1 30S ribosomal protein S18 [Lactococcus lactis]NP_26834..........................

>XP_642131.1 hypothetical protein DDB_G0277827 [Dictyostelium discoideu.............................

>XP_642837.1 hypothetical protein DDB_G0276911 [Dictyostelium d......................................

i want to add the accession number between pipe character "|" of every sequence in the header.

>|WP_003131952.1| 30S ribosomal protein S18 [Lactococcus lactis]NP_26834..........................

>|XP_642131.1| hypothetical protein DDB_G0277827 [Dictyostelium discoideu.............................

>|XP_642837.1|hypothetical protein DDB_G0276911 [Dictyostelium d......................................

kindly help me to solve this issue.

regards bilal

RNA-Seq blast fasta NCBI header • 2.1k views
ADD COMMENT
1
Entering edit mode
7.1 years ago
 sed '/^>/s/>\([^ ]*\)/>|\1|/' input.fa > out.fa
ADD COMMENT
0
Entering edit mode

thanks for help ..... :)

ADD REPLY
1
Entering edit mode

Please check the green mark on the left to flag this question as answered.

ADD REPLY
0
Entering edit mode

bro, I got an error while preparing the blastable database after Rectify the fasta headers with -parse_seqids tag. without -parse_seqids all work well. actually, i am using Blast2Go software for mapping and annotation. here in this page How to create a Fasta file database for local Blast and to import XML results successfully into Blast2GO, they give the instruction about the header format.

here is the error volume: /data/storage_green/mbil/compressed_file/nr_database/nrdb2016/nr_modi

file: /data/storage_green/mbil/compressed_file/nr_database/nrdb2016/nr_modi.pin file: /data/storage_green/mbil/compressed_file/nr_database/nrdb2016/nr_modi.phr file: /data/storage_green/mbil/compressed_file/nr_database/nrdb2016/nr_modi.psq file: /data/storage_green/mbil/compressed_file/nr_database/nrdb2016/nr_modi.psi file: /data/storage_green/mbil/compressed_file/nr_database/nrdb2016/nr_modi.psd file: /data/storage_green/mbil/compressed_file/nr_database/nrdb2016/nr_modi.pog

BLAST Database creation error: Defline lacks a proper ID around line 380

this is the 380 line

>|CAD71090.1| conserved hypothetical protein [Neurospora crassa]

ADD REPLY

Login before adding your answer.

Traffic: 2908 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6