Question: Help required to Rectify/reformat the fasta header of nr database fasta sequences
0
gravatar for bilal.sarwar
3.6 years ago by
bilal.sarwar10
Pakistan/Lahore/CEMB-PU
bilal.sarwar10 wrote:

Hello all

I am a beginner in bioinformatics. I have downloaded the complete nr database from the NCBI. it contains106785170 nr protein sequences altogether. the fasta header of every sequence start with fatsta symbol > followed by the accession number and other information. here is the examples

>WP_003131952.1 30S ribosomal protein S18 [Lactococcus lactis]NP_26834..........................

>XP_642131.1 hypothetical protein DDB_G0277827 [Dictyostelium discoideu.............................

>XP_642837.1 hypothetical protein DDB_G0276911 [Dictyostelium d......................................

i want to add the accession number between pipe character "|" of every sequence in the header.

>|WP_003131952.1| 30S ribosomal protein S18 [Lactococcus lactis]NP_26834..........................

>|XP_642131.1| hypothetical protein DDB_G0277827 [Dictyostelium discoideu.............................

>|XP_642837.1|hypothetical protein DDB_G0276911 [Dictyostelium d......................................

kindly help me to solve this issue.

regards bilal

blast rna-seq header ncbi fasta • 1.1k views
ADD COMMENTlink modified 3.6 years ago by Pierre Lindenbaum131k • written 3.6 years ago by bilal.sarwar10
1
gravatar for Pierre Lindenbaum
3.6 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum131k wrote:
 sed '/^>/s/>\([^ ]*\)/>|\1|/' input.fa > out.fa
ADD COMMENTlink written 3.6 years ago by Pierre Lindenbaum131k

thanks for help ..... :)

ADD REPLYlink written 3.6 years ago by bilal.sarwar10
1

Please check the green mark on the left to flag this question as answered.

ADD REPLYlink written 3.6 years ago by Pierre Lindenbaum131k

bro, I got an error while preparing the blastable database after Rectify the fasta headers with -parse_seqids tag. without -parse_seqids all work well. actually, i am using Blast2Go software for mapping and annotation. here in this page How to create a Fasta file database for local Blast and to import XML results successfully into Blast2GO, they give the instruction about the header format.

here is the error volume: /data/storage_green/mbil/compressed_file/nr_database/nrdb2016/nr_modi

file: /data/storage_green/mbil/compressed_file/nr_database/nrdb2016/nr_modi.pin file: /data/storage_green/mbil/compressed_file/nr_database/nrdb2016/nr_modi.phr file: /data/storage_green/mbil/compressed_file/nr_database/nrdb2016/nr_modi.psq file: /data/storage_green/mbil/compressed_file/nr_database/nrdb2016/nr_modi.psi file: /data/storage_green/mbil/compressed_file/nr_database/nrdb2016/nr_modi.psd file: /data/storage_green/mbil/compressed_file/nr_database/nrdb2016/nr_modi.pog

BLAST Database creation error: Defline lacks a proper ID around line 380

this is the 380 line

>|CAD71090.1| conserved hypothetical protein [Neurospora crassa]

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by bilal.sarwar10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 801 users visited in the last hour