accession to taxid for fasta files downloaded from refseq
2
0
Entering edit mode
5.8 years ago
badribio ▴ 270

I am trying to build a database for metagenomic analysis have all the genomic.fna.gz files for bacteria and virus from ftp://ftp.ncbi.nlm.nih.gov/refseq/release/ release 77 which has 42080 species for bacteria and 5654 for viral

header for viral sequence looks like this

>gi|433660771|ref|NC_019944.1| Okra enation leaf curl alphasatellite, complete sequence

And for bacterial

>gi|759427590|ref|NZ_CDDW01000001.1| Aeromonas salmonicida subsp. salmonicida genome assembly PRJEB7036, contig F321_contig22, whole genome shotgun sequence

Is there a way i can modify all the fasta files (n =659 for bacterial and n=2 for viral) in such a way that my header look like this for both viral and bacterial :

>NZ_CDDW01000001.1|some_text|taxid Aeromonas salmonicida subsp. salmonicida genome assembly PRJEB7036 contig F321_contig22, whole genome shotgun sequence

Thanks

Badri

sequence bacteria viral refseq fasta • 2.7k views
ADD COMMENT
0
Entering edit mode
5.8 years ago

can you write scripts, if yes. you can parse the gi_taxid_nucl.dmp.gz file from ncbi taxonomy ftp and replace GIs in fasta header.

ADD COMMENT
0
Entering edit mode

Yes I can write scripts, but the problem is i have concatenated all the fna to one huge fasta file which is 235Gb. And my awk script is not helping to slow.

ADD REPLY
0
Entering edit mode
5.8 years ago

You can do this with the BBMap package's gi2taxid.sh tool, by following the guide in /bbmap/docs/guides/TaxonomyGuide.txt - specifically, the step labeled "Renaming gi numbers to NCBI Tax ID numbers".

ADD COMMENT
0
Entering edit mode

that's cool. should be fast

ADD REPLY
0
Entering edit mode

Thats awesome!!

Second question, is there a way in bbmap which i can use to extract out reads from a fasta file based on a one column text file with accession id for each sequence.

Thanks

ADD REPLY
0
Entering edit mode

it has, visit the homepage

ADD REPLY
0
Entering edit mode

is it filterbyname.sh ? if so it not no working for me..

ADD REPLY
0
Entering edit mode

filterbyname.sh is the correct program. The names need to be reasonably similar. Post examples of your ID's of interest and some fasta sequence headers.

ADD REPLY
0
Entering edit mode

My multifasta file has (have turncated reads to fit in here) bash-4.1$ head viral.2.1.genomic.fna

>gi|433660771|ref|NC_019944.1| Okra enation leaf curl alphasatellite, complete sequence
CTCGAATAGTTTAATGGTATTGGGTTGCGGCCCAATAAATTAAAAGTCCACTAGATAAGCGAATATTGACTGGTCAACTC
>gi|433660773|ref|NC_019946.1| Tomato yellow mottle virus segment DNA-A, complete sequence
TTGGTAAAGTCATGTGCATCTCAGACATCACACGAGGTGGTGGTATTACCCACCGTGTAGGTAAACGTTT

My text file bash-4.1$ head trial.txt

NC_019944.1
NC_019946.1
NC_019947.1
NC_019704.1
NC_019519.1
NC_019521.1
NC_019945.1
NC_020072.1
NC_020074.1
NC_020083.1

command used:

./filterbyname.sh -Xmx10g in=viral.2.1.genomic.fna names=trial.txt include=t out=viral_trial.fasta
ADD REPLY
0
Entering edit mode

Do you get an error or just an empty file?

ADD REPLY
0
Entering edit mode

Just an empty file, see log

Input is being processed as unpaired Time: 0.257 seconds.

Reads Processed: 469 1.83k reads/sec

Bases Processed: 23151316 90.17m bases/sec

Reads Out: 0

Bases Out: 0

ADD REPLY
0
Entering edit mode

I think this time the "|" (pipe) characters may be causing trouble. I will take a look at this later but perhaps @Brian will be along before that.

ADD REPLY
0
Entering edit mode

I think i fixed it changed the headers using awk '$1 ~ />/ { split($0,a,"|"); print ">"a[4],"\t",a[5];next } { print $0 }'

my new headers

NC_019944.1 Okra enation leaf curl alphasatellite, complete sequence

NC_019946.1 Tomato yellow mottle virus segment DNA-A, complete sequence

NC_019947.1 Tomato yellow mottle virus segment DNA-B, complete sequence

NC_019704.1 Enterobacteria phage mEp237, complete genome

NC_019519.1 Agrobacterium phage 7-7-1, complete genome

NC_019521.1 Sphingomonas phage PAU, complete genome

NC_019945.1 Thysanoplusia orichalcea NPV isolate p2, complete genome

NC_020072.1 Colombian datura virus, complete genome

NC_020074.1 Bovine adenovirus 6 strain 671130, complete genome

NC_020083.1 Serratia phage phiMAM1, complete genome

This did the job.

Thanks

ADD REPLY
0
Entering edit mode

I tried using it on my original fasta file (235Gb and trial.txt file (117 Gb , with 6515957 accession id) and the step failed just like before no output or error just a blank file, note I used 60Gb of memory for this step.

ADD REPLY

Login before adding your answer.

Traffic: 867 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6