Question: accession to taxid for fasta files downloaded from refseq
0
gravatar for badribio
2.7 years ago by
badribio240
badribio240 wrote:

I am trying to build a database for metagenomic analysis have all the genomic.fna.gz files for bacteria and virus from ftp://ftp.ncbi.nlm.nih.gov/refseq/release/ release 77 which has 42080 species for bacteria and 5654 for viral

header for viral sequence looks like this

>gi|433660771|ref|NC_019944.1| Okra enation leaf curl alphasatellite, complete sequence

And for bacterial

>gi|759427590|ref|NZ_CDDW01000001.1| Aeromonas salmonicida subsp. salmonicida genome assembly PRJEB7036, contig F321_contig22, whole genome shotgun sequence

Is there a way i can modify all the fasta files (n =659 for bacterial and n=2 for viral) in such a way that my header look like this for both viral and bacterial :

>NZ_CDDW01000001.1|some_text|taxid Aeromonas salmonicida subsp. salmonicida genome assembly PRJEB7036 contig F321_contig22, whole genome shotgun sequence

Thanks

Badri

ADD COMMENTlink modified 2.7 years ago by Brian Bushnell16k • written 2.7 years ago by badribio240
0
gravatar for shenwei356
2.7 years ago by
shenwei3564.5k
China
shenwei3564.5k wrote:

can you write scripts, if yes. you can parse the gi_taxid_nucl.dmp.gz file from ncbi taxonomy ftp and replace GIs in fasta header.

ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by shenwei3564.5k

Yes I can write scripts, but the problem is i have concatenated all the fna to one huge fasta file which is 235Gb. And my awk script is not helping to slow.

ADD REPLYlink written 2.7 years ago by badribio240
0
gravatar for Brian Bushnell
2.7 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

You can do this with the BBMap package's gi2taxid.sh tool, by following the guide in /bbmap/docs/guides/TaxonomyGuide.txt - specifically, the step labeled "Renaming gi numbers to NCBI Tax ID numbers".

ADD COMMENTlink written 2.7 years ago by Brian Bushnell16k

that's cool. should be fast

ADD REPLYlink written 2.7 years ago by shenwei3564.5k

Thats awesome!!

Second question, is there a way in bbmap which i can use to extract out reads from a fasta file based on a one column text file with accession id for each sequence.

Thanks

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by badribio240

it has, visit the homepage

ADD REPLYlink written 2.7 years ago by shenwei3564.5k

is it filterbyname.sh ? if so it not no working for me..

ADD REPLYlink written 2.7 years ago by badribio240

filterbyname.sh is the correct program. The names need to be reasonably similar. Post examples of your ID's of interest and some fasta sequence headers.

ADD REPLYlink written 2.7 years ago by genomax65k

My multifasta file has (have turncated reads to fit in here) bash-4.1$ head viral.2.1.genomic.fna

>gi|433660771|ref|NC_019944.1| Okra enation leaf curl alphasatellite, complete sequence
CTCGAATAGTTTAATGGTATTGGGTTGCGGCCCAATAAATTAAAAGTCCACTAGATAAGCGAATATTGACTGGTCAACTC
>gi|433660773|ref|NC_019946.1| Tomato yellow mottle virus segment DNA-A, complete sequence
TTGGTAAAGTCATGTGCATCTCAGACATCACACGAGGTGGTGGTATTACCCACCGTGTAGGTAAACGTTT

My text file bash-4.1$ head trial.txt

NC_019944.1
NC_019946.1
NC_019947.1
NC_019704.1
NC_019519.1
NC_019521.1
NC_019945.1
NC_020072.1
NC_020074.1
NC_020083.1

command used:

./filterbyname.sh -Xmx10g in=viral.2.1.genomic.fna names=trial.txt include=t out=viral_trial.fasta
ADD REPLYlink modified 2.7 years ago by genomax65k • written 2.7 years ago by badribio240

Do you get an error or just an empty file?

ADD REPLYlink written 2.7 years ago by genomax65k

Just an empty file, see log

Input is being processed as unpaired Time: 0.257 seconds.

Reads Processed: 469 1.83k reads/sec

Bases Processed: 23151316 90.17m bases/sec

Reads Out: 0

Bases Out: 0

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by badribio240

I think this time the "|" (pipe) characters may be causing trouble. I will take a look at this later but perhaps @Brian will be along before that.

ADD REPLYlink written 2.7 years ago by genomax65k

I think i fixed it changed the headers using awk '$1 ~ />/ { split($0,a,"|"); print ">"a[4],"\t",a[5];next } { print $0 }'

my new headers

NC_019944.1 Okra enation leaf curl alphasatellite, complete sequence

NC_019946.1 Tomato yellow mottle virus segment DNA-A, complete sequence

NC_019947.1 Tomato yellow mottle virus segment DNA-B, complete sequence

NC_019704.1 Enterobacteria phage mEp237, complete genome

NC_019519.1 Agrobacterium phage 7-7-1, complete genome

NC_019521.1 Sphingomonas phage PAU, complete genome

NC_019945.1 Thysanoplusia orichalcea NPV isolate p2, complete genome

NC_020072.1 Colombian datura virus, complete genome

NC_020074.1 Bovine adenovirus 6 strain 671130, complete genome

NC_020083.1 Serratia phage phiMAM1, complete genome

This did the job.

Thanks

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by badribio240

I tried using it on my original fasta file (235Gb and trial.txt file (117 Gb , with 6515957 accession id) and the step failed just like before no output or error just a blank file, note I used 60Gb of memory for this step.

ADD REPLYlink written 2.7 years ago by badribio240
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 892 users visited in the last hour