Add taxID for each record of a fasta file
1
1
Entering edit mode
4.4 years ago
wyc661217 ▴ 10

Hello, I have a fasta file contains thousands of records like this:

X17276.1 Giant Panda satellite 1 DNA GATCCTCCCCAGGCCCCTACACCCAATGTGGAACCGGGGTCCCGAATGAAAATGCTGCTGTTCCCTGGAGGTGTTTTCCT GGACGCTCTGCTTTGTTACCAATGAGAAGGGCGCTGAATCCTCGAAAATCCTGACCCTTTTAATTCATGCTCCCTTACTC ACGAGAGATGATGATCGTTGATATTTCCCTGGACTGTGTGGGGTCTCAGAGACCACTATGGGGCACTCTCGTCAGGCTTC CGCGACCACGTTCCCTCATGTTTCCCTATTAACGAAGGGTGATGATAGTGCTAAGACGGTCCCTGTACGGTGTTGTTTCT GACAGACGTGTTTTGGGCCTTTTCGTTCCATTGCCGCCAGCAGTTTTGACAGGATTTCCCCAGGGAGCAAACTTTTCGAT GGAAACGGGTTTTGGCCGAATTGTCTTTCTCAGTGCTGTGTTCGTCGTGTTTCACTCACGGTACCAAAACACCTTGATTA TTGTTCCACCCTCCATAAGGCCGTCGTGACTTCAAGGGCTTTCCCCTCAAACTTTGTTTCTTGGTTCTACGGGCTG

now want to add the taxID based on the GI for each line, which should be the best way?

Thank you!!

fasta GI taxID • 1.8k views
ADD COMMENT
2
Entering edit mode
4.4 years ago

The BBMap package has a tool called "gi2taxid.sh" which will do this. First you'd also need to fetch the accessions using /bbmap/pipelines/fetchtaxonomy.sh as an example which will get the accession translation tables from ncbi.

You can also do it yourself programmatically, without downloading any files, using JGI's taxonomy server. For example:

curl "https://taxonomy.jgi-psf.org/tax/pt_header/X17276.1 Giant Panda satellite 1 DNA"

...will return the taxid, 9646.

ADD COMMENT
0
Entering edit mode

Thank you Brian!! I figured it out with the JGI.

ADD REPLY
0
Entering edit mode

Hello Brian,

I am in a similar position to the original poster, having a fasta file containing hundreds of thousands of sequences for many species which currently don't have taxids. Here is my current format:

gi|939462896|gb|KT878822.1| Aphis fabae voucher S06340 cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial ATTATTAGTCAAGAAAGAAATAAAAATGAAACATTTGGTAATATTAGAATAATTTATGCAATATTAACTA etc.

My understanding is that BBMap will allow me to add the taxid using this "renamebytaxa.sh" tool, which is exactly what I would like to do. I've downloaded BBMap and ran the fetchtaxonomy shell script, but I can't seem to find a "renamebytaxa" file. I've found "rename.sh", but I can't see a simple way to use this to the same effect.

I also tried following the instructions at: http://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/taxonomy-guide/ but I keep getting an error for every shell script I try to run (taxtree,sh, for example) saying "could not find or load main class". Do you have any advice for this issue? Any help is massively appreciated.

ADD REPLY
0
Entering edit mode

Oh, sorry... not sure why I called it "renamebytaxa.sh"; it's actually "gi2taxid.sh". And the instructions on the website are a little outdated; the latest version of documentation is always distributed with BBMap (in this case/docs/guides/TaxonomyGuide.txt, which still could use a little updating).

In general, for NCBI-style names, the process is like this:

gi2taxid.sh -Xmx63g in=nt.fa.gz out=renamed.fa.gz tree=auto accession=auto table=auto path=/path/to/taxonomy/

...where /path/to/taxonomy/ is the location where you ran fetchtaxonomy.sh. Even though it's called "gi2taxid.sh", it works with both gi numbers or accessions, as long as the headers are in the standard NCBI format. It alternatively works with the Silva header format if you add the "silva" flag.

The before/after results look like this:

zcat /global/projectb/sandbox/gaag/bbtools/nt/jul25_2017/nt.fa.gz | head -1
>X17276.1 Giant Panda satellite 1 DNA

zcat /global/projectb/sandbox/gaag/bbtools/nt/jul25_2017/renamed.fa.gz | head -1
>tid|9646|X17276.1 Giant Panda satellite 1 DNA

...where 9646 is the NCBI tax ID.

ADD REPLY
0
Entering edit mode

Hello Brian,

Many thanks for your kind advice. I am, however, having issues using the shell scripts in the package. Whilst "fetchtaxonomy" seems to run at first, it meets an error when the other shell scripts become involved:

line 7: module: command not found
line 18: shrinkaccession.sh: command not found
line 25: taxtree.sh: command not found
line 26: gitable.sh: command not found
  

Errors occur for each of the shell scripts individually, too:

Error: Could not find or load main class EXT.Obitools.bbmap.current

I'm assuming this is an issue with how I have things set up? Any advice would be massively appreciated!

Many thanks,

Jordan

ADD REPLY

Login before adding your answer.

Traffic: 2430 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6