Question: Filter the non-redundant NCBI database based on taxonomy
0
gravatar for dieunelderilus
4 weeks ago by
dieunelderilus0 wrote:

Please is is there any way to extract all the eukaryotic reads from the non-redundant NCBI database ? What would be the exact command line that would help me to do that ? or where can I find the non-redundant Eukaryotic database ?

Thanks in advance for your support

DD

database genebank ncbi • 309 views
ADD COMMENTlink modified 4 weeks ago by vkkodali860 • written 4 weeks ago by dieunelderilus0
3
gravatar for shenwei356
4 weeks ago by
shenwei3564.4k
China
shenwei3564.4k wrote:

Making nr blastdb for specific taxids

ADD COMMENTlink written 4 weeks ago by shenwei3564.4k

This tutorial recursively gets all taxID's which belong to a higher level taxID. These would be needed to get all sequences under the top level taxID (e.g. eukaryota).

ADD REPLYlink written 4 weeks ago by genomax60k

Thanks for sharing the tutorial , I am following the discussion and read the tutorial to see if could filter the the nr based on taxonomy

DD

ADD REPLYlink written 4 weeks ago by dieunelderilus0

I am following to this tutorial to filter the non-redundant database but I get stock in step 3. You suggest to NOT directly download nr.gz from ncbi ftp, in which the FASTA headers are not well formatted. However the link you provide to download the pre-formatted nr database is not working.

Please where can get the nr database that is well formatted to wok with this tutorial ?

Thanks in advance

ADD REPLYlink written 4 weeks ago by dieunelderilus0

ftp://ftp.ncbi.nlm.nih.gov/blast/db

ADD REPLYlink written 4 weeks ago by shenwei3564.4k

I am getting a problem with taxonkit tutorial. I get stock in the step 3 (option 1). After retrieving FASTA sequences from pre-formated blastdb, the perl one-liner that used to unfold records having mulitple accessions output an empty nr.$id.fa.gz.

Please any suggestions ?

ADD REPLYlink modified 26 days ago by shenwei3564.4k • written 29 days ago by dieunelderilus0

so, is it 6656 or 6665?

ADD REPLYlink written 28 days ago by shenwei3564.4k

Sorry, it was a problem in editing the message. the id is 6656

ADD REPLYlink written 28 days ago by dieunelderilus0

What's your taxid ??? 6656 is just the example.

ADD REPLYlink written 28 days ago by shenwei3564.4k

I need to retrieve get a new database for Chlorophyta with taxid 3041. the perl one-liner script outputs an empty file for watever taxid I use . I could not even reproduce the tutorial. It could be trivial but I don't know why I don't get it.

thank you for your assistance

DD

ADD REPLYlink written 28 days ago by dieunelderilus0

please paste results of

wc -l 3041.taxid.txt 
wc -l 3041.acc.txt

I'll check it on monday.

ADD REPLYlink written 27 days ago by shenwei3564.4k

Thanks they are copied here :

wc -l 3041.acc.txt 
430402 3041.acc.txt

wc -l 3041.taxid.txt 
9872 3041.taxid.txt

Note that I could not reproduce the tutorial, the option 2 of the step three works but , but could not pass the stp4 (make blasted). You did not provide how to deal with this error in the case a user would se the the option 2 of the step3. I try try to remove duplicate reads, and suspect this is not the good option.

Thanks again for your assistance

DD

ADD REPLYlink modified 27 days ago • written 27 days ago by dieunelderilus0

Well, I tried 3041, and every step worked fine.

$ wc -l $id.taxid.txt
9703 3041.taxid.txt

$ wc -l $id.acc.txt
383332 3041.acc.txt

$ pigz -dc nr.$id.fa.gz | grep '^>' -c
383332

$ time makeblastdb -parse_seqids -in nr.$id.fa -dbtype prot -out nr.$id
Building a new DB, current time: 12/22/2018 23:33:14
New DB name:   /home/shenwei/data/data/blastdb/nr.3041
New DB title:  nr.3041.fa
Sequence type: Protein
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 383332 sequences in 8.6429 seconds.

real    0m8.806s
user    0m8.320s
sys     0m0.466s

I also add a faster way to get nr.$taxid.fa.gz from nr.fa.gz, in step 3 option 1.

ADD REPLYlink written 26 days ago by shenwei3564.4k

I can send nr.3041.fa.gz (70M) to you via email, if you give my address or write email to me.

ADD REPLYlink written 26 days ago by shenwei3564.4k

Thanks a lot! Please I would be happy if you can send the nr.$id.fa forme to me at derilus.dieunel@upr.edu or dieunelderilus@gmail.com.

But should keep testing to see where I am wrong, because this is a very interesting tools.

ADD REPLYlink written 26 days ago by dieunelderilus0
0
gravatar for vkkodali
4 weeks ago by
vkkodali860
United States
vkkodali860 wrote:

If you don't mind using the -remote option, you can consider using the -entrez_query with txid2759[Organism].

ADD COMMENTlink written 4 weeks ago by vkkodali860

I am curious as to how NCBI separates txid2759[Organism] sequence in nr? If I look for entries in a local copy of nr I don't find anything matching txid2579.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by genomax60k
1

You probably mean txid2759 not txid2579. I have myself not used blast from command line this way but it is possible that the tax IDs int he local copy are the species level tax IDs where as 2759 includes all eukaryotes.

ADD REPLYlink written 4 weeks ago by vkkodali860

Yes, he/she may just give an invalid taxid. I also checked, and 2579 is not found in NCBI taxonomy database, but 2759 is right.

$ echo 2579 | taxonkit lineage 
2579
$ echo 2759 | taxonkit lineage 
2759    cellular organisms;Eukaryota
ADD REPLYlink written 4 weeks ago by shenwei3564.4k

My bad. Even though I wrote 2579 I was checking the right taxID.

With a local copy of nr, I see results from (573, a random taxID example)

blastdbcmd -db nr -entry all -outfmt %T,%a | awk -F "," '{OFS="\t"} {if ($1 == 573) print $1,$2}'

but the same search with 2759 generate zero results.

blastdbcmd -db nr -entry all -outfmt %T,%a | awk -F "," '{OFS="\t"} {if ($1 == 2759) print $1,$2}'
ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by genomax60k
1

That's why I wrote the tutorial, because most of the protein sequences are linked to species not higher taxonomy rank. You have to get all taxids of species belonging to the Eukayota.

ADD REPLYlink written 4 weeks ago by shenwei3564.4k

I see. I will move your comment above to an answer.

ADD REPLYlink written 4 weeks ago by genomax60k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1099 users visited in the last hour