Question: How to retrieve single protein fasta file for multiple species?
0
gravatar for arsilan324
2.7 years ago by
arsilan32480
arsilan32480 wrote:

Hi all,

We are trying to make protein database of multiple organisms say E. coli, T. ferroxidans, B. subtilus, etc. This is what we want to use for matching our orbitrap output and we want to do that only with those species which we have found through Illumina sequencing. These are approximately 400+ genera. So, can you suggest any smart way of doing so? Like I provide the names of organisms and retrieve single fasta file?

Thank you very much!

ADD COMMENTlink modified 2.7 years ago by Elisabeth Gasteiger1.8k • written 2.7 years ago by arsilan32480

You can use @5heikii's script here.

cating the individual fasta genome proteins files into a giant one afterwards should be a simple task.

Note: See new answer/commnet below.

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by genomax91k

running this code didn't generate any fasta file. Although both the list of species (species.txt) and assembly_summary.txt are is same folder. Am i missing something?

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by arsilan32480
2
gravatar for genomax
2.7 years ago by
genomax91k
United States
genomax91k wrote:

Try this if you need RefSeq (modified version of @5heikki's code):

$ more species.txt 
Bifidobacterium adolescentis

$ wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt

$ IFS=$'\n'; for next in $(cat species.txt); do awk -v SPECIES=^"$next" 'BEGIN{FS="\t"}{if($8 ~ SPECIES){print $20}}' assembly_summary_refseq.txt | awk 'BEGIN{OFS=FS="/"}{print "wget "$0,$NF"_protein.faa.gz"}'; done | sh

Otherwise

$ wget ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt

 IFS=$'\n'; for next in $(cat species.txt); do awk -v SPECIES=^"$next" 'BEGIN{FS="\t"}{if($8 ~ SPECIES){print $20}}' assembly_summary.txt | awk 'BEGIN{OFS=FS="/"}{print "wget "$0,$NF"_protein.faa.gz"}'; done

You will get many strains etc by this method. If you need very specific strains then you could awk '{print $8,$9,$10}' assembly_summary.txt > species and only take those that you need.

ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by genomax91k

thanks!! this worked perfectly. I have list of files such as GCF_000164035.1_ASM16403v1_protein.faa.gz and the next step would be to combine them together. Can you guide me there as well? Thanks a lot!!! :)

ADD REPLYlink written 2.7 years ago by arsilan32480
1

If you want the final data file uncompressed: zcat G*.gz > final.faa
If you want to keep the final data compressed: cat G*.gz > final.faa.gz

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by genomax91k

I have prepared another list of archea this time but this command is not working. Is there any other assembly summary for archea?

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by arsilan32480

Post examples of names that are not working.

ADD REPLYlink written 2.7 years ago by genomax91k

Here are examples, 1- Halodesulfurarchaeum formicicum 2- Methanosphaera cuniculi

The whole list can be seen here...

https://gold.jgi.doe.gov/organisms?Organism.Domain=ARCHAEAL&Organism.Type%20Strain=Yes&Organism.Active=Yes

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by arsilan32480

First one should work: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/886/955/GCF_001886955.1_ASM188695v1/GCF_001886955.1_ASM188695v1_protein.faa.gz

Second does not have a refseq genome. You may have to try second option of plain genomes. These may only have genomic sequence at times. https://www.ncbi.nlm.nih.gov/protein/?term=txid1077256[Organism:noexp]

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by genomax91k
1
gravatar for Elisabeth Gasteiger
2.7 years ago by
Geneva
Elisabeth Gasteiger1.8k wrote:

If you are working with UniProt, you can retrieve the data programmatically as described here (with code examples): https://www.uniprot.org/help/api_downloading https://www.uniprot.org/help/api_queries

ADD COMMENTlink written 2.7 years ago by Elisabeth Gasteiger1.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1650 users visited in the last hour