Question: How can I get FASTA if i have Names of proteins ?
0
gravatar for Xylanaser
2.4 years ago by
Xylanaser10
Poland - Warsaw - SGGW/WULS
Xylanaser10 wrote:

Hey, I have a problem. I have names of proteins for example lpg_200 etc. How can I get FASTA seqences for them?

regards X

protein bioinformatics fasta • 1.1k views
ADD COMMENTlink modified 2.4 years ago by Elisabeth Gasteiger1.6k • written 2.4 years ago by Xylanaser10

Elaborate more please.

ADD REPLYlink written 2.4 years ago by st.ph.n2.4k

i have <1000 names of proteins (kds_0989 xyz_3999 etc) and i nede to get a file of fasta seqs for them. Tried for query them to uniprot, ncbi but the query is to long.

ADD REPLYlink written 2.4 years ago by Xylanaser10

i have <1000 names of proteins (kds_0989 xyz_3999 etc) and i nede to get a file of fasta seqs for them. Tried for query them to uniprot, ncbi but the query is to long.

ADD REPLYlink written 2.4 years ago by Xylanaser10

Please post a few real examples of ID's. Database identifiers can differ from db to db and depending what kind you have the answer may be different.

Also use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by genomax67k
4
gravatar for Buffo
2.4 years ago by
Buffo1.6k
Buffo1.6k wrote:

Go to ftp://ftp.ncbi.nih.gov/refseq/ and download the corresponding data base (human? mouse? etc) and then extract them with a simple python script:

from Bio import SeqIO
import sys

syntax = '''
------------------------------------------------------------------------------------
Syntax:        python extract_sequence_by_name_list.py *file1.fasta *file2.txt
*Sequences in fasta format 
**List of sequences to extract; must have the same name as in fasta file without '>'
------------------------------------------------------------------------------------
'''
if len(sys.argv) != 3:
        print syntax
        sys.exit()

from Bio import SeqIO                                                               
import sys                                                                          

wanted = [line.strip() for line in open(sys.argv[2])]                               
seqiter = SeqIO.parse(open(sys.argv[1]), 'fasta')                                    
SeqIO.write((seq for seq in seqiter if seq.id in wanted), sys.stdout, "fasta")
ADD COMMENTlink written 2.4 years ago by Buffo1.6k

:P u won, ive installed BioPython xD

ADD REPLYlink written 2.4 years ago by Xylanaser10

Of course I did!! jejeje Best regards :)

ADD REPLYlink written 2.4 years ago by Buffo1.6k
0
gravatar for Xylanaser
2.4 years ago by
Xylanaser10
Poland - Warsaw - SGGW/WULS
Xylanaser10 wrote:

It's Legionella proteins.

Iwrote this ;/ but somethings wrong...

#!/bin/bash

#download fasta seqs given file of uniprot ids

names=$1
file_of_seqs=$2

list=$(cat ${1})

mkdir ${file_of_seqs}
cp ${2} ${file_of_seqs}
cd ${file_of_seqs}

for word in ${list}
do
    wget -nv http://www.uniprot.org/uniprot/?sort=score&desc=&compress=no&query=$word&fil=&limit=10&force=no&preview=true&format=fasta

done

EXAMPLE INPUT:

Lpar_2881 Lpar_2978 Lpar_3608 lpg0403

i try now in Python (trying to learn perl now)

ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by Xylanaser10

XD, is not complicated, you have to do just what I said;

1.- go to the FTP page from NCBI.
2.- download you data base
3.- Copy and paste my script on any text editor program, save it as;  extract_sequence_by_name_list.py (or python program, I use note pad ++ text editor to do that).
4.- Save your list of wanted proteins on different txt file as a list. (make sure that they have the same name as in the fasta database).
5.- Run on bash as; python  extract_sequence_by_name_list.py database.fasta wanted.txt > wanted_proteins.fasta
6.- Be happy :)
ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by Buffo1.6k

XD, but i don't use BioPython i rather use my scripts ;/ and xD this daabase is NR i dont want to download whole NR xD but thanks. Maybe tommorow I will write better script ;)

ADD REPLYlink written 2.4 years ago by Xylanaser10
0
gravatar for Elisabeth Gasteiger
2.4 years ago by
Geneva
Elisabeth Gasteiger1.6k wrote:

You can use the UniProt IDmapping service at http://www.uniprot.org/uploadlists Upload your list of identifiers and select to map from Gene names to UniProtKB ACs. The results can be downloaded in tab-separated format.

Alternatively use URLs like http://www.uniprot.org/uniprot/?query=gene%3ALpar_2881&format=fasta in your program.

ADD COMMENTlink written 2.4 years ago by Elisabeth Gasteiger1.6k

I used esearch and efetch (in my script) :), but esearch for one id found sometimes few sequences instead of one (duplicate too,deleted them with genome tools) - other related seqs. This is ok but I have too much trash.

ADD REPLYlink written 2.4 years ago by Xylanaser10

If you are using UniProtKB, you can of course add additional search criteria to avoid duplication, e.g. the taxonomy identifier:

gene:Lpar_2881 and organism:45071

For this particular organism, there are only unreviewed entries, but in other cases there may be reviewed and unreviewed ones, in which case it can be useful to also add reviewed:yes in case of redundancy/duplication.

An alternative approach may be to generate a list of all Legionella parisiensis entries with their ORFnames, and then look up your identifiers locally in this list:

http://www.uniprot.org/uniprot/?query=organism:45071

Customize your display, remove all irrelevant columns and add one for 'Gene name (ORFname)' as described in http://www.uniprot.org/help/customize :

e.g. http://www.uniprot.org/uniprot/?query=organism:45071&format=tab&columns=id,genes%28ORF%29,protein%20names

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by Elisabeth Gasteiger1.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2286 users visited in the last hour