How can I get FASTA if i have Names of proteins ?
3
0
Entering edit mode
7.3 years ago
Xylanaser ▴ 80

Hey, I have a problem. I have names of proteins for example lpg_200 etc. How can I get FASTA seqences for them?

Regards
X

protein FASTA • 3.8k views
ADD COMMENT
0
Entering edit mode

Elaborate more please.

ADD REPLY
0
Entering edit mode

i have <1000 names of proteins (kds_0989 xyz_3999 etc) and i nede to get a file of fasta seqs for them. Tried for query them to uniprot, ncbi but the query is to long.

ADD REPLY
0
Entering edit mode

i have <1000 names of proteins (kds_0989 xyz_3999 etc) and i nede to get a file of fasta seqs for them. Tried for query them to uniprot, ncbi but the query is to long.

ADD REPLY
0
Entering edit mode

Please post a few real examples of ID's. Database identifiers can differ from db to db and depending what kind you have the answer may be different.

Also use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

ADD REPLY
4
Entering edit mode
7.3 years ago
Buffo ★ 2.4k

Go to ftp://ftp.ncbi.nih.gov/refseq/ and download the corresponding data base (human? mouse? etc) and then extract them with a simple python script:

from Bio import SeqIO
import sys

syntax = '''
------------------------------------------------------------------------------------
Syntax:        python extract_sequence_by_name_list.py *file1.fasta *file2.txt
*Sequences in fasta format 
**List of sequences to extract; must have the same name as in fasta file without '>'
------------------------------------------------------------------------------------
'''
if len(sys.argv) != 3:
        print syntax
        sys.exit()

from Bio import SeqIO                                                               
import sys                                                                          

wanted = [line.strip() for line in open(sys.argv[2])]                               
seqiter = SeqIO.parse(open(sys.argv[1]), 'fasta')                                    
SeqIO.write((seq for seq in seqiter if seq.id in wanted), sys.stdout, "fasta")
ADD COMMENT
0
Entering edit mode

:P u won, ive installed BioPython xD

ADD REPLY
0
Entering edit mode

Of course I did!! jejeje Best regards :)

ADD REPLY
0
Entering edit mode
7.3 years ago
Xylanaser ▴ 80

It's Legionella proteins.

Iwrote this ;/ but somethings wrong...

#!/bin/bash

#download fasta seqs given file of uniprot ids

names=$1
file_of_seqs=$2

list=$(cat ${1})

mkdir ${file_of_seqs}
cp ${2} ${file_of_seqs}
cd ${file_of_seqs}

for word in ${list}
do
    wget -nv http://www.uniprot.org/uniprot/?sort=score&desc=&compress=no&query=$word&fil=&limit=10&force=no&preview=true&format=fasta

done

EXAMPLE INPUT:

Lpar_2881 Lpar_2978 Lpar_3608 lpg0403

i try now in Python (trying to learn perl now)

ADD COMMENT
0
Entering edit mode

XD, is not complicated, you have to do just what I said;

1.- go to the FTP page from NCBI.
2.- download you data base
3.- Copy and paste my script on any text editor program, save it as;  extract_sequence_by_name_list.py (or python program, I use note pad ++ text editor to do that).
4.- Save your list of wanted proteins on different txt file as a list. (make sure that they have the same name as in the fasta database).
5.- Run on bash as; python  extract_sequence_by_name_list.py database.fasta wanted.txt > wanted_proteins.fasta
6.- Be happy :)
ADD REPLY
0
Entering edit mode

XD, but i don't use BioPython i rather use my scripts ;/ and xD this daabase is NR i dont want to download whole NR xD but thanks. Maybe tommorow I will write better script ;)

ADD REPLY
0
Entering edit mode
7.3 years ago

You can use the UniProt IDmapping service at http://www.uniprot.org/uploadlists Upload your list of identifiers and select to map from Gene names to UniProtKB ACs. The results can be downloaded in tab-separated format.

Alternatively use URLs like http://www.uniprot.org/uniprot/?query=gene%3ALpar_2881&format=fasta in your program.

ADD COMMENT
0
Entering edit mode

I used esearch and efetch (in my script) :), but esearch for one id found sometimes few sequences instead of one (duplicate too,deleted them with genome tools) - other related seqs. This is ok but I have too much trash.

ADD REPLY
0
Entering edit mode

If you are using UniProtKB, you can of course add additional search criteria to avoid duplication, e.g. the taxonomy identifier:

gene:Lpar_2881 and organism:45071

For this particular organism, there are only unreviewed entries, but in other cases there may be reviewed and unreviewed ones, in which case it can be useful to also add reviewed:yes in case of redundancy/duplication.

An alternative approach may be to generate a list of all Legionella parisiensis entries with their ORFnames, and then look up your identifiers locally in this list:

http://www.uniprot.org/uniprot/?query=organism:45071

Customize your display, remove all irrelevant columns and add one for 'Gene name (ORFname)' as described in http://www.uniprot.org/help/customize :

e.g. http://www.uniprot.org/uniprot/?query=organism:45071&format=tab&columns=id,genes%28ORF%29,protein%20names

ADD REPLY

Login before adding your answer.

Traffic: 2536 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6