Question: Parse Uniprot Fasta Record, Save It, Trim It And Quantify Each Residue
gravatar for Pals
7.5 years ago by
Pals1.3k wrote:


I have a list of UniProt Id's that I want to analyse. I want to extract fasta record for each uniprot id, save those locally with filename_nn (nn is the length of signal peptide) and calculate total number of each residue. How could I do this?

Please suggest me if there is alternate way for the solution.


fasta uniprot • 3.5k views
ADD COMMENTlink modified 7.5 years ago • written 7.5 years ago by Pals1.3k

From where is the value of nn obtained? It's something you know already, in the file of IDs, or something you have to calculate from the UniProt record?

ADD REPLYlink written 7.5 years ago by Neilfws48k

It's from UniProt record under Sequence Annotation . Feature Key: Signal peptide , Positions: 1-nn, length: nn

ADD REPLYlink written 7.5 years ago by Pals1.3k

Total number of each residue == sequence length you mean? What you want to do with total number of each residue? just to place them after filename (e.g filename_312)?

ADD REPLYlink written 7.5 years ago by Thaman3.3k

Sorry, its total count of each residue (No of Gly, Pro, Ala... in the sequence). When Fasta record is retrieved, it needs to be saved as UniProtID_nn, where nn is the length of signal peptide in the sequence as described in my comment above.

ADD REPLYlink written 7.5 years ago by Pals1.3k
gravatar for JC
7.5 years ago by
JC11k wrote:

You can get the Fasta sequence if you query UniProt as:[UniProtID].fasta

So you can script in Perl/Python to get your sequences, store locally and get the statistics, something like:

use strict;
use warnings;
use LWP::Simple;

my @ids  = qw'P22314';
foreach my $id (@ids) {
    my $req = get("$id.fasta");
    if (defined $req) {
        open  FA, ">$id.fa" or die;
        print FA  "$req\n";
        close FA;
        open  ST, ">$id.stat" or die;
        my ($name, @seq) = split (/\n/, $req);
        my %aa;
        my $seq = join ("", @seq);
        while (my $aa = chop $seq) {
        while (my ($aa, $nn) = each %aa) {
            print ST "$aa\t$nn\n";
        close ST; 
    else {
        warn "cannot retrieve $id\n";
ADD COMMENTlink written 7.5 years ago by JC11k

Thanks a lot JC. However, how would I exclude Nucleotide binding region (478-507) in the UniProt entry, in P22314 to calculate statistics? In my case, it is signal peptide that I want to get rid of. Take this example, - when we scroll down to Sequence Annotation, under Molecular Processing, there is signal peptide of 24 residues which I want to exclude. In my analysis set, all sequences contain signal peptide.

ADD REPLYlink written 7.5 years ago by Pals1.3k

then you need another query or source to exclude some positions, it's not hard to implement in the code.

ADD REPLYlink written 7.5 years ago by JC11k
gravatar for Thaman
7.5 years ago by
Thaman3.3k wrote:

Here, I have written in Python which does the job. If you want to get more information regarding the any uniprot ID then its better to fetch xml ( file rather then fasta and parse the needed content using either lxml, SAX or dom. If you want then I can write xml parser too.

*This script assumes that all the Uniprot accession are available so haven't written
any exception for unknown accession.

*The fasta file will be saved as O60243_411.fasta (filname_lengthofsequence.fasta)

import urllib2

ids_list = ['O60243', 'Q8IZP7', 'P02070'] # or read from the file (.txt, csv)
uniprot_url = ""  # constant Uniprot Namespace

def get_fasta(id):
    # get the fasta record of the Uniprot ID 

    url_with_id = "%s%s%s" %(uniprot_url, id, ".fasta") # <>
    file_from_uniprot = urllib2.urlopen(url_with_id)

    # file_from_uniprot = urllib2.urlopen(url_with_id).read() is better no extra variable

    data = # read the retrieve file 

    - replace all the newlines from the file_from_uniprot by replace('\n') method
    and make it one line.
    - after that split the line by split('SV=') and get first second item from the list which is 1
    - then compute the length of the sequence starting from 1 to length of sequence.
    get_only_sequence = data.replace('\n', '').split('SV=')[1]
    length_of_sequence =  len(get_only_sequence[1:len(get_only_sequence)])
    file_output_name = "%s%s%s%s" %(id,"_", length_of_sequence, ".fasta")

    # print file_output_name

    # write file to local computer
    with open(file_output_name, "wb") as fasta_file:
    print "completed"

def main():
    # iterate over the list of IDS

    for k,v in enumerate(ids_list):

    # or read from a text file
    # input_file = open("myuniprot_id.txt").readlines()
    # get_fasta(input_file)

if __name__ == '__main__':


My first attempt wasn't to your requirement which I misunderstood. But, I spare sometime and wrote new script and sure its what you trying to achieve.

I have tried to clear by writing precise comments with lengthy variable names. It creates three different files (fasta, fasta without signal peptide and statistics). Here is a code which I haven't done any refactoring.

import xml.etree.ElementTree as ET
import urllib2
from collections import Counter # require Python 2.7 and higher

ids_list = ['P01165', 'P53634', 'Q8CW46']
uniprot_xml_namespace = "{}"
uniprot_url = ""  # constant Uniprot Namespace

def get_xml_feature_signal_peptide(id):
    # get id and fetched the content as xml from Uniprot.
    # the xml file from uniprot has all the information which is parsed using ElementTree

    url_with_id = "%s%s%s" %(uniprot_url, id, ".xml") # <>

    file_to_parse = urllib2.urlopen(url_with_id)

    tree = ET.parse(file_to_parse) # parse the xml file
    from the tree get feature element with attribute signal peptide and find the
    position. Here its assume the length of the signal peptide will be equivalent
    to end position.
    get_feature = tree.findall(".//"+uniprot_xml_namespace+"feature[@type='signal peptide']")
    if get_feature:
        feature_location_position = []
        for eachitem in get_feature:
            for item in eachitem.getchildren()[0]:
        get_fasta_file_and_statistics(id, feature_location_position)
        print "no signal peptide found for id: %s" %id 

def get_fasta_file_and_statistics(id, feature):
    url_with_id = "%s%s%s" %(uniprot_url, id, ".fasta") # <>
    file_from_uniprot = urllib2.urlopen(url_with_id)
    data = # read the retrieve file
    save_fasta_name = "%s%s%s%s" % (id, "_", feature[1], ".fasta")
    save_fasta_stat_name = "%s%s%s%s%s" % (id, "_", feature[1], "_stat", ".txt")
    save_fasta_without_sigpep = "%s%s%s%s%s" % (id, "_", feature[1], "_without_pepsig", ".fasta")

    # write fasta file to the local computer
    with open(save_fasta_name, "wb") as fasta_file:
    print "completed writing fasta file"

    # remove signal peptide from the sequence
    peptide_seq_region = int(feature[1])
    get_only_sequence = data.replace('\n', '').split('SV=')[1]
    length_of_sequence =  len(get_only_sequence[1:len(get_only_sequence)])
    complete_sequence = get_only_sequence[1:len(get_only_sequence)]
    rm_sig_pep_seq = complete_sequence[peptide_seq_region:len(complete_sequence)]
    statistics = Counter(rm_sig_pep_seq)

    # write statistics file to local computer

    stat_file = open(save_fasta_stat_name, 'w')
    aa_one_letter = "ARNDBCEQZGHILKMFPSTWYV"
    for letter in aa_one_letter:
        stat_file.writelines('%s : %d \n' % (letter, statistics[letter]))
    print "completed writing statistics"

    # write fasta file without peptide signal
    accession = data.split('\n')[0]
    without_sigpep = open(save_fasta_without_sigpep,'w')
    # write 100 residues per line and do line break
    without_sigpep.writelines(accession+'\n'+'\n'.join(rm_sig_pep_seq[pos:pos+100] for pos in xrange(0, len(rm_sig_pep_seq), 100)))
    print "Completed writing fasta file without signal peptide residues"

def main():
    for k,v in enumerate(ids_list):

    # OR
    # get_xml_feature_signal_peptide('Q8CW46')

    # OR 
    #read from a text file
    #input_file = open("myuniprot_id.txt").readlines()

if __name__ == '__main__':

Hope it helps :)

ADD COMMENTlink modified 7.5 years ago • written 7.5 years ago by Thaman3.3k

Wonderful! Thanks a lot :-)

ADD REPLYlink written 7.5 years ago by Pals1.3k
gravatar for Elisabeth Gasteiger
7.5 years ago by
Elisabeth Gasteiger1.7k wrote:

You can also use the Batch Retrieval on the UniProt web site ( and then use the GFF format by following the instructions from this FAQ:

ProtParam could help to compute the amino acid composition.

ADD COMMENTlink modified 7.5 years ago • written 7.5 years ago by Elisabeth Gasteiger1.7k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 616 users visited in the last hour