Question: How to retrieve protein sequences from FASTA by using accession number?
0
gravatar for gaiboyan23
5 months ago by
gaiboyan230
gaiboyan230 wrote:

Noobie to bioinformatics here, I'm strugging with a code that shouldn't be hard. I have a list of 800+ accession numbers for proteins of interest, and I'm trying to get the corresponding protein sequence for all of them.

I've downloaded the FASTA file from Uniprot, and I'm trying to figure out a way to get the sequences in a list using biopython module. So far my code looks something like this:

Creating the original list of 800+ accession numbers (this part is fine)

import openpyxl
file=openpyxl.load_workbook('substrate_1.xlsx')
Y_100= file.get_sheet_by_name ('Supplementary Table 2. Y100Bpa')
rownumber=Y_100.max_row
Acc=[]
for r in range (3, rownumber+1):
    Acc.append (Y_100.cell(column=1, row=r).value)

trying (and failing) to parse Fasta

import Bio
from Bio import SeqIO

for seq_record in SeqIO.parse("uniprot.fasta.", "fasta"): 
    if seq_record.id in Acc:   #is this how I would select for only the accession numbers from my original list
        print seq_record.id, repr(seq_record.seq, len(seq_record))  
    else:
        continue

So far this code doesn't work at all, what am I doing wrong here? I also tried creating a dictionary instead of list, would that be a better solution?

Thanks in advance from someone lost in the world of bioinformatics

protein sequence fasta • 339 views
ADD COMMENTlink modified 5 months ago by jrj.healey12k • written 5 months ago by gaiboyan230
1

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

ADD REPLYlink written 5 months ago by RamRS21k

Just reformmated, first time poster here, thanks for the tip!

ADD REPLYlink written 5 months ago by gaiboyan230
1

When you say "this part is fine", do you mean:

  1. there is no error
  2. there is no error and a list of items is generated
  3. there is no error and a list of valid Uniprot identifiers as seen in the spreadsheet is generated

Have you tried simply using a text file with the IDs alongside any tool from one of the gazillion answers on the site addressing "retrieve sequence by identifier" questions?

ADD REPLYlink written 5 months ago by RamRS21k

there is no error/ list was generated.

I've looked some other posts but they seem to not exactly fit my question/ use different modules/programming language. I'll keep looking I guess

ADD REPLYlink written 5 months ago by gaiboyan230
1

Have you checked manually that any of your accessions from your sheet, match the accessions in the fasta?

Note that Biopython by default only uses the header from > to the first space, and you're doing direct string comparisons (though in rather than == is the better choice it is still not guaranteed to work.

Can you show us some of the format of your list of accessions and the format of your uniprot fasta?

ADD REPLYlink written 5 months ago by jrj.healey12k

Yes, i've checked and the accessions match.

Here's a the first line of the list of my accession numbers (cut short for space):

['SYAC_HUMAN', 'PGAM4_HUMAN', 'SYEP_HUMAN', 'H14_HUMAN', 'K1C9_HUMAN', 'COPA_HUMAN', 'SYQ_HUMAN', ]

Here's from the FASTA file (I only included SYAC_HUMAN, the first accession number):

sp|P49588|SYAC_HUMAN Alanine--tRNA ligase, cytoplasmic OS=Homo sapiens OX=9606 GN=AARS PE=1 SV=2
MDSTLTASEIRQRFIDFFKRNEHTYVHSSATIPLDDPTL...... (goes further on)

ps. actually maybe using == is better than in, because I realized some of the accession numbers from FASTA are the same as the ones in my list with an extra letter/number

ADD REPLYlink modified 5 months ago by RamRS21k • written 5 months ago by gaiboyan230
1

You code should be Acc in seq_record.id not the other way round. This is because you are checking whether SYAC_HUMAN is in sp|P49588|SYAC_HUMAN Alanine... and not the other way.

ADD REPLYlink written 5 months ago by vkkodali1.1k

Acc seems to be an array so OP needs to account for that as well.

ADD REPLYlink written 5 months ago by RamRS21k

Please use the code (101010) formatting option to differentiate programmatic content from other content.

ADD REPLYlink written 5 months ago by RamRS21k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1699 users visited in the last hour