Creating the original list of 800+ accession numbers (this part is fine)

Question

How to retrieve protein sequences from FASTA by using accession number?

0

Entering edit mode

5.3 years ago

gaiboyan23 ▴ 30

Noobie to bioinformatics here, I'm strugging with a code that shouldn't be hard. I have a list of 800+ accession numbers for proteins of interest, and I'm trying to get the corresponding protein sequence for all of them.

I've downloaded the FASTA file from Uniprot, and I'm trying to figure out a way to get the sequences in a list using biopython module. So far my code looks something like this:

Creating the original list of 800+ accession numbers (this part is fine)

import openpyxl
file=openpyxl.load_workbook('substrate_1.xlsx')
Y_100= file.get_sheet_by_name ('Supplementary Table 2. Y100Bpa')
rownumber=Y_100.max_row
Acc=[]
for r in range (3, rownumber+1):
    Acc.append (Y_100.cell(column=1, row=r).value)

trying (and failing) to parse Fasta

import Bio
from Bio import SeqIO

for seq_record in SeqIO.parse("uniprot.fasta.", "fasta"): 
    if seq_record.id in Acc:   #is this how I would select for only the accession numbers from my original list
        print seq_record.id, repr(seq_record.seq, len(seq_record))  
    else:
        continue

So far this code doesn't work at all, what am I doing wrong here? I also tried creating a dictionary instead of list, would that be a better solution?

Thanks in advance from someone lost in the world of bioinformatics

Protein FASTA sequence • 2.8k views

ADD COMMENT • link updated 5.3 years ago by Joe 21k • written 5.3 years ago by gaiboyan23 ▴ 30

1

Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

ADD REPLY • link 5.3 years ago by Ram 43k

0

Entering edit mode

Just reformmated, first time poster here, thanks for the tip!

ADD REPLY • link 5.3 years ago by gaiboyan23 ▴ 30

1

Entering edit mode

When you say "this part is fine", do you mean:

there is no error
there is no error and a list of items is generated
there is no error and a list of valid Uniprot identifiers as seen in the spreadsheet is generated

Have you tried simply using a text file with the IDs alongside any tool from one of the gazillion answers on the site addressing "retrieve sequence by identifier" questions?

ADD REPLY • link 5.3 years ago by Ram 43k

0

Entering edit mode

there is no error/ list was generated.

I've looked some other posts but they seem to not exactly fit my question/ use different modules/programming language. I'll keep looking I guess

ADD REPLY • link 5.3 years ago by gaiboyan23 ▴ 30

1

Entering edit mode

Have you checked manually that any of your accessions from your sheet, match the accessions in the fasta?

Note that Biopython by default only uses the header from > to the first space, and you're doing direct string comparisons (though in rather than == is the better choice it is still not guaranteed to work.

Can you show us some of the format of your list of accessions and the format of your uniprot fasta?

ADD REPLY • link 5.3 years ago by Joe 21k

0

Entering edit mode

Yes, i've checked and the accessions match.

Here's a the first line of the list of my accession numbers (cut short for space):

['SYAC_HUMAN', 'PGAM4_HUMAN', 'SYEP_HUMAN', 'H14_HUMAN', 'K1C9_HUMAN', 'COPA_HUMAN', 'SYQ_HUMAN', ]

Here's from the FASTA file (I only included SYAC_HUMAN, the first accession number):

sp|P49588|SYAC_HUMAN Alanine--tRNA ligase, cytoplasmic OS=Homo sapiens OX=9606 GN=AARS PE=1 SV=2
MDSTLTASEIRQRFIDFFKRNEHTYVHSSATIPLDDPTL...... (goes further on)

ps. actually maybe using == is better than in, because I realized some of the accession numbers from FASTA are the same as the ones in my list with an extra letter/number

ADD REPLY • link updated 5.3 years ago by Ram 43k • written 5.3 years ago by gaiboyan23 ▴ 30

1

Entering edit mode

You code should be Acc in seq_record.id not the other way round. This is because you are checking whether SYAC_HUMAN is in sp|P49588|SYAC_HUMAN Alanine... and not the other way.

ADD REPLY • link 5.3 years ago by vkkodali_ncbi ★ 3.7k

0

Entering edit mode

Acc seems to be an array so OP needs to account for that as well.

ADD REPLY • link 5.3 years ago by Ram 43k

0

Entering edit mode

Please use the code (101010) formatting option to differentiate programmatic content from other content.

ADD REPLY • link 5.3 years ago by Ram 43k