Noobie to bioinformatics here, I'm strugging with a code that shouldn't be hard. I have a list of 800+ accession numbers for proteins of interest, and I'm trying to get the corresponding protein sequence for all of them.
I've downloaded the FASTA file from Uniprot, and I'm trying to figure out a way to get the sequences in a list using biopython module. So far my code looks something like this:
Creating the original list of 800+ accession numbers (this part is fine)
import openpyxl file=openpyxl.load_workbook('substrate_1.xlsx') Y_100= file.get_sheet_by_name ('Supplementary Table 2. Y100Bpa') rownumber=Y_100.max_row Acc= for r in range (3, rownumber+1): Acc.append (Y_100.cell(column=1, row=r).value)
trying (and failing) to parse Fasta
import Bio from Bio import SeqIO for seq_record in SeqIO.parse("uniprot.fasta.", "fasta"): if seq_record.id in Acc: #is this how I would select for only the accession numbers from my original list print seq_record.id, repr(seq_record.seq, len(seq_record)) else: continue
So far this code doesn't work at all, what am I doing wrong here? I also tried creating a dictionary instead of list, would that be a better solution?
Thanks in advance from someone lost in the world of bioinformatics