I have a reference fasta protein database (~ 1M lines) which contains a mix of uniprot amino acid sequence and some of my own protein sequence. It looks something like this:
>sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens GN=YWHAE PE=1 SV=1 MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASWRIISS >my_peptide_43624534 GNTSKTDEQFIHQECIAKSSLWKYTKITKSNVTSYQILWSCSASIDFCFIFYLNLLAGRFALLNTLTATRLLLCW
I also have a list (~ 1k lines) of unannotated amino acid sequence, which looks like this:
-unknown_pep1 ECIAKSSLWKY -unknown_pep2 SNVTSYQILWSCS
I am trying to search the unknown amino acid sequence against the reference fasta file and annotate the unknwon peptide with either a uniprot name or "my_peptide" name. I am a python user, and I tried to load the reference file into a pandas data frame, and then use
str.contains() to locate that specific peptide in the fasta, but it takes forever to load the fasta into pandas as it's just too big. I am thinking about use
df.readline() to iterate the fasta, but still it will be 1M*1k iterations. Does anyone have a good idea of how to work this problem around fast?