Question: annotate unknown peptides by fasta reference
gravatar for robinycfang
8 months ago by
robinycfang0 wrote:


I have a reference fasta protein database (~ 1M lines) which contains a mix of uniprot amino acid sequence and some of my own protein sequence. It looks something like this:

>sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens GN=YWHAE PE=1 SV=1


I also have a list (~ 1k lines) of unannotated amino acid sequence, which looks like this:

-unknown_pep1    ECIAKSSLWKY

-unknown_pep2    SNVTSYQILWSCS

I am trying to search the unknown amino acid sequence against the reference fasta file and annotate the unknwon peptide with either a uniprot name or "my_peptide" name. I am a python user, and I tried to load the reference file into a pandas data frame, and then use str.contains() to locate that specific peptide in the fasta, but it takes forever to load the fasta into pandas as it's just too big. I am thinking about use df.readline() to iterate the fasta, but still it will be 1M*1k iterations. Does anyone have a good idea of how to work this problem around fast?



amino acid python • 193 views
ADD COMMENTlink modified 8 months ago by GenoMax96k • written 8 months ago by robinycfang0
gravatar for JC
8 months ago by
JC12k wrote:

Align your sequences with Blast or Fasta, save your results as table (-outfmt 6) and parse the tables in Python.

ADD COMMENTlink written 8 months ago by JC12k
gravatar for GenoMax
8 months ago by
United States
GenoMax96k wrote:

blat can be very handy here since you are looking for very similar sequences. No need to create indexes or anything. Just use two files (one as your database and one with your queries) as inputs. More info here.

ADD COMMENTlink modified 8 months ago • written 8 months ago by GenoMax96k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2422 users visited in the last hour