Question: annotate unknown peptides by fasta reference
0
gravatar for robinycfang
3 months ago by
robinycfang0 wrote:

Hi,

I have a reference fasta protein database (~ 1M lines) which contains a mix of uniprot amino acid sequence and some of my own protein sequence. It looks something like this:

>sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens GN=YWHAE PE=1 SV=1
MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASWRIISS

>my_peptide_43624534
GNTSKTDEQFIHQECIAKSSLWKYTKITKSNVTSYQILWSCSASIDFCFIFYLNLLAGRFALLNTLTATRLLLCW

I also have a list (~ 1k lines) of unannotated amino acid sequence, which looks like this:

-unknown_pep1    ECIAKSSLWKY

-unknown_pep2    SNVTSYQILWSCS

I am trying to search the unknown amino acid sequence against the reference fasta file and annotate the unknwon peptide with either a uniprot name or "my_peptide" name. I am a python user, and I tried to load the reference file into a pandas data frame, and then use str.contains() to locate that specific peptide in the fasta, but it takes forever to load the fasta into pandas as it's just too big. I am thinking about use df.readline() to iterate the fasta, but still it will be 1M*1k iterations. Does anyone have a good idea of how to work this problem around fast?

Thanks!

Robin

amino acid python • 109 views
ADD COMMENTlink modified 3 months ago by genomax90k • written 3 months ago by robinycfang0
0
gravatar for JC
3 months ago by
JC11k
Mexico
JC11k wrote:

Align your sequences with Blast or Fasta, save your results as table (-outfmt 6) and parse the tables in Python.

ADD COMMENTlink written 3 months ago by JC11k
0
gravatar for genomax
3 months ago by
genomax90k
United States
genomax90k wrote:

blat can be very handy here since you are looking for very similar sequences. No need to create indexes or anything. Just use two files (one as your database and one with your queries) as inputs. More info here.

ADD COMMENTlink modified 3 months ago • written 3 months ago by genomax90k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1615 users visited in the last hour