Question

annotate unknown peptides by fasta reference

0

Entering edit mode

3.9 years ago

robinycfang ▴ 20

Hi,

I have a reference fasta protein database (~ 1M lines) which contains a mix of uniprot amino acid sequence and some of my own protein sequence. It looks something like this:

>sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens GN=YWHAE PE=1 SV=1
MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASWRIISS

>my_peptide_43624534
GNTSKTDEQFIHQECIAKSSLWKYTKITKSNVTSYQILWSCSASIDFCFIFYLNLLAGRFALLNTLTATRLLLCW

I also have a list (~ 1k lines) of unannotated amino acid sequence, which looks like this:

-unknown_pep1    ECIAKSSLWKY

-unknown_pep2    SNVTSYQILWSCS

I am trying to search the unknown amino acid sequence against the reference fasta file and annotate the unknwon peptide with either a uniprot name or "my_peptide" name. I am a python user, and I tried to load the reference file into a pandas data frame, and then use str.contains() to locate that specific peptide in the fasta, but it takes forever to load the fasta into pandas as it's just too big. I am thinking about use df.readline() to iterate the fasta, but still it will be 1M*1k iterations. Does anyone have a good idea of how to work this problem around fast?

Thanks!

Robin

python amino acid • 726 views

ADD COMMENT • link updated 3.9 years ago by GenoMax 141k • written 3.9 years ago by robinycfang ▴ 20

score 0 · Answer 1 · 2020-06-05

0

Entering edit mode

3.9 years ago

JC 13k

Align your sequences with Blast or Fasta, save your results as table (-outfmt 6) and parse the tables in Python.

ADD COMMENT • link 3.9 years ago by JC 13k

score 0 · Answer 2 · 2020-06-05

0

Entering edit mode

3.9 years ago

GenoMax 141k

blat can be very handy here since you are looking for very similar sequences. No need to create indexes or anything. Just use two files (one as your database and one with your queries) as inputs. More info here.

ADD COMMENT • link 3.9 years ago by GenoMax 141k