annotate unknown peptides by fasta reference
2
0
Entering edit mode
3.9 years ago
robinycfang ▴ 20

Hi,

I have a reference fasta protein database (~ 1M lines) which contains a mix of uniprot amino acid sequence and some of my own protein sequence. It looks something like this:

>sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens GN=YWHAE PE=1 SV=1
MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASWRIISS

>my_peptide_43624534
GNTSKTDEQFIHQECIAKSSLWKYTKITKSNVTSYQILWSCSASIDFCFIFYLNLLAGRFALLNTLTATRLLLCW

I also have a list (~ 1k lines) of unannotated amino acid sequence, which looks like this:

-unknown_pep1    ECIAKSSLWKY

-unknown_pep2    SNVTSYQILWSCS

I am trying to search the unknown amino acid sequence against the reference fasta file and annotate the unknwon peptide with either a uniprot name or "my_peptide" name. I am a python user, and I tried to load the reference file into a pandas data frame, and then use str.contains() to locate that specific peptide in the fasta, but it takes forever to load the fasta into pandas as it's just too big. I am thinking about use df.readline() to iterate the fasta, but still it will be 1M*1k iterations. Does anyone have a good idea of how to work this problem around fast?

Thanks!

Robin

python amino acid • 726 views
ADD COMMENT
0
Entering edit mode
3.9 years ago
JC 13k

Align your sequences with Blast or Fasta, save your results as table (-outfmt 6) and parse the tables in Python.

ADD COMMENT
0
Entering edit mode
3.9 years ago
GenoMax 141k

blat can be very handy here since you are looking for very similar sequences. No need to create indexes or anything. Just use two files (one as your database and one with your queries) as inputs. More info here.

ADD COMMENT

Login before adding your answer.

Traffic: 2888 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6