I encountered the need to 'collapse' a set of peptides today in to a single string. Not very difficult I thought, and then realised that this is a familiar problem:- it's basically kmer assembly.
Now, what I want to do doesn't need anything as heavy duty as de Bruijn graph assembly, but the existing StackOverflow answers are various shades of inefficient, so I thought this might be a fun programming challenge for the site.
So, here's a list of peptides:
["MSTSTSQIA","STSTSQIAV","TSTSQIAVE","STSQIAVEY","TSQIAVEYP","SQIAVEYPI","QIAVEYPIP","IAVEYPIPV","AVEYPIPVY","VEYPIPVYR","EYPIPVYRF","YPIPVYRFI","PIPVYRFIV","IPVYRFIVS","PVYRFIVSV","VYRFIVSVG","YRFIVSVGD","RFIVSVGDE","FIVSVGDEK","IVSVGDEKI","VSVGDEKIP","SVGDEKIPF","VGDEKIPFN","GDEKIPFNS","DEKIPFNSV","EKIPFNSVS","KIPFNSVSG","IPFNSVSGL","PFNSVSGLD","FNSVSGLDI","NSVSGLDIS","SVSGLDISY","VSGLDISYD","SGLDISYDT","GLDISYDTI","LDISYDTIE","DISYDTIEY","ISYDTIEYR","SYDTIEYRD","YDTIEYRDG","DTIEYRDGV","TIEYRDGVG","IEYRDGVGN","EYRDGVGNW","YRDGVGNWF","RDGVGNWFK","DGVGNWFKM","GVGNWFKMP","VGNWFKMPG","GNWFKMPGQ","NWFKMPGQS","WFKMPGQSQ","FKMPGQSQS","KMPGQSQST","MPGQSQSTN","PGQSQSTNI","GQSQSTNIT","QSQSTNITL","SQSTNITLR","QSTNITLRK","STNITLRKG","TNITLRKGV","NITLRKGVF","ITLRKGVFP","TLRKGVFPG","LRKGVFPGK","RKGVFPGKT","KGVFPGKTE","GVFPGKTEL","VFPGKTELF","FPGKTELFD","PGKTELFDW","GKTELFDWI","KTELFDWIN","TELFDWINS","ELFDWINSI","LFDWINSIQ","FDWINSIQL","DWINSIQLN","WINSIQLNQ","INSIQLNQV","NSIQLNQVE","SIQLNQVEK","IQLNQVEKK","QLNQVEKKD","LNQVEKKDI","NQVEKKDIT","QVEKKDITI","VEKKDITIS","EKKDITISL","KKDITISLT","KDITISLTN","DITISLTND","ITISLTNDA","TISLTNDAG","ISLTNDAGT","SLTNDAGTE","LTNDAGTEL","TNDAGTELL","NDAGTELLM","DAGTELLMT","AGTELLMTW","GTELLMTWN","TELLMTWNV","ELLMTWNVS","LLMTWNVSN","LMTWNVSNA","MTWNVSNAF","TWNVSNAFP","WNVSNAFPT","NVSNAFPTS","VSNAFPTSL","SNAFPTSLT","NAFPTSLTS","AFPTSLTSP","FPTSLTSPS","PTSLTSPSF","TSLTSPSFD","SLTSPSFDA","LTSPSFDAT","TSPSFDATS","SPSFDATSN","PSFDATSND","SFDATSNDI","FDATSNDIA","DATSNDIAV","ATSNDIAVQ","TSNDIAVQE","SNDIAVQEI","NDIAVQEIT","DIAVQEITL","IAVQEITLM","AVQEITLMA","VQEITLMAD","QEITLMADR","EITLMADRV","ITLMADRVI","TLMADRVIM","LMADRVIMQ","MADRVIMQA","ADRVIMQAV"]
The challenge is to come up with the fastest/most efficient approach to reconstitute the originating sequence.
>Sequence
MSTSTSQIAVEYPIPVYRFIVSVGDEKIPFNSVSGLDISYDTIEYRDGVGNWFKMPGQSQ
STNITLRKGVFPGKTELFDWINSIQLNQVEKKDITISLTNDAGTELLMTWNVSNAFPTSL
TSPSFDATSNDIAVQEITLMADRVIMQAV
In this case, all peptides overlap perfectly with a 1 amino acid overlap, and the sequences are already in the correct order.
Bonus points if you:
- come up with some code that can take arbitrary overlap lengths (potentially of different lengths?!)
- come up with some code that can handle different length peptides/'kmers"
- come up with some code that can handle the list of peptides being out of order
- come up with a lightweight de Bruijn implementation!
I'm interested in python solutions, but in the interests of the challenge, other languages are welcome. No external dependencies beyond packages in that particular language allowed (e.g. no spades
, mafft
, clustal
etc).