Question

Overlapping fragments as efficiently as possible

1

Entering edit mode

4.5 years ago

Joe 21k

I encountered the need to 'collapse' a set of peptides today in to a single string. Not very difficult I thought, and then realised that this is a familiar problem:- it's basically kmer assembly.

Now, what I want to do doesn't need anything as heavy duty as de Bruijn graph assembly, but the existing StackOverflow answers are various shades of inefficient, so I thought this might be a fun programming challenge for the site.

So, here's a list of peptides:

["MSTSTSQIA","STSTSQIAV","TSTSQIAVE","STSQIAVEY","TSQIAVEYP","SQIAVEYPI","QIAVEYPIP","IAVEYPIPV","AVEYPIPVY","VEYPIPVYR","EYPIPVYRF","YPIPVYRFI","PIPVYRFIV","IPVYRFIVS","PVYRFIVSV","VYRFIVSVG","YRFIVSVGD","RFIVSVGDE","FIVSVGDEK","IVSVGDEKI","VSVGDEKIP","SVGDEKIPF","VGDEKIPFN","GDEKIPFNS","DEKIPFNSV","EKIPFNSVS","KIPFNSVSG","IPFNSVSGL","PFNSVSGLD","FNSVSGLDI","NSVSGLDIS","SVSGLDISY","VSGLDISYD","SGLDISYDT","GLDISYDTI","LDISYDTIE","DISYDTIEY","ISYDTIEYR","SYDTIEYRD","YDTIEYRDG","DTIEYRDGV","TIEYRDGVG","IEYRDGVGN","EYRDGVGNW","YRDGVGNWF","RDGVGNWFK","DGVGNWFKM","GVGNWFKMP","VGNWFKMPG","GNWFKMPGQ","NWFKMPGQS","WFKMPGQSQ","FKMPGQSQS","KMPGQSQST","MPGQSQSTN","PGQSQSTNI","GQSQSTNIT","QSQSTNITL","SQSTNITLR","QSTNITLRK","STNITLRKG","TNITLRKGV","NITLRKGVF","ITLRKGVFP","TLRKGVFPG","LRKGVFPGK","RKGVFPGKT","KGVFPGKTE","GVFPGKTEL","VFPGKTELF","FPGKTELFD","PGKTELFDW","GKTELFDWI","KTELFDWIN","TELFDWINS","ELFDWINSI","LFDWINSIQ","FDWINSIQL","DWINSIQLN","WINSIQLNQ","INSIQLNQV","NSIQLNQVE","SIQLNQVEK","IQLNQVEKK","QLNQVEKKD","LNQVEKKDI","NQVEKKDIT","QVEKKDITI","VEKKDITIS","EKKDITISL","KKDITISLT","KDITISLTN","DITISLTND","ITISLTNDA","TISLTNDAG","ISLTNDAGT","SLTNDAGTE","LTNDAGTEL","TNDAGTELL","NDAGTELLM","DAGTELLMT","AGTELLMTW","GTELLMTWN","TELLMTWNV","ELLMTWNVS","LLMTWNVSN","LMTWNVSNA","MTWNVSNAF","TWNVSNAFP","WNVSNAFPT","NVSNAFPTS","VSNAFPTSL","SNAFPTSLT","NAFPTSLTS","AFPTSLTSP","FPTSLTSPS","PTSLTSPSF","TSLTSPSFD","SLTSPSFDA","LTSPSFDAT","TSPSFDATS","SPSFDATSN","PSFDATSND","SFDATSNDI","FDATSNDIA","DATSNDIAV","ATSNDIAVQ","TSNDIAVQE","SNDIAVQEI","NDIAVQEIT","DIAVQEITL","IAVQEITLM","AVQEITLMA","VQEITLMAD","QEITLMADR","EITLMADRV","ITLMADRVI","TLMADRVIM","LMADRVIMQ","MADRVIMQA","ADRVIMQAV"]

The challenge is to come up with the fastest/most efficient approach to reconstitute the originating sequence.

>Sequence
MSTSTSQIAVEYPIPVYRFIVSVGDEKIPFNSVSGLDISYDTIEYRDGVGNWFKMPGQSQ
STNITLRKGVFPGKTELFDWINSIQLNQVEKKDITISLTNDAGTELLMTWNVSNAFPTSL
TSPSFDATSNDIAVQEITLMADRVIMQAV

In this case, all peptides overlap perfectly with a 1 amino acid overlap, and the sequences are already in the correct order.

Bonus points if you:

come up with some code that can take arbitrary overlap lengths (potentially of different lengths?!)
come up with some code that can handle different length peptides/'kmers"
come up with some code that can handle the list of peptides being out of order
come up with a lightweight de Bruijn implementation!

I'm interested in python solutions, but in the interests of the challenge, other languages are welcome. No external dependencies beyond packages in that particular language allowed (e.g. no spades, mafft, clustal etc).

Assembly python code-challenge • 662 views

ADD COMMENT • link 4.5 years ago by Joe 21k

score 2 · Answer 1 · 2019-12-03

A super basic solution for this specific case that I'm using. Should be pretty effiicent as string concatenation isn't that complex. Also fits on one line (just).

>>> reference = "MSTSTSQIAVEYPIPVYRFIVSVGDEKIPFNSVSGLDISYDTIEYRDGVGNWFKMPGQSQSTNITLRKGVFPGKTELFDWINSIQLNQVEKKDITISLTNDAGTELLMTWNVSNAFPTSLTSPSFDATSNDIAVQEITLMADRVIMQAV"

>>> concat = "".join([pep[0] for pep in peptides] + [peptides[-1][1:]])

>>> concat == reference
True

I think this will work for any length 'kmer', and could be fairly easily modified to deal with any length overlap (I think).