Question: Overlapping fragments as efficiently as possible
1
gravatar for Joe
5 days ago by
Joe15k
United Kingdom
Joe15k wrote:

I encountered the need to 'collapse' a set of peptides today in to a single string. Not very difficult I thought, and then realised that this is a familiar problem:- it's basically kmer assembly.

Now, what I want to do doesn't need anything as heavy duty as de Bruijn graph assembly, but the existing StackOverflow answers are various shades of inefficient, so I thought this might be a fun programming challenge for the site.

So, here's a list of peptides:

["MSTSTSQIA","STSTSQIAV","TSTSQIAVE","STSQIAVEY","TSQIAVEYP","SQIAVEYPI","QIAVEYPIP","IAVEYPIPV","AVEYPIPVY","VEYPIPVYR","EYPIPVYRF","YPIPVYRFI","PIPVYRFIV","IPVYRFIVS","PVYRFIVSV","VYRFIVSVG","YRFIVSVGD","RFIVSVGDE","FIVSVGDEK","IVSVGDEKI","VSVGDEKIP","SVGDEKIPF","VGDEKIPFN","GDEKIPFNS","DEKIPFNSV","EKIPFNSVS","KIPFNSVSG","IPFNSVSGL","PFNSVSGLD","FNSVSGLDI","NSVSGLDIS","SVSGLDISY","VSGLDISYD","SGLDISYDT","GLDISYDTI","LDISYDTIE","DISYDTIEY","ISYDTIEYR","SYDTIEYRD","YDTIEYRDG","DTIEYRDGV","TIEYRDGVG","IEYRDGVGN","EYRDGVGNW","YRDGVGNWF","RDGVGNWFK","DGVGNWFKM","GVGNWFKMP","VGNWFKMPG","GNWFKMPGQ","NWFKMPGQS","WFKMPGQSQ","FKMPGQSQS","KMPGQSQST","MPGQSQSTN","PGQSQSTNI","GQSQSTNIT","QSQSTNITL","SQSTNITLR","QSTNITLRK","STNITLRKG","TNITLRKGV","NITLRKGVF","ITLRKGVFP","TLRKGVFPG","LRKGVFPGK","RKGVFPGKT","KGVFPGKTE","GVFPGKTEL","VFPGKTELF","FPGKTELFD","PGKTELFDW","GKTELFDWI","KTELFDWIN","TELFDWINS","ELFDWINSI","LFDWINSIQ","FDWINSIQL","DWINSIQLN","WINSIQLNQ","INSIQLNQV","NSIQLNQVE","SIQLNQVEK","IQLNQVEKK","QLNQVEKKD","LNQVEKKDI","NQVEKKDIT","QVEKKDITI","VEKKDITIS","EKKDITISL","KKDITISLT","KDITISLTN","DITISLTND","ITISLTNDA","TISLTNDAG","ISLTNDAGT","SLTNDAGTE","LTNDAGTEL","TNDAGTELL","NDAGTELLM","DAGTELLMT","AGTELLMTW","GTELLMTWN","TELLMTWNV","ELLMTWNVS","LLMTWNVSN","LMTWNVSNA","MTWNVSNAF","TWNVSNAFP","WNVSNAFPT","NVSNAFPTS","VSNAFPTSL","SNAFPTSLT","NAFPTSLTS","AFPTSLTSP","FPTSLTSPS","PTSLTSPSF","TSLTSPSFD","SLTSPSFDA","LTSPSFDAT","TSPSFDATS","SPSFDATSN","PSFDATSND","SFDATSNDI","FDATSNDIA","DATSNDIAV","ATSNDIAVQ","TSNDIAVQE","SNDIAVQEI","NDIAVQEIT","DIAVQEITL","IAVQEITLM","AVQEITLMA","VQEITLMAD","QEITLMADR","EITLMADRV","ITLMADRVI","TLMADRVIM","LMADRVIMQ","MADRVIMQA","ADRVIMQAV"]

The challenge is to come up with the fastest/most efficient approach to reconstitute the originating sequence.

>Sequence
MSTSTSQIAVEYPIPVYRFIVSVGDEKIPFNSVSGLDISYDTIEYRDGVGNWFKMPGQSQ
STNITLRKGVFPGKTELFDWINSIQLNQVEKKDITISLTNDAGTELLMTWNVSNAFPTSL
TSPSFDATSNDIAVQEITLMADRVIMQAV

In this case, all peptides overlap perfectly with a 1 amino acid overlap, and the sequences are already in the correct order.

Bonus points if you:

  • come up with some code that can take arbitrary overlap lengths (potentially of different lengths?!)
  • come up with some code that can handle different length peptides/'kmers"
  • come up with some code that can handle the list of peptides being out of order
  • come up with a lightweight de Bruijn implementation!

I'm interested in python solutions, but in the interests of the challenge, other languages are welcome. No external dependencies beyond packages in that particular language allowed (e.g. no spades, mafft, clustal etc).

ADD COMMENTlink modified 5 days ago • written 5 days ago by Joe15k
2
gravatar for Joe
5 days ago by
Joe15k
United Kingdom
Joe15k wrote:

A super basic solution for this specific case that I'm using. Should be pretty effiicent as string concatenation isn't that complex. Also fits on one line (just).

>>> reference = "MSTSTSQIAVEYPIPVYRFIVSVGDEKIPFNSVSGLDISYDTIEYRDGVGNWFKMPGQSQSTNITLRKGVFPGKTELFDWINSIQLNQVEKKDITISLTNDAGTELLMTWNVSNAFPTSLTSPSFDATSNDIAVQEITLMADRVIMQAV"

>>> concat = "".join([pep[0] for pep in peptides] + [peptides[-1][1:]])

>>> concat == reference
True

I think this will work for any length 'kmer', and could be fairly easily modified to deal with any length overlap (I think).

ADD COMMENTlink modified 5 days ago • written 5 days ago by Joe15k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1112 users visited in the last hour