Best Way To Get A Unique List Of Items Using Two Dicts...
3
3
Entering edit mode
11.2 years ago

I have two dicts; one with start coordinates and the other with end coordinates, paired using an incremental integer value e.g. starts[0] = 1000 and ends[0] = 2000. I also have a third dict that holds the corresponding id e.g. ids[0] = "EXON1".

I want to get the number of unique values from both the dicts (i.e. all those that appear at least once as 1000 or 2000 may be duplicated) and have used the following to do so (benchmarks I have seen have shown it to be the fastest method):

unique_starts = {}.fromkeys(exon_seq_region_starts.values()).keys()
unique_ends = {}.fromkeys(exon_seq_region_ends.values()).keys()


I then need to iterate through the smallest of these lists and get all the values that match in the larger list for each of the unique values (e.g. ends at 2000 may have starts at 900, 950 and 1000). I need to retrieve the smallest value (e.g. 900 in this example) and its key in order to associate with the third, exon id dict.

Any ideas on the best way?

exon python map sort • 2.1k views
1
Entering edit mode

@Gawbul Then why don't you create an ordered (start/end) array of object Exon(start,end) and scan from 5' to 3' each overlaping Exons ??

0
Entering edit mode

Any chance you could tell us what you are trying to achieve, instead of asking how to implement the method you think is best?

0
Entering edit mode

Hi Casbon, I have a list of exon start and end coordinates retrieved from ensembl, but they aren't unique, in that some sequences overlap. I need to retrieve the exon ids and coordinates for the unique exons.

6
Entering edit mode
11.2 years ago

I'll take your word for it. However, I do think there is a better way. Check out bedtools.

Let me know if this gives you the desired output.

from collections import namedtuple
from itertools import groupby

Bed = namedtuple("Bed", "seqid start end exonid")

def get_nonredundant_exons(beds):
"""
>>> beds = []
>>> beds.append(Bed("chr1", 100, 200, "e1"))
>>> beds.append(Bed("chr1", 59, 200, "e2"))
>>> beds.append(Bed("chr1", 59, 300, "e3"))
>>> print list(get_nonredundant_exons(beds))
[Bed(seqid='chr1', start=59, end=200, exonid='e2'), Bed(seqid='chr1', start=59, end=300, exonid=
'e3')]
"""
beds.sort(key=lambda x:(x.seqid, x.end))
for end, beds_with_same_end in groupby(beds, key=lambda x:(x.seqid, x.end)):
yield min(beds_with_same_end, key=lambda x: x.start)

0
Entering edit mode

I'll give this a try once I get on my computer (on my phone at the moment)! Certainly seems like it might be an excellent solution to my problem! Although I have just been informed about the canonical transcripts defined in the Ensembl database, which may be a little easier and quicker to use?

2
Entering edit mode
11.2 years ago

It sounds like an bsearch/equal_range/lower_bound/upper_bound search . See google code to see an implementation, for example: http://www.sgi.com/tech/stl/stl_tree.h

0
Entering edit mode

Thanks Pierre! I'll have to have a good read up on this!

0
Entering edit mode

just changed the last url

0
Entering edit mode

Yeah, I bookmarked them both, thanks :-)

0
Entering edit mode
9.3 years ago

You might also take a look at the bedmap tool in the BEDOPS suite. The bedmap tool allows the user to quickly find elements of a map BED file (say, a list of exons) which overlap a reference BED file (say, a coordinate range that bounds a subset of exons). You can apply operators which include returning (or echo-ing) a nicely-formatted list of IDs from mapped elements (such as those which identify exons).