Question

Efficient way to add annotation to a list of genome positions with python

1

Entering edit mode

9.7 years ago

juncheng ▴ 220

Dear all,

I have a list of genome positions, like this (thousands of ):

chr1 7860756 7973866
chr1 7973763 7976860
chr3 8155768 8156554
chr5 8729338 8729482
chr5 8846800 8860030
chr6 8848105 8848254
chr6 11904306 11904806
chr7 11904687 11904840
chr8 11904718 11904836
chr10 11904792 11904841

I have a genome annotation file like this (repetitive region) 7 Million lines:

chr1    rn6_rmsk        exon    471     533     32.000000       +       .       gene_id "(ATAC)n"; transcript_id "(ATAC)n";
chr1    rn6_simpleRepeat        exon    480     510     53.000000       .       .       gene_id "trf"; transcript_id "trf";
chr1    rn6_rmsk        exon    717     862     588.000000      +       .       gene_id "RMER10B"; transcript_id "RMER10B";
chr1    rn6_rmsk        exon    2099    2452    627.000000      -       .       gene_id "MTD"; transcript_id "MTD_dup9";
chr1    rn6_rmsk        exon    2902    2936    15.000000       +       .       gene_id "(CTTC)n"; transcript_id "(CTTC)n";
chr1    rn6_rmsk        exon    3227    3251    26.000000       +       .       gene_id "(TGTC)n"; transcript_id "(TGTC)n";
chr1    rn6_simpleRepeat        exon    3227    3251    50.000000       .       .       gene_id "trf"; transcript_id "trf_dup1";
chr1    rn6_rmsk        exon    3252    3288    13.000000       +       .       gene_id "(ATCAC)n"; transcript_id "(ATCAC)n";
chr1    rn6_simpleRepeat        exon    3500    3538    53.000000       .       .       gene_id "trf"; transcript_id "trf_dup2";
chr1    rn6_rmsk        exon    3500    3539    25.000000       +       .       gene_id "(AC)n"; transcript_id "(AC)n_dup5";
chr1    rn6_rmsk        exon    3539    3599    46.000000       +       .       gene_id "(AG)n"; transcript_id "(AG)n";
chr1    rn6_simpleRepeat        exon    3539    3599    88.000000       .       .       gene_id "trf"; transcript_id "trf_dup3";
chr1    rn6_rmsk        exon    3602    4047    1096.000000     +       .       gene_id "RLTR20B1"; transcript_id

Basically I want to figure out whether region in the first file is in the repetitive regions denoted in the second files. I don't want to loop through the second file every time, is there any efficient ways to do this?

I know there is existing script to do this, but want to write this program my self.

Could anyone just give me a hint a algorithm or something?

Best,
Jun

genome python • 2.5k views

ADD COMMENT • link updated 2.3 years ago by Ram 43k • written 9.7 years ago by juncheng ▴ 220

score 3 · Answer 1 · 2014-08-18

3

Entering edit mode

9.7 years ago

dariober 14k

If you want to do it within python, I think pybedtools is what you need. Example, untested: Print the intervals in a with an overlap in b:

import pybedtools

a= pybedtools.BedTool('a.bed')
b= pybedtools.BedTool('b.gff')
print(a.intersect(b, u= True))

ADD COMMENT • link 9.7 years ago by dariober 14k

1

Entering edit mode

you don't need the loop to do this. that's what intersect does by default

ADD REPLY • link 9.7 years ago by brentp 24k

0

Entering edit mode

Thanks! I edited my answer

ADD REPLY • link 9.7 years ago by dariober 14k

0

Entering edit mode

Thank both of you!

ADD REPLY • link 9.7 years ago by juncheng ▴ 220

score 1 · Answer 2 · 2014-08-18

1

Entering edit mode

9.7 years ago

lelle ▴ 830

If you want to implement this yourself, I think an intervall tree is the data structure of choice. Here is a blog post with an implementation and some discussion that I found rather useful.