Question

genomics overlap python error

0

Entering edit mode

8.2 years ago

novicebioinforesearcher ▴ 70

I am trying to do some interval range operations on two files with folowing conditions check if chrom is equal, then check if my start and end of of co0rdinatefile is within or equal to start and end of gene_annotation file (condition if the strand is "+" the start and end will be for eg 10-20, if its "-" it will be 20-10) , if match print start end strand from coordinate and gene_id, gene_name from geneannotation file. (for representation purpose I have head annoataion file )

number of rows in annotation file ~50000 number of rows in coordinated file ~200,000

gene_annotationfile

import csv 

with open('~/coordintes.txt', 'r') as source:
     coordinates = list(csv.reader(source, delimiter="\t"))
#      print coordinates


with open('~/gene_final_anno.csv', 'r') as source:
     annotations = list(csv.reader(source, delimiter=","))
#     print annotations

for index,line in enumerate(coordinates):
    for index2,line2 in enumerate(annotations):
#         print line2[1]
        if line[0] == line2[0] and line[3] == "-":
           line2[1] >= line[1] and line2[2] <= line[2]
        else:
            line[1] <= line2[1] and line2[2] >= line[2]

        print "%s\t%s\t%s\t%s\t%s\t%s" % (line[0],line[1],line[2],line[5],line2[4],line2[5])
        break

this is the output that I get

chrom   start   end strand  gene_id gene_name
4   3356893 3359510 +   gene_id gene_name
4   3363294 3424125 -   gene_id gene_name
4   3413587 3424125 -   gene_id gene_name
4   3424201 3425964 -   gene_id gene_name
4   3424201 3424529 -   gene_id gene_name
4   3424197 3425964 -   gene_id gene_name

Not sure where am I going wrong any help is appreciated

python • 4.7k views

ADD COMMENT • link updated 8.2 years ago by Andrzej Zielezinski 11k • written 8.2 years ago by novicebioinforesearcher ▴ 70

2

Entering edit mode

In your script, you forgot to convert start and end coordinates into integer numbers. If you show us a sample of your query files (i.e. coordinates, gene_final_anno.csv) we would probably be more helpful.

ADD REPLY • link 8.2 years ago by combio.pl ▴ 20

1

Entering edit mode

sorry should have done this when I posted the question gene_final_anno.csv (head)

chrom,start,end,strand,gene_id,gene_name
17,71223692,71274336,-,ENSMUSG00000085299,Gm16627
17,18186448,18211184,+,ENSMUSG00000067978,Vmn2r-ps113
11,84645863,84684319,-,ENSMUSG00000020530,Ggnbp2
7,51097639,51106551,+,ENSMUSG00000074155,Klk5
13,31711037,31712238,+,ENSMUSG00000087276,Gm11378
9,44887266,44916613,+,ENSMUSG00000048534,Amica1
13,31889579,31892618,+,ENSMUSG00000086144,Gm11379
19,33534099,33547681,-,ENSMUSG00000086875,Gm8975
7,51136530,51141174,+,ENSMUSG00000006948,Klk4
13,31976094,31977093,+,ENSMUSG00000083149,Gm11380
11,3883639,3899329,+,ENSMUSG00000049721,Gal3st1
12,86426764,86447381,-,ENSMUSG00000042320,Prox2
13,32471776,32475557,-,ENSMUSG00000085770,Gm11381
19,33572527,33592270,-,ENSMUSG00000079344,Lipo4

coordinates head

chrom   start   end strand
1   4247322 4247912 -
1   4427449 4432604 +
1   4763414 4764404 -
1   4764597 4767606 -
1   4764597 4766491 -
1   4766882 4767606 -
1   4767729 4772649 -
1   4767729 4768829 -
1   4767729 4775654 -
1   4772382 4772649 -
1   4772814 4774032 -
1   4772814 4774159 -
1   4772814 4775654 -
1   4772814 4774032 +
1   4774186 4775654 -
1   4774186 4775654 +
1   4774186 4775699 -
1   4775960 4798536 +
1   4798063 4798536 +
1   4798063 4818665 +
1   4798567 4818665 +
1   4818730 4820349 +
1   4820396 4822392 +
1   4822462 4827082 +
1   4822462 4829468 +
1   4827155 4829468 +
1   4827155 4831037 +
1   4829569 4831037 +
1   4831213 4835044 +

gene_final_anno, is nothing but annotation file from biomart, for mm9.37.67, and coordinates file is from tophat junction bed representing junctions need to annotate each junction with gene id and name.

ADD REPLY • link 8.2 years ago by novicebioinforesearcher ▴ 70

1

Entering edit mode

As written by combio.pl

you forgot to convert start and end coordinates into integer numbers

That's important.

ADD REPLY • link 8.2 years ago by WouterDeCoster 48k

0

Entering edit mode

Ok will change that, any more pointers?

ADD REPLY • link 8.2 years ago by novicebioinforesearcher ▴ 70

1

Entering edit mode

Why are you using enumerate? I don't see where you use those index and index2 variables.

It's to me not really clear what you aim to achieve, but I have the idea that more appropriate tools exist rather than rolling your own solution.

ADD REPLY • link 8.2 years ago by WouterDeCoster 48k

0

Entering edit mode

@WouterDeCoster upstream and downstream of this code is in python hence..

ADD REPLY • link 8.2 years ago by novicebioinforesearcher ▴ 70

0

Entering edit mode

There might be solution using the htseq-count python module, or similar.

ADD REPLY • link 8.2 years ago by WouterDeCoster 48k

0

Entering edit mode

htseq count uses sorted bam (by name) and gtf right? how can I use it for such operations?

ADD REPLY • link 8.2 years ago by novicebioinforesearcher ▴ 70

0

Entering edit mode

I don't have code ready for that, but have a look at http://www-huber.embl.de/HTSeq/doc/tour.html#genomic-intervals-and-genomic-arrays

ADD REPLY • link 8.2 years ago by WouterDeCoster 48k

0

Entering edit mode

8.2 years ago

christophersugai • 0

I'm not sure, but this basically sounds like a Granges findoverlaps or subsetbyoverlaps job. If I had to guess, I would say it's something with that enumerate(). I would create csvs that are compatible with Granges via dplyr and just subsetbyoverlaps https://kasperdanielhansen.github.io/genbioconductor/html/GenomicRanges_GRanges_Usage.html

ADD COMMENT • link 8.2 years ago by christophersugai • 0

score 3 · Accepted Answer · 2017-05-13

3

Entering edit mode

8.2 years ago

Andrzej Zielezinski 11k

Python:

C = []
with open('coordinates.txt') as fh:
    fh.readline()  # Skip header
    for line in fh:
        sl = line.strip().split()
        C.append((sl[0], int(sl[1]), int(sl[2]), sl[3]))

A = []
with open('gene_final_anno.csv') as fh:
    fh.readline()
    for line in fh:
        sl = line.strip().split(',')
        A.append((sl[0], int(sl[1]), int(sl[2]), sl[3], sl[4], sl[5]))

for cchr, cst, cen, cstr in C:
    for achr, ast, aen, _, gid, gname in A:
        if cchr == achr:
            l = sorted([(cst, cen), (ast, aen)])
            if l[1][0] <= l[0][1]:  # Overlap
                st, en = [cst, cen] if cstr == '+' else [cen, cst]
                output = [cchr, st, en, cstr, gid, gname]
                print('{}\t{}\t{}\t{}\t{}\t{}'.format(*output))

coordinates.txt

chrom   start   end strand
17   71223692 71274336 +
17   71223692 71274336 -
1   4247322 4247912 -
1   4427449 4432604 +
1   4763414 4764404 -

gene_final_anno.csv

chrom,start,end,strand,gene_id,gene_name
17,71223692,71274336,-,ENSMUSG00000085299,Gm16627
17,18186448,18211184,+,ENSMUSG00000067978,Vmn2r-ps113
11,84645863,84684319,-,ENSMUSG00000020530,Ggnbp2
7,51097639,51106551,+,ENSMUSG00000074155,Klk5
13,31711037,31712238,+,ENSMUSG00000087276,Gm11378
9,44887266,44916613,+,ENSMUSG00000048534,Amica1
13,31889579,31892618,+,ENSMUSG00000086144,Gm11379
19,33534099,33547681,-,ENSMUSG00000086875,Gm8975
7,51136530,51141174,+,ENSMUSG00000006948,Klk4
13,31976094,31977093,+,ENSMUSG00000083149,Gm11380
11,3883639,3899329,+,ENSMUSG00000049721,Gal3st1
12,86426764,86447381,-,ENSMUSG00000042320,Prox2
13,32471776,32475557,-,ENSMUSG00000085770,Gm11381
19,33572527,33592270,-,ENSMUSG00000079344,Lipo4

Output:

17  71274336    71223692    +   ENSMUSG00000085299  Gm16627
17  71223692    71274336    -   ENSMUSG00000085299  Gm16627

ADD COMMENT • link 8.2 years ago by Andrzej Zielezinski 11k

0

Entering edit mode

@a.zielezinski thank you for the code what does it mean when I get this error

ValueError                                Traceback (most recent call last)
<ipython-input-2-9ef2214e1af0> in <module>()
     11     for line in fh:
     12         sl = line.strip().split(',')
---> 13         A.append((int(sl[0]), int(sl[1]), int(sl[2]), sl[3], sl[4], sl[5]))
     14 
     15 for cchr, cst, cen, cstr in C:

ValueError: invalid literal for int() with base 10: 'X'

ADD REPLY • link 8.2 years ago by novicebioinforesearcher ▴ 70

3

Entering edit mode

Chromosome X, lovely annoying :)

ADD REPLY • link 8.2 years ago by WouterDeCoster 48k

2

Entering edit mode

As @WouterDeCoster noticed, I forgot to consider non-integer chromosomes (e.g. X, Mitochondrial etc). I updated the code in my answer.

ADD REPLY • link 8.2 years ago by Andrzej Zielezinski 11k

0

Entering edit mode

@a.zielezinski How can I modify the script in a case where a co ordinate maps to two or more genes? for example

coordinate

1   4831213 4857551 +

output

1   4857551 4831213 +   ENSMUSG00000025903  Lypla1
1   4857551 4831213 +   ENSMUSG00000033813  Tcea1

desired output

1   4857551 4831213 +     ENSMUSG00000033813,ENSMUSG00000025903  Tcea1,Lypla

many thanks.

ADD REPLY • link 8.2 years ago by novicebioinforesearcher ▴ 70

2

Entering edit mode

Since the script already reports cases where a coordinate maps to multiple genes, the easiest way to get to your desired output would be to write another script that would reformat existing output.

ADD REPLY • link 8.2 years ago by Andrzej Zielezinski 11k

0

Entering edit mode

thanks two questions regarding this

1) how do i write the output to a file from your script 2) to get desired output from what i understood multiple keys has multiple values need to write a code to set multiple key and values from reading on stackoverflow

     out["chrom"].update(dict(zip(["start", "end", "strand"] #this would add multiple keys to dict

is this approach correct?

ADD REPLY • link 8.2 years ago by novicebioinforesearcher ▴ 70

3

Entering edit mode

To write the output to a file, just run the script and redirect the output to a file (python script.py > output.txt).

To combine cases where a coordinate maps to multiple genes use something like this:

d = {}
fh = open('output.txt')
for line in fh:
    sl = line.split()
    chrom = sl[0]
    st = int(sl[1])
    en = int(sl[2])
    strand = sl[3]
    gene_name = sl[4]
    gene_id = sl[5]
    key = (chrom, st, en, strand)
    if key not in d:
        d[key] = [[], []]
    d[key][0].append(gene_name)
    d[key][1].append(gene_id)
fh.close()

for key in d:
    l = [key[0], key[1], key[2], key[3], ",".join(d[key][0]), ",".join(d[key][1])]
    print('{}\t{}\t{}\t{}\t{}\t{}'.format(*l))

From the output.txt:

1   4857551 4831213 +   ENSMUSG00000025903  Lypla1
1   4857551 4831213 +   ENSMUSG00000033813  Tcea1
17  71274336    71223692    +   ENSMUSG00000085299  Gm16627
17  71223692    71274336    -   ENSMUSG00000085299  Gm16627

You will get the following output:

17  71223692    71274336    -   ENSMUSG00000085299  Gm16627
1   4857551 4831213 +   ENSMUSG00000025903,ENSMUSG00000033813   Lypla1,Tcea1
17  71274336    71223692    +   ENSMUSG00000085299  Gm16627

ADD REPLY • link 8.2 years ago by Andrzej Zielezinski 11k

0

Entering edit mode

saved my life! thanks

ADD REPLY • link 8.2 years ago by novicebioinforesearcher ▴ 70

0

Entering edit mode

@a.zielezinski I was reading up on python and stumbled upon pandas, just out of curiosity is such operations viable using pandas? or is it purely for data science and not bioinformatics

ADD REPLY • link 8.2 years ago by novicebioinforesearcher ▴ 70

1

Entering edit mode

Pandas is probably one of the best things since bread came sliced.

ADD REPLY • link 8.2 years ago by WouterDeCoster 48k

0

Entering edit mode

can this be done using pandas? or has someone done it?

ADD REPLY • link 8.2 years ago by novicebioinforesearcher ▴ 70

0

Entering edit mode

I had a question why is that when then strand is positive the start and end is inverted meaning the start is greater than end, for example the output you showed above

1   4857551 4831213 +   ENSMUSG00000025903,ENSMUSG00000033813   Lypla1,Tcea1
17  71274336    71223692    +   ENSMUSG00000085299  Gm16627

is'nt suppose to be

1    4831213 4857551 +   ENSMUSG00000025903,ENSMUSG00000033813   Lypla1,Tcea1
    17      71223692 71274336    +   ENSMUSG00000085299  Gm16627

same trend seen in the actual data :

1       52176756        52176467        +       ENSMUSG00000026104      Stat1
11      115850204       115850069       +       ENSMUSG00000020758,ENSMUSG00000084827   Itgb4,B230344G16Rik
7       4416961 4414002 +       ENSMUSG00000006154,ENSMUSG00000087297,ENSMUSG00000091177        Eps8l1,Gm15494,Gm15494
9       30240915        30239958        +       ENSMUSG00000031993      Snx19
17      24818054        24817976        +       ENSMUSG00000041130      Zfp598
8       87831485        87830217        +       ENSMUSG00000031697      Orc6

ADD REPLY • link 8.2 years ago by novicebioinforesearcher ▴ 70

0

Entering edit mode

Oh i see it is because of this line is it?

st, en = [cst, cen] if cstr == '-' else [cen, cst]

when printing out is there a way to print start less than end?

ADD REPLY • link 8.2 years ago by novicebioinforesearcher ▴ 70

2

Entering edit mode

Sorry for that, I made a mistake. It should be: st, en = [cst, cen] if cstr == '+' else [cen, cst]. I also corrected the answer. If you want start always less than end (regardless of the strand), just change this line to st, en = cst, cen.

ADD REPLY • link 8.2 years ago by Andrzej Zielezinski 11k

0

Entering edit mode

Thank you...much appreciated. I was so fascinated by pandas was trying to fix it by using that (not sure if logic is correct)

import pandas as pd
df = pd.read_clipboard()
df.columns = ["chrom", "start", "end", "strand", "gene_id", "gene_name"]
df
if df.strand == '+':
     df['start_1'] = df.end & df['end_1'] = df.start
 df

ADD REPLY • link 8.2 years ago by novicebioinforesearcher ▴ 70