Question: genomics overlap python error
0
gravatar for novicebioinforesearcher
19 months ago by

I am trying to do some interval range operations on two files with folowing conditions check if chrom is equal, then check if my start and end of of co0rdinatefile is within or equal to start and end of gene_annotation file (condition if the strand is "+" the start and end will be for eg 10-20, if its "-" it will be 20-10) , if match print start end strand from coordinate and gene_id, gene_name from geneannotation file. (for representation purpose I have head annoataion file )

number of rows in annotation file ~50000 number of rows in coordinated file ~200,000

gene_annotationfile

import csv 

with open('~/coordintes.txt', 'r') as source:
     coordinates = list(csv.reader(source, delimiter="\t"))
#      print coordinates


with open('~/gene_final_anno.csv', 'r') as source:
     annotations = list(csv.reader(source, delimiter=","))
#     print annotations

for index,line in enumerate(coordinates):
    for index2,line2 in enumerate(annotations):
#         print line2[1]
        if line[0] == line2[0] and line[3] == "-":
           line2[1] >= line[1] and line2[2] <= line[2]
        else:
            line[1] <= line2[1] and line2[2] >= line[2]

        print "%s\t%s\t%s\t%s\t%s\t%s" % (line[0],line[1],line[2],line[5],line2[4],line2[5])
        break

this is the output that I get

chrom   start   end strand  gene_id gene_name
4   3356893 3359510 +   gene_id gene_name
4   3363294 3424125 -   gene_id gene_name
4   3413587 3424125 -   gene_id gene_name
4   3424201 3425964 -   gene_id gene_name
4   3424201 3424529 -   gene_id gene_name
4   3424197 3425964 -   gene_id gene_name

Not sure where am I going wrong any help is appreciated

python • 892 views
ADD COMMENTlink modified 19 months ago by a.zielezinski8.5k • written 19 months ago by novicebioinforesearcher40
2

In your script, you forgot to convert start and end coordinates into integer numbers. If you show us a sample of your query files (i.e. coordinates, gene_final_anno.csv) we would probably be more helpful.

ADD REPLYlink written 19 months ago by combio.pl20
1

sorry should have done this when I posted the question gene_final_anno.csv (head)

chrom,start,end,strand,gene_id,gene_name
17,71223692,71274336,-,ENSMUSG00000085299,Gm16627
17,18186448,18211184,+,ENSMUSG00000067978,Vmn2r-ps113
11,84645863,84684319,-,ENSMUSG00000020530,Ggnbp2
7,51097639,51106551,+,ENSMUSG00000074155,Klk5
13,31711037,31712238,+,ENSMUSG00000087276,Gm11378
9,44887266,44916613,+,ENSMUSG00000048534,Amica1
13,31889579,31892618,+,ENSMUSG00000086144,Gm11379
19,33534099,33547681,-,ENSMUSG00000086875,Gm8975
7,51136530,51141174,+,ENSMUSG00000006948,Klk4
13,31976094,31977093,+,ENSMUSG00000083149,Gm11380
11,3883639,3899329,+,ENSMUSG00000049721,Gal3st1
12,86426764,86447381,-,ENSMUSG00000042320,Prox2
13,32471776,32475557,-,ENSMUSG00000085770,Gm11381
19,33572527,33592270,-,ENSMUSG00000079344,Lipo4

coordinates head

chrom   start   end strand
1   4247322 4247912 -
1   4427449 4432604 +
1   4763414 4764404 -
1   4764597 4767606 -
1   4764597 4766491 -
1   4766882 4767606 -
1   4767729 4772649 -
1   4767729 4768829 -
1   4767729 4775654 -
1   4772382 4772649 -
1   4772814 4774032 -
1   4772814 4774159 -
1   4772814 4775654 -
1   4772814 4774032 +
1   4774186 4775654 -
1   4774186 4775654 +
1   4774186 4775699 -
1   4775960 4798536 +
1   4798063 4798536 +
1   4798063 4818665 +
1   4798567 4818665 +
1   4818730 4820349 +
1   4820396 4822392 +
1   4822462 4827082 +
1   4822462 4829468 +
1   4827155 4829468 +
1   4827155 4831037 +
1   4829569 4831037 +
1   4831213 4835044 +

gene_final_anno, is nothing but annotation file from biomart, for mm9.37.67, and coordinates file is from tophat junction bed representing junctions need to annotate each junction with gene id and name.

ADD REPLYlink modified 19 months ago • written 19 months ago by novicebioinforesearcher40
1

As written by combio.pl

you forgot to convert start and end coordinates into integer numbers

That's important.

ADD REPLYlink written 19 months ago by WouterDeCoster35k

Ok will change that, any more pointers?

ADD REPLYlink written 19 months ago by novicebioinforesearcher40
1

Why are you using enumerate? I don't see where you use those index and index2 variables.

It's to me not really clear what you aim to achieve, but I have the idea that more appropriate tools exist rather than rolling your own solution.

ADD REPLYlink written 19 months ago by WouterDeCoster35k

@WouterDeCoster upstream and downstream of this code is in python hence..

ADD REPLYlink written 19 months ago by novicebioinforesearcher40

There might be solution using the htseq-count python module, or similar.

ADD REPLYlink written 19 months ago by WouterDeCoster35k

htseq count uses sorted bam (by name) and gtf right? how can I use it for such operations?

ADD REPLYlink written 19 months ago by novicebioinforesearcher40

I don't have code ready for that, but have a look at http://www-huber.embl.de/HTSeq/doc/tour.html#genomic-intervals-and-genomic-arrays

ADD REPLYlink written 19 months ago by WouterDeCoster35k
3
gravatar for a.zielezinski
19 months ago by
a.zielezinski8.5k
a.zielezinski8.5k wrote:

Python:

C = []
with open('coordinates.txt') as fh:
    fh.readline()  # Skip header
    for line in fh:
        sl = line.strip().split()
        C.append((sl[0], int(sl[1]), int(sl[2]), sl[3]))

A = []
with open('gene_final_anno.csv') as fh:
    fh.readline()
    for line in fh:
        sl = line.strip().split(',')
        A.append((sl[0], int(sl[1]), int(sl[2]), sl[3], sl[4], sl[5]))

for cchr, cst, cen, cstr in C:
    for achr, ast, aen, _, gid, gname in A:
        if cchr == achr:
            l = sorted([(cst, cen), (ast, aen)])
            if l[1][0] <= l[0][1]:  # Overlap
                st, en = [cst, cen] if cstr == '+' else [cen, cst]
                output = [cchr, st, en, cstr, gid, gname]
                print('{}\t{}\t{}\t{}\t{}\t{}'.format(*output))

coordinates.txt

chrom   start   end strand
17   71223692 71274336 +
17   71223692 71274336 -
1   4247322 4247912 -
1   4427449 4432604 +
1   4763414 4764404 -

gene_final_anno.csv

chrom,start,end,strand,gene_id,gene_name
17,71223692,71274336,-,ENSMUSG00000085299,Gm16627
17,18186448,18211184,+,ENSMUSG00000067978,Vmn2r-ps113
11,84645863,84684319,-,ENSMUSG00000020530,Ggnbp2
7,51097639,51106551,+,ENSMUSG00000074155,Klk5
13,31711037,31712238,+,ENSMUSG00000087276,Gm11378
9,44887266,44916613,+,ENSMUSG00000048534,Amica1
13,31889579,31892618,+,ENSMUSG00000086144,Gm11379
19,33534099,33547681,-,ENSMUSG00000086875,Gm8975
7,51136530,51141174,+,ENSMUSG00000006948,Klk4
13,31976094,31977093,+,ENSMUSG00000083149,Gm11380
11,3883639,3899329,+,ENSMUSG00000049721,Gal3st1
12,86426764,86447381,-,ENSMUSG00000042320,Prox2
13,32471776,32475557,-,ENSMUSG00000085770,Gm11381
19,33572527,33592270,-,ENSMUSG00000079344,Lipo4

Output:

17  71274336    71223692    +   ENSMUSG00000085299  Gm16627
17  71223692    71274336    -   ENSMUSG00000085299  Gm16627
ADD COMMENTlink modified 19 months ago • written 19 months ago by a.zielezinski8.5k

@a.zielezinski thank you for the code what does it mean when I get this error

ValueError                                Traceback (most recent call last)
<ipython-input-2-9ef2214e1af0> in <module>()
     11     for line in fh:
     12         sl = line.strip().split(',')
---> 13         A.append((int(sl[0]), int(sl[1]), int(sl[2]), sl[3], sl[4], sl[5]))
     14 
     15 for cchr, cst, cen, cstr in C:

ValueError: invalid literal for int() with base 10: 'X'
ADD REPLYlink modified 19 months ago • written 19 months ago by novicebioinforesearcher40
3

Chromosome X, lovely annoying :)

ADD REPLYlink written 19 months ago by WouterDeCoster35k
2

As @WouterDeCoster noticed, I forgot to consider non-integer chromosomes (e.g. X, Mitochondrial etc). I updated the code in my answer.

ADD REPLYlink modified 19 months ago • written 19 months ago by a.zielezinski8.5k

@a.zielezinski How can I modify the script in a case where a co ordinate maps to two or more genes? for example

coordinate

1   4831213 4857551 +

output

1   4857551 4831213 +   ENSMUSG00000025903  Lypla1
1   4857551 4831213 +   ENSMUSG00000033813  Tcea1

desired output

1   4857551 4831213 +     ENSMUSG00000033813,ENSMUSG00000025903  Tcea1,Lypla

many thanks.

ADD REPLYlink modified 19 months ago • written 19 months ago by novicebioinforesearcher40
2

Since the script already reports cases where a coordinate maps to multiple genes, the easiest way to get to your desired output would be to write another script that would reformat existing output.

ADD REPLYlink written 19 months ago by a.zielezinski8.5k

thanks two questions regarding this

1) how do i write the output to a file from your script 2) to get desired output from what i understood multiple keys has multiple values need to write a code to set multiple key and values from reading on stackoverflow

     out["chrom"].update(dict(zip(["start", "end", "strand"] #this would add multiple keys to dict

is this approach correct?

ADD REPLYlink written 19 months ago by novicebioinforesearcher40
3

To write the output to a file, just run the script and redirect the output to a file (python script.py > output.txt).

To combine cases where a coordinate maps to multiple genes use something like this:

d = {}
fh = open('output.txt')
for line in fh:
    sl = line.split()
    chrom = sl[0]
    st = int(sl[1])
    en = int(sl[2])
    strand = sl[3]
    gene_name = sl[4]
    gene_id = sl[5]
    key = (chrom, st, en, strand)
    if key not in d:
        d[key] = [[], []]
    d[key][0].append(gene_name)
    d[key][1].append(gene_id)
fh.close()

for key in d:
    l = [key[0], key[1], key[2], key[3], ",".join(d[key][0]), ",".join(d[key][1])]
    print('{}\t{}\t{}\t{}\t{}\t{}'.format(*l))

From the output.txt:

1   4857551 4831213 +   ENSMUSG00000025903  Lypla1
1   4857551 4831213 +   ENSMUSG00000033813  Tcea1
17  71274336    71223692    +   ENSMUSG00000085299  Gm16627
17  71223692    71274336    -   ENSMUSG00000085299  Gm16627

You will get the following output:

17  71223692    71274336    -   ENSMUSG00000085299  Gm16627
1   4857551 4831213 +   ENSMUSG00000025903,ENSMUSG00000033813   Lypla1,Tcea1
17  71274336    71223692    +   ENSMUSG00000085299  Gm16627
ADD REPLYlink modified 19 months ago • written 19 months ago by a.zielezinski8.5k

saved my life! thanks

ADD REPLYlink written 19 months ago by novicebioinforesearcher40

@a.zielezinski I was reading up on python and stumbled upon pandas, just out of curiosity is such operations viable using pandas? or is it purely for data science and not bioinformatics

ADD REPLYlink written 19 months ago by novicebioinforesearcher40
1

Pandas is probably one of the best things since bread came sliced.

ADD REPLYlink written 19 months ago by WouterDeCoster35k

can this be done using pandas? or has someone done it?

ADD REPLYlink written 19 months ago by novicebioinforesearcher40

I had a question why is that when then strand is positive the start and end is inverted meaning the start is greater than end, for example the output you showed above

1   4857551 4831213 +   ENSMUSG00000025903,ENSMUSG00000033813   Lypla1,Tcea1
17  71274336    71223692    +   ENSMUSG00000085299  Gm16627

is'nt suppose to be

1    4831213 4857551 +   ENSMUSG00000025903,ENSMUSG00000033813   Lypla1,Tcea1
    17      71223692 71274336    +   ENSMUSG00000085299  Gm16627

same trend seen in the actual data :

1       52176756        52176467        +       ENSMUSG00000026104      Stat1
11      115850204       115850069       +       ENSMUSG00000020758,ENSMUSG00000084827   Itgb4,B230344G16Rik
7       4416961 4414002 +       ENSMUSG00000006154,ENSMUSG00000087297,ENSMUSG00000091177        Eps8l1,Gm15494,Gm15494
9       30240915        30239958        +       ENSMUSG00000031993      Snx19
17      24818054        24817976        +       ENSMUSG00000041130      Zfp598
8       87831485        87830217        +       ENSMUSG00000031697      Orc6
ADD REPLYlink modified 19 months ago • written 19 months ago by novicebioinforesearcher40

Oh i see it is because of this line is it?

st, en = [cst, cen] if cstr == '-' else [cen, cst]

when printing out is there a way to print start less than end?

ADD REPLYlink written 19 months ago by novicebioinforesearcher40
2

Sorry for that, I made a mistake. It should be: st, en = [cst, cen] if cstr == '+' else [cen, cst]. I also corrected the answer. If you want start always less than end (regardless of the strand), just change this line to st, en = cst, cen.

ADD REPLYlink written 19 months ago by a.zielezinski8.5k

Thank you...much appreciated. I was so fascinated by pandas was trying to fix it by using that (not sure if logic is correct)

import pandas as pd
df = pd.read_clipboard()
df.columns = ["chrom", "start", "end", "strand", "gene_id", "gene_name"]
df
if df.strand == '+':
     df['start_1'] = df.end & df['end_1'] = df.start
 df
ADD REPLYlink written 19 months ago by novicebioinforesearcher40
0
gravatar for christophersugai
19 months ago by
christophersugai0 wrote:

I'm not sure, but this basically sounds like a Granges findoverlaps or subsetbyoverlaps job. If I had to guess, I would say it's something with that enumerate(). I would create csvs that are compatible with Granges via dplyr and just subsetbyoverlaps https://kasperdanielhansen.github.io/genbioconductor/html/GenomicRanges_GRanges_Usage.html

ADD COMMENTlink written 19 months ago by christophersugai0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1654 users visited in the last hour