Question: Python Script Help for Cooccupancy
0
gravatar for dally
4.0 years ago by
dally180
United States
dally180 wrote:

I need a script or way to search for pol II summits that are greater than -5k +5k from the TSS start site all the way until the beginning of the next gene.

so basically something like this

-------(-5k)-------[|tss|==GENE1===]----(-5k)----(+5k)----[=====GENE2=====]-------(+5k)----------------------(-5k)-------------[===GENE3===]

So i need to find pol II summits > than -/+5k away from the TSS and then have it search for summits till it reaches the -5k region of the next gene.

So if you look at the figure I want it to search for Pol II summits from (-5k of GENE 1) to (+5k of GENE 2). Does that make sense?

Then using those summits I want to search for H3K4me3 marks -250 250 from the Pol II summit.

 

I have a current script that does something similar but it simply searches for summits in a certain integer region from the TSS (for instance -250 downstream, +1000 upstream) and then searches for all peaks of interested mark in a 250 window from that summit, but I need to modify it to accomplish this new task and I simply have no idea how.

This is a section of the script that I would need to update to search for summits from TSS.

    

cntdict = {}

cntdict["PolII"] = 0

for mark in marks.split(","):

    cntdict[mark] = 0

counter = 1

totalhits = 0

intersectlist = []

for locus in loci:

print "Working on line# " + str(counter)

    counter += 1

    direction = str(locus[5])

    if direction == "+":

        tss = int(locus[1])

        cur.execute("select summit from CHIP_PEAK where mark = '%s' and chr = '%s' and ((summit > %i and summit < %i))" % ("PolII-Chip",str(locus[0]),tss-2500,tss+2500))

        print "select summit from CHIP_PEAK where mark = '%s' and chr = '%s' and ((summit > %i and summit < %i))" % ("PolII-Chip",str(locus[0]),tss-2500,tss+2500)

else:

        tss = int(locus[2])

        cur.execute("select summit from CHIP_PEAK where mark = '%s' and chr = '%s' and ((summit > %i and summit < %i))" % ("PolII-Chip",str(locus[0]),tss-2500,tss+2500))

        print "select summit from CHIP_PEAK where mark = '%s' and chr = '%s' and ((summit > %i and summit < %i))" % ("PolII-Chip",str(locus[0]),tss-2500,tss+2500)

 

EDIT: Sorry for the lack of code formatting, i'm trying to figure it out now.

cooccupancy python • 1.2k views
ADD COMMENTlink modified 4.0 years ago by Devon Ryan92k • written 4.0 years ago by dally180
1
gravatar for Devon Ryan
4.0 years ago by
Devon Ryan92k
Freiburg, Germany
Devon Ryan92k wrote:

Much of bioinformatics is knowing when not to reinvent the wheel. In this case, you're looking for bedtools flank and bedtools intersect -v.

ADD COMMENTlink written 4.0 years ago by Devon Ryan92k

Hmm, I see how this might work. I'm unsure what number would be appropriate for the "number of base pairs" to add to the start and end coordinates since all the regions between the gene lengths vary. Would I use a rather large number and let bedtools clip the coordinates as it does by default?

ADD REPLYlink written 4.0 years ago by dally180

Add 5000, since that's what you specified in your question. You then exclude all intersecting summits and have your list of PolII summits to proceed with.

ADD REPLYlink written 4.0 years ago by Devon Ryan92k

bedtools doesn't take into account that some intergenic regions between genes might be less than 5kb from the TSE does it? In that case it wouldn't give me intergenic pol II summits, but summits inside that gene as well

 

I'd want to exclude those summits that get pulled if they fall into that category

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by dally180

Just flank both ends of the transcript bounds. Using a refseq BED file from UCSC would be convenient for this.

ADD REPLYlink written 4.0 years ago by Devon Ryan92k

Now if I want to overlap those Pol II summits I managed to pull up with say chip seq H3K4me1 marks in a -250, +250 window from the pol II summit to identify possible enhancer sites, can i use intersect for this too?

ADD REPLYlink written 4.0 years ago by dally180

Yup, just flank either the H3K4me1 or PolII peaks  by 250 on each side and then run the intersect.
 

ADD REPLYlink written 4.0 years ago by Devon Ryan92k

This is what I actually did yesterday, so it's good to hear that that was the way to go about it.

 

I'm running into a problem where i'm picking up false positives (pol summits inside intragenic sites, or inside genes) ... I tried to weed these out by taking -v of my pol summit files + my flanked tss files and then intersecting (using a -v command so I only keep intergenic sites) it with a pol summit file + tss file (since this should give the intragenic summits, no -v option here) ... It made sense when I drew it down on paper, but these false positives obviously mean that i'm doing something wrong or my data is incorrect. 

 

I'm assuming there is multiple tss and tse within a gene which means it's impossible to weed out all the pol II summits inside genes (right now i'd say 20% of data is false positives). What do you suggest or think?

ADD REPLYlink written 4.0 years ago by dally180

If you intersect with "transcript" or "gene" entries from a GTF (or appropriate BED) file with the -v option you will never get intragenic peaks.

ADD REPLYlink written 4.0 years ago by Devon Ryan92k

Also when I flank the Pol II summits, it gives me some that have a end site than the start site (ex: chr1: 1000-999) so it messes up the H3K1 and Pol intersect.

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by dally180
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1928 users visited in the last hour