Question: Detect Regions In A List That Are Within 1 Megabase Of Each Other
1
gravatar for Rubal
7.6 years ago by
Rubal220
Germany
Rubal220 wrote:

Hello,

I am analyzing the results from two different statistical tests of genome wide SNP data. Each test outputs a list of rows with score value and also the genomic position. I have already filtered the files so they contain only the highest scores for each test. I am interested in observing the concordance between the two tests. So I would like to use a python script to go through the two files and identify any genomic positions that are identified by both tests. However I do not want to be too stringent so I want the list to include any regions that are within 1mb of each other. I am a beginner programmer and am not sure how to make a loop that can identify matches conditional on being within 1mb of each other. Any regions that match this criteria should then be written into a new file.

Here are examples of the two files (no headers):

Rn34_2155934833 155934833 1.30383e+06 1.50241e+06 -0.141762
Rn34_2167031291 167031291 1.96651e+06 3.47144e+06 -0.568305
Rn34_2178882599 178882599 1.89353e+06 2.00596e+06 -0.0576771

and

2 90152 439 180348277.000000 40.978182 10.406474 0.150000
2 97679 311 195402277.000000 44.399545 19.557207 0.150000
2 102957 486 205958277.000000 46.798636 13.764937 0.150000

In the first file the 2nd column is physical distance (eg 155934833) and in the second file the 4th column is physical distance (eg 180348277.000000). So I suppose a loop that reads in and compares the 2nd column of each row in file 1 to all the 4th column values in the rows in file 2 and then outputs both rows if a match is found within 1 megabase would be perfect. Though I am sure more experienced people know tidier ways. The other columns contain score information, chromosome etc but I think are irrelevant for the required function.

Any help is really appreciated, especially in python as I am learning this language.

Many thanks,

Rubal7

overlap python snp genomics • 2.0k views
ADD COMMENTlink written 7.6 years ago by Rubal220

by physical distance, you mean location on a chromosome?

ADD REPLYlink written 7.6 years ago by brentp23k

Do you want to do a run time comparison? My guess is 1. (fastest) BEDTools, 2. R/IRanges, 3. python looping, 4. Galaxy

ADD REPLYlink written 7.6 years ago by Michael Dondrup46k
3
gravatar for brentp
7.6 years ago by
brentp23k
Salt Lake City, UT
brentp23k wrote:

Convert your data into BED format (chrom, start, stop) and this becomes an easy problem thanks to BedTools.

You can use simple unix commands to get to BED format and PAD by your distance, then intersectBed will find the intersection of those padded regions.

#!/bin/bash

DIST=1000000
PATH=$PATH:/usr/local/src/bedtools/bin

#convert to bed. pad by 1/2 the distance.
awk -vd=$(($DIST/2)) 'BEGIN {OFS="\t"}{ print "chrFake",$2-d-1,$2+d,$0 }' file1.txt > file1.bed
awk -vd=$(($DIST/2)) 'BEGIN {OFS="\t"}{ print "chrFake",$4-d-1,$4+d,$0 }' file2.txt > file2.bed

# cut -f 4,8 works because the oringal data is space-seperated. you'll have 
# to adjust # if it's actually tab-delimited.
intersectBed -a file1.bed -b file2.bed -wo | cut -f 4,8 > theanswer.txt
ADD COMMENTlink modified 7.6 years ago • written 7.6 years ago by brentp23k
2
gravatar for Sean Davis
7.6 years ago by
Sean Davis25k
National Institutes of Health, Bethesda, MD
Sean Davis25k wrote:

What you propose seems reasonable. However, it might be interesting to convert your files to gff format (using python, of course). Then, tools such as BEDtools, Galaxy, and others could be applied on the files directly to find the overlaps of interest.

For a more general solution to range queries in python, consider a library that includes one of several variations on interval trees such as bx-python.

ADD COMMENTlink modified 7.6 years ago • written 7.6 years ago by Sean Davis25k

thanks, I'll look into this library.

ADD REPLYlink written 7.6 years ago by Rubal220
2
gravatar for Niek De Klein
7.6 years ago by
Niek De Klein2.5k
Netherlands
Niek De Klein2.5k wrote:

easy but probably not so fast python solution:

#opeing file1 and the file to write to
file1 = open("file.txt", "r")
outfile = open("result.txt","w")
fileString = "" #empty string

#looping trough every line of file1
for line in file1:
    print line
    #opening second file
     file2 = open("file2.txt","r")
    #parsing the distance out of file 1
    physicDistance1 = int(line.split(" ")[1])   #looks like it's space delimited, if it's tab delimited use line.split(" \t")
    #for every line in file1, loop trough every line in file2
    for line2 in file2:
        #parse distance out of file 2
        physicDistance2 = int(line2.split(" ")[3].split(".")[0]) #or line2.split("\t")
        #if distance1 - distance 2 or distance2 - distance1 equal or smaller than 1000 
        if physicDistance1 - physicDistance2 <= 1000 or physicDistance2 - physicDistance1 <= 1000:
             #dont know what exactly you want in the new file, but I put the lines that are equal behind eachother
            fileString += str(line) +"\t" + str(line2)

outfile.write(fileString)
outfile.close()
ADD COMMENTlink modified 7.6 years ago • written 7.6 years ago by Niek De Klein2.5k
2

Niek, you need to open 'file2' within the file1 loop. Otherwise file2 will be ready completely on the first 'line' in file1 and will be empty for each subsequent line.

ADD REPLYlink written 7.6 years ago by Brad Chapman9.4k
1

there are still quite a few syntax errors in there. =< should be <= and you need an extra ")" on the line where you define physicDistance2. You can also write directly to outfile rather than saving up a long string.

ADD REPLYlink written 7.6 years ago by brentp23k

Thanks very much, I'll play around with this script and let you know if it does the job!

ADD REPLYlink written 7.6 years ago by Rubal220

You're right, wrote this on the fly. I will edit it. Thanks!

ADD REPLYlink written 7.6 years ago by Niek De Klein2.5k

I get a syntax error on the first =< on line 16. Any idea why?

ADD REPLYlink written 7.6 years ago by Rubal220

@Rubal see brentp's comment. Got those syntaxes out.

ADD REPLYlink written 7.6 years ago by Niek De Klein2.5k

I've tried with the new syntax and there is still a problem with this line, now the syntax error is at the = . Any ideas? Rest of it seems to work fine thanks

ADD REPLYlink written 7.6 years ago by Rubal220

Hmm, a problem occurs that is maybe due to the many .0000s after the physical position values in one file. Anyone know what the root problem is likely to be here -see error message below:

File "Detectoverlappingregions.py", line 14, in <module> physicDistance2 = int(line2.split(" ")[3]) #or line2.split("\t") ValueError: invalid literal for int() with base 10: '4709301.000000

ADD REPLYlink written 7.6 years ago by Rubal220

When there is a . in a value python can't make an int out of it. Are those zeros important? Then you could use float(line2.split(" ")[3]) instead of using the int() or if what is after the . is not important, you could use int(line2.split(" ")[3].split(".")[0])

ADD REPLYlink written 7.6 years ago by Niek De Klein2.5k

Traceback (most recent call last): @Niek de Klein, thanks, now I think hopefully the final issue is this:

File "Detectoverlappingregions.py", line 20, in <module> outfile.write(fileString) AttributeError: 'tuple' object has no attribute 'write'

ADD REPLYlink written 7.6 years ago by Rubal220

@Niek de Klein, thanks, now hopefully this is the last issue:

Traceback (most recent call last): File "Detectoverlappingregions.py", line 20, in <module> outfile.write(fileString) AttributeError: 'tuple' object has no attribute 'write'

ADD REPLYlink written 7.6 years ago by Rubal220

Little mistake, forgot to put open infront of ("result.txt","w"), changed it now

ADD REPLYlink written 7.6 years ago by Niek De Klein2.5k

Script runs (after changing '=' to '-' on the 'if PhysicDistance1' line. But the output only contains 1 line from the first file, which is repeated every other line (see below) along with what I believe are potentially the correct lines from the 2nd file. Any ideas how to fix this greatly appreciated:

Rn34_1002190962 2190962 1.06905e+06 1.89194e+06 -0.570826 1 85419 532 171085301.000000 30.506786 30.562953 0.150000 Rn34_1002190962 2190962 1.06905e+06 1.89194e+06 -0.570826

ADD REPLYlink written 7.6 years ago by Rubal220

Rn34_1002190962 2190962 1.06905e+06 1.89194e+06 -0.570826 1 85419 532 171085301.000000 30.506786 30.562953 0.150000 Rn34_1002190962 2190962 1.06905e+06 1.89194e+06 -0.570826 1 77944 323 156135301.000000 27.837143 19.561030 0.150000 Rn34_1002190962 2190962 1.06905e+06 1.89194e+06 -0.570826 1 83239 296 166725301.000000 29.728214 30.324539 0.150000 Rn34_1002190962 2190962 1.06905e+06 1.89194e+06 -0.570826 1 85419 532 171085301.000000 30.506786 30.562953 0.150000

ADD REPLYlink written 7.6 years ago by Rubal220

So the script seems to only print the first line of the first input file, but several different lines from the 2nd input file. Presumably to output the several different lines from the 2nd input file it is looking at more than the first line of the first file, but not sure how. Thanks for any help.

ADD REPLYlink written 7.6 years ago by Rubal220

can you try it with adding +"n" at the end of fileString += str(line) +"t" + str(line2)+"n" (see the example above)

ADD REPLYlink written 7.6 years ago by Niek De Klein2.5k

Thank Niek, but same output, only with an empty line spacing between each duplicate pair (still with 1st line being always the first line from file1 and the 2nd line being a different line from file2)

ADD REPLYlink written 7.6 years ago by Rubal220

if you do print line in the first loop, what does it print?

ADD REPLYlink written 7.6 years ago by Niek De Klein2.5k

sorry python newby, could you add in the example above how i implement your suggestion. thank you

ADD REPLYlink written 7.6 years ago by Rubal220

It prints out all the lines in order from file1, with a space between each line

ADD REPLYlink written 7.6 years ago by Rubal220

I would discourage such a 'brute-force' approach, without a supporting data structure, the performance of such a naive implementation is terrible.

ADD REPLYlink written 7.6 years ago by Michael Dondrup46k
0
gravatar for Michael Dondrup
7.6 years ago by
Bergen, Norway
Michael Dondrup46k wrote:

Oh, too tempting to give a quick R answer. Given your data is in file1.txt and file2.txt and the files have exactly the layout described in your example.

library(IRanges) # need to install this library before

## read in the input files efficiently using scan
int1 = scan(file="file1.txt", what=list(NULL, numeric(),NULL, NULL, NULL))
int2 = scan( file="file2.txt", what=list(NULL,NULL, NULL, numeric(), NULL, NULL, NULL))

## create the objects holding the genomic positions
ir1 = IRanges(start=as.integer(int1[[2]]), width=1) # use column 2 SNP is an interval of length 1
ir2 = IRanges(start=as.integer(int2[[4]]), width=1) # use column 4, make an integer of the float

## output the positions from file1 which overlap file2 with maximum gap of 1e6 bp, one per line
cat(start(ir1[countOverlaps(ir1, ir2, maxgap=1e6)>0]), file="", sep="\n")
## output postions matching from file2 in file1
cat(start(ir2[countOverlaps(ir2, ir1, maxgap=1e6)>0]), file="", sep="\n")

## quick'n'dirty solution to get the full output back:

ind1 = countOverlaps(ir1, ir2, maxgap=1e6)>0
ind2 = countOverlaps(ir2, ir1, maxgap=1e6)>0
write.table( read.table(file="file1.txt")[ind1,], col.names=F, quote=F, row.names=F, file=stdout() ) # replace stdout() with a filename, read ?read.table, ?write.table for more
write.table( read.table(file="file2.txt")[ind2,], col.names=F, quote=F, row.names=F, file=stdout() )

Note: add a filename to file="" to print to a file instead of to stdout. Your example didn't contain overlapping positions, then output is empty.

ADD COMMENTlink modified 7.6 years ago • written 7.6 years ago by Michael Dondrup46k

great, this also seems to work, but can you make it output the entire row with the matching column, not just the column value?

ADD REPLYlink written 7.6 years ago by Rubal220

sure xD, will take a while though

ADD REPLYlink written 7.6 years ago by Michael Dondrup46k

If you have time to figure it out, please do let me know! Cheers

ADD REPLYlink written 7.6 years ago by Rubal220

here you are, you might have to adapt a little to fit your file format

ADD REPLYlink written 7.6 years ago by Michael Dondrup46k

@Michael Dondrup, thanks very much for all your help, this is working great

ADD REPLYlink written 7.6 years ago by Rubal220
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2286 users visited in the last hour