Question: How Can I Divide Snp Data Into Fixed Windows Based On Physical Distance ?
0
gravatar for Rubal
7.7 years ago by
Rubal220
Germany
Rubal220 wrote:

Hi all,

I have a tab-delimited text file of SNP data that I need to split into smaller files, with each file containing data from SNPs in 20mb windows. My problem is how to split the files conditional on the numerical value in one of the columns.

File format:

SNP ID     Physical distance
rs_123132  12343 
rs_123134  304354
rs_123434  8930044

I need a way to keep track of the distance between values in column 2 and when it becomes >= 20,000,000 to export all the rows within this block into a new file, and to do this for each block of 20,000,000 until the end of the file.

If possible I'd love to see this done in Python, as this is the language I am learning.

Thanks very much for any help!

Rubal

python snp • 1.4k views
ADD COMMENTlink modified 7.7 years ago by brentp23k • written 7.7 years ago by Rubal220
1
gravatar for brentp
7.7 years ago by
brentp23k
Salt Lake City, UT
brentp23k wrote:

I'm a bit confused as to wether your 2nd column is the rs location, or the distance. Below, I assume it's the location, and you want all SNPs with location < 20million in one file, then SNPS between 20 and 40 million in another, and so on. (I ignore chromosome, since you seem to have done so also).

import sys
file_iter = (x.strip().split("\t") for x in open(sys.argv[1]))
file_iter.next() # drop header

files = {}   
SPLIT = 20000000

for rsid, start in file_iter:
    (n, rem) = divmod(int(start), SPLIT)
    if not n in files:
        files[n] = open('snps.%i.txt' % (n * SPLIT), 'w')
    print >> files[n], "%s\t%s" % (rsid, start)

for fh in files.values(): fh.close()

Call this like:

python splitter.py your-snps.txt

and it will create files like: snps.0.txt, snps.20000000000.txt, etc.

ADD COMMENTlink written 7.7 years ago by brentp23k

Yes it's the physical rs location. Thanks! I'll try this out and let you know if it works for me.

ADD REPLYlink written 7.7 years ago by Rubal220

I get the following error (perhaps I have compiled it incorrectly?):

Traceback (most recent call last):

File "makewindows.py", line 10, in <module> for rsid, start in file_iter: ValueError: need more than 1 value to unpack

ADD REPLYlink written 7.7 years ago by Rubal220

you have to call it with the name of your snps file as the first argument. that value indicates that your snp file is empty or it is not tab delimited. if it is not tab delimted. use .split() in place of .split("t")

ADD REPLYlink written 7.7 years ago by brentp23k

Thanks for the feedback, ironically now I fixed that issue I seem to get the opposite message (thanks for your patience):

Traceback (most recent call last): File "makewindows.py", line 10, in <module> for rsid, start in file_iter: ValueError: too many values to unpack

ADD REPLYlink written 7.7 years ago by Rubal220

so your columns are separated by multiple spaces. fix that, or use re.split("s+", x) instead of x.split()

ADD REPLYlink written 7.7 years ago by brentp23k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2300 users visited in the last hour