Question: restructure rows and columns in perl or python (Interleaving columns by row pairs)
0
gravatar for Ana
5 months ago by
Ana50
Ana50 wrote:

Hi all, I have a SNPsfile (containing 11 millions SNPs) which I was using to create covariance matrix in Bayenv, so each column in this file corresponds populations and rows are SNPs, but for every SNP I have 2 rows (for two alleles), look like below (2 * nsnps "rows" and npops "columns"):

7        2       2       0        6        2       2
1        0       0       0        0        0       0
0        2       2       0        0        0       0
1        0       0       0        0        0       0

So in the example above I have 7 populations (columns) and 2 SNPs (rows). I need to modify the format of this file a bit. In the new file each row should correspond to one SNP and the number of columns should be twice the number of populations because each pair of numbers corresponds to each allele. So the new file should look like this ( nsnps "rows" and 2*npops "columns"):

7   1   2    0    2   0    0   0    6   0   2   0   2   0
0   1   2    0    2   0    0   0    0   0   0   0   0   0

I have Rcodes which do this manipulation job for me, but it seems that R is so slow, I just want to ask can anyone help me to figure out if there is anyway to do it in Perl or Python. I am new to both of them, I would appreciate any help to fix this issue. Thanks

dataframe columns rows python perl • 312 views
ADD COMMENTlink modified 5 months ago by WouterDeCoster23k • written 5 months ago by Ana50

Can you show your R code and tell the size of your matrix and how much RAM you have available? Transposing a matrix should be quick in R, unless your matrix is too big and you are swapping to disk.

ADD REPLYlink written 5 months ago by h.mon9.2k

It's not exactly transposing.

ADD REPLYlink written 5 months ago by WouterDeCoster23k

You are right, it is not near transposing.

ADD REPLYlink written 5 months ago by h.mon9.2k

Which is great because there is no need to load the entire matrix.

ADD REPLYlink written 5 months ago by WouterDeCoster23k

is it me or someone who did not understand the output format. I feel so noob and still, cannot figure out what the OP wanted. Am glad Wouter figured it out but I would be glad if I can understand what the OP is trying to achieve. It will be nice to learn something new. :)

ADD REPLYlink written 5 months ago by vchris_ngs4.2k
1

Ah, I understood now the format and what the OP is trying to achieve. Actually, it is not replacing rows to the column to its entirety. If the moderator could help in changing the question else it will be misleading.

ADD REPLYlink written 5 months ago by vchris_ngs4.2k
1

I changed it to "restructure", can't think of anything more specific.

ADD REPLYlink written 5 months ago by WouterDeCoster23k

yes, it is much better now and a reader will not be misled. Thanks, @Wouter. At least other readers will simply not copy the code rather read the query posted if they need any help with this post.

ADD REPLYlink written 5 months ago by vchris_ngs4.2k

Interleaving columns by row pairs? Interleaving columns by every two rows?

ADD REPLYlink written 5 months ago by h.mon9.2k

That's a good one! ;)

ADD REPLYlink written 5 months ago by WouterDeCoster23k
7
gravatar for WouterDeCoster
5 months ago by
Belgium
WouterDeCoster23k wrote:

I think this should do the job:

Save as rearrangingalleles.py and execute as python rearrangingalleles.py myinput.txt > myoutput.txt

ADD COMMENTlink modified 5 months ago • written 5 months ago by WouterDeCoster23k
2

More profesional use .write:

import sys
output = open('output.txt', 'w')
with open(sys.argv[1]) as input:
  while True:
    line1 = [item for item in input.readline().split()]
    if len(line1) == 0:
        break
    line2 = [item for item in input.readline().split()]
    output.write(' '.join([line1[i] + " " + line2[i] for i in range(len(line1))]) + '\n')
print '\n' + '\t' + 'Job completed!'

Save as rearrangingalleles.py and execute as python rearrangingalleles.py myinput.txt

ADD REPLYlink modified 5 months ago • written 5 months ago by Buffo550
2

I firmly disagree.
It's a very convenient feature if scripts write to stdout, as such you can use them when piping. Also, it allows specifying both the output name and output directory.

While talking about more professional:

  • didn't close the output file
  • you should use the 'with' statement for opening files

e.g.:

with open(sys.argv[1]) as input, open('output.txt', 'w') as ouput:
ADD REPLYlink modified 5 months ago • written 5 months ago by WouterDeCoster23k
2

Also, it allows specifying both the output name and output directory.

prefix = sys.argv[1].split('.')[0]
output = open(prefix + '_output.txt', 'w')    #you will never have to specify output name or directory

you should use the 'with' statement for opening files

??? Why?

ADD REPLYlink modified 5 months ago • written 5 months ago by Buffo550
1

It's nice that you defend your opinion, but I would suggest considering you might be wrong.

Your code will crash if the current directory is not writeable for the user, which is not an uncommon situation for directories with (shared) data files. Also, the user doesn't have the freedom to

  • Stream the output to another command on stdin (fundamental concept in unix pipelines)
  • Choose the output name and directory of their choice

With regard to "why using the with statement" see this page of Python for beginners.

ADD REPLYlink written 5 months ago by WouterDeCoster23k
2

You are a very funny person WouterDeCoster :). I'm glad you're learning to program but, in addition to online courses I recommend using common sense, sometimes it is very useful!

Best.

ADD REPLYlink written 5 months ago by Buffo550
1

Due to the limited meta-communication online I'm not sure if you are just trolling me or are genuinely an arrogant fool. Anyway, thanks for the advice.

Have a nice day.

ADD REPLYlink modified 5 months ago • written 5 months ago by WouterDeCoster23k

Thanks so much WouterDeCoster, it produced exactly the file I wished in less than 1 minute.

ADD REPLYlink written 5 months ago by Ana50

If my answer resolved your question you can mark it as accepted.

Upvote|Bookmark|Accept

ADD REPLYlink written 5 months ago by WouterDeCoster23k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1372 users visited in the last hour