Question: copy number file format issue
0
gravatar for jim.paredes
9 months ago by
jim.paredes0 wrote:

Hello, I have a problem with a .csv file from Copy number data. The original looks like this:

genes               Log2
PIK3CA,TET2          -0.35
MLH2,NRAS            0.54

And, what I need is:

genes                 Log2
PIK3CA              -0.35
TET2                 -0.35
MLH2                0.54
NRAS                 0.54

I have tried many things by now, and they have not been successful. The file was created with CNVkit from gastric cancer samples. The file is much bigger, and the list of genes is longer, but this is essentially what I need to do in order to analyze our cnv data.

I use Linux, Ubuntu V 16.04. I would appreciate if you could help me with an R or Python script, but, by now, any solution would be good.

Thank you

python cnv copynumber R • 312 views
ADD COMMENTlink modified 9 months ago by cpad011211k • written 9 months ago by jim.paredes0
1
  1. Please remove the text in bold, it does not make sense to have a full post in bold.
  2. Please select a more descriptive title for your thread. You want to reformat a file, the fact that this is copy number analysis data is less relevant in the title
  3. Explain what you tried and how that didn't work, we'll be more eager to point out your mistake and get you back on track.
ADD REPLYlink written 9 months ago by WouterDeCoster37k

Adding to Wouter's comment, please explain if you're tried awk. Plus, what have you tried using Python/R?

ADD REPLYlink written 9 months ago by RamRS20k

Thanks! I was trying with this mostly:

awk -F , -v OFS='\t' 'NR == 1 || $0 > 0 {print $4}' AGM3.call.prueba.cns.csv |less

But, it doesn't work well. And also, I need to repeat the Log2 value for each gene in the row (in the comma separated list). Would transpose the columns work for this?

ADD REPLYlink modified 9 months ago by genomax64k • written 9 months ago by jim.paredes0

This can be done in awk esp format in OP. Use split and loop. @OP

ADD REPLYlink modified 9 months ago • written 9 months ago by cpad011211k

That is not what a csv file looks like. If it were me I would do this with a python script because it's a bit messy.

ADD REPLYlink written 9 months ago by goodez460

The format is not really the problem, I can export it to any other format, but, the genes column looks the same. How can I do it with Python?

ADD REPLYlink written 9 months ago by jim.paredes0

To me it is a problem because you are showing PIK3CA,TET2 as a single column in a csv, even though there is a comma separating them. But then you also show the columns separated by tabs(?). If I could see the exact structure of the file I could write up something quick in python.

ADD REPLYlink written 9 months ago by goodez460

Ok, well, the original file is .cns, it is a text file, that looks like this (first line plus header):

chromosome,start,end,gene,log2
chr1,13402,861395,"LOC102725121,DDX11L1,OR4F5,LOC100133331,LOC100132062,LOC100132287,LOC100133331,LINC00115,SAMD11",-0.28067

We transformed it to a .csv so we could separate it by tab. The .cns is separated by commas but genes are a single string delimited by quotes. I hope this is more useful

ADD REPLYlink modified 9 months ago by RamRS20k • written 9 months ago by jim.paredes0

Take a look at csvkit - it works well with quoted CSV columns. Or, use R to read stuff into a 2d array, then create another 2D array by splitting the 4th column and assigning the 5th col value to each part of the 4th column.

ADD REPLYlink written 9 months ago by RamRS20k
1
gravatar for cpad0112
9 months ago by
cpad011211k
India
cpad011211k wrote:

output:

$ awk -v OFS="\t" '{split ($1,a,",")} {for (i in a) {print a[i],$2}}' test.txt 
genes   Log2
PIK3CA  -0.35
TET2    -0.35
MLH2    0.54
NRAS    0.54

input:

$ cat test.txt 
genes   Log2
PIK3CA,TET2 -0.35
MLH2,NRAS   0.54
ADD COMMENTlink written 9 months ago by cpad011211k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1646 users visited in the last hour