copy number file format issue
1
0
Entering edit mode
4.4 years ago

Hello, I have a problem with a .csv file from Copy number data. The original looks like this:

genes               Log2
PIK3CA,TET2          -0.35
MLH2,NRAS            0.54


And, what I need is:

genes                 Log2
PIK3CA              -0.35
TET2                 -0.35
MLH2                0.54
NRAS                 0.54


I have tried many things by now, and they have not been successful. The file was created with CNVkit from gastric cancer samples. The file is much bigger, and the list of genes is longer, but this is essentially what I need to do in order to analyze our cnv data.

I use Linux, Ubuntu V 16.04. I would appreciate if you could help me with an R or Python script, but, by now, any solution would be good.

Thank you

cnv copynumber R python • 1.2k views
1
Entering edit mode
1. Please remove the text in bold, it does not make sense to have a full post in bold.
2. Please select a more descriptive title for your thread. You want to reformat a file, the fact that this is copy number analysis data is less relevant in the title
3. Explain what you tried and how that didn't work, we'll be more eager to point out your mistake and get you back on track.
0
Entering edit mode

Adding to Wouter's comment, please explain if you're tried awk. Plus, what have you tried using Python/R?

0
Entering edit mode

Thanks! I was trying with this mostly:

awk -F , -v OFS='\t' 'NR == 1 || $0 > 0 {print$4}' AGM3.call.prueba.cns.csv |less


But, it doesn't work well. And also, I need to repeat the Log2 value for each gene in the row (in the comma separated list). Would transpose the columns work for this?

0
Entering edit mode

This can be done in awk esp format in OP. Use split and loop. @OP

0
Entering edit mode

That is not what a csv file looks like. If it were me I would do this with a python script because it's a bit messy.

0
Entering edit mode

The format is not really the problem, I can export it to any other format, but, the genes column looks the same. How can I do it with Python?

0
Entering edit mode

To me it is a problem because you are showing PIK3CA,TET2 as a single column in a csv, even though there is a comma separating them. But then you also show the columns separated by tabs(?). If I could see the exact structure of the file I could write up something quick in python.

0
Entering edit mode

Ok, well, the original file is .cns, it is a text file, that looks like this (first line plus header):

chromosome,start,end,gene,log2
chr1,13402,861395,"LOC102725121,DDX11L1,OR4F5,LOC100133331,LOC100132062,LOC100132287,LOC100133331,LINC00115,SAMD11",-0.28067


We transformed it to a .csv so we could separate it by tab. The .cns is separated by commas but genes are a single string delimited by quotes. I hope this is more useful

0
Entering edit mode

Take a look at csvkit - it works well with quoted CSV columns. Or, use R to read stuff into a 2d array, then create another 2D array by splitting the 4th column and assigning the 5th col value to each part of the 4th column.

1
Entering edit mode
4.4 years ago

output:

$awk -v OFS="\t" '{split ($1,a,",")} {for (i in a) {print a[i],$2}}' test.txt genes Log2 PIK3CA -0.35 TET2 -0.35 MLH2 0.54 NRAS 0.54  input: $ cat test.txt
genes   Log2
PIK3CA,TET2 -0.35
MLH2,NRAS   0.54