Entering edit mode
3.3 years ago
princy149
•
0
Hi all, I have a csv file with 3 column as 1st column is sample is, Second column is gene name and third column is allele frequency so how can I create a count matrix using this Information. below is the csv file format:
Sample gene AlleleFrequency
ABP AGRN 0.480662983
ABP SPATA21 0.571428571
ABP RHD 0.458333333
ABP AUNIP 0.480769231
ABP COL16A1 0.461538462
MNV BEND5 0.3
MNV BEND5 0.333333333
MNV BEND5 0.337349398
MNV DNAJC6 0.55
MNV FPGT-TNNI3K 0.367346939
CCR SLC44A5. 0.375
CCR SLC25A24 0.59375
CCR TCHH 0.142857143
CCR TCHH 0.153846154
CCR TCHH 0.236220472
KKL TCHH 0.231343284
KKL TCHH 0.271428571
KKL NPR1 0.503968254
KKL ADAR 0.466666667
KKL CD1E 0.418604651
BPP SPTA1 0.566666667
BPP PIGM 0.56
BPP SLAMF7 0.451219512
BPP LINC01720 0.548387097
BPP ELF3 0.482758621
CCQ NUAK2 0.46969697
CCQ MARK1 0.490196078
CCQ CDC42BPA;ZNF678 0.411764706
CCQ RYR2 0.739130435
CCQ PXDN 0.517006803
"so how can I create a count matrix"....start by describing what you'd like to count. What is it that you would like to count?
Thank you for prompt response for my question.
I want to create this type of format as below using my csv file data Information:
Is there any script through which I can make this format?
I wouldn't call that a count matrix, "somehow summarized full matrix" from "sparse matrix representation" maybe. So you would like a MxN matrix call it A with M rows of genes, N columns of samples. let i in 1..M, j in 1..N: then
Please fille out the missing
???
. Also, genes don't have allele frequencies per se, so does it make sense at all?Thank you for your help and concept.
do you have any idea for R script or linux command based script for this ?
If you tell me what to do in case 3 with the ??? I can give you an R script that has about 3 lines of code. Hope this is motivation to describe the problem in full. It might however be that this is an ad-hoc approach that is not really valid for the problem you are trying to address (like "average allele frequency over multiple snp's in a gene").
Here is the part of your example that bothers me:
So what do you want to put as a value into the matrix here, and how is that valid, given that genes (in the sense of the genomics age) don't have allele frequencies? Best, if you could show me a published paper that applies this calculus in the same way.
Hi, sorry for my late msz, actually I was trying to see what should be there in algorithms:
here is the format of tab file which I want to prepare from my data(csv file), in this I have chromosome location with gene name in features so I think if I can give you features name this way it is like unique feature means no repetition of same feature name for samples.
FEATURES SAMPL-1 SAMPL-2 SAMPLE-3 SAMPLE-4 SAMPLE-5
chr1::5967245::5967245::exonic::NPHP4 0 0 0.351648352 0 0
chr1::9642569::9642569::UTR3::SLC25A33 0 0 0 0 0.307692308 chr1::10355741::10355741::exonic::KIF1B 0.285714286 0.344827586 0.2 0.318181818 0.388888889 chr1::16459709::16459709::exonic::EPHA2 0 0.211864407 0.304 0 0
chr1::22924187::22924187::exonic::EPHA8 0.336956522 0.391304348 0.445652174 0.441558442 0.486111111 chr1::26360245::26360245::exonic::EXTL1 0 0 0.075268817 0 0
chr1::26689642::26689642::exonic::ZNF683 0.102564103 0 0.151515152 0 0
chr1::35260188::35260188::exonic::GJA4 0.362416107 0.415254237 0.405405405 0.446969697 0.43902439
chr1::36752561::36752561::exonic::THRAP3 0 0.151785714 0 0 0
please suggest me how can I prepare code in R for this format?
Hi, I am very sorry but I am giving up. Whenever I think, I almost got the problem, there is a new twist. I think there is no good solution to your problem given these examples because the problem is ill-defined. I hope you can solve the problem anyway somehow.
Thank you for your reply, its ok. I highly appreciate your efforts for my query.
Thank you once again !!
Hi, sorry for my late msz, actually I was trying to see what should be there in algorithms:
here is the format of tab file which I want to prepare from my data(csv file), in this I have chromosome location with gene name in features so I think if I can give you features name this way it is like unique feature means no repetition of same feature name for samples.
FEATURES SAMPL-1 SAMPL-2 SAMPLE-3 SAMPLE-4 SAMPLE-5
chr1::5967245::5967245::exonic::NPHP4 0 0 0.351648352 0 0
chr1::9642569::9642569::UTR3::SLC25A33 0 0 0 0 0.307692308 chr1::10355741::10355741::exonic::KIF1B 0.285714286 0.344827586 0.2 0.318181818 0.388888889 chr1::16459709::16459709::exonic::EPHA2 0 0.211864407 0.304 0 0
chr1::22924187::22924187::exonic::EPHA8 0.336956522 0.391304348 0.445652174 0.441558442 0.486111111 chr1::26360245::26360245::exonic::EXTL1 0 0 0.075268817 0 0
chr1::26689642::26689642::exonic::ZNF683 0.102564103 0 0.151515152 0 0
chr1::35260188::35260188::exonic::GJA4 0.362416107 0.415254237 0.405405405 0.446969697 0.43902439
chr1::36752561::36752561::exonic::THRAP3 0 0.151785714 0 0 0
please suggest me how can I prepare code in R for this format?