Question

making of count matrix

0

Entering edit mode

3.3 years ago

princy149 • 0

Hi all, I have a csv file with 3 column as 1st column is sample is, Second column is gene name and third column is allele frequency so how can I create a count matrix using this Information. below is the csv file format:

Sample      gene         AlleleFrequency
ABP         AGRN                   0.480662983
ABP         SPATA21                0.571428571
ABP           RHD                 0.458333333
ABP         AUNIP                  0.480769231
ABP        COL16A1         0.461538462
MNV BEND5                0.3
MNV BEND5          0.333333333
MNV BEND5           0.337349398
MNV DNAJC6             0.55
MNV FPGT-TNNI3K  0.367346939
CCR SLC44A5.            0.375
CCR SLC25A24          0.59375
CCR TCHH              0.142857143
CCR TCHH             0.153846154
CCR TCHH             0.236220472
KKL TCHH    0.231343284
KKL TCHH    0.271428571
KKL NPR1    0.503968254
KKL ADAR    0.466666667
KKL CD1E    0.418604651
BPP SPTA1   0.566666667
BPP PIGM    0.56
BPP SLAMF7  0.451219512
BPP LINC01720   0.548387097
BPP ELF3    0.482758621
CCQ NUAK2   0.46969697
CCQ MARK1   0.490196078
CCQ CDC42BPA;ZNF678 0.411764706
CCQ RYR2    0.739130435
CCQ PXDN    0.517006803

scRNA Mutational seq data • 1.7k views

ADD COMMENT • link 3.3 years ago by princy149 • 0

1

Entering edit mode

"so how can I create a count matrix"....start by describing what you'd like to count. What is it that you would like to count?

ADD REPLY • link 3.3 years ago by seidel 11k

0

Entering edit mode

Thank you for prompt response for my question.

I want to create this type of format as below using my csv file data Information:

features            sample1 sample2 sample3        sample4  sample5
NPHP4                    0     0            0.351648352     0               0
SLC25A33             0     0               0                         0           0.307692308
KIF1B       0.285714286  0.344827586    0.2 0.318181818 0.388888889
EPHA2       0             0.211864407      0.304       0            0
EPHA8   0.336956522 0.391304348 0.445652174 0.441558442 0.486111111
EXTL1   0   0   0.075268817 0   0
ZNF683  0.102564103 0   0.151515152 0   0
GJA4    0.362416107 0.415254237 0.405405405 0.446969697 0.43902439

Is there any script through which I can make this format?

ADD REPLY • link updated 3.3 years ago by Michael 55k • written 3.3 years ago by princy149 • 0

0

Entering edit mode

I wouldn't call that a count matrix, "somehow summarized full matrix" from "sparse matrix representation" maybe. So you would like a MxN matrix call it A with M rows of genes, N columns of samples. let i in 1..M, j in 1..N: then

A_ij := 0 iff, gene_i is not in sample_j
A_ij := allele_frequency(gene_i) iff Gene_i is in sample_j exactly *once*
A_ij := ??? iff Gene_i occurs in sample_j  *more than once*

Please fille out the missing ???. Also, genes don't have allele frequencies per se, so does it make sense at all?

ADD REPLY • link 3.3 years ago by Michael 55k

0

Entering edit mode

Thank you for your help and concept.

do you have any idea for R script or linux command based script for this ?

ADD REPLY • link 3.3 years ago by princy149 • 0

1

Entering edit mode

If you tell me what to do in case 3 with the ??? I can give you an R script that has about 3 lines of code. Hope this is motivation to describe the problem in full. It might however be that this is an ad-hoc approach that is not really valid for the problem you are trying to address (like "average allele frequency over multiple snp's in a gene").

Here is the part of your example that bothers me:

MNV BEND5                0.3
MNV BEND5          0.333333333
MNV BEND5           0.337349398

So what do you want to put as a value into the matrix here, and how is that valid, given that genes (in the sense of the genomics age) don't have allele frequencies? Best, if you could show me a published paper that applies this calculus in the same way.

ADD REPLY • link 3.3 years ago by Michael 55k

0

Entering edit mode

Hi, sorry for my late msz, actually I was trying to see what should be there in algorithms:

here is the format of tab file which I want to prepare from my data(csv file), in this I have chromosome location with gene name in features so I think if I can give you features name this way it is like unique feature means no repetition of same feature name for samples.

FEATURES SAMPL-1 SAMPL-2 SAMPLE-3 SAMPLE-4 SAMPLE-5
chr1::5967245::5967245::exonic::NPHP4 0 0 0.351648352 0 0
chr1::9642569::9642569::UTR3::SLC25A33 0 0 0 0 0.307692308 chr1::10355741::10355741::exonic::KIF1B 0.285714286 0.344827586 0.2 0.318181818 0.388888889 chr1::16459709::16459709::exonic::EPHA2 0 0.211864407 0.304 0 0
chr1::22924187::22924187::exonic::EPHA8 0.336956522 0.391304348 0.445652174 0.441558442 0.486111111 chr1::26360245::26360245::exonic::EXTL1 0 0 0.075268817 0 0
chr1::26689642::26689642::exonic::ZNF683 0.102564103 0 0.151515152 0 0
chr1::35260188::35260188::exonic::GJA4 0.362416107 0.415254237 0.405405405 0.446969697 0.43902439
chr1::36752561::36752561::exonic::THRAP3 0 0.151785714 0 0 0

please suggest me how can I prepare code in R for this format?

ADD REPLY • link 3.3 years ago by princy149 • 0

0

Entering edit mode

Hi, I am very sorry but I am giving up. Whenever I think, I almost got the problem, there is a new twist. I think there is no good solution to your problem given these examples because the problem is ill-defined. I hope you can solve the problem anyway somehow.

ADD REPLY • link 3.3 years ago by Michael 55k

0

Entering edit mode

Thank you for your reply, its ok. I highly appreciate your efforts for my query.

Thank you once again !!

ADD REPLY • link 3.3 years ago by princy149 • 0

0

Entering edit mode

Hi, sorry for my late msz, actually I was trying to see what should be there in algorithms:

here is the format of tab file which I want to prepare from my data(csv file), in this I have chromosome location with gene name in features so I think if I can give you features name this way it is like unique feature means no repetition of same feature name for samples.

FEATURES SAMPL-1 SAMPL-2 SAMPLE-3 SAMPLE-4 SAMPLE-5
chr1::5967245::5967245::exonic::NPHP4 0 0 0.351648352 0 0
chr1::9642569::9642569::UTR3::SLC25A33 0 0 0 0 0.307692308 chr1::10355741::10355741::exonic::KIF1B 0.285714286 0.344827586 0.2 0.318181818 0.388888889 chr1::16459709::16459709::exonic::EPHA2 0 0.211864407 0.304 0 0
chr1::22924187::22924187::exonic::EPHA8 0.336956522 0.391304348 0.445652174 0.441558442 0.486111111 chr1::26360245::26360245::exonic::EXTL1 0 0 0.075268817 0 0
chr1::26689642::26689642::exonic::ZNF683 0.102564103 0 0.151515152 0 0
chr1::35260188::35260188::exonic::GJA4 0.362416107 0.415254237 0.405405405 0.446969697 0.43902439
chr1::36752561::36752561::exonic::THRAP3 0 0.151785714 0 0 0

please suggest me how can I prepare code in R for this format?

ADD REPLY • link 3.3 years ago by princy149 • 0

score 1 · Answer 1 · 2021-05-27

1

Entering edit mode

3.3 years ago

jared.andrews07 ★ 17k

What exactly makes you feel this data is sufficient to generate/impute a count matrix? As far as I am aware, there is no way to do that from mutational frequency data.

ADD COMMENT • link 3.3 years ago by jared.andrews07 ★ 17k