Question

Salmon

0

Entering edit mode

7 weeks ago

Bizhan • 0

I have this GEO data (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE141183), which contains files in the formats barcodes.tsv.gz, genes.tsv.gz, and matrix.mtx.gz. I used Seurat to integrate these files. I can perform analysis and other tasks with this data. However, I am facing an issue. I need to generate a gene expression matrix where rows represent genes and columns represent sample IDs. But when I use Seurat, I find that genes are represented as rows and nucleotides as columns.

To address this, I executed the following commands:

gene_expression_matrix <- GetAssayData(object = mtx_obj.seurat.obj, slot = "counts")
write.table(gene_expression_matrix, file = "gene_expression_matrix.tsv", sep = "\t", row.names = TRUE, col.names = TRUE)

I then ran Salmon on this matrix to obtain the expression matrix where genes are rows and sample IDs are columns (but I am getting an error here)

Here how I run Salmon on the gene expression matrix file

salmon quant -i salmon_index -l A -r gene_expression_matrix.tsv -o salmon_output

Now, I have two questions:

Are these steps correct for obtaining the gene expression matrix? If yes, how can I extract the resulting matrix?

Edit: I need to convert the cell barcode to the corresponding sample id.

Edit: I need to run PACNet (http://ec2-44-201-176-192.compute-1.amazonaws.com/PACNet/webApp/), CellNet updated version. The input file should be metadata (which is easy to create) and an expression matrix. The expression matrix has to have (gene symbols as row names and sample names as column names). In my case, I have gene symbols as row names and cellular barcodes as column names. In my metadata, I have a sample ID, and according to PACNet, "column names of the expression matrix must match the sample_name column of the sample metadata table". So, I need to convert the cellular barcodes to sample IDs.

For example, if I have 1 sample and 10 genes, then I should have a matrix of 10x1 (10 rows, and 1 column), but when I read the files (either by readMM or Read10x) I will get a matrix of 10x100(for example) because there are 100 cellular barcodes.

Thanks

Salmon Seurat Gene-Expression • 504 views

ADD COMMENT • link updated 7 weeks ago by Ram 43k • written 7 weeks ago by Bizhan • 0

score 0 · Answer 1 · 2024-03-02

0

Entering edit mode

7 weeks ago

ATpoint 82k

You don't need to do any of that. The mtx format is essentially what you need, it's just in a sparse format so better compress data. Use either readMM from the Matrix package to read this into a gene x cell matrix in R or use something like Read10X function from Seurat.

Salmon is definitely wrong as it does not accept this format. It accepts reads in fastq or bam format. By the way, the 'nucleotides' are cellular barcodes.

ADD COMMENT • link 7 weeks ago by ATpoint 82k

0

Entering edit mode

Thanks for your reply. I have used the suggestions before and had the same issue. Let me elaborate a bit more.

I need to run PACNet (http://ec2-44-201-176-192.compute-1.amazonaws.com/PACNet/webApp/), CellNet updated version. The input file should be metadata (which is easy to create) and an expression matrix. The expression matrix has to have (gene symbols as row names and sample names as column names). In my case, I have gene symbols as row names and cellular barcodes as column names. In my metadata, I have a sample ID, and according to PACNet, "column names of the expression matrix must match the sample_name column of the sample metadata table". So, I need to convert the cellular barcodes to sample IDs.

For example, if I have 1 sample and 10 genes, then I should have a matrix of 10x1 (10 rows, and 1 column), but when I read the files (either by readMM or Read10x) I will get a matrix of 10x100(for example) becuase there are 100 cellular barcodes.

Thank you so much

ADD REPLY • link 7 weeks ago by Bizhan • 0

1

Entering edit mode

It seems to me that this is a method developed for bulk RNA-seq, and this is where one typically calls a column a "sample". However, in single-cell each column is a cell. You might want to aggregate/pseudobulk your cells into samples somehow, but how this needs to be done for your study I cannot tell. Generally, one would sum counts per gene and cells for the groups for this.