[SOLVED] How to build a table with gene expression per cell type with Seurat ?
3
2
Entering edit mode
16 months ago
Evan ▴ 90

Hello, I'm new on single-cell analysis and the use of deconvolution methods.

I would like to create my own signature matrix from single-cell rna data to use it in Cibersortx as a reference profile. Currently, I'm using Seurat to cluster my cells in cell type following this tutorial : https://satijalab.org/seurat/v3.1/pbmc3k_tutorial.html

Is it possible to get a table with in column the cells labeled with their cell type and in rows the genes with their expression in each cells (Count/RPKM/TPM ?).

In fact I would like a table which look like the picture below to use it as single cell reference sample file to build a signature matrix file to use in Cibersortx.

Reference sample file format

I would be very grateful if someone could explain me how to do it. Thank you.

single-cell cibersort deconvolution seurat • 6.2k views
ADD COMMENT
5
Entering edit mode
16 months ago
Evan ▴ 90

Thank you for your answers. I succeed to extract two tables, one with two colums with cell sequence (UMI) associated with its cell label (CD4, CD8, DC etc...) and another with gene expression per cell sequence. In case if someone is getting the same problem I use this command in R to write it in two files :

Cell Sequence and Cell Label (pmbc is my data)

write.table(pbmc@active.ident, file='Convert_UMI_Label.tsv', quote=FALSE, sep='\t', col.names = TRUE)

Gene counts per cell

write.table(pbmc@assays[["RNA"]]@counts, file='Gene_Count_per_Cell.tsv', quote=FALSE, sep='\t', col.names = TRUE)

After I used a little script in Python to merge these two files getting Gene Counts per cell labeled with their cell type :) (I could surely do it in R but my knowledge in this language is limited).

Visit this page it explains how to extract some interessant content from seurat object : https://satijalab.org/seurat/v3.0/interaction_vignette.html

ADD COMMENT
0
Entering edit mode

Thank you for sharing. I face the same problem. Can you share Python script which can merge these two files? Thanks a lot!

ADD REPLY
0
Entering edit mode

Hi, I also have the same problem. Did you find a solution?

ADD REPLY
0
Entering edit mode

Hi, sorry for the delay. Please find the code in my answer for oomoru.

ADD REPLY
0
Entering edit mode

Hi Evan, as you exported the raw counts (stored in pbmc@assays[["RNA"]]@counts), did you normalize your reference sample file prior creation of the signature matrix (e.g., RPKM) or did you submit the raw counts? Did you make any filtration on raw data prior creation of the signature matrix? Thanks

ADD REPLY
0
Entering edit mode

Hi, at the moment I don't normalize my Raw Counts matrix, as it's mentionned on Cibersortx tutorial (on it's website), raw counts are recommended. Don't forget that to get the bests results as possible, your signature matrix and your Bulk RNA-Seq must be in the same space normalization (in my case I use Raw Counts). I will not recommand RPKM to perform cellular deconvolution even if Cibersortx is able to convert it into TPM.

CIBERSORTx will automatically normalize the input data such that the sum of all normalized reads are the same for each transcriptome. If a gene length-normalized expression matrix is provided (e.g., RPKM), then the signature matrix will be in TPM (transcripts per million). If a count matrix is provided, the signature matrix will be in CPM (counts per million). Regardless of the input, the signature matrix and mixture files should be represented in the same normalization space.

Read this excellent article to know more about that :

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7648640/

> Did you make any filtration on raw data prior creation of the signature matrix?

No more than the filtrations I made in Seurat to process the data.

ADD REPLY
0
Entering edit mode

Hello, please can you share the Python script for merging the files? Thanks.

ADD REPLY
0
Entering edit mode

Hi, here is the code, it's way too long for a simple thing but it has the advantage to be fast and the ram consumption is very low compared to pandas or R.

# This script takes the output file created in seurat and process
# it to be taken by Cibersortx to create signature matrix
# The file generated by this script has cell type per column
# and raw count for gene in the intersection of a cell type and a 
# gene.
import click # pip install click

# Dictionary
dict_sequence_label = {}
# Key = cell sequence ; value : cell label 

# Opening file with the conversion between sequence and cell label
# Column 1 : Single-Cell Barcode; Column 2 : Label (CD4, B, CD8 etc...)
input_directory = click.prompt('> Enter absolute path of input files directory')
output_directory = click.prompt('> Enter absolute path of output directory')

with open(input_directory+"Convert_UMI_Label.tsv","r") as file_cell_label:
    next(file_cell_label)
    #useless first line
    for line in file_cell_label:
        clean_line = line.rstrip("\n").split("\t")
        sequence = clean_line[0].replace("-",".") #Store barcode
        cell_label = clean_line[1] # Store Cell Label
        dict_sequence_label[sequence] = cell_label

print("dictionnary done !")

header = []
# Now i got the conversion between sequence and cell type
# i'll read the file with gene count per cell sequence
# and write a new file with the dictionnary above
with open(input_directory+"Gene_Count_Per_Cell.tsv","r") as file_gene_count:
    first_line_cell_sequence = file_gene_count.readline()
    #The first line of export in R present a mistake 
    #i save it and correct after
    #next(file_gene_count) #Delete first line
    with open(output_directory+"preSigMatrix.tsv","w") as file_matrix:
        split_first_line_cell_sequence = first_line_cell_sequence.rstrip("\n").split("\t")
        for element in split_first_line_cell_sequence:
            header.append(dict_sequence_label[element])
        file_matrix.write("GENES\t"+'\t'.join(header)+"\n")
        # Once I wrote all the cells label I can write the gene counts
        for line in file_gene_count:
            file_matrix.write(line)
# The script in finished and the file is now created.

print("Single Cell Raw Counts Annotated Matrix susccessfully created!")
ADD REPLY
1
Entering edit mode
16 months ago

Use meta.data of the seurat object to get the number of counts and save that using write.csv to get your table

ADD COMMENT
0
Entering edit mode

Thank you, it helped me a lot :)

ADD REPLY
1
Entering edit mode
16 months ago
igor 12k

Use the AverageExpression() function to get the averaged feature expression by identity class.

ADD COMMENT
0
Entering edit mode

Thank you I will try this solution :)

ADD REPLY

Login before adding your answer.

Traffic: 2491 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6