[SOLVED] How to build a table with gene expression per cell type with Seurat ?
3
4
Entering edit mode
4.2 years ago
Evan ▴ 220

Hello, I'm new on single-cell analysis and the use of deconvolution methods.

I would like to create my own signature matrix from single-cell rna data to use it in Cibersortx as a reference profile. Currently, I'm using Seurat to cluster my cells in cell type following this tutorial : https://satijalab.org/seurat/v3.1/pbmc3k_tutorial.html

Is it possible to get a table with in column the cells labeled with their cell type and in rows the genes with their expression in each cells (Count/RPKM/TPM ?).

In fact I would like a table which look like the picture below to use it as single cell reference sample file to build a signature matrix file to use in Cibersortx.

Reference sample file format

I would be very grateful if someone could explain me how to do it. Thank you.

single-cell cibersort deconvolution seurat • 28k views
ADD COMMENT
6
Entering edit mode
4.2 years ago
Evan ▴ 220

Thank you for your answers. I succeed to extract two tables, one with two colums with cell sequence (UMI) associated with its cell label (CD4, CD8, DC etc...) and another with gene expression per cell sequence. In case if someone is getting the same problem I use this command in R to write it in two files :

Cell Sequence and Cell Label (pmbc is my data)

write.table(pbmc@active.ident, file='Convert_UMI_Label.tsv', quote=FALSE, sep='\t', col.names = TRUE)

Gene counts per cell

write.table(pbmc@assays[["RNA"]]@counts, file='Gene_Count_per_Cell.tsv', quote=FALSE, sep='\t', col.names = TRUE)

After I used a little script in Python to merge these two files getting Gene Counts per cell labeled with their cell type :) (I could surely do it in R but my knowledge in this language is limited).

Visit this page it explains how to extract some interessant content from seurat object : https://satijalab.org/seurat/v3.0/interaction_vignette.html

ADD COMMENT
0
Entering edit mode

Thank you for sharing. I face the same problem. Can you share Python script which can merge these two files? Thanks a lot!

ADD REPLY
0
Entering edit mode

Hi, I also have the same problem. Did you find a solution?

ADD REPLY
0
Entering edit mode

Hi, sorry for the delay. Please find the code in my answer for oomoru.

ADD REPLY
0
Entering edit mode

Hi Evan, as you exported the raw counts (stored in pbmc@assays[["RNA"]]@counts), did you normalize your reference sample file prior creation of the signature matrix (e.g., RPKM) or did you submit the raw counts? Did you make any filtration on raw data prior creation of the signature matrix? Thanks

ADD REPLY
2
Entering edit mode

Hi, at the moment I don't normalize my Raw Counts matrix, as it's mentionned on Cibersortx tutorial (on it's website), raw counts are recommended. Don't forget that to get the bests results as possible, your signature matrix and your Bulk RNA-Seq must be in the same space normalization (in my case I use Raw Counts). I will not recommand RPKM to perform cellular deconvolution even if Cibersortx is able to convert it into TPM.

CIBERSORTx will automatically normalize the input data such that the sum of all normalized reads are the same for each transcriptome. If a gene length-normalized expression matrix is provided (e.g., RPKM), then the signature matrix will be in TPM (transcripts per million). If a count matrix is provided, the signature matrix will be in CPM (counts per million). Regardless of the input, the signature matrix and mixture files should be represented in the same normalization space.

Read this excellent article to know more about that :

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7648640/

> Did you make any filtration on raw data prior creation of the signature matrix?

No more than the filtrations I made in Seurat to process the data.

ADD REPLY
0
Entering edit mode

Hello, please can you share the Python script for merging the files? Thanks.

ADD REPLY
2
Entering edit mode

Hi, here is the code, it's way too long for a simple thing but it has the advantage to be fast and the ram consumption is very low compared to pandas or R.

# This script takes the output file created in seurat and process
# it to be taken by Cibersortx to create signature matrix
# The file generated by this script has cell type per column
# and raw count for gene in the intersection of a cell type and a 
# gene.
import click # pip install click

# Dictionary
dict_sequence_label = {}
# Key = cell sequence ; value : cell label 

# Opening file with the conversion between sequence and cell label
# Column 1 : Single-Cell Barcode; Column 2 : Label (CD4, B, CD8 etc...)
input_directory = click.prompt('> Enter absolute path of input files directory')
output_directory = click.prompt('> Enter absolute path of output directory')

with open(input_directory+"Convert_UMI_Label.tsv","r") as file_cell_label:
    next(file_cell_label)
    #useless first line
    for line in file_cell_label:
        clean_line = line.rstrip("\n").split("\t")
        sequence = clean_line[0].replace("-",".") #Store barcode
        cell_label = clean_line[1] # Store Cell Label
        dict_sequence_label[sequence] = cell_label

print("dictionnary done !")

header = []
# Now i got the conversion between sequence and cell type
# i'll read the file with gene count per cell sequence
# and write a new file with the dictionnary above
with open(input_directory+"Gene_Count_Per_Cell.tsv","r") as file_gene_count:
    first_line_cell_sequence = file_gene_count.readline()
    #The first line of export in R present a mistake 
    #i save it and correct after
    #next(file_gene_count) #Delete first line
    with open(output_directory+"preSigMatrix.tsv","w") as file_matrix:
        split_first_line_cell_sequence = first_line_cell_sequence.rstrip("\n").split("\t")
        for element in split_first_line_cell_sequence:
            header.append(dict_sequence_label[element])
        file_matrix.write("GENES\t"+'\t'.join(header)+"\n")
        # Once I wrote all the cells label I can write the gene counts
        for line in file_gene_count:
            file_matrix.write(line)
# The script in finished and the file is now created.

print("Single Cell Raw Counts Annotated Matrix susccessfully created!")
ADD REPLY
0
Entering edit mode

Thank you very much, this is so helpful to me. I used to the code but had error with this section-


KeyError Traceback (most recent call last)

<ipython-input-105-3265a1ddf5eb> in <module> 11 split_first_line_cell_sequence = first_line_cell_sequence.rstrip("\n").split("\t") 12 for element in split_first_line_cell_sequence: ---> 13 header.append(dict_sequence_label[element]) 14 file_matrix.write("GENES\t"+'\t'.join(header)+"\n") 15 # Once I wrote all the cells label I can write the gene counts

KeyError: '1_AAACCTGAGACTTGAA.1'

I don't know why it keeps giving the error. Please, how can I avoid this. I appreciate your time and kind assistance.

ADD REPLY
0
Entering edit mode

Hi, it seems the cells abels in the file "Convert_UMI_Label.tsv" are not the same than in the raw counts matrix "Gene_Count_Per_Cell.tsv". Maybe, in one of these file the cell label (barcode) is 1_AAACCTGAGACTTGAA.1 and in the other "1_AAACCTGAGACTTGAA_1". Could you give me one cell barcode per file to adjust the code for you ?

ADD REPLY
0
Entering edit mode

Thanks Evan! Gene_Count_Per_Cell.tsv, first barcode- 1_AAACCTGAGACTTGAA.1 Convert_UMI_Label.tsv, first barcode- 1_AAACCTGAGACTTGAA.1

The barcodes are the same for both files. At first, I was getting an error that stated- KeyError: ' ', when my files looked like this- Gene_Count_Per_Cell.tsv

Convert_UMI_Label.tsv

so I adjusted them in excel to this- Gene_Count_Per_Cell.tsv

Convert_UMI_Label.tsv

Then the error became this- KeyError: '1_AAACCTGAGACTTGAA.1' - this is the first barcode in both files.

I think the error is occurring because we are taking elements from the dictionary which has the quotation mark as headers in the new file, but I don't know much, and how to go about it.

ADD REPLY
0
Entering edit mode

Hum okay I see, could you upload the both files somewhere to help you ?

ADD REPLY
0
Entering edit mode

Thanks a lot Evan! I have the files in the link below. https://drive.google.com/drive/folders/1GkwubWLgyUgStveemX43Ud4l1_e5Hnt9?usp=sharing

-it is publicly available data from geo ncbi.

ADD REPLY
1
Entering edit mode

Thanks ! I modified some stuff in your files, firstly, I removed the " in the Convert_UMI_Label.tsv, secondly the 'X' in front of the barcodes in the Gene_Count_Per_Cell.tsv and finally the the tabulation before the first barcode in the count matrix.

I ran the script after these little modifications and the (pre) signature matrix has been generated without problems :)

Here is the link where you can download the corrected files and the (pre)signature matrix generated : https://mega.nz/folder/REUBzCaZ#uhFr6F82imiky2LVVf1nSg

Now you're able to build your signature matrix with Cibersortx. You can do it on their website (storage limited to 1Gb and this sigmatrix is ~ 670mb) or directly from Docker. You must request a token before using Docker. Enjoy !

ADD REPLY
0
Entering edit mode

Thank you very much Evan!

ADD REPLY
1
Entering edit mode
4.2 years ago

Use meta.data of the seurat object to get the number of counts and save that using write.csv to get your table

ADD COMMENT
0
Entering edit mode

Thank you, it helped me a lot :)

ADD REPLY
1
Entering edit mode
4.2 years ago
igor 13k

Use the AverageExpression() function to get the averaged feature expression by identity class.

ADD COMMENT
0
Entering edit mode

Thank you I will try this solution :)

ADD REPLY

Login before adding your answer.

Traffic: 2398 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6