Analyzing digital gene expression data (DGE) from drop-seq pipeline with Seurat.
2
1
Entering edit mode
2.4 years ago
chilifan ▴ 80

I am using Seurat to cluster data that previously has been filtered, aligned and turned into DGE by the Drop-Seq alignment pipline from Drop-seq tools. This has created a file sample_DGE.txt.gz. I then want to cluster my data and do a QC analysis through calculating the percent mithocondrial genes. I am following the Seurat Clustering tutorial found here: https://satijalab.org/seurat/pbmc3k_tutorial.html In this tutorial they use the files barcodes.tsv, genes.tsv and matrix.mtx generated by 10x genomics as raw data, and read it with the command Read10X(). I have generated these three files from our DGE data inspired by this biostars page: A: Storing a gene expression matrix in a matrix.mtx It works fine, except that the row name title "GENE" is stored as a column name, saved into barcodes.tsv, which later in Seurat is a problem because seurat uses "GENE" as one of the cell barcodes when calculating the percent mitochondrial DNA per cell. Example below:

This of course, makes it impossible to use VlnPlot, generating the error:

VlnPlot(object = pbmc, features.plot = c("nGene", "nUMI", "percent.mito"), nCol = 3) Error in if(all(data[,feature] == data,feature)) { : missing value where TRUE/FALSE needed

Simple removing "GENE" manually from the barcodes.tsv file creates a error in dimensions at the Read10X step.

pbmc.data <- Read10X(data.dir = "dir/to/barcode_matrix_and_gene_files") Error in dimnamesGets(x, value) : invalid dimnames given for "dgTMatrix" object stop(gettextf("invalid dimnames given for %s object" dQuote(class(x))), domail + NA) dimnamesGets(x, value)

SO my question is: do anyone know a workaround to this problem? Or is there an equivalent to Read10X(), say ReadDGE() or ReadDropseq() that can be used directly on my DGE file?

RNA-Seq R Seurat • 4.7k views
0
Entering edit mode

Thank you @Igor that is the answer to a question I've been pondering for a long time. However, ?CreateSeuratObject uses this example:

pbmc_raw <- read.table(
file = system.file('extdata', 'pbmc_raw.txt', package = 'Seurat'),
as.is = TRUE
)
pbmc_small <- CreateSeuratObject(raw.data = pbmc_raw)


which in my case would be:

# Load the PBMC dataset
pbmc.data <- read.table(file=system.file("DGE.txt", package = 'Seurat'), as.is =TRUE)


but it yields the error:

Error in read.table(file = system.file("DGE.txt", package = "Seurat"), : no lines available in input

1
Entering edit mode

If you are not sure what a function does, you can check by putting a ? in front of it. For example, ?system.file. That will tell you that system.file takes "character vectors, specifying subdirectory and file(s) within some package". In the example, they are using pbmc_raw.txt from the Seurat package. Your file is not stored in the Seurat package. You should specify the exact path where it is. Using system.file is not needed.

0
Entering edit mode

Thank you @Igor, got it! :)

0
Entering edit mode

thanks @chilifan. I started working with scRNA-seq recently and was stuck in the same issue. you saved my life :-).

2
Entering edit mode
2.4 years ago
chilifan ▴ 80

Because I like closure I will answer my own question. You need to read in the DGE data before Creating the seurat object. You also need to define column and row names manually and set the data type of data for both the row names (character) and the rest of the data columns (numeric)

#Read DGE file
pbmc.data <- read.table(file = "DGE.txt", header = TRUE, row.names = 1, colClasses =c("character", rep("numeric", 10000)))


I can hardcode 10000 since I chose to include 10000 cells already in the DGE step.

# Initialize the Seurat object with the raw (non-normalized data).
# Keep all genes expressed in >= 3 cells (~0.1% of the data). Keep all cells with at least 200 detected genes
pbmc <- CreateSeuratObject(raw.data = pbmc.data, min.cells = 3, min.genes = 200, project = "10X_PBMC")


Thanks again @igor for giving me the clue to figure this out! :)

1
Entering edit mode
2.4 years ago
igor 12k

If you don't have 10x data, then you don't need to use Read10X(). This is a function to make reading 10x data easier since it's not stored as a simple CSV/TSV table. There is no need to try to recreate that format. In the same tutorial, you can skip to the next step, which is CreateSeuratObject() and then give it your data matrix. When you create the matrix, make sure it has the appropriate columns and rows.

0
Entering edit mode

So my question is "what are 'appropriate columns and rows'"? Does it include requirements imposed on column and row names? I was following a vignette for seurat and it clearly made assumptions about the row names (expecting mitochondrial genes to have row names that start with "MT-"). Plus, there appear to be steps that assume there are no commas in the row names, because they generate errors like "Error in parse(text = x) : <text>:1:7: unexpected ','". And I'm also concerned about the column names and possible format assumptions there. When I was debugging an issue with a step in the vignette, I took a look at the output of the method that an error was coming from, by supplying all my column names:

> CellsByIdentities(object=pbmc, cells=colnames(pbmc))
$AAACATCGAAACATCG [1] "AAACATCGAAACATCG_0" "AAACATCGAAACATCG_1"$AAACATCGAACGTGAT
[1] "AAACATCGAACGTGAT_0"

$AACGTGATAAACATCG [1] "AACGTGATAAACATCG_0" "AACGTGATAAACATCG_1"$AACGTGATAACGTGAT
[1] "AACGTGATAACGTGAT_0" "AACGTGATAACGTGAT_1"


I'm not convinced that this is doing the correct thing. My understanding of split-seq data is that the _# differentiates different cells, but from the above, I infer that cells are considered to be the same if the 16 nts are the same. Is that what it's assuming?

The other assumptions have to do with data size. I was running a small set of test data, and I had to adjust some parameters. So I would like to know if there's a hard lower limit on the data. And do I need to restructure my data to separate counts of genes from different species (e.g. mouse & human)? I.e. Should I create separate matrices that have only the human-identified cells and gene annotations to use it with Seurat?

0
Entering edit mode

So my question is "what are 'appropriate columns and rows'"?

Column names are cells. Row names are genes. Matrix values are expression values (such as counts or TPMs).

I was following a vignette for seurat and it clearly made assumptions about the row names (expecting mitochondrial genes to have row names that start with "MT-")

If you have data that does not look like example data, you have to make modifications to the protocol. If your mitochondrial genes do not have "MT-", then you have to define them using an alternate method.

My understanding of split-seq data is that the _# differentiates different cells

You should go over the CreateSeuratObject documentation. For example, there are parameters names.field and names.delim that assign samples to groups based on cell names (column names).

I was running a small set of test data, and I had to adjust some parameters. So I would like to know if there's a hard lower limit on the data.

Some of the steps have a lower limit. For example, you won't be able to get 50 PCs if you have less than 50 genes and you cannot calculate 30 neighbors if you have less than 30 cells. Seurat comes with a build-in dataset pbmc_small that has 230 genes and 80 cells which is expected to work for all steps. In general, if you use all the genes and 100 cells, almost every step will take less than a few seconds, so there is no reason to be too aggressive with test data size.