Question

I need help with Starsolo and scRNA-seq

1

Entering edit mode

3.6 years ago

Rafael Soler ★ 1.2k

Hi! My name is Rafa and I am a beginer in the world of scRNA-seq. I've been looking at workflows like https://scrnaseq-course.cog.sanger.ac.uk/website/index.html or https://broadinstitute.github.io/2019_scWorkshop/index.html#course-overview and I do not understand the creation of the SCE object/Starsolo alignment.

I'm using the https://support.10xgenomics.com/single-cell-gene-expression/datasets/2.1.0/neurons_900 dataset for practice as it doesn't take up much memory and to make wait times shorter. I'm analyzing with the "Starsolo" program, using the following code:

STAR --genomeDir /home/victor/Escritorio/Curso_Single_Cell/indices/STAR --runThreadN 16 --readFilesIn neurons_900_fastqs/neurons_900_S1_L001_R2_001.fastq,neurons_900_fastqs/neurons_900_S1_L002_R2_001.fastq neurons_900_fastqs/neurons_900_S1_L001_R1_001.fastq,neurons_900_fastqs/neurons_900_S1_L002_R1_001.fastq --soloType CB_UMI_Simple --soloCBwhitelist /home/victor/Escritorio/Curso_Single_Cell/whitelist/737K-august-2016.txt --outFileNamePrefix results/STAR/

After that, Starsolo return a raw and filtered data, where you can find the matrix, barcodes and genes/features. But when I load this 3 files and create a SCE object, the count of assays are not correct.

> dir.name <- "/home/victor/Escritorio/Curso_Single_Cell/results/STAR/Solo.out/Gene/raw"
> list.filesdir.name)
[1] "barcodes.tsv" "genes.tsv"    "matrix.mtx"  
> sce <- DropletUtils::read10xCountsdir.name, col.names = TRUE)
> sce

class: SingleCellExperiment 
dim: 55487 737280 
metadata(1): Samples
assays(1): counts
rownames(55487): ENSMUSG00000102693 ENSMUSG00000064842 ... ENSMUSG00000096730 ENSMUSG00000095742
rowData names(3): ID Symbol NA
colnames(737280): AAACCTGAGAAACCAT AAACCTGAGAAACCGC ... TTTGTCATCTTTAGTC TTTGTCATCTTTCCTC
colData names(2): Sample Barcode
reducedDimNames(0):
spikeNames(0):
altExpNames(0):

> summary(assay(sce, "counts"))
55487 x 737280 sparse Matrix of class "dgCMatrix", with 5113008 entries 
        i   j x
1    2681   1 1
2   26019   1 1
3   30593   1 1
4   30624   1 1
5   30756   1 1
6   36144   1 1
7   38875   1 1
8   53732   1 1
9   46321   3 1
10  55399   5 1
11   4333   6 1
12   7768   6 1
13  10051   6 1
14  15470   6 1
15  25255   6 1
16  32249   6 1
17  33914   6 1
18  37100   6 1
19  40026   6 1
20  40180   6 1
21  41019   6 1
22  49661   6 1
23  49669   6 1
24  18081   7 1
25  16776   9 1
26  54018  11 1
27    272  12 1
28   9832  12 1
29  13560  12 1
30  14856  12 1
31  15490  12 1
32  18592  12 1
33  23950  12 1
34  25910  12 1
35  28138  12 1
36  28177  12 1
37  35881  12 1
38  36144  12 1
39  36692  12 1
40  37663  12 1
41  38459  12 1
42  39978  12 1
43  40156  12 1
44  41019  12 1
45  41030  12 1
46  43773  12 1
47  46411  12 2
48  48427  12 1
49  49388  12 1
50  49409  12 1
51  49414  12 2
52  50650  12 1
53  33914  14 1
 ... etc

I don't know why is happening this. Maybe it could be because I need to count the reads per gene? I thought that Starsolo perform the mapping but also the counting. If it this the reason, what should I do?

Thanks a lot!! :)

Star scRNA-seq Starsolo RNA-Seq • 3.9k views

ADD COMMENT • link 3.6 years ago by Rafael Soler ★ 1.2k

0

Entering edit mode

And which rownames should be ??

> assay(sce, "counts")
55487 x 992 sparse Matrix of class "dgCMatrix"
   [[ suppressing 77 column names ‘AAACCTGGTCTCGTTC’, ‘AAACGGGAGCCACGTC’, ‘AAACGGGAGCGAGAAA’ ... ]]
   [[ suppressing 77 column names ‘AAACCTGGTCTCGTTC’, ‘AAACGGGAGCCACGTC’, ‘AAACGGGAGCGAGAAA’ ... ]]

ENSMUSG00000102693 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
ENSMUSG00000064842 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
ENSMUSG00000051951 1 . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
ENSMUSG00000102851 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
ENSMUSG00000103377 . . . . 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . ......
ENSMUSG00000104017 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......

 ..............................
 ........suppressing 915 columns and 55475 rows in show(); maybe adjust 'options(max.print= *, width = *)'
 ..............................
   [[ suppressing 77 column names ‘AAACCTGGTCTCGTTC’, ‘AAACGGGAGCCACGTC’, ‘AAACGGGAGCGAGAAA’ ... ]]

ENSMUSG00000095434 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
ENSMUSG00000094431 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
ENSMUSG00000094621 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
ENSMUSG00000098647 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
ENSMUSG00000096730 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
ENSMUSG00000095742 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......

ADD REPLY • link 3.6 years ago by Rafael Soler ★ 1.2k

0

Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized. SUBMIT ANSWER is for new answers to original question

ADD REPLY • link 3.6 years ago by GenoMax 141k

score 2 · Accepted Answer · 2020-09-21

Hello Rafa,

I am not an expert on STARsolo, but looking at the rownames of your SingleCellExperiment, it seems that reads were counted on the mouse exome:

rownames(55487): ENSMUSG00000102693 ENSMUSG00000064842 ... ENSMUSG00000096730 ENSMUSG00000095742

I am not sure why the matrix is under this form, maybe it is the summary function ?

55487 x 737280 sparse Matrix of class "dgCMatrix", with 5113008 entries 
        i   j x
1    2681   1 1
2   26019   1 1
3   30593   1 1
4   30624   1 1
5   30756   1 1

This is the sparse representation of your matrix, e.g. the matrix indexes and values of non-zeroes entries. For example, in row 2681, column 1, the value is 0.
What happens if you run :

head(assay(sce, "counts"))

?
If it is not under it's sparse matrix (dgCMatrix) representation, see ?Matrix::sparseMatrix in order to create the matrix from non-zeroes entries.

score 2 · Accepted Answer · 2020-09-21

2

Entering edit mode

3.6 years ago

ATpoint 82k

It seems to me that you read the entire set of barcodes from the 737k list into your sce. I am not a STARsolo, neither CellRanger (Alevin for the win ;-) ) user but maybe you selected the wrong directory? The row number looks fine, but 737k columns must be wrong. You selected folder raw, is there a second folder or so, something like filtered where the empty barcodes got eliminated?

ADD COMMENT • link 3.6 years ago by ATpoint 82k

0

Entering edit mode

I have used the filtered, and now I have the correct number of cells! Thanks :) But I still have a number of counts in the assay to small.

> sce
class: SingleCellExperiment 
dim: 55487 992 
metadata(1): Samples
assays(1): counts
rownames(55487): ENSMUSG00000102693 ENSMUSG00000064842 ... ENSMUSG00000096730 ENSMUSG00000095742
rowData names(3): ID Symbol NA
colnames(992): AAACCTGGTCTCGTTC AAACGGGAGCCACGTC ... TTTGGTTTCATGCATG TTTGTCACATCGGTTA
colData names(2): Sample Barcode
reducedDimNames(0):
spikeNames(0):
altExpNames(0):

> assay(sce, "counts")
55487 x 992 sparse Matrix of class "dgCMatrix"
   [[ suppressing 77 column names ‘AAACCTGGTCTCGTTC’, ‘AAACGGGAGCCACGTC’, ‘AAACGGGAGCGAGAAA’ ... ]]
   [[ suppressing 77 column names ‘AAACCTGGTCTCGTTC’, ‘AAACGGGAGCCACGTC’, ‘AAACGGGAGCGAGAAA’ ... ]]

ENSMUSG00000102693 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
ENSMUSG00000064842 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
ENSMUSG00000051951 1 . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
ENSMUSG00000102851 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
ENSMUSG00000103377 . . . . 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . ......
ENSMUSG00000104017 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......

ADD REPLY • link 3.6 years ago by Rafael Soler ★ 1.2k

1

Entering edit mode

Not sure what you mean. Do you mean these dots? This is the way this sparse matrix format (dgCMatrix) represents data. Nothing to worry about, it is a kind of compression. You should be good to go.

ADD REPLY • link 3.6 years ago by ATpoint 82k