I need help with Starsolo and scRNA-seq
2
1
Entering edit mode
10 months ago

Hi! My name is Rafa and I am a beginer in the world of scRNA-seq. I've been looking at workflows like https://scrnaseq-course.cog.sanger.ac.uk/website/index.html or https://broadinstitute.github.io/2019_scWorkshop/index.html#course-overview and I do not understand the creation of the SCE object/Starsolo alignment.

I'm using the https://support.10xgenomics.com/single-cell-gene-expression/datasets/2.1.0/neurons_900 dataset for practice as it doesn't take up much memory and to make wait times shorter. I'm analyzing with the "Starsolo" program, using the following code:

STAR --genomeDir /home/victor/Escritorio/Curso_Single_Cell/indices/STAR --runThreadN 16 --readFilesIn neurons_900_fastqs/neurons_900_S1_L001_R2_001.fastq,neurons_900_fastqs/neurons_900_S1_L002_R2_001.fastq neurons_900_fastqs/neurons_900_S1_L001_R1_001.fastq,neurons_900_fastqs/neurons_900_S1_L002_R1_001.fastq --soloType CB_UMI_Simple --soloCBwhitelist /home/victor/Escritorio/Curso_Single_Cell/whitelist/737K-august-2016.txt --outFileNamePrefix results/STAR/

After that, Starsolo return a raw and filtered data, where you can find the matrix, barcodes and genes/features. But when I load this 3 files and create a SCE object, the count of assays are not correct.

> dir.name <- "/home/victor/Escritorio/Curso_Single_Cell/results/STAR/Solo.out/Gene/raw"
> list.filesdir.name)
[1] "barcodes.tsv" "genes.tsv"    "matrix.mtx"  
> sce <- DropletUtils::read10xCountsdir.name, col.names = TRUE)
> sce

class: SingleCellExperiment 
dim: 55487 737280 
metadata(1): Samples
assays(1): counts
rownames(55487): ENSMUSG00000102693 ENSMUSG00000064842 ... ENSMUSG00000096730 ENSMUSG00000095742
rowData names(3): ID Symbol NA
colnames(737280): AAACCTGAGAAACCAT AAACCTGAGAAACCGC ... TTTGTCATCTTTAGTC TTTGTCATCTTTCCTC
colData names(2): Sample Barcode
reducedDimNames(0):
spikeNames(0):
altExpNames(0):

> summary(assay(sce, "counts"))
55487 x 737280 sparse Matrix of class "dgCMatrix", with 5113008 entries 
        i   j x
1    2681   1 1
2   26019   1 1
3   30593   1 1
4   30624   1 1
5   30756   1 1
6   36144   1 1
7   38875   1 1
8   53732   1 1
9   46321   3 1
10  55399   5 1
11   4333   6 1
12   7768   6 1
13  10051   6 1
14  15470   6 1
15  25255   6 1
16  32249   6 1
17  33914   6 1
18  37100   6 1
19  40026   6 1
20  40180   6 1
21  41019   6 1
22  49661   6 1
23  49669   6 1
24  18081   7 1
25  16776   9 1
26  54018  11 1
27    272  12 1
28   9832  12 1
29  13560  12 1
30  14856  12 1
31  15490  12 1
32  18592  12 1
33  23950  12 1
34  25910  12 1
35  28138  12 1
36  28177  12 1
37  35881  12 1
38  36144  12 1
39  36692  12 1
40  37663  12 1
41  38459  12 1
42  39978  12 1
43  40156  12 1
44  41019  12 1
45  41030  12 1
46  43773  12 1
47  46411  12 2
48  48427  12 1
49  49388  12 1
50  49409  12 1
51  49414  12 2
52  50650  12 1
53  33914  14 1
 ... etc

I don't know why is happening this. Maybe it could be because I need to count the reads per gene? I thought that Starsolo perform the mapping but also the counting. If it this the reason, what should I do?

Thanks a lot!! :)

Star scRNA-seq Starsolo RNA-Seq • 650 views
ADD COMMENT
0
Entering edit mode

And which rownames should be ??

> assay(sce, "counts")
55487 x 992 sparse Matrix of class "dgCMatrix"
   [[ suppressing 77 column names ‘AAACCTGGTCTCGTTC’, ‘AAACGGGAGCCACGTC’, ‘AAACGGGAGCGAGAAA’ ... ]]
   [[ suppressing 77 column names ‘AAACCTGGTCTCGTTC’, ‘AAACGGGAGCCACGTC’, ‘AAACGGGAGCGAGAAA’ ... ]]

ENSMUSG00000102693 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
ENSMUSG00000064842 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
ENSMUSG00000051951 1 . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
ENSMUSG00000102851 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
ENSMUSG00000103377 . . . . 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . ......
ENSMUSG00000104017 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......

 ..............................
 ........suppressing 915 columns and 55475 rows in show(); maybe adjust 'options(max.print= *, width = *)'
 ..............................
   [[ suppressing 77 column names ‘AAACCTGGTCTCGTTC’, ‘AAACGGGAGCCACGTC’, ‘AAACGGGAGCGAGAAA’ ... ]]

ENSMUSG00000095434 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
ENSMUSG00000094431 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
ENSMUSG00000094621 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
ENSMUSG00000098647 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
ENSMUSG00000096730 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
ENSMUSG00000095742 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
ADD REPLY
0
Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized. SUBMIT ANSWER is for new answers to original question

ADD REPLY
1
Entering edit mode
10 months ago
pacome.pr ▴ 100

Hello Rafa,

I am not an expert on STARsolo, but looking at the rownames of your SingleCellExperiment, it seems that reads were counted on the mouse exome:

rownames(55487): ENSMUSG00000102693 ENSMUSG00000064842 ... ENSMUSG00000096730 ENSMUSG00000095742

I am not sure why the matrix is under this form, maybe it is the summary function ?

55487 x 737280 sparse Matrix of class "dgCMatrix", with 5113008 entries 
        i   j x
1    2681   1 1
2   26019   1 1
3   30593   1 1
4   30624   1 1
5   30756   1 1

This is the sparse representation of your matrix, e.g. the matrix indexes and values of non-zeroes entries. For example, in row 2681, column 1, the value is 0.
What happens if you run :

head(assay(sce, "counts"))

?
If it is not under it's sparse matrix (dgCMatrix) representation, see ?Matrix::sparseMatrix in order to create the matrix from non-zeroes entries.

ADD COMMENT
1
Entering edit mode
10 months ago
ATpoint 52k

It seems to me that you read the entire set of barcodes from the 737k list into your sce. I am not a STARsolo, neither CellRanger (Alevin for the win ;-) ) user but maybe you selected the wrong directory? The row number looks fine, but 737k columns must be wrong. You selected folder raw, is there a second folder or so, something like filtered where the empty barcodes got eliminated?

ADD COMMENT
0
Entering edit mode

I have used the filtered, and now I have the correct number of cells! Thanks :) But I still have a number of counts in the assay to small.

> sce
class: SingleCellExperiment 
dim: 55487 992 
metadata(1): Samples
assays(1): counts
rownames(55487): ENSMUSG00000102693 ENSMUSG00000064842 ... ENSMUSG00000096730 ENSMUSG00000095742
rowData names(3): ID Symbol NA
colnames(992): AAACCTGGTCTCGTTC AAACGGGAGCCACGTC ... TTTGGTTTCATGCATG TTTGTCACATCGGTTA
colData names(2): Sample Barcode
reducedDimNames(0):
spikeNames(0):
altExpNames(0):

> assay(sce, "counts")
55487 x 992 sparse Matrix of class "dgCMatrix"
   [[ suppressing 77 column names ‘AAACCTGGTCTCGTTC’, ‘AAACGGGAGCCACGTC’, ‘AAACGGGAGCGAGAAA’ ... ]]
   [[ suppressing 77 column names ‘AAACCTGGTCTCGTTC’, ‘AAACGGGAGCCACGTC’, ‘AAACGGGAGCGAGAAA’ ... ]]

ENSMUSG00000102693 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
ENSMUSG00000064842 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
ENSMUSG00000051951 1 . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
ENSMUSG00000102851 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
ENSMUSG00000103377 . . . . 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . ......
ENSMUSG00000104017 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
ADD REPLY
1
Entering edit mode

Not sure what you mean. Do you mean these dots? This is the way this sparse matrix format (dgCMatrix) represents data. Nothing to worry about, it is a kind of compression. You should be good to go.

ADD REPLY

Login before adding your answer.

Traffic: 1645 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6