Question

single cell RNA-seq detected genes

0

Entering edit mode

6.4 years ago

mefthymios • 0

Hello all,

We have generated single-cell RNA seq data for more than 1,500 cells of various cardiac cell types. The cells were isolated using FACS sorting and the cell types are unknown. The cells were loaded on 20 96-well plates. The library preparation was done by SMART-seq2 on total RNA. We sequenced the samples with pair-end Illumina 2500. The experiment included ERCC spike-ins for normalization. We would like to filter and normalize the scRNA-seq data and subsequently separate the cell types computationally (e.g. by PCA or tSNE) in order to see the number of cells of each cell type, to compare their transcriptomes and explore new cell subtypes.

We deduplicated the fastq reads and obtained the counts by Kallisto and tximport (gene level). What we see from the raw counts (un-normalized data) is a negative correlation (within each plate and across all plates) between the number of detected genes (genes with count > 0) and the library size which we think it is unexpected. The negative correlation coefficients range from -0.25 to -0.5. Similar relationship holds between the spike-in total counts and the detected genes. Our library sizes range from 200,000 to 2,000,000 (roughly). Has anyone encountered such a relationship before? We think that it might affect the accuracy of the downstream spike-in normalization and differential expression, so we are not sure which type of normalization is best for our data.

Thank you, Mike

single cell RNA-seq • 2.3k views

ADD COMMENT • link 6.4 years ago by mefthymios • 0

0

Entering edit mode

can you share the plot?

and can you check the most strongly expressed genes per cell? my first guess would be that many reads might be scavenged by ribosomal or other highly abundant transcripts, thereby reducing the diversity of transcripts despite an increased sequencing depth, but it would indeed be surprising if that was a consistent effect.

ADD REPLY • link 6.4 years ago by Friederike 8.9k

score 0 · Answer 1 · 2017-11-15

Hello Friederike,

Thank you for your reply. I have uploaded an example (Plate 5) here: https://ibb.co/eH9gXm (I hope you can see it). I see these patterns in approximately 80% of the plates, thus it seems to me it is a systematic effect. I have two types of mice, a healthy and a diseased type. Cells from both mice were loaded in each plate at pre-defined positions (the well IDs of the "healthy" and "diseased" cells are the same across plates). I wonder if these differences could be explained from the Nextera barcodes (also the same across plates) but maybe not since I see the anti-correlation within Healthy and Disease.

The list of the top 50 most strongly expressed genes is enriched in mt genes and some others (e.g. Gsn, Dcn, Actb, Malat1 etc) that we expected to be highly expressed due to the cell types we capture. The variability of the mt genes is relatively low across all cells.

Thank you again. Mike

PS. I just realized that the library sizes are shifted by 1,000,000 and the x-axis shows some negative values. The relationship of the library size and the detected genes is correct.