Hello all,
We have generated single-cell RNA seq data for more than 1,500 cells of various cardiac cell types. The cells were isolated using FACS sorting and the cell types are unknown. The cells were loaded on 20 96-well plates. The library preparation was done by SMART-seq2 on total RNA. We sequenced the samples with pair-end Illumina 2500. The experiment included ERCC spike-ins for normalization. We would like to filter and normalize the scRNA-seq data and subsequently separate the cell types computationally (e.g. by PCA or tSNE) in order to see the number of cells of each cell type, to compare their transcriptomes and explore new cell subtypes.
We deduplicated the fastq reads and obtained the counts by Kallisto and tximport (gene level). What we see from the raw counts (un-normalized data) is a negative correlation (within each plate and across all plates) between the number of detected genes (genes with count > 0) and the library size which we think it is unexpected. The negative correlation coefficients range from -0.25 to -0.5. Similar relationship holds between the spike-in total counts and the detected genes. Our library sizes range from 200,000 to 2,000,000 (roughly). Has anyone encountered such a relationship before? We think that it might affect the accuracy of the downstream spike-in normalization and differential expression, so we are not sure which type of normalization is best for our data.
Thank you, Mike
can you share the plot?
and can you check the most strongly expressed genes per cell? my first guess would be that many reads might be scavenged by ribosomal or other highly abundant transcripts, thereby reducing the diversity of transcripts despite an increased sequencing depth, but it would indeed be surprising if that was a consistent effect.