Hi everyone,
I've recently started analyzing single-cell RNA-seq data (with FASTQ files as a starting point) and so far I have used 10x genomics data from their website.
Now, I'm interested in using data generated by other protocols, specifically SMART, because it is the most used full-length protocol (the two main paradigms are tag-based like 10x and full length). However, I'm having trouble understanding the raw data and I figured that it would be worth discussing the differences between FASTQ files from 10x and SMART-seq. Both methods are sequenced in Illumina sequencers, which depending on the model, yield a different number of files, but it's always one set of files. What about SMART-seq? is that the protocol where there's one set of files for each cell?
To further complicate matters, I understand that full-length protocols (SMART-seq2) -unlike tag-based protocols- do not support UMIs, but SMART-seq3 does use UMIs and I had the idea (I read it in some paper) that when you are sequencing full-length transcripts having UMIs is really not a factor that changes anything. So how does the analysis between SMART-seq2 and SMART-seq3 change to account for this?
Thank you!
Thank you, that was a great answer. So before smart-seq3 the data was inflated? since without UMIs there was no way to correct for the PCR duplicates
Correct, smart-seq2 doesn't have UMIs so there was no way to correct for PCR bias. This was why smart-seq3 was developed.
As for how big of a difference PCR bias makes, that's a whole other discussion entirely. All RNAseq library preps introduce many sources of technical biases (PCR, length, coverage, capture bias, sequence-specific biases, etc.) and how these various biases affect downstream analyses is an entire field of research on its own!