Question: GC content of 10x Genomics
1
gravatar for Assa Yeroslaviz
17 months ago by
Assa Yeroslaviz1.4k
Munich
Assa Yeroslaviz1.4k wrote:

I was wondering if the results of my fastqc run are to be expected under the circumstances. We are running some 10x Genomics (single-cell) data sets and get a strange GC content plot when running the fastqc tool

In the case of the image attached, the basic statistics tells me, the GC content should be around 41%. But even the theoretical distribution in the plot is around 51% while the true one is difficult to interpret.

So my questions are -

  1. can I still assume the data is good enough? It doesn't look like there is contamination in the fastq files.
  2. Can it be, that the reason for this behavior is the specific structure of the 10x Genomics sequences, including the barcodes and the UMI?

thanks

GC content plot

fastqc gc content 10x • 1.0k views
ADD COMMENTlink modified 5 weeks ago by Eugene A80 • written 17 months ago by Assa Yeroslaviz1.4k

I think the data should be fine. If this is for read 1 (which the cell barcodes/UMI's for RNAseq) I would not even do this analysis.

ADD REPLYlink written 17 months ago by genomax89k

Is this R1 or R2? For R1 fastqc is basically meaningless as you only scan the "technical sequences".

ADD REPLYlink modified 17 months ago • written 17 months ago by ATpoint38k

This is from the file of R2, not R1

ADD REPLYlink modified 17 months ago • written 17 months ago by Assa Yeroslaviz1.4k

There could be various things going on. There is strong 3'-bias to 10x scRNAseq (which FastQC has no way to understand). Not sure what to tell you. I have not tried to do FastQC with 10x data.

ADD REPLYlink modified 17 months ago • written 17 months ago by genomax89k

what better tools are there out there to test the QC of 10x genomics data sets?

ADD REPLYlink written 17 months ago by Assa Yeroslaviz1.4k

Once you do your 10x analysis check to see how many cells you have and the number of reads detected per cell. All those things are highlighted when you look at the set with loupe browser.

ADD REPLYlink written 17 months ago by genomax89k

I can't say I've looked at this specifically for 10X data, but in general FastQC reports for 3' RNA-seq datasets (single-cell like or the bulk stuff with a kit from Lexogen) tend to look rather odd. My suspicion has always been that this is due to the nature of the 3' end of transcripts.

ADD REPLYlink written 17 months ago by Devon Ryan96k

I'm not really sure what it means. What is different in the 3' end of the transcripts with the scRNA-Seq (or Lexogen kits)? Does it apply only for data from 10x Genomics or all other scRNA-Seq technologies as well?

why does this kind of bias influences the GC content only on the 3' end?

ADD REPLYlink modified 17 months ago • written 17 months ago by Assa Yeroslaviz1.4k

It will affect all kits and I suspect there is a GC bias at the 3' end (likely exacerbated by this being RNA-seq, so there are enrichments of a handful of sequences anyway).

ADD REPLYlink written 17 months ago by Devon Ryan96k

Hi, I also have some problems understanding Fastqc repots for scRNA data These are figures for R1 and R2, I can understand strange GC content/per base content, BUT I do not get the per base sequence quality and per tile sequence quality. The reports basically the same for all R1 files and fro different librarie preps, so that should mean that I just do not understand something simple here. Clarifications will be appreciated!

R1 QC

R1

R2 QC

R2

ADD REPLYlink written 5 weeks ago by Eugene A80
1

The R1 sequences should be either 26 or 28 bases long for 10x, depending on the chemistry. Therefore all what is after that are primer/adapter residues. But I think the cellranger can ignore them. see here

ADD REPLYlink written 5 weeks ago by Assa Yeroslaviz1.4k

I gave it a bit more thoughts and now confused even more (

here is the library pic. from 10x genomic website

2

R1 if I understand corretcly (and seems that I do not), should looks like: 5'-CellBarcode-UMI-polyA-cDNA-3' And R2: 5'-cDNA-polyA(depending on insert size and seq len)-3'

In my data I see increased A contetnt in R1, which I assume should be polyA/primers/barcodes traces? Why the seq quality is dropping for R1 and not for R2, is it because of the repeats? Actual meaningfull part of the read is thereefore only first ~30bp? Why these appear at 5', when the read was reverse complemented?

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by Eugene A80

There is no need to do FastQC for 10x data. You should proceed with cellranger analysis.

ADD REPLYlink written 5 weeks ago by genomax89k

Yes, a run it through cellRanger, but I want to understand this QC report ;)

ADD REPLYlink written 5 weeks ago by Eugene A80
1

See: https://teichlab.github.io/scg_lib_structs/methods_html/10xChromium3.html (go to bottom of page)

Any time sequencer encounters stretches of same nucleotides software keeping track of the clusters starts having problems which results in reduction in Q scores and strange tile metrics.

ADD REPLYlink written 5 weeks ago by genomax89k

Thanks! And also R1 is written in fastq in reverse complement to the actual order of sequencing

ADD REPLYlink written 5 weeks ago by Eugene A80
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1106 users visited in the last hour