Entering edit mode
4.6 years ago
Researcher
▴
20
Hi all,
I have a question regarding "Per Base Sequence Content" plot for "fastqc":
In the fastqc documentation, it is written: "In a random library you would expect that there would be little to no difference between the different bases of a sequence run, so the lines in this plot should run parallel with each other."
But I don't understand why different bases in a read should follow the same pattern of allele frequency ("A/T/C/G"). I mean they are different positions in the genome and it is normal that each position has different allele.
I would appreciate it if someone could help me, please.
Please post an example plot if you have one. : How to add images to a Biostars post
To my understand. take DNA-seq for example, with enough sequencing depth, the reads from random library would cover the whole genome equally. And so, the base content in each position of the reads are the same, that equal to whole genome GC content.
So, for a ideal random library, the per base sequence content plot in fastqc report shows four lines in parallel (G=C, A=T).
Thanks for your reply, but what do you mean by G=C and A=T? I mean all the reads are coming from the same strand, right? so in a single strand, why should the per base content for A and T or G and C be equal?
I take the DNA-seq for example, we could get the reads from both strand, so in each position of the read, the GC% content would be identical (ideally), equal to the whole GC% content.
If the reads come from both strands, I agree with you, but this is not always the case, right?
sure, it not always for some libraries. And we should be very careful for the data.
To my experience, in small RNAseq, CLIPseq, etc, I see the GC% content is not identical across the read (my data), it shows that per base content lines crossed over in fastqc report.
That should always be the case. Even in stranded sequencing. Unless you are using some method that is discarding one strand entirely.
So, you mean even in the single-end sequencing, the reads come from the both strands?
yes. If "both strands" you mentioned is refer to DNA double strand. For DNA-seq, the insert fragment is strandless, so sequencing reads could not tell which strand is "forward of DNA".
I have no idea which high-throughput library protocol could generate fragment from either 'forward' or 'reverse' strand of genome. (and No need to do this).
thanks for your answer, yeah, I mean for DNA-Seq. but for alignment, we should separate the reads that are located in different strands, right? Otherwise, how can we align them?
Aligners automatically check for alignment on both strands by making a reverse complemented copy of the reference you provide. Since DNA sequence is always written in 5'-->3' order the reference you provide is considered forward/top strand.
So in this case, as you said, for an ideal random library, it should be always G=C and A=T, but I found in the fastqc documentation, they provide an example for good illumina data: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html#M4 which doesn't satisfy these conditions! Then how this data is considered having a good base per content, but A!=T and C!=G?
That graph plots A/C/T/G seen at a particular cycle in the sequencing. Depending on the GC% of the genome and the way genome got fragmented you would see variation (while we expect the fragmenting to be random there are likely biases depending on how the fragmentation was done and sequence itself). That said I have seen datasets where A/C/T/G curves almost perfectly overlapped.
The example file contains 250000 reads, that is not enough to randomly cover the whole genome.
Do try it yourself
I suggest download real datasets from NCBI/SRA or EBI, and run the
fastqc
yourself to check whether "G=C" or not.I also prepared a example of my data (DNA-seq, 19M, Paired-end reads from fruitfly) for you.
Thank you so much for your detailed answer. I am new in sequencing, but I implied that each fragment in DNA sequencing consists of two strands, is it true or not?
You only sequence one strand at a time.
what about the paired-end reads? Are the mates in paired-end reads related to one strand or both?
Same fragment is sequenced from either end (5'--> 3') to generate the two reads you get. Reads are from opposite strands of that fragment.
Thank you so much for your help. the discussion here was so beneficial for me.
Dear @genomax, Now, another question comes to my mind: In the case of paired-end sequencing, do we know to which strand read1 and read2 belong to or we just know that they are from opposite strands?
DNA is anti-parallel so the concept of strands is relative. One always sequences 5' --> 3' so whichever end sequences first (that would be the end where the p5 adapter ligated) becomes read 1.
thank you for your answer
For DNA-seq, it cannot tell which read belong to which strand. because both forward and reverse strand of DNA are able to connect to the P5-end of library in 5'-3' direction (it is read1 in Paired end sequencing).
But for strand-specific RNA sequencing, which strand ligated to the P5-end adapter (5'-3' direction) is determined, so we can tell discriminate which strand the read belong to. for example: for dUTP-strand-specific RNA library. the read2 is from the sense strand of mRNA.
Thanks for your answer, so, this means in the case of DNA-seq, the reads in the file of read1s, can belong to different strands of DNA and the same for read2s, right?
yes, your are right.
Does anyone else know the answer?
Yes, the insert fragment is double strand DNA, and sequencing reads (read1 or read2) was from only one strand of it. you could read some Illumina library materials or videos. for example:
thanks for your answer
Hello negin
It appears that your post has been cross-posted to another site: https://bioinformatics.stackexchange.com/questions/11701/per-base-sequence-content-in-fastqc
This is typically not recommended as it runs the risk of annoying people in both communities.
Thanks for your advice