Per Base Sequence Content
0
0
Entering edit mode
2.9 years ago
Negin ▴ 20

Hi all,

I have a question regarding "Per Base Sequence Content" plot for "fastqc":

In the fastqc documentation, it is written: "In a random library you would expect that there would be little to no difference between the different bases of a sequence run, so the lines in this plot should run parallel with each other."

But I don't understand why different bases in a read should follow the same pattern of allele frequency ("A/T/C/G"). I mean they are different positions in the genome and it is normal that each position has different allele.

I would appreciate it if someone could help me, please.

sequencing fastqc genome sequence • 3.3k views
0
Entering edit mode

Please post an example plot if you have one. : How to add images to a Biostars post

0
Entering edit mode

To my understand. take DNA-seq for example, with enough sequencing depth, the reads from random library would cover the whole genome equally. And so, the base content in each position of the reads are the same, that equal to whole genome GC content.

So, for a ideal random library, the per base sequence content plot in fastqc report shows four lines in parallel (G=C, A=T).

0
Entering edit mode

Thanks for your reply, but what do you mean by G=C and A=T? I mean all the reads are coming from the same strand, right? so in a single strand, why should the per base content for A and T or G and C be equal?

0
Entering edit mode

I take the DNA-seq for example, we could get the reads from both strand, so in each position of the read, the GC% content would be identical (ideally), equal to the whole GC% content.

0
Entering edit mode

If the reads come from both strands, I agree with you, but this is not always the case, right?

1
Entering edit mode

sure, it not always for some libraries. And we should be very careful for the data.

To my experience, in small RNAseq, CLIPseq, etc, I see the GC% content is not identical across the read (my data), it shows that per base content lines crossed over in fastqc report.

0
Entering edit mode

That should always be the case. Even in stranded sequencing. Unless you are using some method that is discarding one strand entirely.

0
Entering edit mode

So, you mean even in the single-end sequencing, the reads come from the both strands?

0
Entering edit mode

yes. If "both strands" you mentioned is refer to DNA double strand. For DNA-seq, the insert fragment is strandless, so sequencing reads could not tell which strand is "forward of DNA".

I have no idea which high-throughput library protocol could generate fragment from either 'forward' or 'reverse' strand of genome. (and No need to do this).

0
Entering edit mode

thanks for your answer, yeah, I mean for DNA-Seq. but for alignment, we should separate the reads that are located in different strands, right? Otherwise, how can we align them?

0
Entering edit mode

Aligners automatically check for alignment on both strands by making a reverse complemented copy of the reference you provide. Since DNA sequence is always written in 5'-->3' order the reference you provide is considered forward/top strand.

0
Entering edit mode

So in this case, as you said, for an ideal random library, it should be always G=C and A=T, but I found in the fastqc documentation, they provide an example for good illumina data: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html#M4 which doesn't satisfy these conditions! Then how this data is considered having a good base per content, but A!=T and C!=G?

1
Entering edit mode

That graph plots A/C/T/G seen at a particular cycle in the sequencing. Depending on the GC% of the genome and the way genome got fragmented you would see variation (while we expect the fragmenting to be random there are likely biases depending on how the fragmentation was done and sequence itself). That said I have seen datasets where A/C/T/G curves almost perfectly overlapped.

1
Entering edit mode

The example file contains 250000 reads, that is not enough to randomly cover the whole genome.

Do try it yourself
I suggest download real datasets from NCBI/SRA or EBI, and run the fastqc yourself to check whether "G=C" or not.

I also prepared a example of my data (DNA-seq, 19M, Paired-end reads from fruitfly) for you.

0
Entering edit mode

Thank you so much for your detailed answer. I am new in sequencing, but I implied that each fragment in DNA sequencing consists of two strands, is it true or not?

1
Entering edit mode

You only sequence one strand at a time.

0
Entering edit mode

what about the paired-end reads? Are the mates in paired-end reads related to one strand or both?

1
Entering edit mode

Same fragment is sequenced from either end (5'--> 3') to generate the two reads you get. Reads are from opposite strands of that fragment.

0
Entering edit mode

Thank you so much for your help. the discussion here was so beneficial for me.

0
Entering edit mode

Dear @genomax, Now, another question comes to my mind: In the case of paired-end sequencing, do we know to which strand read1 and read2 belong to or we just know that they are from opposite strands?

1
Entering edit mode

DNA is anti-parallel so the concept of strands is relative. One always sequences 5' --> 3' so whichever end sequences first (that would be the end where the p5 adapter ligated) becomes read 1.

0
Entering edit mode

1
Entering edit mode

For DNA-seq, it cannot tell which read belong to which strand. because both forward and reverse strand of DNA are able to connect to the P5-end of library in 5'-3' direction (it is read1 in Paired end sequencing).

But for strand-specific RNA sequencing, which strand ligated to the P5-end adapter (5'-3' direction) is determined, so we can tell discriminate which strand the read belong to. for example: for dUTP-strand-specific RNA library. the read2 is from the sense strand of mRNA.

0
Entering edit mode

Thanks for your answer, so, this means in the case of DNA-seq, the reads in the file of read1s, can belong to different strands of DNA and the same for read2s, right?

0
Entering edit mode

0
Entering edit mode

Does anyone else know the answer?

1
Entering edit mode

Yes, the insert fragment is double strand DNA, and sequencing reads (read1 or read2) was from only one strand of it. you could read some Illumina library materials or videos. for example:

0
Entering edit mode

0
Entering edit mode

Hello negin

It appears that your post has been cross-posted to another site: https://bioinformatics.stackexchange.com/questions/11701/per-base-sequence-content-in-fastqc

This is typically not recommended as it runs the risk of annoying people in both communities.

0
Entering edit mode