Question: Per Base Sequence Content
0
gravatar for Negin
8 days ago by
Negin0
Negin0 wrote:

Hi all,

I have a question regarding "Per Base Sequence Content" plot for "fastqc":

In the fastqc documentation, it is written: "In a random library you would expect that there would be little to no difference between the different bases of a sequence run, so the lines in this plot should run parallel with each other."

But I don't understand why different bases in a read should follow the same pattern of allele frequency ("A/T/C/G"). I mean they are different positions in the genome and it is normal that each position has different allele.

I would appreciate it if someone could help me, please.

ADD COMMENTlink modified 8 days ago by wm280 • written 8 days ago by Negin0

Please post an example plot if you have one. : How to add images to a Biostars post

ADD REPLYlink written 8 days ago by genomax80k

To my understand. take DNA-seq for example, with enough sequencing depth, the reads from random library would cover the whole genome equally. And so, the base content in each position of the reads are the same, that equal to whole genome GC content.

So, for a ideal random library, the per base sequence content plot in fastqc report shows four lines in parallel (G=C, A=T).

ADD REPLYlink written 8 days ago by wm280

Thanks for your reply, but what do you mean by G=C and A=T? I mean all the reads are coming from the same strand, right? so in a single strand, why should the per base content for A and T or G and C be equal?

ADD REPLYlink written 8 days ago by Negin0

I take the DNA-seq for example, we could get the reads from both strand, so in each position of the read, the GC% content would be identical (ideally), equal to the whole GC% content.

ADD REPLYlink written 8 days ago by wm280

If the reads come from both strands, I agree with you, but this is not always the case, right?

ADD REPLYlink written 7 days ago by Negin0
1

sure, it not always for some libraries. And we should be very careful for the data.

To my experience, in small RNAseq, CLIPseq, etc, I see the GC% content is not identical across the read (my data), it shows that per base content lines crossed over in fastqc report.

ADD REPLYlink written 7 days ago by wm280

That should always be the case. Even in stranded sequencing. Unless you are using some method that is discarding one strand entirely.

ADD REPLYlink written 7 days ago by genomax80k

So, you mean even in the single-end sequencing, the reads come from the both strands?

ADD REPLYlink written 6 days ago by Negin0

yes. If "both strands" you mentioned is refer to DNA double strand. For DNA-seq, the insert fragment is strandless, so sequencing reads could not tell which strand is "forward of DNA".

I have no idea which high-throughput library protocol could generate fragment from either 'forward' or 'reverse' strand of genome. (and No need to do this).

ADD REPLYlink written 6 days ago by wm280

thanks for your answer, yeah, I mean for DNA-Seq. but for alignment, we should separate the reads that are located in different strands, right? Otherwise, how can we align them?

ADD REPLYlink modified 5 days ago • written 5 days ago by Negin0

Aligners automatically check for alignment on both strands by making a reverse complemented copy of the reference you provide. Since DNA sequence is always written in 5'-->3' order the reference you provide is considered forward/top strand.

ADD REPLYlink modified 5 days ago • written 5 days ago by genomax80k

So in this case, as you said, for an ideal random library, it should be always G=C and A=T, but I found in the fastqc documentation, they provide an example for good illumina data: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html#M4 which doesn't satisfy these conditions! Then how this data is considered having a good base per content, but A!=T and C!=G?

ADD REPLYlink written 5 days ago by Negin0
1

That graph plots A/C/T/G seen at a particular cycle in the sequencing. Depending on the GC% of the genome and the way genome got fragmented you would see variation (while we expect the fragmenting to be random there are likely biases depending on how the fragmentation was done and sequence itself). That said I have seen datasets where A/C/T/G curves almost perfectly overlapped.

ADD REPLYlink written 5 days ago by genomax80k
1

The example file contains 250000 reads, that is not enough to randomly cover the whole genome.

Do try it yourself
I suggest download real datasets from NCBI/SRA or EBI, and run the fastqc yourself to check whether "G=C" or not.

I also prepared a example of my data (DNA-seq, 19M, Paired-end reads from fruitfly) for you.

enter image description here

ADD REPLYlink modified 5 days ago • written 5 days ago by wm280

Thank you so much for your detailed answer. I am new in sequencing, but I implied that each fragment in DNA sequencing consists of two strands, is it true or not?

ADD REPLYlink written 4 days ago by Negin0
1

You only sequence one strand at a time.

ADD REPLYlink written 4 days ago by genomax80k

what about the paired-end reads? Are the mates in paired-end reads related to one strand or both?

ADD REPLYlink written 4 days ago by Negin0
1

Same fragment is sequenced from either end (5'--> 3') to generate the two reads you get. Reads are from opposite strands of that fragment.

ADD REPLYlink written 4 days ago by genomax80k

Thank you so much for your help. the discussion here was so beneficial for me.

ADD REPLYlink written 2 days ago by Negin0

Dear @genomax, Now, another question comes to my mind: In the case of paired-end sequencing, do we know to which strand read1 and read2 belong to or we just know that they are from opposite strands?

ADD REPLYlink written 1 day ago by Negin0

Does anyone else know the answer?

ADD REPLYlink written 20 hours ago by Negin0

DNA is anti-parallel so the concept of strands is relative. One always sequences 5' --> 3' so whichever end sequences first (that would be the end where the p5 adapter ligated) becomes read 1.

ADD REPLYlink modified 12 hours ago • written 18 hours ago by genomax80k

For DNA-seq, it cannot tell which read belong to which strand. because both forward and reverse strand of DNA are able to connect to the P5-end of library in 5'-3' direction (it is read1 in Paired end sequencing).

But for strand-specific RNA sequencing, which strand ligated to the P5-end adapter (5'-3' direction) is determined, so we can tell discriminate which strand the read belong to. for example: for dUTP-strand-specific RNA library. the read2 is from the sense strand of mRNA.

ADD REPLYlink modified 6 hours ago • written 6 hours ago by wm280
1

Yes, the insert fragment is double strand DNA, and sequencing reads (read1 or read2) was from only one strand of it. you could read some Illumina library materials or videos. for example:

ADD REPLYlink modified 4 days ago • written 4 days ago by wm280

thanks for your answer

ADD REPLYlink written 1 day ago by Negin0

Hello negin

It appears that your post has been cross-posted to another site: https://bioinformatics.stackexchange.com/questions/11701/per-base-sequence-content-in-fastqc

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLYlink modified 8 days ago by Istvan Albert ♦♦ 83k • written 8 days ago by ATpoint31k

Thanks for your advice

ADD REPLYlink written 8 days ago by Negin0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1144 users visited in the last hour