Question: fastQC - case of the anomalous last base
gravatar for wasphunter
3.2 years ago by
wasphunter0 wrote:


I am not getting the best mapping rate (~60%) on my latest batch of sequence from a HiSeq run following de novo assembly. I don't see a lot of evidence for DNA contamination in the reads so I've been looking elsewhere for a reason for the low mapping efficiency.

My sequences appear to have high quality scores throughout except for the final base (sub-30 phred). I was able to get these removed in a subset of the data using trimmomatic. One thing persists, however. FastQC reports for "per base sequence content) indicates the last base percentages diverge substantially from the percentages present elsewhere in the reads. For example, my average G% and C% appear at a steady ~22% each throughout the reads but, for the final base, the G read increases to ~25%. the C% to almost 30%.

This observation differs from that I've seen of "normal" RNAseq reads. Have you seen this and/or can you explain the significance of this divergence? Thanks.

fastqc rna-seq • 1.1k views
ADD COMMENTlink modified 3.2 years ago by Grinch80 • written 3.2 years ago by wasphunter0
gravatar for Grinch
3.2 years ago by
Grinch80 wrote:

Hi, my first thought is, was your sample treated in any way or do you have any overrepresented sequences? If the bias is only at the final base, than it's most probably due to too aggressive adapter trimming, or due to your trimming of final bases. How long are your reads? Is it paired-end?

There are a number of common scenarios which would elicit a warning or error from this module:

  • Overrepresented sequences: If there is any evidence of overrepresented sequences such as adapter dimers or rRNA in a sample then these sequences may bias the overall composition and their sequence will emerge from this plot.

  • Biased fragmentation: Any library which is generated based on the ligation of random hexamers or through tagmentation should theoretically have good diversity through the sequence, but experience has shown that these libraries always have a selection bias in around the first 12bp of each run. This is due to a biased selection of random primers, but doesn't represent any individually biased sequences. Nearly all RNA-Seq libraries will fail this module because of this bias, but this is not a problem which can be fixed by processing, and it doesn't seem to adversely affect the ability to measure expression.
  • Biased composition libraries: Some libraries are inherently biased in their sequence composition. The most obvious example would be a library which has been treated with sodium bisulphite which will then have converted most of the cytosines to thymines, meaning that the base composition will be almost devoid of cytosines and will thus trigger an error, despite this being entirely normal for that type of library
  • If you are analyzing a library which has been aggressively adapter trimmed then you will naturally introduce a composition bias at the end of the reads as sequences which happen to match short stretches of adapter are removed, leaving only sequences which do not match. Sudden deviations in composition at the end of libraries which have undergone aggressive trimming are therefore likely to be spurious. [1]
ADD COMMENTlink modified 3.2 years ago • written 3.2 years ago by Grinch80

Thank you for your input.

My RNAseq data derives from 125 bp paired-end reads.

I should have stated more clearly that the "errant" bases are at the 3' end of the reads, not the 5'. Apologies for not including an image earlier:

content across all bases

I believe I did a good job of removing adapter sequences as nothing comes up on the "overrepresented sequences" report. Since the problem I'm concerned with is at the 3' end of paired-end reads, it isn't clear that biased fragmentation could account for the observation (although that could explain the observations at the 5' end of the reads in the image above).

So, does this new information give you any more ideas or have I missed something? Thanks again.

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by wasphunter0

That profile looks typical of data which has been trimmed for adapter sequences. Did you use cutadapt or trim_galore for the adapter trimming? Normally when I use trim_galore, the amount of overlap used to detect adapter contamination is a single base pair. This causes the last base of the reads to have a funny base content percentage. You can remove this by increasing the amount of overlap required to detect the adapter. Either way this is not the source of your low mapping rate.

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by James Ashmore2.9k

Thanks for your comments, James.

I used Trimmomatic for the adapter removal in both single and palindromic modes.

I'm not sure which software was used to remove the index adapters (still checking with my sequencing company). The data came to me with almost all reads 125 bases in length. The fastQC image that I link to in the comment above is similar to the original at the 5' and 3' ends but differs in that I managed to remove a peak around the 40-50 base range that was due to illumina sequences in some of the reads (~0.1%) . I conducted my trimming using Trimmomatic with a special primer file to allow for both single end and palindromic trimming modes.

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by wasphunter0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1177 users visited in the last hour