Question

MultiQC output: opposing output

0

Entering edit mode

13 months ago

S ▴ 10

Hi,

I'm working with RAD genomic data (produced through genotype-by-sequencing). Two panels in my MultiQC document seem to contradict one another, the "Trimmed Sequence Lengths (3')" and the "Sequence Length Distribution" panels. I expect a lot of duplication because of the GBS protocol.

Trimmed Sequence Length panel

In the Trimmed Sequence Length panel (multicolored), there are peaks around 70, 90, and 110 bp. When I used Trim Galore (a Cutadapt wrapper), I indicated a quality score (q) of 0, so there should have been no quality trimming, only removal of the adapters. Is this plot indicating that some reads were trimmed >50 bp? Trimming both Illumina adapters should have removed <40 bp.

Sequence Length Distribution panel

In the Sequence Length Distribution panel (orange), I think the peaks are more normal- 30, 50, 70 bp.

Can anyone provide any ideas about what happened along the way?

Thanks so much!

cutadapt fastqc multiQC genotype-by-sequencing GBS • 1.3k views

ADD COMMENT • link updated 12 months ago by Phil Ewels ★ 1.4k • written 13 months ago by S ▴ 10

0

Entering edit mode

If the Sequence length distribution data is post trimming (cutadapt is pre-trimming) and if you expect ~40 bp to be removed then the peaks seems to have shifted correctly in trimmed data?

ADD REPLY • link 13 months ago by GenoMax 142k

0

Entering edit mode

Thanks for your reply! I was under the impression that the Trimmed Sequence Length plot is showing the number of nucleotides that are trimmed per read, which would imply that it's trimming, in some cases, >100 bp, which doesn't make sense if I'm only asking it to trim ~40 bp total, per read. What do you mean by "cutadapt is pre-trimming"?

ADD REPLY • link 13 months ago by S ▴ 10

0

Entering edit mode

I wonder, does the x axis label "Length trimmed (bp)" mean the base pair at which the read was trimmed? Which I think is similar to your idea?

ADD REPLY • link 13 months ago by S ▴ 10

0

Entering edit mode

I don't use cutadapt so I am not so familiar with its metrics. I thought that the plot you were showing us in figure 1 was data prior to trimming. Is that not the case?

Trimmed Sequence Length plot is showing the number of nucleotides that are trimmed per read

This must show the plot of reads (length) that remains after trimming. Not the bases that got trimmed.

ADD REPLY • link 13 months ago by GenoMax 142k

0

Entering edit mode

Frankly, I'm not certain what the "Trimmed Sequence Length" plot is showing. I can't figure out another explanation besides that, as you suggested, this plot shows read lengths before trimming, and the other plot shows lengths after trimming. However, there's a huge peak around 140 in the second plot that I can't explain...

ADD REPLY • link 13 months ago by S ▴ 10

score 1 · Answer 1 · 2023-04-04

1

Entering edit mode

13 months ago

Phil Ewels ★ 1.4k

Yes, this plot shows the length of removed sequence (see explanation text in the MultiQC plot and cutadapt documentation).

Trimming both Illumina adapters should have removed <40 bp.

..which would imply that it's trimming, in some cases, >100 bp, which doesn't make sense if I'm only asking it to trim ~40 bp total, per read.

This is where I think the confusion lies. You're _not_ only asking it to trim ~40bp total per read, you're asking it to only trim adapters. Adapter sequence can be anywhere in the read, and cutadapt will remove that and anything that follows.

Consider a simplified example for a 30bp read with a 5bp adapter (X):

GAGCXXXXXACTACGTACGTACAGCGGATA
    ^^^^^

Cutadapt would remove the adapter and everything after it, leaving just GAGC. It would have trimmed 26bp even though the adapter is only 5bp long.

If you have super short inserts then it's entirely likely that your 140bp reads could be 10bp sequence, 40bp adapter and 90bp of N's (or other random low quality base calls).

ADD COMMENT • link 13 months ago by Phil Ewels ★ 1.4k

0

Entering edit mode

Thanks so much for this, Phil. I didn't think that adapter sequence can be anywhere in the read. I thought it's always at the end of the read, since the GBS technique we used 1) cuts up the DNA at specific places using a cutter enzyme and then 2) attaches the adapters to those cut sites. Is there something I'm missing?

My reads were (fortunately) very high quality so there's not much chance the reads would be majority Ns. Can you please give another example for what could've happened?

ADD REPLY • link 13 months ago by S ▴ 10

0

Entering edit mode

illumina libraries are never as perfect as we imagine in our nice drawings with clean lines :) Adapter dimers, PCR product concatamers etc. I also suspect that it's more than 40bp adapter - your sequencing primer might be 40bp, but don't forget that there's the p5 / p7 adapters and all the other stuff that makes SBS work. Also, once a sequencing cluster runs out of bases, it can easily get light bleed from adjacent clusters, which can come through both as base calls instead of Ns. Short insert sizes with read-through can give you all kinds of rubbish.

ADD REPLY • link 13 months ago by Phil Ewels ★ 1.4k

0

Entering edit mode

If you're not sure what's happening, I would recommend run FastQC before and after trimming (I think TrimGalore runs FastQC for you in fact) and then scoop both sets of results into MultiQC. Make sure to separate them by running the module twice with path_filters (see the MultiQC docs with exactly this example).

This way you can easily see library read length before and after trimming, for the avoidance of any doubt.

ADD REPLY • link 13 months ago by Phil Ewels ★ 1.4k

0

Entering edit mode

Hi Phil- thank you again for your efforts to help!

I've spent the last few hours trying to get my head around this and am still struggling. Here's what I believe represents my reads:

[Illumina adapter: 13 bp]+[Barcode: ~4 bp]+[DNA: up to ~120 bp if sequencing 150 bp reads]+[Illumina adapter: 13 bp]

(here's a link to a video outlining what these reads should look like)

Then, it was sequenced on an Illumina HiSeq with 150bp reads. So if there are two adapters (13bp) and one barcode (4-8bp, 5' end) and the fragment could be at most 150 bp long, if each oligo stuck to the flow cell with the first (5') adapter, how could more than (150-13=137 bp) be trimmed from the ends?

And since I have a peak at 110 bp, does that mean that a lot of my fragments are 150-110=40 bp - the other adapter/barcode combo (17 bp) = ~23 bp long? Same for peaks at 70 and 90 bp (~150-70-17=63 bp and ~150-90-17=43 bp)?

These numbers are very close to the peaks in the second figure... a good thing?

ADD REPLY • link 13 months ago by S ▴ 10

0

Entering edit mode

DNA: up to ~120 bp if sequencing 150 bp reads

I'm not sure that this is right. The insert length is not affected by the sequencing length, it's purely determined by the library prep. The DNA insert could be 500bp, the 150bp reads just means that you'll only get 150bp of it from each end.

The link you posted is a rather simplified diagram. Remember that illumina sequencing typically actually has 4 reads for example (depending on instrument). The details will depend on prep type and sequencing type, but I'm pretty sure that you'll end up with more than 13bp on either end. See a diagram from illumina here.

I'd recommend backing up a step and taking a look at the gels of the library prep (eg. bioanalyser / fragment analyser etc). My suspicion is that insert lengths will be rather short and a bit peaky / lumpy - hopefully corresponding a little to your MultiQC plot. This should help to interpret the numbers.

ADD REPLY • link 12 months ago by Phil Ewels ★ 1.4k