Question: General question about batch effect, read trimming and what to do when the adapter trimming step is not working appropriately.
gravatar for Mozart
10 days ago by
Mozart110 wrote:

Hello everyone, I have a huge dataset with a bunch of human samples to analyse. Of course, I run into troubles because the samples come from different donors and when I PCA those samples,'s a bit dodgy. They cluster according to their condition but I am not sure about how am I supposed to deal with this batch effect? A few time ago, I used SVA package but I wasn't happy with that.

A problem related to this is probably due to the fact that my samples are not trimmed appropriately. I have a lot of problem with the facility that generated these fastq files because sometimes they provide me trimmed samples, sometimes they don't (given the fact that this whole dataset comes from different batches/years). Thus, my questions:

  1. Don't you think that all of my samples, to generate useful data, must have been processed in the same identical way (e.g. same Sliding window, leading, trailing, minlen)? I am quite confused about this.
  2. What if, by any chance, I trim an already-trimmed file?
  3. When I am trying to trim my samples, I don't manage to remove adapter contamination..according to my beloved multiqc report there's a huge nextera transposase sequence contamination that Trimmomatic can't remove, even when selecting specific adapters...

Yours, M

ADD COMMENTlink modified 4 days ago • written 10 days ago by Mozart110

when I PCA those samples,'s a bit dodgy

How was PCA done and how was data normalization/regularization performed?

They cluster according to their condition

Isnt't that expected as this is the biological difference?

problem with the facility that generated these fastq files because sometimes they provide me trimmed samples

It is very uncommon that facilitites provide adapter-trimmed samples. Do you really mean trimmed or demultiplexed?

As for the questions 1-3:

  1. Yes data should be uniformly processed but re-trimming a dataset is probably not harmful as there should be little effect if indeed the adapter sequence is not present anymore.
  2. see 1)
  3. Did you provide the correct adapter sequence? See for example code in the web. If the sequence persists, your command is somewhat wrong. Can you share some command lines?
ADD REPLYlink modified 10 days ago • written 10 days ago by ATpoint15k

As a small addition, do a fastQC report for each sample before and after trimming. Afterwards, run on the reports the multiqc tool.

Then you'll see the differences in adapter content, read length, etc.

ADD REPLYlink written 9 days ago by michael.ante3.2k

Thanks ATpoint for your question. I am judging the PCA according to someone else's analysis. I hadn't got the chance to get to that point yet. By the way, I guess there is very little variation amongst the different samples.

Anyway I solved the issue but, as you can see below, I am not sure if I have to use either paired or unpaired samples, after trimming.

ADD REPLYlink written 4 days ago by Mozart110

I have recently used Trimmomatic to remove nextera transposase sequence so it is probably just a matter of providing the correct sequence to use.

ADD REPLYlink written 8 days ago by kristoffer.vittingseerup1.7k

Agreed- The standard tools (I use cutadapt) all perform more or less equally-well and if it does not work it is 99.9% of the time a user-induced problem (=wrong commands, wrong adapter sequences provided etc.)

ADD REPLYlink modified 8 days ago • written 8 days ago by ATpoint15k
gravatar for colindaven
8 days ago by
Hannover Medical School
colindaven1.2k wrote:

Try alternative trimmers too. I use fastp and ea-utils fastq-mcf for tricky samples besides the standard Trimmomatic.

I also use multiple rounds of trimming to eg, remove adapters from some tricky short sequences, eg miRNAs or amplicons.

Multiple rounds of FASTQC and Multiqc are also necessary.

ADD COMMENTlink written 8 days ago by colindaven1.2k
gravatar for Biogeek
8 days ago by
Biogeek350 wrote:

I'd recommend using BBDUK under the bb tools suite by Brian Bushnell. It has an extensive adapter.fa file containing all publicly available adaptor sequences - just an idea? The amount of times people sue Trimmomatic without the correct adaptor sequence .fa file. Admittedly I also made that mistake and realised once. The performance of BBDUK is supposedly superior to Trimmomatic.

Once you've tried BBDUK, report back the QC results. The log output will also inform you of adaptor sequence % detected and removed.


ADD COMMENTlink written 8 days ago by Biogeek350
gravatar for Mozart
4 days ago by
Mozart110 wrote:

Thanks all of you for the useful replies. Following the code I am using:

java -jar /Users/Trimmomatic-0.39/trimmomatic-0.39.jar PE -phred33 -threads 4 /Users/FASTQ/sample1_R1_001.fastq.gz /Users/FASTQ/sample1_R2_001.fastq.gz /Users/FASTQ/sample1_R1_paired.fastq.gz /Users/FASTQ/sample1_R1_unpaired.fastq.gz /Users/FASTQ/sample1_R2_paired.fastq.gz /Users/FASTQ/sample1_R2_unpaired.fastq.gz 
ILLUMINACLIP:/Users/Trimmomatic-0.39/adapters/NexteraPE-PE.fa SLIDINGWINDOW:value LEADING:value TRAILING:value MINLEN:value

It seems to work now, because I slightly changed the code to be honest. In fact looking at the QC report again, it seems I managed to remove the adapter contamination

At the end of this process, should I use the paired file for the downstream analysis, right?



ADD COMMENTlink written 4 days ago by Mozart110
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1776 users visited in the last hour