Does this adapter content from fastqc to warrant trimming prior to alignment?
1
0
Entering edit mode
4 days ago
curious ▴ 890

Hi I am looking to align some human short read WGS data with BWA-MEM prior to variant calling. I have 62 samples, each has 8 pairs of fastqs (992 fastqs total). I ran fastqc/multiqc on all of them and got this result from multiqc:

enter image description here

For comparison here is an individual fastqc:

enter image description here

Spot checking a handful of individual fastqc reports it seems the flat universal adapter line hovering around 2ish percent and that poly a line creeping up over the length of the read is pretty typical, does this require trimming prior to alignment?

According to this the threshold for warning is 5% and failure is 10%, but just wondering what the general practice is

Thanks!

fastqc bwa-mem multiqc • 5.5k views
ADD COMMENT
0
Entering edit mode

I would trim it. Two percent is not much but with WGS, say you have 400mio reads per sample or so, it's still millions of reads that theoretically could contaminate variant calls. Do it once, properly, and never care about it again. That having said, we are utterly flattered with a free-to-use powerful HPC at our university, so all we do is waiting for this job to complete. If you're on a bidget, say need to pay for HPC or a cloud service, you might want to skip it, but I would still feel safer trimming.

ADD REPLY
4
Entering edit mode
4 days ago
GenoMax 154k

It is an investment in time and will ensure that there is no extraneous sequence is present, if you choose to do scanning/trimming.

That said aligners should soft-clip the adapter sequence at the time of alignment, so technically you do not need to trim the data.

Note: Those sequences that show adapter starting at cycle 1 are likely all adapter dimers and have no inserts.

ADD COMMENT
0
Entering edit mode

Thank you, the adapter sequences are specified as CTGTCTCTTATACACATCT+ATGTGTATAAGAGACA is the SampleSheet.csv file from the sequencing run. So I ran cut adapt on a single sample:

cutadapt \
  -a CTGTCTCTTATACACATCT -A ATGTGTATAAGAGACA \
  --cores 4 \
  -o {output.r1} -p {output.r2} \
  {input.r1} {input.r2} 

The resulting log file claims:

Total read pairs processed:         68,426,019
  Read 1 with adapter:               1,034,464 (1.5%)
  Read 2 with adapter:               1,216,834 (1.8%)

But I see little impact on the before: enter image description here

to after fastqc:

enter image description here

Are there other sequences I should be trimming? The kit is Illumina DNA RNA UDI SetA T agmentation DNA PCRFree so I am not sure if there would be primer dimer sequences to remove based on it being a PCR free kit unless I am misunderstanding (I am new)

ADD REPLY
0
Entering edit mode

I am not sure if there would be primer dimer sequences

I should have said adapter dimers above. Correction made.

ADD REPLY
0
Entering edit mode

Just adding some detail here that may help draw some connections with other posts, I did a bunch of sleuthing with selectively trimming different things and I think the persistent band from position 1 to 150 is polyG. Later I found these posts from Brian Bushnell describing in great detail an artefactual band like what I observe due to NovaSeqX (the machine we are using)

Using BBtools to remove polyGs in Illumina data

New Illumina error mode, new BBTools release (39.09) to deal with it

I think the band that starts creeping up around 60 bp is residual adapter (the center that generates our fastqs clipped already, but I can eliminate that band entirely with cutadapt targeting the adapters

The band that rises steadily between by 20 and 150 is at least partially poly N, the rest might be poly A but that has been harder to show

enter image description here

ADD REPLY

Login before adding your answer.

Traffic: 3559 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6