Does this adapter content from fastqc to warrant trimming prior to alignment?
1
0
Entering edit mode
2 days ago
curious ▴ 890

Hi I am looking to align some human short read WGS data with BWA-MEM prior to variant calling. I have 62 samples, each has 8 pairs of fastqs (992 fastqs total). I ran fastqc/multiqc on all of them and got this result from multiqc:

enter image description here

For comparison here is an individual fastqc:

enter image description here

Spot checking a handful of individual fastqc reports it seems the flat universal adapter line hovering around 2ish percent and that poly a line creeping up over the length of the read is pretty typical, does this require trimming prior to alignment?

According to this the threshold for warning is 5% and failure is 10%, but just wondering what the general practice is

Thanks!

fastqc bwa-mem multiqc • 3.8k views
ADD COMMENT
0
Entering edit mode

I would trim it. Two percent is not much but with WGS, say you have 400mio reads per sample or so, it's still millions of reads that theoretically could contaminate variant calls. Do it once, properly, and never care about it again. That having said, we are utterly flattered with a free-to-use powerful HPC at our university, so all we do is waiting for this job to complete. If you're on a bidget, say need to pay for HPC or a cloud service, you might want to skip it, but I would still feel safer trimming.

ADD REPLY
4
Entering edit mode
2 days ago
GenoMax 154k

It is an investment in time and will ensure that there is no extraneous sequence is present, if you choose to do scanning/trimming.

That said aligners should soft-clip the adapter sequence at the time of alignment, so technically you do not need to trim the data.

Note: Those sequences that show adapter starting at cycle 1 are likely all primer dimers and have no inserts.

ADD COMMENT
0
Entering edit mode

Thank you, the adapter sequences are specified as CTGTCTCTTATACACATCT+ATGTGTATAAGAGACA is the SampleSheet.csv file from the sequencing run. So I ran cut adapt on a single sample:

cutadapt \
  -a CTGTCTCTTATACACATCT -A ATGTGTATAAGAGACA \
  --cores 4 \
  -o {output.r1} -p {output.r2} \
  {input.r1} {input.r2} 

The resulting log file claims:

Total read pairs processed:         68,426,019
  Read 1 with adapter:               1,034,464 (1.5%)
  Read 2 with adapter:               1,216,834 (1.8%)

But I see little impact on the before: enter image description here

to after fastqc:

enter image description here

Are there other sequences I should be trimming? The kit is Illumina DNA RNA UDI SetA T agmentation DNA PCRFree so I am not sure if there would be primer dimer sequences to remove based on it being a PCR free kit unless I am misunderstanding (I am new)

ADD REPLY

Login before adding your answer.

Traffic: 3025 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6