Question

Trimming FASTQs

0

Entering edit mode

10 months ago

DKA ▴ 40

Hello,

I would like to trim my FASTQ files, but I don't have the sequences for the barcodes and adaptors. Is there a reliable method to predict these sequences and efficiently trim them?

I came across the fastp tool, which offers automatic detection of adaptor sequences. After using fastp and running FASTQC, it appeared that the adaptor sequences were successfully trimmed.

To verify the accuracy of the trimming process, I compared the generated VCFs from my trimmed FASTQ files (trimmed using fastp and DRAGEN FASTQ Toolkit) with the trimmed FASTQ files provided by the sequencing company. I found that a few hundred variants were different between the called variants in each set of trimmed FASTQ files. Is that expected to happen or not, please?

Thanks

NGS FASTQ trimming • 1.1k views

ADD COMMENT • link updated 10 months ago by Darked89 4.6k • written 10 months ago by DKA ▴ 40

1

Entering edit mode

Your samples must be demultiplexed so at this point you do not need barcode/indexes. Index sequences should be in the fastq headers as long as standard procedures were used for demultiplexing.

Aligners will soft clip adapter sequencers so unless one you used did not do that there should not be a major concern. A few hundred variants can easily be different between the two analysis streams simply based on the differences in the software, settings used and stochastic nature of NGS alignments.

ADD REPLY • link 10 months ago by GenoMax 141k

0

Entering edit mode

I guess, the easiest would be to contact the representative of the company and ask for adaptor and barcode sequences. After all, you will need some information about the library preparation for an eventual publication. What kind of method/library prep are you working with? I suppose some kind of multi-step barcoding PCR with highly-multiplexed amplicon sequencing?

If you can't get hold of the barcode sequences, you might be able to derive them by a k-mer analysis exploratively. I prefer the BBMap toolkit for this kind of tasks, but it has a wealth of tools that might be a bit overwhelming to use for the first time. You could for example start by masking the adaptors with bbduk.sh and then run commonkmers.sh or use seal.sh to "map" to an artificial reference that contains simulated barcodes?

ADD REPLY • link 10 months ago by Matthias Zepper 4.5k

score 0 · Answer 1 · 2023-06-14

Apart from the aforementioned suggestions, you could also randomly check the variants mapped to ClinVar whether you have inherently missed pathogenic variants. Trimmomatic also has options to check/trim the adapters. Irrespective of this as Genomax suggests you shouldn't worry either as these are synthetic bases and wouldn't ideally be a part of native/genomic variants

score 0 · Answer 2 · 2023-06-14

If you really want to go ballistic you can get a comprehensive set of adaptors from UniVec, select non-vectors and use locate from seqkit to search for patterns in that file.

But unless you have some specific application in mind (like mapping miRNAs?) it is an overkill with modern FASTQ reads aligners.