Question

identify and remove adapter sequence

0

Entering edit mode

2.2 years ago

BeeWork ▴ 10

Hi all,

I am trying to identify the adapter sequences of my ATAC-sequencing data. The way I tried to achieve this was to send the fastq file to FastQC. Hoping the sequence would be picked and showed in the report.

In the report, there was no overrepresented sequences shown in the overrepresented sequences section, but in adapter content graph, it indicated the reads contain Nextera Transposase Sequence. That's what confused me as I expected the adapter sequence (Nextera Transposase Sequence) would be picked up in overrepresented sequences section? or am i wrong?

I looked around in internet and found the Nextera Transposase Adapters from illumina document:

Read 1
5’ TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG
Read 2
5’ GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG

Does anyone know if the above sequences are the corrected adapter sequence ? Thanks

trimming adapter sequence fastqc ATAC • 2.5k views

ADD COMMENT • link 2.2 years ago by BeeWork ▴ 10

score 2 · Answer 1 · 2022-01-27

For the Overrepresented Sequences FastQC compares the sequences with contaminant_list.txt, which does not include the Nextera Transposase adapter sequences. FastQC reports Nextera Transposase in the adapter content graph as the tools also perform kmer search using another list of adapter sequences.

You can find more details in the FastQC documentation.

https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/10%20Adapter%20Content.html https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/9%20Overrepresented%20Sequences.html

score 1 · Answer 2 · 2022-01-27

... I expected the adapter sequence (Nextera Transposase Sequence) would be picked up in overrepresented sequences section? or am i wrong?

Yes, you should not necessarily expect that since the adapter (or more precisely, a 12-bp fragment of the adapter) is explicitly searched against your library. In contrast, overexpressed sequences lists all of the sequence which make up more than 0.1% of the total over the first 50-75 nt. Both analysis are quite different, and its possible to find adapter contamination in less than 0.1% of reads or in "readthrough" reads (so not over the full read length), and in both case, the contamination will not show up as overexpressed sequence. In addition, Nextera adaptors are not in the default contaminant list of FASQC.

Does anyone know if the above sequences are the corrected adapter sequence ?

Looks like it is correct. At least, Nextera Transposase Sequence in FASTQC is defined as the reverse complement of the last 12 nucleotides of Read 1.