Trimming P5 and P7 dual index adapters (and other QC)
1
0
Entering edit mode
3.5 years ago
willnotburn ▴ 50

I am trying to trim adapters from raw Novaseq sequences. Here is the FastQC from raw seqs:

Raw sequences checked by FastQC. Reverse only is shown.

Raw sequences checked by FastQC. Reverse only is shown.

Problems

I need to trim adapters. Does the rest of the report offer any clues as to what's wrong with the run? Is it overall bad?

Sequencing facility told me the following

Platform: TrueSeq

Kit: Swift Accel-NGS 2s DNA library prep kit

Adapters

  • P5: 5' AATGATACGGCGACCACCGAGATCTACAC[i5index]ACACTCTTTCCCTACACGACGCTCTTCCGATCT
  • P7: 5' CAAGCAGAAGACGGCATACGAGAT[i7index]GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT

    My understanding is the i5 and i7 indices are hexamers i.e. NNNNNN

I used AdapterRemoval with two approaches (neither worked well)

  • Used facility-supplied adapters inserting NNNNNN hexamer for the spacers

    AdapterRemoval --threads 40 --file1 sample_R1.fastq.gz --file2 sample_R2.fastq.gz --adapter1 AATGATACGGCGACCACCGAGATCTACACNNNNNNACACTCTTTCCCTACACGACGCTCTTCCGATCT --adapter2 CAAGCAGAAGACGGCATACGAGATNNNNNNGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT

    After trimming using facility-supplied P5 and P7. Reverse only is shown.

    After trimming using facility-supplied P5 and P7. Reverse only is shown.

  • Used identify_adapters in AdapterRemoval, which gave me P5 that partially matched what the facility said (above) but totally different P7 sequences. Using the automatically detected adapters, I also tried trimming as follows.

    AdapterRemoval --threads 40 --file1 sample_R1.fastq.gz --file2 sample_R2.fastq.gz --adapter1 AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG --adapter2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT

    After trimming using automatically-deleted adapters. Reverse only is shown.

    After trimming using automatically detected adapters. Reverse only is shown.

What am I doing wrong? What should I be doing?

sequencing trimming QC • 4.1k views
ADD COMMENT
1
Entering edit mode

Use the instructions here to add images: How to add images to a Biostars post

I suggest that you try bbduk.sh (GUIDE) for adapter removal. There is a core sequence common for all adapters (before the index). So as long as you find that and trim everything to the right of that sequence you will remove adapter sequences.

ADD REPLY
0
Entering edit mode

Thanks, genomax! Are the images not showing up? I just clicked the add image button and linked these from my public Google Drive. I will read the instructions.

ADD REPLY
0
Entering edit mode

I can only see the images of the report categories that are in left hand column of fastqc report.

ADD REPLY
0
Entering edit mode

oh, that was the intention. I wanted to limit post size (i.e. prevent overwhelming).

ADD REPLY
1
Entering edit mode

Without seeing the actual plots we can't really help you. Having a red "X" in fastqc only means that the value is out of bounds of an interval (the defaults are set for genomic sequencing). These "failures" have to be taken into context of the type of data one is analyzing.

ADD REPLY
0
Entering edit mode

Raw sequences before trimming. This analysis was run on a concatenated set of all reverse seqs, that's why there are lots of them in Basic Statistics.

enter image description here

tile scores

quality scores

seq content

GC content

N content

length distribution

duplication levels

overrep_adapter

Trimmed sequences using facility-supplied adapters inserting NNNNNN hexamer for the spacer (first approach in OP). This analysis was run on just one sample of reverse seqs, that's why there are fewer seqs in Basic Statistics.

basic

pertile

quality

seq_content

GC

N_content

length_distribution

duplication

overrepresented_adapter

ADD REPLY
0
Entering edit mode

Hi GenoMax, For adapter removal and filtering, is it not necessary to provide the whole adapter sequence, but only the core sequence? Are the below sequences the core sequence common for all adapters that you are referring to? They are described by Illumina for their TruSeq kits and also found in the adapter file of Trimmomatic. Thanks.

Read 1 AGATCGGAAGAGCACACGTCTGAACTCCAGTCA Read 2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

ADD REPLY
1
Entering edit mode

Providing the core sequence is adequate. Once trimming programs match this sequence they will remove all sequence to the 3'-side of that match.

ADD REPLY
0
Entering edit mode

Thanks for confirming. Is it necessary to provide the reverse complement of the core adapter sequence as well?

ADD REPLY
0
Entering edit mode
3.5 years ago
GenoMax 141k

Looks like all of your adapter has not been trimmed. Please try using bbduk.sh that I mentioned above. It is a java based program and should run on any OS. It comes with a file containing all commonly used adapter sequences (in resources folder you will find adapter.fa file). Then you will use the program as

bbduk.sh -Xmx2g in1=R1.fastq.gz in2=R2.fastq.gz out1=trimmed_R1.fastq.gz out2=trimmed_R2.fastq.gz ktrim=r k=23 tbo tpe ref=adapters.fa

If you have multiple cores available add threads=N option to the command above.

ADD COMMENT
0
Entering edit mode

Thanks, genomax! bbduk worked at removing adapters. I am still getting a weird GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG overrepresented sequence that occurs 61132 times! The stats for overrepresented sequences are the same as in the output shown in the picture in the above post (where I used AdapterRemoval)

In addition, I ran two extra steps for quality trimming (qtrim=r trimq=10) and phiX removal (ref=phix174_ill.ref.fa.gz k=31) as per the bbduk manual. Still no luck with removing the crazy poly-G, which is probably affecting the GC content.

Interestingly, this only happens on the reverse reads. The forward reads are fine.

ADD REPLY
1
Entering edit mode

Poly-G's are clusters with no signal. This must be data from a 2-color sequencer. You can also remove the poly-G's by using option trimpolyg=0.

ADD REPLY
0
Entering edit mode

bbduk.sh in1=R1.fastq.gz in2=R2.fastq.gz out1=R1_clean.fastq.gz out2=R2_clean.fastq.gz qtrim=r trimq=10 minlen=100 trimpolyg=0 did not seem to work to remove the poly G

I could try literal= GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

Would that be good practice?

ADD REPLY
1
Entering edit mode

My apologies. Should have said trimpolyg=15 or you can also try filterpolyg=5. Using literal=GGGGGGGG should work as well.

Unless you are going to do de novo work you can probably ignore the poly-G. Those reads should not align to anything.

ADD REPLY
0
Entering edit mode

Actually, this is for de novo assembly. For that, I'm using metaspades, and it explicitly recommends to do a good job in pre-processing. Will update on what works.

ADD REPLY
0
Entering edit mode

Setting trimq=20 took care of the poly G. And filterpolyg=5 also worked. Thanks!

ADD REPLY
0
Entering edit mode

I think I'll go with a simple trimq=20. When both are flagged, filterpolyg=5 removes additional sequences, but I'm not sure that's beneficial. Maybe a string of 5Gs can have a natural source?

ADD REPLY

Login before adding your answer.

Traffic: 2839 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6