Question: Trimming P5 and P7 dual index adapters (and other QC)
0
gravatar for willnotburn
4 days ago by
willnotburn40
United States, Michigan State Universtiy
willnotburn40 wrote:

I am trying to trim adapters from raw Novaseq sequences. Here is the FastQC from raw seqs:

Raw sequences checked by FastQC. Reverse only is shown.

Raw sequences checked by FastQC. Reverse only is shown.

Problems

I need to trim adapters. Does the rest of the report offer any clues as to what's wrong with the run? Is it overall bad?

Sequencing facility told me the following

Platform: TrueSeq

Kit: Swift Accel-NGS 2s DNA library prep kit

Adapters

  • P5: 5' AATGATACGGCGACCACCGAGATCTACAC[i5index]ACACTCTTTCCCTACACGACGCTCTTCCGATCT
  • P7: 5' CAAGCAGAAGACGGCATACGAGAT[i7index]GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT

    My understanding is the i5 and i7 indices are hexamers i.e. NNNNNN

I used AdapterRemoval with two approaches (neither worked well)

  • Used facility-supplied adapters inserting NNNNNN hexamer for the spacers

    AdapterRemoval --threads 40 --file1 sample_R1.fastq.gz --file2 sample_R2.fastq.gz --adapter1 AATGATACGGCGACCACCGAGATCTACACNNNNNNACACTCTTTCCCTACACGACGCTCTTCCGATCT --adapter2 CAAGCAGAAGACGGCATACGAGATNNNNNNGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT

    After trimming using facility-supplied P5 and P7. Reverse only is shown.

    After trimming using facility-supplied P5 and P7. Reverse only is shown.

  • Used identify_adapters in AdapterRemoval, which gave me P5 that partially matched what the facility said (above) but totally different P7 sequences. Using the automatically detected adapters, I also tried trimming as follows.

    AdapterRemoval --threads 40 --file1 sample_R1.fastq.gz --file2 sample_R2.fastq.gz --adapter1 AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG --adapter2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT

    After trimming using automatically-deleted adapters. Reverse only is shown.

    After trimming using automatically detected adapters. Reverse only is shown.

What am I doing wrong? What should I be doing?

sequencing qc trimming • 114 views
ADD COMMENTlink modified 4 days ago by genomax91k • written 4 days ago by willnotburn40
1

Use the instructions here to add images: How to add images to a Biostars post

I suggest that you try bbduk.sh (GUIDE) for adapter removal. There is a core sequence common for all adapters (before the index). So as long as you find that and trim everything to the right of that sequence you will remove adapter sequences.

ADD REPLYlink written 4 days ago by genomax91k

Thanks, genomax! Are the images not showing up? I just clicked the add image button and linked these from my public Google Drive. I will read the instructions.

ADD REPLYlink modified 4 days ago • written 4 days ago by willnotburn40

I can only see the images of the report categories that are in left hand column of fastqc report.

ADD REPLYlink written 4 days ago by genomax91k

oh, that was the intention. I wanted to limit post size (i.e. prevent overwhelming).

ADD REPLYlink written 4 days ago by willnotburn40
1

Without seeing the actual plots we can't really help you. Having a red "X" in fastqc only means that the value is out of bounds of an interval (the defaults are set for genomic sequencing). These "failures" have to be taken into context of the type of data one is analyzing.

ADD REPLYlink written 4 days ago by genomax91k

Raw sequences before trimming. This analysis was run on a concatenated set of all reverse seqs, that's why there are lots of them in Basic Statistics.

enter image description here

tile scores

quality scores

seq content

GC content

N content

length distribution

duplication levels

overrep_adapter

Trimmed sequences using facility-supplied adapters inserting NNNNNN hexamer for the spacer (first approach in OP). This analysis was run on just one sample of reverse seqs, that's why there are fewer seqs in Basic Statistics.

basic

pertile

quality

seq_content

GC

N_content

length_distribution

duplication

overrepresented_adapter

ADD REPLYlink written 4 days ago by willnotburn40
0
gravatar for genomax
4 days ago by
genomax91k
United States
genomax91k wrote:

Looks like all of your adapter has not been trimmed. Please try using bbduk.sh that I mentioned above. It is a java based program and should run on any OS. It comes with a file containing all commonly used adapter sequences (in resources folder you will find adapter.fa file). Then you will use the program as

bbduk.sh -Xmx2g in1=R1.fastq.gz in2=R2.fastq.gz out1=trimmed_R1.fastq.gz out2=trimmed_R2.fastq.gz ktrim=r k=23 tbo tpe ref=adapters.fa

If you have multiple cores available add threads=N option to the command above.

ADD COMMENTlink modified 4 days ago • written 4 days ago by genomax91k

Thanks, genomax! bbduk worked at removing adapters. I am still getting a weird GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG overrepresented sequence that occurs 61132 times! The stats for overrepresented sequences are the same as in the output shown in the picture in the above post (where I used AdapterRemoval)

In addition, I ran two extra steps for quality trimming (qtrim=r trimq=10) and phiX removal (ref=phix174_ill.ref.fa.gz k=31) as per the bbduk manual. Still no luck with removing the crazy poly-G, which is probably affecting the GC content.

Interestingly, this only happens on the reverse reads. The forward reads are fine.

ADD REPLYlink modified 3 days ago • written 3 days ago by willnotburn40
1

Poly-G's are clusters with no signal. This must be data from a 2-color sequencer. You can also remove the poly-G's by using option trimpolyg=0.

ADD REPLYlink written 3 days ago by genomax91k

bbduk.sh in1=R1.fastq.gz in2=R2.fastq.gz out1=R1_clean.fastq.gz out2=R2_clean.fastq.gz qtrim=r trimq=10 minlen=100 trimpolyg=0 did not seem to work to remove the poly G

I could try literal= GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

Would that be good practice?

ADD REPLYlink written 3 days ago by willnotburn40
1

My apologies. Should have said trimpolyg=15 or you can also try filterpolyg=5. Using literal=GGGGGGGG should work as well.

Unless you are going to do de novo work you can probably ignore the poly-G. Those reads should not align to anything.

ADD REPLYlink modified 3 days ago • written 3 days ago by genomax91k

Actually, this is for de novo assembly. For that, I'm using metaspades, and it explicitly recommends to do a good job in pre-processing. Will update on what works.

ADD REPLYlink written 3 days ago by willnotburn40

Setting trimq=20 took care of the poly G. And filterpolyg=5 also worked. Thanks!

ADD REPLYlink written 3 days ago by willnotburn40

I think I'll go with a simple trimq=20. When both are flagged, filterpolyg=5 removes additional sequences, but I'm not sure that's beneficial. Maybe a string of 5Gs can have a natural source?

ADD REPLYlink written 3 days ago by willnotburn40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1766 users visited in the last hour