Question: Please help me with adapter-trimming
0
gravatar for novicebioinforesearcher
9 weeks ago by

I received fastq files from core they said they have de multiplexed it but when i ran fastqc i can still see some adapters, attached is figure

Adapter seq fastqc

My question along with fastq files with names like this

_TAGTCTTG_S7_L001_R1_001.fastq.gz
_TAGTCTTG_S7_L001_R2_001.fastq.gz

I also received some files which i am not sure what it has (i guess they are index)

TAGTCTTG_S7_L001_I1_001.fastq.gz
TAGTCTTG_S7_L001_I2_001.fastq.gz

zcat TAGTCTTG_S7_L001_I1_001.fastq.gz | head

@someinfo:1:1101:15235:1340 1:N:0:TAGTCTTGAT+TCTTTCCC
TAGTCTTGAT
+
CCDDDFFFFF
@someinfo:1:1101:15815:1395 1:N:0:TAGTCTTGAT+TCTTTCCC
TAGTCTTGAT
+
CCCCCFFFFF
@soomeinfo:1:1101:15719:1398 1:N:0:TAGTCTTGAT+TCTTTCCC
TAGTCTTGAT

when i look in to the actual fastq file i am not sure does it have both index and adapter? (core said they have demultiplex it) zcat _TAGTCTTG_S7_L001_R1_001.fastq.gz | head

@someinfo:1:1101:15235:1340 1:N:0:TAGTCTTGAT+TCTTTCCC
TGGGGCCTTAGTAAATGTGCCTGTGTGTGGGTCTCGGTCCAACACAGTTGATGTACATCTGTTTACCTGTTATAGTTGCAAGTTGTTCAGGCTGACATTGCTGTCGTTCACCCGACAAACACTGACTTCTACACCGGTGGTGAAGTAGGTAATGCGAGCTGGGTGCTGCCGAGTGTGTGTGTGCATGCTCAGCCGGCCGCGCAGACAGCTTGATCCTCTGACAGCTACGCAGATCGGAAGAGCACACGTC
+
DDCDDDCDFFFFGGGGGGGGGGHHHHHHHGGGHHHHGGGGHHHGGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGHHHHHHHHHHHHHHHHHHHHHHHHGGGHHHHHGGGGGHHHHHHHHHHHHHHHHHGGFGGGGHHHHHHHHHHHHHGGGGGHHGHGGHHHHGGGGHHHGHHGHHHHGHHHHHHHHGGGGGGAGGGGGGGGGGGFFFFFFFFEFFFFFFFFFFFFFFFFFFFFFFFFFFFFE
@someinfo:1:1101:15815:1395 1:N:0:TAGTCTTGAT+TCTTTCCC
TGGGGCCTTAGTAAATGTGCCTGTGTGTGGGTCTCGGTCCAACACAGTTGATGTACATCTGTTTACCTGTTATAGTTGCAAGTTGTTCAGGCTGACATTGCCTCGACAGTGATGCTGTCGTTCACCCGACAAACACTGACTTCTACACCGGTGGTGAAGTAGGTAATGCGAGCTGGGTGCTGCCGAGTGTGTGTGTGCATGCTCAGCCGGCCGCGCAGACAGCTTGATCCTCTGACAGCTACGCAGATCG
+
CCCCCCCCFFFFGGGGGGGGGGHHHHHHHGGGGHHHGGGGHHHGGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGHHHHHHHHHHHHHHHHHHHHHHHHHHHHGGGGGHHHHHHHHHHGGGHHHHHGGGGGHHHHHGGHHHHHGHHHHGGGGGGGFFGFHHGHHHHGHGGGGGHGGFEGHHHHG-CCGHHGHHHHHHHHGHGHHHGGGGGGGGGGFFFFFFFFFFFFFFFFEFFFFFFFFFFF?DFFFF
@someinfo:1:1101:15719:1398 1:N:0:TAGTCTTGAT+TCTTTCCC
TGGGGCCTTAGTAAATGTGCCTGTGTGTGGGTCTCGGTCCAACACAGTTGATGTACATCTGTTTACCTGTTATAGTTGCAAGTTGTTCAGGCTGACATTGCCTCGATCGACAGTGATGCTGTCGTTCACCCGACAAACACTGACTTCTACACCGGTGGTGAAGTAGGTAATGCGAGCTGGGTGCTGCCGAGTGTGTGTATGCATGCTCAGCCGGCCGCGCAGACAGCTTGATCCTCTGACAGCTACGCAG

I did know about this and went ahead and aligned here is snapshot of how the alignments look in igv(4 samples paired end on Miseq (2*250)) sorted using base and used show soft clip in preferences.(suggested by some one from the core)

igv snapshot

How can i solve remove them with out loosing any information from actual reads

dna trimming • 272 views
ADD COMMENTlink modified 9 weeks ago by Brian Bushnell14k • written 9 weeks ago by novicebioinforesearcher30
1

Clearly, your DNA library prep was not optimal. I am not sure what's going on in your IGV images, but it's very obvious from your first (% adapter) graph that the insert size was too short compared to read length.

Your IGV images look like amplicon data. Can you describe this in more detail? Did you authorize the sequencing center to PCR-amplify your DNA sample? There's no way such a high proportion of reads would have the exact same start site without amplification. Considering that none of the reads you posted agree with the reference, it looks bad. How did you align the reads?

Also, the specific reference would be helpful here... and, what you are trying to do is also always useful information.

I encourage you to post an insert-size histogram and detail the platform and read length used. I'm guessing you ran 2x250bp on a MiSeq, but it's not really possible to tell from what you posted.

Also:

bbmap.sh in=reads.fq ref=ref.fasta in1=r1.fq in2=r2.fq mhist=mhist.txt qhist=qhist.txt qahist=qahist.txt ihist=ihist.txt bhist=bhist.txt covhist=covhist.txt lhist=lhist.txt

Posting those results would be useful, along with the screen output.

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by Brian Bushnell14k

Apologies for incomplete information,Yes these were PCR amplicons that were sequenced, I aligned the reads using bwa mem, we were trying to induce a deletion and check if worked by sequencing exon 6 of a particular gene.

ADD REPLYlink written 9 weeks ago by novicebioinforesearcher30
1

Oh... if you're looking for a somewhat long deletion, I suggest you try aligning with BBMap; it's very good at capturing those within the alignment of a read.

ADD REPLYlink written 9 weeks ago by Brian Bushnell14k

Sure, some additional info about the experiment attempting to detect indels from a panel of clones resulting from CRISPR targeted deletion. Regions around the target were PCR amplified to produce a roughly 150bp amplicon, which was then sequenced with as a PE250 run.

ADD REPLYlink written 9 weeks ago by novicebioinforesearcher30
1

You can detect the adapter sequences and trim them like this:

bbmerge.sh in1=r1.fastq.gz in2=r2.fastq.gz outa=adapters.fa
bbduk.sh in1=r1.fastq.gz in2=r2.fastq.gz out=trimmed.fq.gz ktrim=r k=23 mink=11 hdist=1 tbo tpe

Then map the trimmed (interleaved) reads and you'll get better results.

ADD REPLYlink written 9 weeks ago by Brian Bushnell14k
3
gravatar for lshepard
9 weeks ago by
lshepard60
United States
lshepard60 wrote:

Hi, I would recommend using a program such as trimmomatic to remove your adapter sequences. It handles paired-end reads quite well.

ADD COMMENTlink written 9 weeks ago by lshepard60

Thanks, will look in to it where can i find adapter sequences? that have been highlighted in fastqc and what about the seond file that has which I am guessing to be index

ADD REPLYlink written 9 weeks ago by novicebioinforesearcher30

trim_galore is a wrapper around trimmomatic that will automatically detect and remove common (including Illumina) adapter sequences: https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/

ADD REPLYlink written 9 weeks ago by fanli.gcb610

I certainly recommend removing adapters in all cases, but when >30% of reads have adapter sequence, that indicates a major problem in sequencing. I'd reject the data and have it sequenced correctly.

ADD REPLYlink written 9 weeks ago by Brian Bushnell14k
3
gravatar for genomax
9 weeks ago by
genomax33k
United States
genomax33k wrote:

You will not see adapter sequences that easily in the actual reads. You will need to use a scan/trim program to look for those. I recommend bbduk.sh from BBMap suite. BBMap suite comes with a comprehensive set of adapter sequences for many commonly used commercial adapters (in adapters.fa file in resources directory in BBMap software).

TAGTCTTGAT+TCTTTCCC are the index/tag read sequences. IndexRead1+IndexRead2 is how they are represented in the fastq read headers. You also have separate files with the index read sequences (I1 and I2 files).

ADD COMMENTlink modified 9 weeks ago • written 9 weeks ago by genomax33k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1387 users visited in the last hour