Question: Sudden quality drop in the middle of HiSeq R1 reads but not in R2?
2
gravatar for sentausa
3.7 years ago by
sentausa630
France
sentausa630 wrote:

Dear all,

I'm analyzing eukaryotic metatranscriptomics HiSeq (2x100 bp) data from soil (the cDNAs are from eukaryotic poly-A mRNAs). The FastQC report of the R1 reads shows that there is a sudden sequence quality drop around the 50th nucleotide, even some unknown bases (Ns), and high Kmer content (see here). However, this drop in quality and unknown bases are absent in the R2 reads (there).

I wonder if it is normal to have something like this, and if it is not, what should I do to improve the read quality?

 

ADD COMMENTlink modified 3.4 years ago by chen1.7k • written 3.7 years ago by sentausa630
4
gravatar for Devon Ryan
3.7 years ago by
Devon Ryan86k
Freiburg, Germany
Devon Ryan86k wrote:

This is likely due to a bubble going through as the image was being acquired. This actually used to be fairly common. Whether this will be an issue or not will depend a bit on exactly what you want to do with the data.

ADD COMMENTlink written 3.7 years ago by Devon Ryan86k
Thanks for answering. Basically we'd like to know which genes are active and from which organism. So, I'm thinking to map the reads to some eukaryotic genomes that we have. Do you think it may cause a problem?
ADD REPLYlink written 3.7 years ago by sentausa630

This should not be an issue for count based analysis.

ADD REPLYlink written 3.7 years ago by geek_y8.8k

And what about paired-read merging and finally assembly? Do you think it'll be problematic to merge the paired-reads and later to do assembly?

ADD REPLYlink written 3.7 years ago by sentausa630

Conceptually, if the overlap between two reads extends beyond the base where the quality is low, it would create a problem as the low quality base will be a mismatch. But you have to run it check how the results look.

ADD REPLYlink written 3.7 years ago by geek_y8.8k

But now I wonder if the paired-read merging is really necessary. Any idea?

ADD REPLYlink written 3.7 years ago by sentausa630

Further question: is the high Kmer content coming from the same technical problem? I wonder how the reads that are supposed to be coming from total mRNAs (which I supposed are quite different from one another) have similar sequences around their 50th nucleotide. Thanks again.

ADD REPLYlink written 3.7 years ago by sentausa630

Quite possibly, since it occurs in the same spot. I expect that the 7mers are just what the base caller called due to the bubble (or whatever) being there. You might subset the file to look at only the tiles without the artifact and see if the jump in 7-mers in the middle of reads then goes away. Another possibility is that this is from a highly conserved region and there's biased fragmentation.

ADD REPLYlink written 3.7 years ago by Devon Ryan86k
1

And how to do "subset the file to look at only the tiles without the artifact"?

ADD REPLYlink written 3.7 years ago by sentausa630
1

By tile name. This is one of those poorly documented things, unfortunately. Let's take a look at a typical read name:

@HWI-ST1140:171:C5TKCACXX:1:1101:1212:2202 1:N:0:CGATGT
CTCATTTTCCGTGATTTTCATTTTTCTCGCCATATTCCAGGTCCTTCAGT
+
=@@FFFFFGHHACFHEHHEEGGHHIGGI=FHEAADHGEIJGDDGGGAFG0

The HWI-ST1140 part is a machine ID, 171 is the run number, C5TKCACXX is a flow cell ID, 1 is the lane number and 1101 the tile on the lane. It's this tile number that we want to subset by. If you look at the fastQC output, you'll see that only tiles 2101 and up are affected. So something like

zgrep -E -A 3 ":1:1[0-9]{3}:" file.fastq.gz | gzip > file.subset.fastq.gz

should extract all reads not on those tiles. You can then run that through fastQC and see what you get. If some of the k-mers aberrations go away then you know the cause.

ADD REPLYlink written 3.7 years ago by Devon Ryan86k

Actually, I guess this isn't so poorly documented.

ADD REPLYlink written 3.7 years ago by Devon Ryan86k

Ah, OK. I'l try it. Never knew about this before. Thanks!

ADD REPLYlink written 3.7 years ago by sentausa630
1

If you want even more obscure knowledge (I'd do well at bioinformatics pub quiz), the individual numbers in 1101 or 2308 also have meaning. The first digit (1 or 2) denotes the side of the flow cell (the camera just uses depth of field to distinguish between the sides). The second (1-3 I think) denotes the swath, and the last two the tile number within that swath. Currently there are 16 tiles per swath, though you'll see fewer (8 I think) in older datasets.

ADD REPLYlink written 3.7 years ago by Devon Ryan86k
2

I wonder how you acquired such knowledge....

ADD REPLYlink written 3.7 years ago by sentausa630

I wonder that too sometimes!

ADD REPLYlink written 3.7 years ago by Devon Ryan86k

Seriously, how did you learn all of these? Did you follow a course somewhere? Did you get it from books? Journal articles?

ADD REPLYlink written 3.7 years ago by sentausa630
1

If you hang around forums like this one for a while you'll pick up a lot of arcane knowledge.
 

ADD REPLYlink written 3.7 years ago by Devon Ryan86k

This is interesting. How do we subset data that is not effected ? based on x,y coordinates of clusters on read name ? 

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by geek_y8.8k

I would just use the tile number, since XY coordinates are within tiles (I think, don't quote me on that).

ADD REPLYlink written 3.7 years ago by Devon Ryan86k
4
gravatar for chen
3.4 years ago by
chen1.7k
OpenGene
chen1.7k wrote:

A tool in github can remove these bubble effects

https://github.com/OpenGene/AfterQC

Automatic Filtering, Trimming, Error Removing and Quality Control for Fastq Data
Currently it supports Illumina 1.8 or newer format

AfterQC does following tasks automatically:

    Filters reads with too low quality, too short length or too many N
    Filters reads with abnormal PolyA/PolyT/PolyC/PolyG sequences
    Does per-base quality control and plots the figures
    Trims reads at front and tail, according to QC results
    For pair-end sequencing data, AfterQC automatically corrects low quality wrong bases in overlapped area of read1/read2
    Detects and eliminates bubble artifact caused by sequencer due to fluid dynamics issues
    Single molecule barcode sequencing support: if all reads have a single molecule barcode (see duplex sequencing), AfterQC shifts the barcodes from the reads to the fastq query names
    Support both single-end sequencing and pair-end sequencing data
ADD COMMENTlink modified 2.3 years ago • written 3.4 years ago by chen1.7k
1

Looks interesting, is there any documentation on how it works (just to save me going through the code)? In particular, information on how it does bubble detection would be nice.

ADD REPLYlink written 3.4 years ago by Devon Ryan86k

Emm.. no any document right now.

Just started writing a small paper, will share it to u when it is done :)

ADD REPLYlink written 3.4 years ago by chen1.7k

Thanks for sharing. I'm analyzing public metagenomics data and encountered sudden drop in quality in 50th and 76th bases in both read1 and read2. After reading this thread I thought it is a bubble effect. So, I tried to apply AfterQC to fix that. It improved quality for sure, but the effect is still present, additionally, it seems that AfterQC doesn't trim nextera transposase sequence (should it?).

I've never encountered bubble before, so maybe it is not bubble at all. Could be great if you can advice smth.

I've analyzed raw fastq with fastqc and combined the results with multiqc, then applied afterqc, again analyzed with fastqc and combined with multiqc (to compare with previous). Check the screenshots.

  1. multiqc_afterqc_adapters
  2. multiqc_afterqc_quality
  3. multiqc_raw_quality

multiqc_afterqc_adapters multiqc_afterqc_quality multiqc_raw_quality

ADD REPLYlink modified 19 months ago • written 19 months ago by aln170

Hi @aln. I am seeing a similar pattern of sudden drops in sequence quality near the middle of my reads for my hiseq 4000 data. Have you had any luck yet with figuring this out? The only difference for my data is a single "blip" / decrease near the center of reads, rather than two as you have above.

Thanks.

ADD REPLYlink modified 19 months ago • written 19 months ago by davis.todt0

In my case it wasn't a bubble, but it could be in your case, so check with AfterQC. If you also have strange adapters content, which starts very early, please, read this issue - https://github.com/OpenGene/AfterQC/issues/16. It may have smth to do with this - https://www.ncbi.nlm.nih.gov/pubmed/24523726.

In the end I processed the data with AfterQC and additinally trimmed adapters with trim_galore (--nextera flag), while filtering for short read. I've got affordable quality, though the gaps are still present, but with quality score > 20.

ADD REPLYlink written 19 months ago by aln170
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1339 users visited in the last hour