Question

Quality Contol For Illumina Transcriptome Reads

2

Entering edit mode

14.1 years ago

Plantae ▴ 390

We have sequencing one species transcriptome using Illumina GA II, reads are 50bp on average. Using a sliding window method, reads are trimmed according to their base qualities: eg, with a 4 base sliding window, if the average quality value for this window is lower than 20, then this streches of sequence are trimmed out from the original read.

This method generate several short trimmed reads, from 1bp to 50bp.

My question is that should i filter out reads that are too short, for example, exclude all reads shorter than 20bp in further analysis (mapping to reference genome)

If so, which length cutoff should be used?

illumina trimming length • 5.1k views

ADD COMMENT • link updated 14.1 years ago by Darked89 4.7k • written 14.1 years ago by Plantae ▴ 390

score 2 · Answer 1 · 2011-05-22

2

Entering edit mode

14.1 years ago

Bioquant ▴ 160

You could first run FastQC on your data. This will give you boxplot of quality score for each base pair position. From this you can get a rough estimate of the read length which have good quality score. According to my experience reads should be at least 25bp long to get reasonably good results.

ADD COMMENT • link 14.1 years ago by Bioquant ▴ 160

0

Entering edit mode

Yes, i use FASTQC to view the data, but after trimming, all reads seems to be good (low quality bases have been trimmed out, leaving only high quality bases). The problem is that we should filtered out some reads that are too short, but I haver no idea about setting this length cutoff.

ADD REPLY • link 14.1 years ago by Plantae ▴ 390

score 0 · Answer 2 · 2011-05-22

Maybe you should consider use prinseq (http://sourceforge.net/projects/prinseq/files/standalone/), you can filter out reads with low quality, filter reads that shorter than 'integer' bp, and trim the poly A tails from 5' and 3' ends. Again if you do not want to perform filters on quality (since you done that already), you can use prinseq to filter to short reads. I'll recommend use at least 25bp long reads for downstream analysis.

Ilia

score 0 · Answer 3 · 2011-05-23

You don't necessarily need to do a hard cutoff for length. But (of course) as your reads get shorter, they are more likely to map to multiple places in the genome. This will likely slow down alignment. But, you could keep only hits that map uniquely to a single location and discard the rest following alignment.

score 0 · Answer 4 · 2011-05-23

Illumina qualities are not reliable: I got 96bp RNA-Seq reads with 96 Bs as a quality string which were mapped with few mismatches to the genome. Depends on application, but since when it comes to RNA-Seq mapping more uniquely mapped reads are usually better, I would not spend too much time trying to get "perfect quality" set. Map them first with something what can trim the unmapped ends, check if mappings are unique and go from there.

I guess that here and there you may see some reads mapping to a "wrong" paralogue (= wrong base matching better the non-expressed gene), but if you are doing it for primarily de novo gene annotation it is still OK.