Quality Contol For Illumina Transcriptome Reads
4
2
Entering edit mode
11.0 years ago
Plantae ▴ 390

We have sequencing one species transcriptome using Illumina GA II, reads are 50bp on average. Using a sliding window method, reads are trimmed according to their base qualities: eg, with a 4 base sliding window, if the average quality value for this window is lower than 20, then this streches of sequence are trimmed out from the original read.

This method generate several short trimmed reads, from 1bp to 50bp.

My question is that should i filter out reads that are too short, for example, exclude all reads shorter than 20bp in further analysis (mapping to reference genome)

If so, which length cutoff should be used?

illumina trimming length • 3.9k views
ADD COMMENT
2
Entering edit mode
11.0 years ago
Bioquant ▴ 160

You could first run FastQC on your data. This will give you boxplot of quality score for each base pair position. From this you can get a rough estimate of the read length which have good quality score. According to my experience reads should be at least 25bp long to get reasonably good results.

ADD COMMENT
0
Entering edit mode

Yes, i use FASTQC to view the data, but after trimming, all reads seems to be good (low quality bases have been trimmed out, leaving only high quality bases). The problem is that we should filtered out some reads that are too short, but I haver no idea about setting this length cutoff.

ADD REPLY
0
Entering edit mode
11.0 years ago
Zhidkov ▴ 580

Maybe you should consider use prinseq (http://sourceforge.net/projects/prinseq/files/standalone/), you can filter out reads with low quality, filter reads that shorter than 'integer' bp, and trim the poly A tails from 5' and 3' ends. Again if you do not want to perform filters on quality (since you done that already), you can use prinseq to filter to short reads. I'll recommend use at least 25bp long reads for downstream analysis.

Ilia

ADD COMMENT
0
Entering edit mode
11.0 years ago
brentp 24k

You don't necessarily need to do a hard cutoff for length. But (of course) as your reads get shorter, they are more likely to map to multiple places in the genome. This will likely slow down alignment. But, you could keep only hits that map uniquely to a single location and discard the rest following alignment.

ADD COMMENT
0
Entering edit mode
11.0 years ago
Darked89 4.2k

Illumina qualities are not reliable: I got 96bp RNA-Seq reads with 96 Bs as a quality string which were mapped with few mismatches to the genome. Depends on application, but since when it comes to RNA-Seq mapping more uniquely mapped reads are usually better, I would not spend too much time trying to get "perfect quality" set. Map them first with something what can trim the unmapped ends, check if mappings are unique and go from there.

I guess that here and there you may see some reads mapping to a "wrong" paralogue (= wrong base matching better the non-expressed gene), but if you are doing it for primarily de novo gene annotation it is still OK.

ADD COMMENT

Login before adding your answer.

Traffic: 965 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6