Question

Trimming reads based on quality (Phred) scores

0

Entering edit mode

5.6 years ago

c.clarido ▴ 110

Here I have a few reads with their corresponding quality scores. To my understanding, I translated the quality scores from this fastq file with ord() in python and changed the numbers to 0 and 1 on the condition that if the ord(char) <= 53 then it's 1, otherwise it's 1. From this method I got all 0's, so does that means that each read does not require any trimming? This is however just a test data. I have a lot bigger fastq file, but what if from this big file I got something like: 111111001101111000000000000.... etc. Is there any rule of condition I should follow when to trim the ends of a read?

(PS: It's a project from school that we need to understand how the trimming works before using an existing tool)

@HWI-EAS384_0000:2:1:1444:905#0/1
NTGTAAAGTTCGATGAGTATTTGCTTTATGGGAGAAATATCCAGCGTTTAGAAAATGTAATTTCAAGGTTACAAC
+HWI-EAS384_0000:2:1:1444:905#0/1
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@HWI-EAS384_0000:2:1:1629:903#0/1
NCAACACTTTCTGAATATGCCTTCAAAACGTGTATCATGTTGATAAATGCAATATTCCATTTCCCAACAGTGACT
+HWI-EAS384_0000:2:1:1629:903#0/1
BGGKOIJIKJ[YY[Y__________BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@HWI-EAS384_0000:2:1:1838:908#0/1
NATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATATCGTATGCCGTCTTCTGCTTGAAAAAAAAAAACAAG
+HWI-EAS384_0000:2:1:1838:908#0/1
BKKQKNQNNLWWXWWYYYYYYYYYYXXXXX[[[[[VVVNVTTWRRYYYYY_____BBBBBBBBBBBBBBBBBBBB
@HWI-EAS384_0000:2:1:2067:910#0/1
NGAAATTTACAAAGAAGAACACGTAATATATTCATAAACGGGGAATTTTCATCAATGGAGACAAAAAATGTCGAC
+HWI-EAS384_0000:2:1:2067:910#0/1
BIIEENNJJN____YIJLKOQQTTNQWNTN_____YYY[Y____W[[Y[[___W_BBBBBBBBBBBBBBBBBBBB
@HWI-EAS384_0000:2:1:2279:904#0/1
NAATCGTTCTGTTAAATCAATATTCATAAAAGGCACAAATTCATTATCGTTAATTTTTGAACTATGAAGTAATAC
+HWI-EAS384_0000:2:1:2279:904#0/1
BJJNNWWTQT_____WWWWRVTWVWY[YTYOOVVVQQNNQ_____NOROOLIJJQ____Y___W_YWYYYVPVTT
@HWI-EAS384_0000:2:1:2329:907#0/1
NCAGACAGTTCCTTATTTCTGTTCGACTGACTGAAAATTGACTTTTCTACTAGATTTTTCTAATACTTAACTTTG
+HWI-EAS384_0000:2:1:2329:907#0/1
BKHOGJINQLYYYYYYYQQY_____TVVVVXXXRVIJNLK_____YYQQYTPTMT[Y[[[QQ______Y______
@HWI-EAS384_0000:2:1:2464:909#0/1
NTTTAGCCTGGCCCATGGTTCCCAAAAAGCAATACAAAGCTTGGGTCAACTCCAGCCCAGGGTGACCAGAACCCC
+HWI-EAS384_0000:2:1:2464:909#0/1
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@HWI-EAS384_0000:2:1:2603:919#0/1
NTCGTTGCACCATTGCTTTTTGAAAAAGAATGAGTCGACTTTACGAGTTCAATTTAAAGCACAAATTTTTGCACA
+HWI-EAS384_0000:2:1:2603:919#0/1
BRRRRVVWTV_V_____________WVWQQQ________Y_____PVVVWIKQKJXRVXX___V_[[[[[_____
@HWI-EAS384_0000:2:1:2755:912#0/1
NCGAGGGGAAAGGATAAGAAACTTGATCTCACGCCGGAGAAAATAGCAGCCCAGGCTTTTGTCATCTATTTCGGT
+HWI-EAS384_0000:2:1:2755:912#0/1
BQLLNROMJP_____YY[[[QQ___BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

reads trimming phredscore qualityscore • 2.8k views

ADD COMMENT • link 5.6 years ago by c.clarido ▴ 110

1

Entering edit mode

An approach that you can use is a sliding window: if in a sliding window of N (say 5) nucleotides the average quality drops below a cutoff M then you trim the read. This prevents 'internal' trimming when just one base has a lower quality. That is also an option in Trimmomatic.

ADD REPLY • link 5.6 years ago by WouterDeCoster 47k

0

Entering edit mode

why not use an existing tool on your original fastq files?

ADD REPLY • link 5.6 years ago by lieven.sterck 15k

0

Entering edit mode

It's a project from school that we need to understand how the trimming works before using an existing tool

ADD REPLY • link 5.6 years ago by c.clarido ▴ 110

0

Entering edit mode

I don't get the point to switch from phred quality to your 0 or 1 quality score. If you want to trim your sequences you can use dedicated tools as fastp

ADD REPLY • link 5.6 years ago by Bastien Hervé 5.3k

0

Entering edit mode

It's a project from school that we need to understand how the trimming works before using an existing tool

ADD REPLY • link 5.6 years ago by c.clarido ▴ 110

0

Entering edit mode

There can be a few different possibilities of how these scores are encoded depending on how old the data is. Are you using a simple rule that as soon as you encounter a 0 you are going to trim the rest of the read until the end or are you going to use something more sophisticated like a sliding window average?

ADD REPLY • link 5.6 years ago by GenoMax 141k