Question

How To Understand The Encoding And Trim Their Adapters

0

Entering edit mode

10.3 years ago

bmechuangye ▴ 20

Hi,

I am puzzled how to change the encode to Sanger format phred+33 of the data and how to trim the adapter sequence? I have also used fastqc tool to find their overrepresented sequences, but found no good information. And there were no information in the paper and NCBI GEO (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE37686).

$ head -n8 SRR493015.fastq
@SRR493015.1 HWI-ST667_0105:1:1101:1543:1997 length=37
CCCCCTGGGCCTCTCTGTAGGCACCATCAATCTGATC
+SRR493015.1 HWI-ST667_0105:1:1101:1543:1997 length=37
FFFFFHFHHHJIJIJIIHJJJJIJJJJJJJIJJJIIJ
@SRR493015.2 HWI-ST667_0105:1:1101:2733:1999 length=37
TCGTACGACTCTTATCTCTGTAGGCACCATCAATCTG
+SRR493015.2 HWI-ST667_0105:1:1101:2733:1999 length=37
FDFDFHHHHGIIGGIHJJJJJIEHEHGGHHIIJJJIG

$ awk 'NR % 4 ==0' SRR493015.fastq |python guess-encoding.py 
# reading qualities from stdin
no encodings for range: (43, 74)

Could you give me some advice ? Thanks.

• 11k views

ADD COMMENT • link 10.3 years ago by bmechuangye ▴ 20

0

Entering edit mode

Do you wish to change the encoding of the fastq file to sanger encoding(phred +33)??

ADD REPLY • link 10.3 years ago by Varun Gupta ★ 1.3k

0

Entering edit mode

Yes. More important is that I want to remove the adapters. But I don't know the adapters sequences.

ADD REPLY • link 10.3 years ago by bmechuangye ▴ 20

score 2 · Answer 1 · 2014-01-17

Fastqc will also predict the encoding of the data at the beginning of its report. There are many tools for trimming/removing adaptors. I like Trimmomatic. It handles paired end reads well and will trim adaptors; you need to know the platform used to generate the sequences as the adaptors change.

score 2 · Answer 2 · 2014-01-17

2

Entering edit mode

10.3 years ago

bmechuangye ▴ 20

Thanks.

Yes, Fastqc has told that the encoding of the data is Sanger / Illumina 1.9. And Ian, it should be set -phred33 or -phred64 when I using Trimmomatic ?

ADD COMMENT • link 10.3 years ago by bmechuangye ▴ 20

score 1 · Answer 3 · 2014-01-16

The odds are good that one line is either malformed or you concatenated two files with different encodings together. What that error is telling you is that one of the lines have a minimal Phred score of 43 (meaning either Sanger or Illumina 1.8+ encoding) and a maximum score of 74 (meaning either Solexa, Illumina 1.3+, or Illumina 1.5+ encoding). That's not actually valid. I should note that the snippet you posted looks like Illumina 1.3+ encoding (that's also what the guess-encoding.py script returns). You might slowly increase the number of reads fed to the python script until you hit this error. Then, you'll know where in your fastq file the problem read(s) occur.

Edit: Actually that script seems to have been written before Illumina 1.8 was introduced. If you edit the definition of RANGES at the top to be as follows then it'll work

RANGES = {
    'Sanger': (33, 73),
    'Solexa': (59, 104),
    'Illumina-1.3': (64, 104),
    'Illumina-1.5': (66, 104),
    'Illumina-1.8': (33, 94)
}

I also corrected the Illumina-1.5 definition, which was wrong (though there's no practical reason to differentiate 1.3 from 1.5 (or Sanger from 1.8, for that matter)).