How to trim ngs sequences according to the phred score using python/biopython
4
0
Entering edit mode
7.0 years ago

For fastq file, .letter_annotations['phred_quality'] will give you the quality score of each base in your sequence.

For example:

@M02339:45:000000000-A8L62:1:1101:14225:1785 1:N:0:5
GTTAGCACCCGATGTCTGTCTCCCGTGATTGCACTCTTCGGTATTCGGAGTTTGCTATGGCGGGGGAATCTGCAATAGACCCCCCAACCATGACAGGGCTCTACCCCCGAAGGGGAGACACGAGGCACTACCTAAATAG
+
CCCDCFFFCCCCGGGGGGGGGGHHGHGGHHHHHHHHHHHHGGHGHHHHGGGHHHHHHHHHHGGEG//EG3B44FG433FGHHGGGG/<A=?DDFGF///=/?11FGHG->-@D.@.-..:/.-:@@CF/0:FFFFG0;:

It gave:

[34, 34, 34, 35, 34, 37, 37, 37, 34, 34, 34, 34, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 39, 39, 38, 39, 38, 38, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 38, 38, 39, 38, 39, 39, 39, 39, 38, 38, 38, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 38, 38, 36, 38, 14, 14, 36, 38, 18, 33, 19, 19, 37, 38, 19, 18, 18, 37, 38, 39, 39, 38, 38, 38, 38, 14, 27, 32, 28, 30, 35, 35, 37, 38, 37, 14, 14, 14, 28, 14, 30, 16, 16, 37, 38, 39, 38, 12, 29, 12, 31, 35, 13, 31, 13, 12, 13, 13, 25, 14, 13, 12, 25, 31, 31, 34, 37, 14, 15, 25, 37, 37, 37, 37, 38, 15, 26, 25]

I intend to trim the sequence from 3 consecutive Q<20 bases. How do I write the biopython program?

Thanks

next-gen Forum biopython • 4.9k views
ADD COMMENT
0
Entering edit mode

pl: What I want to do is to keep the sequence part with high scores [34, 34, 34, 35, 34, 37, 37, 37, 34, 34, 34, 34, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 39, 39, 38, 39, 38, 38, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 38, 38, 39, 38, 39, 39, 39, 39, 38, 38, 38, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 38, 38, 36, 38, 14, 14, 36, 38, 18, 33, 19, 19, 37, 38] and delete the rest part.

ADD REPLY
2
Entering edit mode
7.0 years ago
Peter 5.9k

Given a SeqRecord in the variable record, get the scores using scores = record.letter_annotations['phred_quality'] and find the index of the three poor quality scores, say i, then trim the SeqRecord object to the point using record[:i] and write that trimmed record out.

See also the trimming examples in the Tutorial, http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec353

ADD COMMENT
0
Entering edit mode
7.0 years ago
Ram 33k

Why reinvent the wheel? Why not use SeqTK/Trimmomatic?

EDIT: I see the problem now - these algorithms might not allow trimming in such ups-and-downs Q-score cases. Just to be sure, have you explored all QC/trimming software available?

EDIT: Trimmomatic has a SLIDINGWINDOW option that might be useful.

ADD COMMENT
0
Entering edit mode
7.0 years ago

Thank you all for the helpful answers. I tried the following lines, but they didn't work well. It trimmed one sequence many times at different low quality spots (3 consecutive Q<20 bases). I don't know how to improve it.

from Bio import SeqIO

trimseq=[]

for rec in SeqIO.parse('C:\work in SF\short.fastq','fastq'):
    read_quals = rec.letter_annotations['phred_quality']
    for i in range(len(read_quals)):
        if read_quals[i]<=20 and read_quals[i+1]<=20 and read_quals[i+2]<=20:
           trimseq.append(rec[:i])
        else:
            trimseq.append(rec)
ADD COMMENT
1
Entering edit mode

You need to break out of your for loop the first time the conditions are fulfilled - add break after trimseq.append(rec[:i]). Alternatively, this is a bit more streamlined:

read_quals = rec.letter_annotations['phred_quality']
count = 0
for i, q in enumerate(read_quals):
    if q <= 20:
        count += 1
        if count == 3:
            break 
    elif count:
        count = 0
print(read_quals[:i-2])
ADD REPLY
0
Entering edit mode

Thanks a lot for your reply. That did help. But when there are more than one sequences. It didn't work.

for rec in SeqIO.parse('C:\...\shortone.fastq','fastq'):
    read_quals = rec.letter_annotations['phred_quality']
    count=0
    for i, q in enumerate(read_quals):
        if q<=20:
            count +=1
            if count==3:
                break
                    print read_quals[:i-2]
        elif count:
            count=0
            print read_quals

shortone.fastq is the following:

@M02339:45:000000000-A8L62:1:1101:14225:1785 1:N:0:5
GTTAGCACCCGATGTCTGTCTCCCGTGATTGCACTCTTCGGTATTCGGAGTTTGCTATGGCGGGGGAATCTGCA
ATAGACCCCCCAACCATGACAGGGCTCTACCCCCGAAGGGGAGACACGAGGCACTACCTAAATAG
+
CCCDCFFFCCCCGGGGGGGGGGHHGHGGHHHHHHHHHHHHGGHGHHHHGGGHHHHHHHHHHGGEG//E
G3B44FG433FGHHGGGG/<A=?DDFGF///=/?11FGHG->-@D.@.-..:/.-:@@CF/0:FFFFG0;:
@M02339:45:000000000-A8L62:1:1101:11582:2214 1:N:0:5
TTCCAGCACCGGGCAGGCGTCAGCCCCTATACTTCACCTTTCGGTTTCGCAGAGACCTGTGTTTTTGATAAACAGTC
GCTTGGGCCTATTCACTGCGGCCTACTCTCGTAGGCACCCCTTCTCCCGAAGTTACGGGGTCATTTTGCCGAGTTCCT
TAACGATGGTTCTCTCGCTCACCTTAGAATTCTCTTCCCAACTACCTGTGTCGGTTTGCGGTACGGGCACCTCTTATCTCG
ATAGCGGCTTTTCTT
+
CCCCDFFFFDEEGGGGGGGGGGGHHGGHGHHHHHHHHHHHHHGGGGGHGGGGGHHHGHHHHGHHHGGGHHHHH
HHHHGGGHGGHHHGHHHHHHHHHGGGGGHHHHHHGGGGGHHHGGGGHHHHHHGGGGGHHHGGGGGGHHHHHHH
HG-AGGHHHHHHHGGHGGHHHHHHGGGGGGGGGGGGGGGGGGGHGGGGGGGGGGGGFFFFFFFFFFFHFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
ADD REPLY
1
Entering edit mode

Sorry that wasn't really complete code, it was just a quick thing to help you on your way. You don't actually say what doesn't work, though I assume it's just down to a strange flow through the code. Hard to tell, as the indentation is very strange - Python wouldn't let you indent the print statement after the break without throwing an error, so I assume this isn't how you have it on your machine. In any case, being after the break, that line will never print anything anyway. Something like the following should be more complete - you can replace the print statements at the bottom with something useful.

for rec in SeqIO.parse(filename, 'fastq'):
    read_quals = rec.letter_annotations['phred_quality']
    count = 0
    for i, q in enumerate(read_quals):
        if q <= 20:
            count += 1
            if count == 3:
                break
        elif count:
            count = 0
    if count != 3:
        print read_quals
    else:
        print read_quals[:i-2]
ADD REPLY
0
Entering edit mode
7.0 years ago

Final solution by george.ry. It works very well! Thanks a lot.

from Bio import SeqIO
trimseq=[]
for rec in SeqIO.parse('C:\work in SF\short.fastq', 'fastq'):
    read_quals = rec.letter_annotations['phred_quality']
    count = 0
    for i, q in enumerate(read_quals):
        if q <= 20:
            count += 1
            if count == 3:
                break
        elif count:
            count = 0
    if count != 3:
        trimseq.append(rec)
    else:
        trimseq.append(rec[:i-2])

print "Saved %i reads" % len(trimseq)
output_handle=open("C:\work in SF\qtrim_seqs.fastq", "w")
SeqIO.write(trimseq, output_handle, "fastq")
output_handle.close()
ADD COMMENT

Login before adding your answer.

Traffic: 1581 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6