Question: How to trim ngs sequences according to the phred score using python/biopython
0
gravatar for linlinli99013
5.0 years ago by
United States
linlinli9901310 wrote:

For fastq file, .letter_annotations['phred_quality'] will give you the quality score of each base in your sequence.

For example:

@M02339:45:000000000-A8L62:1:1101:14225:1785 1:N:0:5
GTTAGCACCCGATGTCTGTCTCCCGTGATTGCACTCTTCGGTATTCGGAGTTTGCTATGGCGGGGGAATCTGCAATAGACCCCCCAACCATGACAGGGCTCTACCCCCGAAGGGGAGACACGAGGCACTACCTAAATAG
+
CCCDCFFFCCCCGGGGGGGGGGHHGHGGHHHHHHHHHHHHGGHGHHHHGGGHHHHHHHHHHGGEG//EG3B44FG433FGHHGGGG/<A=?DDFGF///=/?11FGHG->-@D.@.-..:/.-:@@CF/0:FFFFG0;:

It gave:

[34, 34, 34, 35, 34, 37, 37, 37, 34, 34, 34, 34, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 39, 39, 38, 39, 38, 38, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 38, 38, 39, 38, 39, 39, 39, 39, 38, 38, 38, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 38, 38, 36, 38, 14, 14, 36, 38, 18, 33, 19, 19, 37, 38, 19, 18, 18, 37, 38, 39, 39, 38, 38, 38, 38, 14, 27, 32, 28, 30, 35, 35, 37, 38, 37, 14, 14, 14, 28, 14, 30, 16, 16, 37, 38, 39, 38, 12, 29, 12, 31, 35, 13, 31, 13, 12, 13, 13, 25, 14, 13, 12, 25, 31, 31, 34, 37, 14, 15, 25, 37, 37, 37, 37, 38, 15, 26, 25]

I intend to trim the sequence from 3 consecutive Q<20 bases. How do I write the biopython program?

Thanks

biopython next-gen forum • 3.7k views
ADD COMMENTlink modified 4.9 years ago • written 5.0 years ago by linlinli9901310

pl: What I want to do is to keep the sequence part with high scores [34, 34, 34, 35, 34, 37, 37, 37, 34, 34, 34, 34, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 39, 39, 38, 39, 38, 38, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 38, 38, 39, 38, 39, 39, 39, 39, 38, 38, 38, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 38, 38, 36, 38, 14, 14, 36, 38, 18, 33, 19, 19, 37, 38] and delete the rest part.

ADD REPLYlink written 5.0 years ago by linlinli9901310
2
gravatar for Peter
5.0 years ago by
Peter5.8k
Scotland, UK
Peter5.8k wrote:

Given a SeqRecord in the variable record, get the scores using scores = record.letter_annotations['phred_quality'] and find the index of the three poor quality scores, say i, then trim the SeqRecord object to the point using record[:i] and write that trimmed record out.

See also the trimming examples in the Tutorial, http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec353

ADD COMMENTlink modified 5.0 years ago • written 5.0 years ago by Peter5.8k
0
gravatar for RamRS
5.0 years ago by
RamRS21k
Houston, TX
RamRS21k wrote:

Why reinvent the wheel? Why not use SeqTK/Trimmomatic?

EDIT: I see the problem now - these algorithms might not allow trimming in such ups-and-downs Q-score cases. Just to be sure, have you explored all QC/trimming software available?

EDIT: Trimmomatic has a SLIDINGWINDOW option that might be useful.

ADD COMMENTlink modified 5.0 years ago • written 5.0 years ago by RamRS21k
0
gravatar for linlinli99013
5.0 years ago by
United States
linlinli9901310 wrote:

Thank you all for the helpful answers. I tried the following lines, but they didn't work well. It trimmed one sequence many times at different low quality spots (3 consecutive Q<20 bases). I don't know how to improve it.

from Bio import SeqIO

trimseq=[]

for rec in SeqIO.parse('C:\work in SF\short.fastq','fastq'):
    read_quals = rec.letter_annotations['phred_quality']
    for i in range(len(read_quals)):
        if read_quals[i]<=20 and read_quals[i+1]<=20 and read_quals[i+2]<=20:
           trimseq.append(rec[:i])
        else:
            trimseq.append(rec)

ADD COMMENTlink written 5.0 years ago by linlinli9901310
1

You need to break out of your for loop the first time the conditions are fulfilled - add 'break' after 'trimseq.append(rec[:i])'. Alternatively, this is a bit more streamlined:

read_quals = rec.letter_annotations['phred_quality']
count = 0
for i, q in enumerate(read_quals):
    if q <= 20:
        count += 1
        if count == 3:
            break 
    elif count:
        count = 0
print(read_quals[:i-2])
ADD REPLYlink modified 5.0 years ago • written 5.0 years ago by george.ry1.1k

Thanks a lot for your reply. That did help. But when there are more than one sequences. It didn't work.

for rec in SeqIO.parse('C:\...\shortone.fastq','fastq'):
    read_quals = rec.letter_annotations['phred_quality']
    count=0
    for i, q in enumerate(read_quals):
        if q<=20:
            count +=1
            if count==3:
                break
                    print read_quals[:i-2]
        elif count:
            count=0
            print read_quals

shortone.fastq is the following:

@M02339:45:000000000-A8L62:1:1101:14225:1785 1:N:0:5
GTTAGCACCCGATGTCTGTCTCCCGTGATTGCACTCTTCGGTATTCGGAGTTTGCTATGGCGGGGGAATCTGCA

ATAGACCCCCCAACCATGACAGGGCTCTACCCCCGAAGGGGAGACACGAGGCACTACCTAAATAG
+
CCCDCFFFCCCCGGGGGGGGGGHHGHGGHHHHHHHHHHHHGGHGHHHHGGGHHHHHHHHHHGGEG//E

G3B44FG433FGHHGGGG/<A=?DDFGF///=/?11FGHG->-@D.@.-..:/.-:@@CF/0:FFFFG0;:

@M02339:45:000000000-A8L62:1:1101:11582:2214 1:N:0:5
TTCCAGCACCGGGCAGGCGTCAGCCCCTATACTTCACCTTTCGGTTTCGCAGAGACCTGTGTTTTTGATAAACAGTC

GCTTGGGCCTATTCACTGCGGCCTACTCTCGTAGGCACCCCTTCTCCCGAAGTTACGGGGTCATTTTGCCGAGTTCCT

TAACGATGGTTCTCTCGCTCACCTTAGAATTCTCTTCCCAACTACCTGTGTCGGTTTGCGGTACGGGCACCTCTTATCTCG

ATAGCGGCTTTTCTT
+
CCCCDFFFFDEEGGGGGGGGGGGHHGGHGHHHHHHHHHHHHHGGGGGHGGGGGHHHGHHHHGHHHGGGHHHHH

HHHHGGGHGGHHHGHHHHHHHHHGGGGGHHHHHHGGGGGHHHGGGGHHHHHHGGGGGHHHGGGGGGHHHHHHH

HG-AGGHHHHHHHGGHGGHHHHHHGGGGGGGGGGGGGGGGGGGHGGGGGGGGGGGGFFFFFFFFFFFHFFFFFF

FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

 

 

ADD REPLYlink modified 4.9 years ago • written 4.9 years ago by linlinli9901310
1

Sorry that wasn't really complete code, it was just a quick thing to help you on your way.  You don't actually say what doesn't work, though I assume it's just down to a strange flow through the code.  Hard to tell, as the indentation is very strange - Python wouldn't let you indent the print statement after the break without throwing an error, so I assume this isn't how you have it on your machine.  In any case, being after the break, that line will never print anything anyway.  Something like the following should be more complete - you can replace the print statements at the bottom with something useful.

for rec in SeqIO.parse(filename, 'fastq'):
    read_quals = rec.letter_annotations['phred_quality']
    count = 0
    for i, q in enumerate(read_quals):
        if q <= 20:
            count += 1
            if count == 3:
                break
        elif count:
            count = 0
    if count != 3:
        print read_quals
    else:
        print read_quals[:i-2]
 
ADD REPLYlink written 4.9 years ago by george.ry1.1k
0
gravatar for linlinli99013
4.9 years ago by
United States
linlinli9901310 wrote:

Final solution by george.ry. It works very well! Thanks a lot.

from Bio import SeqIO
trimseq=[]
for rec in SeqIO.parse('C:\work in SF\short.fastq', 'fastq'):
    read_quals = rec.letter_annotations['phred_quality']
    count = 0
    for i, q in enumerate(read_quals):
        if q <= 20:
            count += 1
            if count == 3:
                break
        elif count:
            count = 0
    if count != 3:
        trimseq.append(rec)
    else:
        trimseq.append(rec[:i-2])

print "Saved %i reads" % len(trimseq)
output_handle=open("C:\work in SF\qtrim_seqs.fastq", "w")
SeqIO.write(trimseq, output_handle, "fastq")
output_handle.close()

 

ADD COMMENTlink written 4.9 years ago by linlinli9901310
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1667 users visited in the last hour