Question: (Closed) How to print stats from Python filtering script
0
gravatar for arabhorsegal18
3.9 years ago by
United States
arabhorsegal180 wrote:

Hello all,

I have the following python script which will filter paired-end fastq read data based on nucleotide complexity with a user provided percentage (as a way of removing reads with lots of monomeric or dimeric repeats). The script works, but I would to generate some simple statistics, the number of read pairs input and the number of read pairs retained, and print them. I was trying to use .count but kept receiving error messages.

Any ideas or suggestions would be welcomed. As you can probably tell my Python skills are less than stellar, I'm still a beginner and feeling kind of stuck.

Thanks for any help,

~Ana


#!/usr/bin/env python

from Bio import SeqIO

from Bio import SeqRecord

import sys

import argparse

def open_either(file, mode='r'):
    '''open either compressed or not compressed files'''
    
    if file.endswith('.gz'):
        import gzip
        return gzip.open(file, mode)
        
    else:
    
        return open(file, mode)
        
        
def calc_percentA(List):

    BaseList = ['A']
    for Base in BaseList:
        Percent = 100 * List.count(Base) / float(len(List))
        
        PercentA = float(Percent)
        
    return PercentA  

def calc_percentC(List):
    
    BaseList = ['C']
    for Base in BaseList:
        Percent = 100 * List.count(Base) / float(len(List))
    
        PercentC = float(Percent)
    
    return PercentC        
    
def calc_percentG(List):

    BaseList = ['G']    
    for Base in BaseList:
        Percent = 100 * List.count(Base) / float(len(List))
    
        PercentG = float(Percent)
        
    return PercentG
    
def calc_percentT(List):
    
    BaseList = ['T']    
    for Base in BaseList:
        Percent = 100 * List.count(Base) / float(len(List))
    
        PercentT = float(Percent)
        
    return PercentT
    
        
if __name__ == '__main__' :

    parser = argparse.ArgumentParser(description='Filter paired-end read quality based on percent base composition.',)    
    
    parser.add_argument('-1', '--fastq1',
                                            help='The input read1 fastq file.',
                                            required=True)

    parser.add_argument('-2', '--fastq2',
                                            help='The input read2 fastq file.',
                                            required=True)
                                                                                    
    parser.add_argument('-p', '--minPercent',
                                            help='Any base must account for this percent of read. Or read will be discarded.',
                                            type=float,
                                            default=5.0)

    parser.add_argument('-o1', '--outFile1',
                                            help='fastq1 output file.',
                                            required=True)
                                            
                                            
    parser.add_argument('-o2', '--outFile2',
                                            help='fastq2 output file.',
                                            required=True)     
                                            
    args = parser.parse_args()
                                                

    fastqFormat = 'fastq'
    

    fastq1handle = open_either(args.fastq1, 'rU')
    fastq2handle = open_either(args.fastq2, 'rU')

    fastq1out = open_either(args.outFile1, 'wb')
    fastq2out = open_either(args.outFile2, 'wb')
    
    fastq1SeqIO = SeqIO.parse(fastq1handle, fastqFormat)
    fastq2SeqIO = SeqIO.parse(fastq2handle, fastqFormat)
    
    
    while True:
        
        try:
            
            fastq1record = fastq1SeqIO.next()
            fastq2record = fastq2SeqIO.next()
            
        except:
        
            break
            
        Seq_ListR1 = fastq1record.seq
        Seq_ListR2 = fastq2record.seq        
    
    
        if (calc_percentA(Seq_ListR1) < args.minPercent) or (calc_percentA(Seq_ListR2) < args.minPercent):
            continue
            
        if (calc_percentC(Seq_ListR1) < args.minPercent) or (calc_percentC(Seq_ListR2) < args.minPercent):
            continue
            
        if (calc_percentG(Seq_ListR1) < args.minPercent) or (calc_percentG(Seq_ListR2) < args.minPercent):
            continue            
        
        if (calc_percentT(Seq_ListR1) < args.minPercent) or (calc_percentT(Seq_ListR2) < args.minPercent):
            continue
                
        
        fastq1out.write(fastq1record.format(fastqFormat))
        fastq2out.write(fastq2record.format(fastqFormat))
        
        
    
    fastq1out.close()
    fastq2out.close()
        
        
    print "Filtering complete!"
                                                       
ADD COMMENTlink written 3.9 years ago by arabhorsegal180

Hello arabhorsegal18!

We believe that this post does not fit the main topic of this site.

This is a question about writing code, and not about bioinformatics and therefore it would be best posted at StackOverflow

For this reason we have closed your question. This allows us to keep the site focused on the topics that the community can help with.

If you disagree please tell us why in a reply below, we'll be happy to talk about it.

Cheers!

ADD REPLYlink written 3.9 years ago by Daniel Swan13k
1

I think that this question has enough connection to bioinformatics to be kept open. Certainly it is about writing code but with a connection to bioinformatics it is in my opinion ok, and we shouldn't be that strict. What do you think? 

ADD REPLYlink written 3.9 years ago by Michael Dondrup46k

I'm struggling to convince myself that this is anything more than a question about code, and I don't think that's what we're here for. I see this as analogous to queries like "I'm running <bioinformatics command line app x> and getting  a bash error 'Permission denied'" - these are procedural non-bioinformatics questions even if it happens to be about a bioinformatics tool. I'm a big fan of using the existing Stack* forums where appropriate. I don't doubt the ability of the community to help, it's about focus (for me!).

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by Daniel Swan13k

I agree. This is about adding a loop counter, not about debugging a bioinformatics problem. I think the best approach would be to close it, but provide a pointer in the closing comments, such as "look to stack overflow for help with loop counters"

ADD REPLYlink written 3.9 years ago by RamRS21k

Please remember to provide the error messages and your modifications as well. 

ADD REPLYlink written 3.9 years ago by Michael Dondrup46k
0
gravatar for Ashutosh Pandey
3.9 years ago by
Philadelphia
Ashutosh Pandey11k wrote:

You will need a counter between your if statement and continue statement. That way you can count the reads that failed the if statements. Alternatively, you can use a counter to count reads that have passed all the if conditions and before they are written to output files.  

ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by Ashutosh Pandey11k

Implement both so you can also obtain total count (of all reads). And please, Ashutosh, do not refer to the if statement as a "loop". It's a conditional construct, not an iterative construct.

ADD REPLYlink written 3.9 years ago by RamRS21k

Ram, Please feel free to modify or rectify my posts in future. I have taken care it for now.

ADD REPLYlink written 3.9 years ago by Ashutosh Pandey11k

That would be rude, Ashutosh, and besides, it wouldn't serve the purpose. A lot of Indian schools refer to conditional constructs as loops, and it is passed on to the students. I have encountered this quite a bit personally, and do my best to give the correct usage when I see it.

ADD REPLYlink written 3.9 years ago by RamRS21k
Please log in to add an answer.
The thread is closed. No new answers may be added.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2006 users visited in the last hour