Question

Counting reads and bases from a list of fastq files

0

Entering edit mode

5.6 years ago

caro-ca ▴ 20

Hi! I trimmed my Illumina short reads, forward and reverse, by using Trimmomatic. The Trimmomatic's outputs were: paired_1 - unpaired_1, and paired_2 - unpaired_2.fastq.gz files. I want to know how big was the impact of trimming by counting the number of reads and bases of each file in my directory. I had made a script to count the number of bases and reads for each file in my directory; however, I have problems in if __name__=='__main__'. When I do the for loop I don't know the order of the files that will be run, how can I make it to call the files by the order I see from the screen? Additionally, I also need help with correcting the script as I don't get any stdout. Thank you in advance for your help.

#!/usr/bin/env python

from sys import argv 
import os 

def get_num_bases(file_content):
    total = []
    for linenumber, line in enumerate(file_content):
        mod=linenumber%4

        if mod==0: 
            ID = line.strip()[1:]
            #print(ID)

        if mod==1: 

            seq = line.strip()
            counting = 0
            counting += seq.count("T")+ seq.count("A") + seq.count("C") + seq.count("G")
            total.append(counting)
            allbases = sum(total)
            print("Number of bases are: " , allbases)

def get_num_reads(file_content):

    total_1 = []

    for line in file_content: 
        num_reads = 0 
        num_reads += content.count(line)  
        total_1.append(num_reads)
        print("Number of reads are: ", sum(total_1)/int(4))


if __name__=='__main__':
    path = os.getcwd()
    dir_files = os.listdir(path)
    list_files = []
    for file in dir_files:
        if file.endswith("fastq.gz"):
            if file not in list_files:
                file_content = open(file, "r").readlines() 
                list_files.append(file)
                print("This is the filename: ", file, get_num_bases(file_content), get_num_reads(file_content))

python count_reads count_bases • 1.1k views

ADD COMMENT • link updated 5.6 years ago by cpad0112 21k • written 5.6 years ago by caro-ca ▴ 20

1

Entering edit mode

Use seqkit stats. It's easy to get stats on each fastq file. Biopython libraries might be there for fastq stats. Use them.

ADD REPLY • link 5.6 years ago by cpad0112 21k