Question: Counting reads and bases from a list of fastq files
0
gravatar for caro-ca
10 months ago by
caro-ca20
caro-ca20 wrote:

Hi! I trimmed my Illumina short reads, forward and reverse, by using Trimmomatic. The Trimmomatic's outputs were: paired_1 - unpaired_1, and paired_2 - unpaired_2.fastq.gz files. I want to know how big was the impact of trimming by counting the number of reads and bases of each file in my directory. I had made a script to count the number of bases and reads for each file in my directory; however, I have problems in if __name__=='__main__'. When I do the for loop I don't know the order of the files that will be run, how can I make it to call the files by the order I see from the screen? Additionally, I also need help with correcting the script as I don't get any stdout. Thank you in advance for your help.

#!/usr/bin/env python

from sys import argv 
import os 

def get_num_bases(file_content):
    total = []
    for linenumber, line in enumerate(file_content):
        mod=linenumber%4

        if mod==0: 
            ID = line.strip()[1:]
            #print(ID)

        if mod==1: 

            seq = line.strip()
            counting = 0
            counting += seq.count("T")+ seq.count("A") + seq.count("C") + seq.count("G")
            total.append(counting)
            allbases = sum(total)
            print("Number of bases are: " , allbases)

def get_num_reads(file_content):

    total_1 = []

    for line in file_content: 
        num_reads = 0 
        num_reads += content.count(line)  
        total_1.append(num_reads)
        print("Number of reads are: ", sum(total_1)/int(4))


if __name__=='__main__':
    path = os.getcwd()
    dir_files = os.listdir(path)
    list_files = []
    for file in dir_files:
        if file.endswith("fastq.gz"):
            if file not in list_files:
                file_content = open(file, "r").readlines() 
                list_files.append(file)
                print("This is the filename: ", file, get_num_bases(file_content), get_num_reads(file_content))
ADD COMMENTlink modified 10 months ago by cpad011214k • written 10 months ago by caro-ca20
1

Use seqkit stats. It's easy to get stats on each fastq file. Biopython libraries might be there for fastq stats. Use them.

ADD REPLYlink written 10 months ago by cpad011214k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1044 users visited in the last hour