I'm looking for an easy and efficient way to get basic information from a given fastq file programmatically, like number of sequences, average sequence length, average GC%... I adapted this code but it's actually very slow for big fastq files. I was considering parsing FastQC's results but there must be an easier way? I've googled it for hours but I can't find anything.
Thanks a lot!
I forgot to mention I'm trying to put that into a python function, so here's what I've got right now:
def fastq_stats(fastq_file): os.system("reformat.sh in=" + fastq_file + " gchist=temp") with open('temp', 'r') as f: first_line = f.readline() gc_rate = float(first_line.split()) os.system("rm temp") read_count = int(sum(1 for line in open(fastq_file))/4) return (read_count, gc_rate)
It's a bit ugly imo, but it will do for now. I'm still interested by better options! I'm coding in python and R mostly, and this is part of a pipeline that generates FastQC reports automatically, hence my quest for a ready-to-use FastQC parser.