Question: Fastqc result collection script
1
gravatar for Shicheng Guo
18 months ago by
Shicheng Guo4.3k
Shicheng Guo4.3k wrote:

Hi All,

I guess lot of you would like to use fastqc to check the fastq/bam/sam files after you get the data. However, sometime, we would have 100+ samples/files, fastqc would provided a report for each file. Can you share a script (perl, python?) to collect all the statistics from all the files? 

 

Thanks. 

fastqc • 763 views
ADD COMMENTlink modified 5 weeks ago by h.mon5.8k • written 18 months ago by Shicheng Guo4.3k
2
gravatar for pengchy
18 months ago by
pengchy330
China/Beijing/CAS
pengchy330 wrote:

Hi,

I have written the following perl script to deal with the fastqc results summary.txt files, both for PE and SE. the output contain the reads number, total bases, GC content, Q20, Q30. The insert size is provided in the first parameter file by yourself. I have tested this script widely, but you still use it at your own risk.

 

use strict;
use warnings;

my $usage="$0 <listf> <outf>\n
<listf>  the fastQC output summary file list in the format:
\t\tSampleID<tb>InsertSize<tb>fastqc1<tb>fastqc2, the InsertSize and fastqc2 is omitted for SE
<outf>   the statistics result file


\n";

die $usage if @ARGV<2;

my($listf,$outf)=@ARGV;

my $wc = `cat $listf |head -n1 |sed 's/\\t/\\n/g'|wc -l`;chomp $wc;
my $ps;
if($wc > 2){
        $ps = "PE";
}else{
        $ps = "SE";
}

my(@info,%res);
open I,$listf;
open O,">",$outf;
if($ps eq "PE"){
        print O "Sample\tInsertSize\tReadLength\tGC%\tRawReads\tRawBases\tQ20%\tQ30%\n";
}else{
        print O "Sample\tReadLength\tGC%\tRawReads\tRawBases\tQ20%\tQ30%\n";
}
while(<I>){
        my ($total,$ratio,%posNum,$readLen,%res,$readsNum1,$readsLen1,$fastqcRes);
        chomp;
        $fastqcRes=$_;
        @info=split /\s+/;
        $res{InsertSize}=$info[1];
        my $samp=$info[0];
        if($ps eq "PE"){
                open I2,$info[2];
                open I3,$info[3];
        }else{
                open I2,$info[1];
        }
        while(<I2>){
                chomp;
                if(/Total\sSequences\s(\d+)/){
                        $total=$1;
                        $posNum{20}=$total;
                        $posNum{30}=$total;
                        $res{RawReads}= $total;
                        $readsNum1=$1;
                }
                if(/Sequence\slength\s(\d+)[-_](\d+)/){
                        warn "can not correctly statistic the base number, because the reads length is not the same, use the shortest one\n$_\n";
                }
                if(/Sequence\slength\s(\d+)/){
                        $res{ReadLength}=$1."_".$1 if $ps eq "PE";
                        $res{ReadLength}=$1 if $ps eq "SE";
                        $res{RawBases}=$res{RawReads} * $1;
                        $readsLen1=$1;
                }
                if(/^\%GC\s+(\d+)/){
                        $res{GC}=$1;
                }
                @info=split /\s+/;
                if(/#Quality/){
                        while(<I2>){
                                chomp;@info=split /\s+/;
                                last if $info[0]>29;
                                $posNum{20} -= $info[1] if $info[0] <20;
                                $posNum{30} -= $info[1] if $info[0] < 30;
                        }
                        $res{Q20}=sprintf("%.2f",$posNum{20}/$total*100);
                        $res{Q30}=sprintf("%.2f",$posNum{30}/$total*100);
                        last;
                }
        }
        if($ps eq "PE"){
        while(<I3>){
                chomp;
                @info=split /\s+/;

                if(/Total\sSequences\s(\d+)/){
                        warn "paired reads with different reads number\n$fastqcRes\n" if $readsNum1 != $1;
                        $total=$1;
                        $posNum{20}=$total; $posNum{30}=$total;
                        $res{RawReads} += $total;
                }
                if(/Sequence\slength\s(\d+)[-_](\d+)/){
                        warn "can not statistic the base number, because the reads length is not the same in one fastq file\n$fastqcRes\n";
                }
                if(/Sequence\slength\s(\d+)/){
                        warn "paired reads with different reads length\n$fastqcRes\n" if $readsLen1 != $1;
                        $res{RawBases} += $total * $1;
                }

                if(/^\%GC\s+(\d+)/){
                        $res{GC}=$res{GC}."_".$1;
                }
                if(/#Quality/){
                        while(<I3>){
                                chomp;@info=split /\s+/;
                                last if $info[0]>29;
                                $posNum{20} -= $info[1] if $info[0] <20;
                                $posNum{30} -= $info[1] if $info[0] < 30;
                        }
                        $res{Q20}=$res{Q20}."_".sprintf("%.2f",$posNum{20}/$total*100);
                        $res{Q30}=$res{Q30}."_".sprintf("%.2f",$posNum{30}/$total*100);
        #print O "Sample\tInsertSize\tReadLength\tGC%\tRawReads\tRawBases\tQ20%\tQ30%\n";
                        print O "$samp\t$res{InsertSize}\t$res{ReadLength}\t$res{GC}\t$res{RawReads}\t$res{RawBases}\t$res{Q20}\t$res{Q30}\n";
                        last;
                }
        }}
        if($ps eq "SE"){
                print O "$samp\t$res{ReadLength}\t$res{GC}\t$res{RawReads}\t$res{RawBases}\t$res{Q20}\t$res{Q30}\n";
        }

}
close O;
close I;

 

ADD COMMENTlink modified 18 months ago • written 18 months ago by pengchy330
2
gravatar for h.mon
5 weeks ago by
h.mon5.8k
Brazil
h.mon5.8k wrote:

This post is quite old, but in my view MultiQC is always worth mentioning, and it has a FastQC module.

ADD COMMENTlink written 5 weeks ago by h.mon5.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 941 users visited in the last hour