Question: Mean and SD read length from a range of fastq files
2
2
Entering edit mode
4.6 years ago
eoin ▴ 30

Hi all,

I'm trying to write some code to generate mean read length data from a range of fastq files. awk '{if(NR%4==2) print NR"\t"$0"\t"length($0)}' HG1.fastq > readLength.txt

i've got as far as here from looking through other posts and trying to improve but i'm stuck on a couple of things. This command only works on a single file and will report the length of each read within that file separately.

I want to run a single command so the mean and Standard Dev of read lengths from all .fastq files within a folder are reported in a single .txt file, one sample per line. I gues SD might be difficult to calculate in a command so even just the mean read length.

e.g.the first 5 files in my folder are: ru1.fastq ru2.fastq hg3.fastq hg25.fastq ru7.fastq

obviously i'm a bit of a novice at this so all help would be appreciated !!

thanks a lot

fastq awk sed sequencing • 6.1k views
0
Entering edit mode

Hello I'm so grateful for your help. I tried running your pipeline on a metagenomic file in the fastq format, and I got a very high stddev. Can you help me with this? total 84855052 avg=160.729787 stddev=2220.571952

8
Entering edit mode
4.6 years ago

Using awk:

awk 'BEGIN { t=0.0;sq=0.0; n=0;} ;NR%4==2 {n++;L=length($0);t+=L;sq+=L*L;}END{m=t/n;printf("total %d avg=%f stddev=%f\n",n,m,sq/n-m*m);}' *.fastq  ADD COMMENT 0 Entering edit mode thank you for your answer Pierre!! This code will give us the total number of reads, their mean, and SD. But i think I wasn't clear in my post; what I need is to get the mean and SD read length from each individual .fastq file within in the directory. So an example of the output • File Name Mean SD • ru1.fastq 235.4 -- 31.3 • hg25.fastq 241.5 -- 35.5 and so on sorry if I wasn't clear earlier and thank you so much for your response!! ADD REPLY 2 Entering edit mode for F in *.fasq do echo "$F   "
awk '(... same... script)' \$F
done

0
Entering edit mode

Perfect. thank you so much. I had tried a loop but left out echo

thanks a lot

0
Entering edit mode

0
Entering edit mode

Hello I'm so grateful for your help. I tried running your pipeline on a metagenomic file in the fastq format, and I got a very high stddev. Can you help me with this? total 84855052 avg=160.729787 stddev=2220.571952

0
Entering edit mode

The given algorithm is producing the variance instead the standard deviation. In awk, sqrt(x) gives the square root of x. So, you may change the last printf line by something like this:

printf("total %d avg=%f stddev=%f\n",n,m,sqrt(sq/n-m*m));

0
Entering edit mode
25 days ago
Ric • 0

seqtk comp FASTQ_FILE_HERE |cut -f 2 > temp.txt

and then

datamash mean 1 sstdev 1 < temp.txt