Question

Question: Mean and SD read length from a range of fastq files

2

Entering edit mode

7.1 years ago

eoin ▴ 30

Hi all,

I'm trying to write some code to generate mean read length data from a range of fastq files. awk '{if(NR%4==2) print NR"\t"$0"\t"length($0)}' HG1.fastq > readLength.txt

i've got as far as here from looking through other posts and trying to improve but i'm stuck on a couple of things. This command only works on a single file and will report the length of each read within that file separately.

I want to run a single command so the mean and Standard Dev of read lengths from all .fastq files within a folder are reported in a single .txt file, one sample per line. I gues SD might be difficult to calculate in a command so even just the mean read length.

e.g.the first 5 files in my folder are: ru1.fastq ru2.fastq hg3.fastq hg25.fastq ru7.fastq

obviously i'm a bit of a novice at this so all help would be appreciated !!

thanks a lot

fastq awk sed sequencing • 9.3k views

ADD COMMENT • link updated 2.6 years ago by Ric • 0 • written 7.1 years ago by eoin ▴ 30

0

Entering edit mode

Hello I'm so grateful for your help. I tried running your pipeline on a metagenomic file in the fastq format, and I got a very high stddev. Can you help me with this? total 84855052 avg=160.729787 stddev=2220.571952

ADD REPLY • link 5.6 years ago by savscosta • 0

0

Entering edit mode

2.6 years ago

Ric • 0

seqtk comp FASTQ_FILE_HERE |cut -f 2 > temp.txt

and then

datamash mean 1 sstdev 1 < temp.txt

ADD COMMENT • link 2.6 years ago by Ric • 0

score 8 · Accepted Answer · 2017-03-23

8

Entering edit mode

7.1 years ago

Pierre Lindenbaum 161k

Using awk:

awk 'BEGIN { t=0.0;sq=0.0; n=0;} ;NR%4==2 {n++;L=length($0);t+=L;sq+=L*L;}END{m=t/n;printf("total %d avg=%f stddev=%f\n",n,m,sq/n-m*m);}'  *.fastq

ADD COMMENT • link 7.1 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

thank you for your answer Pierre!!

This code will give us the total number of reads, their mean, and SD. But i think I wasn't clear in my post;

what I need is to get the mean and SD read length from each individual .fastq file within in the directory. So an example of the output

File Name Mean SD
ru1.fastq 235.4 -- 31.3
hg25.fastq 241.5 -- 35.5

and so on

sorry if I wasn't clear earlier and thank you so much for your response!!

ADD REPLY • link 7.1 years ago by eoin ▴ 30

3

Entering edit mode

for F in *.fasq
do 
echo  "$F   "
awk '(... same... script)' $F
done

ADD REPLY • link 7.1 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Perfect. thank you so much. I had tried a loop but left out echo

thanks a lot

ADD REPLY • link 7.1 years ago by eoin ▴ 30

0

Entering edit mode

please mark this question as answered (green tick on the left)

ADD REPLY • link 7.1 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Hello I'm so grateful for your help. I tried running your pipeline on a metagenomic file in the fastq format, and I got a very high stddev. Can you help me with this? total 84855052 avg=160.729787 stddev=2220.571952

ADD REPLY • link 5.6 years ago by savscosta • 0

0

Entering edit mode

The given algorithm is producing the variance instead the standard deviation. In awk, sqrt(x) gives the square root of x. So, you may change the last printf line by something like this:

printf("total %d avg=%f stddev=%f\n",n,m,sqrt(sq/n-m*m));

ADD REPLY • link 5.1 years ago by guillermo.luque.ds ▴ 40