Question: Read count from a batch of fastq.gz files
0
gravatar for Mitra
11 days ago by
Mitra0
Mitra0 wrote:

Hi, How can we do Read count from a batch of fastq.gz files.

Obviously if we unzip it then it is easy by grep.

But what is the best way to do without unzipping.

I tried this for a batch of files:

for i in `ls  RIF*.fastq.gz | sort`; do zcat $i | echo $((`wc -l`/4)) ; done

Gives me result:

133408
133408
127328
127328
124862
124862
156815
156815
146333
146333
123459
123459
159771
159771
126039
126039
167112
167112
97320
97320

But When I unzip those same files and run grep with the identifier like

for i in `ls  *.fastq | sort`; do grep "@M03691" $i |wc -l; done

I get true result as

132637
132637
126435
126435
123616
123616
156347
156347
145592
145592
123059
123059
158654
158654
125626
125626
166377
166377
97159
97159

Which is clearly a mismatch. Can anybody please suggest me what i am missing out in my first try. Thanks, Mitra

sequencing next-gen • 144 views
ADD COMMENTlink modified 11 days ago by Devon Ryan70k • written 11 days ago by Mitra0
1

If you have zcat then you must have zgrep. Use zgrep -c "^@Sequencer_ID" file.fq.gz to get your counts.

ADD REPLYlink modified 11 days ago • written 11 days ago by genomax33k

Great works thanks..don't know why zgrep didn't come to my mind.

But still I really wonder why wc -l`/4 didn't work. Definitely all these fastq files have 4 lines.

ADD REPLYlink written 11 days ago by Mitra0
1
gravatar for Devon Ryan
11 days ago by
Devon Ryan70k
Freiburg, Germany
Devon Ryan70k wrote:

Apparently not all of your read names start with M03691.

ADD COMMENTlink written 11 days ago by Devon Ryan70k

Which should not happen for a file for one sample (unless multiple sequencers are involved) :)

ADD REPLYlink modified 11 days ago • written 11 days ago by genomax33k

So far my money is on that as being the most likely explanation.

ADD REPLYlink written 11 days ago by Devon Ryan70k

No ..thats not the fact ..and yes they all named as M03691 as much I can see.

ADD REPLYlink written 11 days ago by Mitra0

Either they don't all start with that or you have cases where each entry isn't 4 lines. There are no other possibilities.

ADD REPLYlink written 11 days ago by Devon Ryan70k

If this dataset was trimmed on the sequencer then there may be some entries with no sequence. Just a guess.

ADD REPLYlink written 11 days ago by genomax33k

BTW, wc -l / 4 ends up truncating the division. It'd be interesting to not round and see if your files have line numbers that aren't multiples of 4.

ADD REPLYlink written 11 days ago by Devon Ryan70k

Also when I am checking with one file I get exact match..but somehow above command not working..

 grep "@M" RIFA3D26_S20_L001_R2_001.fastq |wc -l
145592
grep "@M0369" RIFA3D26_S20_L001_R2_001.fastq |wc -l
145592

This confirms that all the reads have same name.

Also with line count for single file matches.

cat RIFA3D26_S20_L001_R2_001.fastq |  echo $((`wc -l`/4))
145592

Just still leaves me in confusion why line count don't work within zip files.

ADD REPLYlink written 10 days ago by Mitra0

Your question isn't why line counting doesn't work (it does), but rather why you get different numbers. That's a rather important distinction.

ADD REPLYlink written 10 days ago by Devon Ryan70k

Yes I described above two different ways of Read count from a batch of fastq.gz files ..out of which $((wc -l/4)) don't work (rather I must say not getting expected or correct result) when applied to zip files. So I assume this is something to do with the zip files ..and try to understand how the wc works in zip files. Thanks Mitra

ADD REPLYlink written 10 days ago by Mitra0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 753 users visited in the last hour