Do we have any easy, fast way to know how many sequences contained in paired-end fastq.gz file? One simple way I think to calculate is to count the # of lines in fastq file divided by 4. (Four line for one sequence)
But I'm afraid it'll be quite slow for the calculation due to the large size of fastq.gz
What about the sequences in .sai/.sam file? (Because I want to guarantee the quality by comparing sequences in fastq file, and how many sequences been processed in .sai/.sam file)
The only way I can think of, is that we have the error report of .sai/.sam file generation, in which we get the information like "10000000 sequences have been processed". I can print this line by python, but how can I print out this number out of the line?
To Ale: thx, but what's the rationale for your first command? I mean about ^@ Just look at the fastq.gz file below, where is ^@ ? I see many @, but not ^
@IL11_266:1:67:0:838/1 AAAAAAAAAAAAAAAAAAAAANNANNNANANAAAAA + @@AAAAAAA@AAAAAABAAAB&&B&%%B%>%=<>>> @IL11_266:1:67:3:594/1 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAAAAC + @@AAAAAAA@AAA?AA@A@AB?AB@AAB7>&=><A2
Edit again: Also I try both of the commands, and find the result is 2-fold difference. I compared with .sai file error report, which said 853813 sequences have been processed, then should say 'zgrep' command double counts.
[x@hpcc01 bwa2]$ zcat ERR002356_1.recal.fastq.gz | echo $((`wc -l`/4)) 853813 [x@hpcc01 bwa2]$ zgrep -c '^@' ERR002356_1.recal.fastq.gz 1643596
EDIT: THX. But I'm not dealing with one file. Plz look at below:
#!/usr/bin/python import os,sys,re import gzip import commands path = "/home/xug/nearline" for file in os.listdir(path): if re.match('.*\.recal.fastq.gz', file): fullpath = os.path.join(path, file) result = commands.getoutput('zcat fullpath |wc -l') numseqs = int(result)/4.0 print numseqs
Problem is, I define 'fullpath' here for all fastq files, but after being put under ' ', seems this fullpath doesn't work...How can I solve this problem? thx
Problem solves....fullpath is variable. ..so 'zcat '+fullpath+' |wc -l'