How a fastq.gz file is formatted?
3
0
Entering edit mode
4 months ago
MobiusT ▴ 10

What is the format of a fastq.gz file? And what is the frequency of read in the file? Actually I am trying to find the read lengths in a fastq.gz file then calculate the mean read length. What I do is

zcat dataset015.fastq.gz | head -50| awk ' { s+=1; sum += length($_); printf length($_) " , " }END{avg = sum/s; }'

This belongs to a nanopore dataset and it prints

163 , 29 , 1 , 29 , 163 , 43 , 1 , 43 , 171 , 1034 , 1 , 1034 , 163 , 70 , 1 , 70 , 162 , 295 , 1 , 295 , 163 , 1270 , 1 , 1270 , 163 , 61 , 1 , 61 , 162 , 18 , 1 , 18 , 171 , 973 , 1 , 973 , 170 , 489 , 1 , 489 , 169 , 2203 , 1 , 2203 , 170 , 741 , 1 , 741 , 171 , 9799

However, I know some of them is not a read but other info, so how can I select only the read lengths not any other info field?

fastq nanopore sequence length • 404 views
ADD COMMENT
4
Entering edit mode
4 months ago

You can use SeqKit to calculate the mean read length. See the usage examples here.

seqkit stats dataset015.fastq.gz

Sample output:

file                 format  type  num_seqs    sum_len  min_len  avg_len  max_len
dataset015.fastq.gz  FASTQ   DNA      2,500    567,516      226      227      229
ADD COMMENT
3
Entering edit mode
4 months ago
GenoMax 123k

You can use testformat2.sh from BBMap suite to get a bunch of stats/identity of the organism etc.

$ testformat2.sh nanopore.fq.gz   qin=33
Format          fastq
Compression     gz
Interleaved     false
MaxLen          87431
MinLen          134
AvgLen          2049.26
StdevLen        6646.89
ModeLen         240
QualOffset      33
NegativeQuals   0

Content         Nucleotides
Type            DNA
Reads           4000
-JunkReads      0
-ChastityFail   0
-BadPairNames   0

Bases           8197035
-Lowercase      0
-Uppercase      8197035
-Non-Letter     0
-FullyDefined   8197035
-No-call        0
-Degenerate     0
-Gap            0
-Invalid        0

GC              0.435
-GCMedian       40.625
-GCMode         41.675
-GCSTDev        9.087

Cardinality     7804442
Organism        Oryza sativa Japonica Group
TaxID           39947
Barcodes        60
ZMWs            0

QErrorRate      6.368%
-QAvgLog        11.96
-QAvgLinear     20.00
-qMinUncalled   999
-qMaxUncalled   -999
-qMinCalled     1
-qMaxCalled     90
-TrimmedAtQ5    0.30%
-TrimmedAtQ10   13.11%
-TrimmedAtQ15   81.31%
-TrimmedAtQ20   95.24%
ADD COMMENT

Login before adding your answer.

Traffic: 1333 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6