How a fastq.gz file is formatted?
Entering edit mode
2.6 years ago
MobiusT ▴ 20

What is the format of a fastq.gz file? And what is the frequency of read in the file? Actually I am trying to find the read lengths in a fastq.gz file then calculate the mean read length. What I do is

zcat dataset015.fastq.gz | head -50| awk ' { s+=1; sum += length($_); printf length($_) " , " }END{avg = sum/s; }'

This belongs to a nanopore dataset and it prints

163 , 29 , 1 , 29 , 163 , 43 , 1 , 43 , 171 , 1034 , 1 , 1034 , 163 , 70 , 1 , 70 , 162 , 295 , 1 , 295 , 163 , 1270 , 1 , 1270 , 163 , 61 , 1 , 61 , 162 , 18 , 1 , 18 , 171 , 973 , 1 , 973 , 170 , 489 , 1 , 489 , 169 , 2203 , 1 , 2203 , 170 , 741 , 1 , 741 , 171 , 9799

However, I know some of them is not a read but other info, so how can I select only the read lengths not any other info field?

fastq nanopore sequence length • 1.1k views
Entering edit mode
2.6 years ago

You can use SeqKit to calculate the mean read length. See the usage examples here.

seqkit stats dataset015.fastq.gz

Sample output:

file                 format  type  num_seqs    sum_len  min_len  avg_len  max_len
dataset015.fastq.gz  FASTQ   DNA      2,500    567,516      226      227      229
Entering edit mode
2.6 years ago
GenoMax 149k

You can use from BBMap suite to get a bunch of stats/identity of the organism etc.

$ nanopore.fq.gz   qin=33
Format          fastq
Compression     gz
Interleaved     false
MaxLen          87431
MinLen          134
AvgLen          2049.26
StdevLen        6646.89
ModeLen         240
QualOffset      33
NegativeQuals   0

Content         Nucleotides
Type            DNA
Reads           4000
-JunkReads      0
-ChastityFail   0
-BadPairNames   0

Bases           8197035
-Lowercase      0
-Uppercase      8197035
-Non-Letter     0
-FullyDefined   8197035
-No-call        0
-Degenerate     0
-Gap            0
-Invalid        0

GC              0.435
-GCMedian       40.625
-GCMode         41.675
-GCSTDev        9.087

Cardinality     7804442
Organism        Oryza sativa Japonica Group
TaxID           39947
Barcodes        60
ZMWs            0

QErrorRate      6.368%
-QAvgLog        11.96
-QAvgLinear     20.00
-qMinUncalled   999
-qMaxUncalled   -999
-qMinCalled     1
-qMaxCalled     90
-TrimmedAtQ5    0.30%
-TrimmedAtQ10   13.11%
-TrimmedAtQ15   81.31%
-TrimmedAtQ20   95.24%

Login before adding your answer.

Traffic: 2803 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6