Question: Sequence Length Distribution From A Fastq File
11
gravatar for deepthithomaskannan
6.5 years ago by
Canada
deepthithomaskannan280 wrote:

Hi all,

Any script available for doing sequence length disrtibution of fastq file?

Thanks, Deepthi

sequence fastq length • 39k views
ADD COMMENTlink modified 6 months ago by LuckyLuck0 • written 6.5 years ago by deepthithomaskannan280
5

I'm surprised anyone answered this, given that you present no evidence of having looked for a solution yourself.

ADD REPLYlink written 6.5 years ago by Neilfws48k
13

Maybe just let the community decide on whether a question is worthy of answering or not (unless of course it is completely unsuitable for Biostar). To most of us who have been at this a while his question is trivial, but to him... Lets not make people feel bad about asking questions.

ADD REPLYlink written 6.5 years ago by Ian5.5k
2

I don't care if questions seem "trivial". I do care about questions that begin "is there any script" - it simply implies that they have not even tried a Web search before asking.

ADD REPLYlink written 6.5 years ago by Neilfws48k

Good point: I'm as a consequence surprised by myself...

ADD REPLYlink written 6.5 years ago by Manu Prestat3.9k

Embrace the non-Googlers of Biostars

ADD REPLYlink written 6.3 years ago by Dan D6.9k

Hello,

I am trying to do something similar, were I want to determine both the read length and how many reads are in my fastq file. Here is my code:

gunzip -c SRR1060507_1.fastq.gz|awk 'NR%4==2{printlength($0)}'|uniq -c

But I keep getting the following error:

awk: cmd. line:1: (FILENAME=- FNR=2) fatal: function `printlength' not defined

I'm not sure what I've done incorrectly? I also tried Frederic's code above and although I got that to run, its not exactly the output I'm seeking, I should be returning something like 2420797 100

Any help would be super appreciated!

ADD REPLYlink written 21 months ago by oars160
1

Between print and length should be at least one space. If you want to compute the number of reads simultaneously, use something like Frèdèric's approach

awk 'NR%4 == 2 {lengths[length($0)]++ ; counter++} END {for (l in lengths) {print l, lengths[l]}; print "total reads: " counter}' file.fastq
ADD REPLYlink modified 21 months ago • written 21 months ago by michael.ante3.5k

Thanks! This was a winner.

ADD REPLYlink written 21 months ago by oars160
32
gravatar for Frédéric Mahé
6.5 years ago by
France, Montpellier, CIRAD
Frédéric Mahé3.0k wrote:

Here is a solution using awk:

awk 'NR%4 == 2 {lengths[length($0)]++} END {for (l in lengths) {print l, lengths[l]}}' file.fastq

It reads like this: every second line in every group of 4 lines (the sequence line), measure the length of the sequence and increment the array cell corresponding to that length. When all lines have been read, loop over the array to print its content. In awk, arrays and array cells are initialized when they are called for the first time, no need to initialize them before.

ADD COMMENTlink modified 6 months ago by RamRS24k • written 6.5 years ago by Frédéric Mahé3.0k
1

Thanks for teaching me the NR%4 line grouping method: pretty useful indeed when dealing with fastq files.

ADD REPLYlink modified 6 months ago by RamRS24k • written 6.5 years ago by Manu Prestat3.9k
16
gravatar for bastienllamas
6.3 years ago by
bastienllamas160
bastienllamas160 wrote:

Another awk solution, very similar to Frédéric's:

cat reads.fastq | awk '{if(NR%4==2) print length($1)}' | sort -n | uniq -c > read_length.txt

And to quickly obtain a graph in R:

reads<-read.csv(file="read_length.txt", sep="", header=FALSE)
plot (reads$V2,reads$V1,type="l",xlab="read length",ylab="occurences",col="blue")
ADD COMMENTlink modified 6 months ago by RamRS24k • written 6.3 years ago by bastienllamas160
2

no need to use cat

ADD REPLYlink written 6.3 years ago by PoGibas4.8k
10
gravatar for Jordan
6.5 years ago by
Jordan1.1k
Pittsburgh
Jordan1.1k wrote:

You can use FASTQC. It gives you all the details you need, including the quality, overrepresented sequences, enriched k-mers etc.

ADD COMMENTlink modified 6 months ago by RamRS24k • written 6.5 years ago by Jordan1.1k
6
gravatar for Brian Bushnell
2.6 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

Another possibility, from BBMap:

readlength.sh in=reads.fq out=histogram.txt

The default is 10bp bins with a max of 80kbp, but those can be configured (run the shellscript with no arguments for details). It's very fast, and handles fastq/fasta/sam/bam; raw/gzip/bzip2.

ADD COMMENTlink written 2.6 years ago by Brian Bushnell16k

Really useful! Only if I want to parse the output (e.g. pipe it into grep) I have to first suppress the stderr message: Processed XY reads. Time: Z seconds. Might it be more convenient to have this info also in stdout?

ADD REPLYlink written 10 weeks ago by al-ash110
2
gravatar for Buffo
2.6 years ago by
Buffo1.7k
Buffo1.7k wrote:

Save it as fastq_lenght.py and run it :)

#!/usr/bin/env python
#-*- coding: UTF-8 -*-
import sys
######################################################################################
syntax = '''
------------------------------------------------------------------------------------
Usage: python fastq_lenght.py file.fastq 
result: .txt file same name as input name plus "_lenghts.txt" 
------------------------------------------------------------------------------------
'''
######################################################################################

if len(sys.argv) != 2:
    print syntax
    sys.exit()

######################################################################################

fastq_file = open(sys.argv[1], 'r')
prefix = sys.argv[1].split('.')[0]
outfile = open(prefix + '_' + 'lenghts.txt', 'w')
seq = ''
for line in fastq_file:
    line = line.rstrip('\n')
    if line.startswith('@'):
        if seq:
            outfile.write(name + '\t' + str(len(seq)) + '\n')
            seq = ""
        name = line 
    else:
        seq = line 
outfile.write(name + '\t' + str(len(seq)) + '\n')
fastq_file.close()
outfile.close()

print '\n' + '\t' + 'File: ' + prefix + '_' + 'lenghts.txt has been created...'
ADD COMMENTlink modified 2.3 years ago • written 2.6 years ago by Buffo1.7k
2
gravatar for Manu Prestat
6.5 years ago by
Manu Prestat3.9k
Marseille, France
Manu Prestat3.9k wrote:

You can do this using fastx toolkit, especially, you might be interested in this program:

fastx_nucleotide_distribution_graph.sh

and maybe:

fastx_quality_stats
fastq_quality_boxplot_graph.sh
ADD COMMENTlink modified 6 months ago by RamRS24k • written 6.5 years ago by Manu Prestat3.9k
1

i think this does not provide length distribution information

ADD REPLYlink written 4.0 years ago by Malcolm.Cook1.1k
1
gravatar for Ashutosh Pandey
6.5 years ago by
Philadelphia
Ashutosh Pandey11k wrote:

Use this post as a help Fastq Quality Read and Score Length Check and apply a few basic unix commands like sort and uniq to get the frequency.

ADD COMMENTlink modified 6 months ago by RamRS24k • written 6.5 years ago by Ashutosh Pandey11k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1843 users visited in the last hour