Question: Counting number of reads and bases fastaQ with python
0
gravatar for Toto26
5 months ago by
Toto2610
Toto2610 wrote:

Hi all, does someone know a way to count a many base are in a fastaQ file and how many read as well? (R1 and R2 fastaQ files from illumina). And if it is possible to do it with python? Thanks for you help :)

biopython python fastaq • 403 views
ADD COMMENTlink modified 5 months ago by drkennetz350 • written 5 months ago by Toto2610
2

Many other ways described here: https://bioinformatics.stackexchange.com/questions/935/fast-way-to-count-number-of-reads-and-number-of-bases-in-a-fastq-file

This is about as easy as it can get: (reformat.sh from BBMap suite).

$ reformat.sh in=reads.fq
Input:                          4000000 reads           297852680 bases

Time:                           2.339 seconds.
ADD REPLYlink modified 5 months ago • written 5 months ago by genomax57k
0
gravatar for finswimmer
5 months ago by
finswimmer6.2k
Germany
finswimmer6.2k wrote:

Hello Toto26,

here is a naiv python implementation:

And here another one using unix standard commands:

$ zcat your.fastq.gz|paste - - - -|cut -f2|wc -c -l|awk -v OFS="\n" '{print "reads: "$1, "bases: "$2-$1}'

What the command is doing is:

  • zcat decompress the gziped fastq file
  • paste puts 4 lines in a row (delimited by tab) as a typically fastq file have 4 lines per read.
  • cut extract the second column as this column contains now the sequence of the read
  • wc counts the total number of characters (= total number of bases + line breaks) and the total number of lines (= total number of reads)
  • we can use awk to subtract the number of lines from the number of characters calculated by wc, as wc counts the line breaks as a characters

Edit:

I knew that loops in python are slow, but it's amazing how slow. For 1.5Gb large file with 31434394 reads and 2355832933 bases the python solution above needs around 2 minutes to finish. Using zcat etc. runs in 30 second. So for testing I modified the python code in that way, that it reads from stdin serving the result of zcat your.fastq.gz|paste - - - -, split the line and calculate read length. Now even python runs in 30 seconds.

More speed is possible with using pigz instead of zcat. The runtime goes down to 23 seconds.

fin swimmer

ADD COMMENTlink modified 5 months ago • written 5 months ago by finswimmer6.2k
0
gravatar for drkennetz
5 months ago by
drkennetz350
drkennetz350 wrote:

This should give you what you are looking for:

#!/usr/bin/env python

import sys
fastq = sys.argv[1]
with open(fastq, 'r') as f:
    reads = []
    for line in f:
        if line.startswith("@"):
            reads.append(line)
print("The total number of reads is: ", len(reads))
totalReads = len(reads)
bases = []
with open(fastq, 'r') as f:
    for line in f:
        if line.startswith("A") or line.startswith("T") or line.startswith("C") or line.startswith("G"):
            numBases = line.split()
            bases.append(numBases)
flatList = [item for sublist in bases for item in sublist] # split a list of lists into a single list
total = []
for i in flatList:
    totalBases = list(i)
    count = 0
    count += (len(totalBases))
    total.append(count)
allBases = sum(total)
print("The total number of bases is:", allBases)
avgReadLength = allBases/totalReads
print("The avg read length is:", avgReadLength)

To run the code:

$ python fastqcount.py youfastqfile.fastq

The first block of "with open" code opens your fastq read only and checks to see if the line in the file starts with @ (every header line in fastq files starts with @. If it does, it appends it to a list called reads, then we just print the length of the list which gives the number of reads. We then initialize a new list called bases. We use the if statement to see if the line starts with a base (A,T,C,G) and if it does it adds the read sequence to a list called bases. But this list is currently a list of lists, with each list in the main list being a read sequence. We split this into 1 flat list, so we no longer have a list of lists, but rather a list of sequences. We then initialize a list called total, and iterate through each sequence of the list called flatList. To split the big list called flatList into a list of individual bases we use "totalBases = list(i)", then we get the length of each sequence. We do this by initializing a variable called count, and for every sequence in totalBases, we get the length and add it to count. We then put all of the different sequence lengths into the list of total with the "total.append(count)". Lastly, we define a variable called allBases and we just sum the list total. I stuck the average read length function on there as well in case you wanted that. I saw some answers above, but if you are asking this question you are probably pretty new to python. This may help you understand what each step above is doing without using too many built-in functions.

ADD COMMENTlink modified 5 months ago • written 5 months ago by drkennetz350

Hello drkennetz,

good work for taking the time to describe what your code is doing. It animate me to add some comments to my post. Keep on writing!

Please let me comment on two problems I see in your code:

  1. You assume that the fastq file is not gz compressed. That is possible, of course. But most of the time a user have a compressed version of it.
  2. You are trying to identify the lines that hold sequences by looking whether the lines starts with A, T, C or G. But it is also possible that a sequence have an N. Also the quality encoding line have these characters. So it could happend that you count this line as well.

fin swimmer

ADD REPLYlink modified 5 months ago • written 5 months ago by finswimmer6.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1067 users visited in the last hour