Question: Use Python To List Filename With Specific Extensions.
1
gravatar for Bioscientist
8.1 years ago by
Bioscientist1.7k
Bioscientist1.7k wrote:

How can I use python to list the filenames of all FASTQ files? Use os.listdir()? But how to specify on FASTQ file?

Also, after this I want to do some further analysis on these files; eg. zcat filenames.recal.fastq.gz |wc

How can I do such things using Python?

THanks!

Edit: I'm writing python myself. My python script goes like:

#!/usr/bin/python

import os, sys,re,gzip

path = "/home/xxxx/Downloads"

for file in os.listdir(path):
  if re.match('.*\.recal.fastq.gz', file):
    text = gzip.open(file,'r').read()
    word_list = text.split()
    number = word_list.count('J') + 1
    if number != 0:
      print file

searching fastq.gz goes well, but problems are:

Traceback (most recent call last):
  File "try.py", line 9, in <module>
    text = gzip.open(file,'r').read()
  File "/usr/lib/python2.7/gzip.py", line 34, in open
    return GzipFile(filename, mode, compresslevel)
  File "/usr/lib/python2.7/gzip.py", line 89, in __init__
    fileobj = self.myfileobj = __builtin__.open(filename, mode or 'rb')
IOError: [Errno 2] No such file or directory: 'ERR001274_1.recal.fastq.gz'

I think there's sth wrong with the gzip, and also why can't I open ERR001274, it DOES exist ......any ideas? thx!

python • 45k views
ADD COMMENTlink modified 8.1 years ago by Giovanni M Dall'Olio26k • written 8.1 years ago by Bioscientist1.7k
1

try glob. import glob; print glob.glob("*.fastq"). you might clarify your question, why not just do it on the command-line?

ADD REPLYlink written 8.1 years ago by brentp23k
1

command-line alternative: http://www.infoanda.com/resources/find.htm

ADD REPLYlink written 8.1 years ago by Michael Schubert6.9k

Your Python code is not showing up correctly formatted, please indent it with four spaces. See: http://meta.stackoverflow.com/questions/22186/how-do-i-format-my-code-blocks

ADD REPLYlink written 8.1 years ago by Jeroen Van Goey2.2k

You need to provide the full path to the file line line 3 of your loop - the file in your first loop is just a string of the file name within the Downloads folder.

ADD REPLYlink written 8.1 years ago by David W4.7k
8
gravatar for Jeroen Van Goey
8.1 years ago by
Jeroen Van Goey2.2k
Ghent, Belgium
Jeroen Van Goey2.2k wrote:

To find files with a specific extension, use glob.

import glob
import gzip

filenames = glob.glob('*.fastq.gz')

for filename in filenames:
    with gzip.open(filename) as f:
        data = f.read()
        number_of_characters = len(data)
        # the last line usually has no '\n' so add 1 to count
        number_of_lines = data.count('\n') + 1
        number_of_words = len(data.split())
        print "%d %d %d %s" % (number_of_lines, number_of_words, 
                               number_of_characters, filename)

Note that this naive Python implementation is many times slower then the command line version. To speed up the Python version, increase the block size to 1MB.

import glob
import gzip

BLOCK_SIZE = 2**20
filenames = glob.glob('*.fastq.gz')

number_of_characters = number_of_lines = number_of_words = 0
for filename in filenames:
    with gzip.open(filename) as f:
        for block in iter(lambda: f.read(BLOCK_SIZE), ""):
            number_of_characters += len(block)
            number_of_lines += block.count('\n')
            number_of_words += len(block.split())
        print "%d %d %d %s" % (number_of_lines, number_of_words, 
                               number_of_characters, filename)
ADD COMMENTlink modified 8.1 years ago • written 8.1 years ago by Jeroen Van Goey2.2k
1
gravatar for David W
8.1 years ago by
David W4.7k
New Zealand
David W4.7k wrote:

Hey,

This is really a 'pure' python question, and probably better asked at Stack Overflow or similar. But this answer is easy enough. Using a list comprehension:

fastq = [f for f in os.listdir('.') if f.endswith('.fastq')]

EDIT: forgot the other bit of your question. You should be able to work this one out - look at the gzip module to read your files, then loop through the lines (I presume you want wc -l) either using count += 1 for each line or enumerate() to get a counter running.

ADD COMMENTlink modified 8.1 years ago • written 8.1 years ago by David W4.7k
0
gravatar for Giovanni M Dall'Olio
8.1 years ago by
London, UK
Giovanni M Dall'Olio26k wrote:

First: do not use 'file' as variable name. 'File' is the name of a python builtin variable, if you overwrite it you can get a weird behavior.

Second: you need to provide the correct path to the gzip file, concatenating the value of the path variable.

Third: it is better to use the glob module as suggested in another answer.

import os, sys,re,gzip

path = "/home/xxxx/Downloads"

for filename in os.listdir(path):       # Do not use 'file' as a variable name
  if re.match('.*\.recal.fastq.gz', filename):
    text = gzip.open(path + '/' + filename,'r').read() # You need to attach 'path' to the file name
    word_list = text.split()
    number = word_list.count('J') + 1
    if number != 0:
      print file
ADD COMMENTlink written 8.1 years ago by Giovanni M Dall'Olio26k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1656 users visited in the last hour