How to automatically decompress bz2 fastq files from different directories into one directory
1
0
Entering edit mode
9.3 years ago
sentausa ▴ 650

Dear all,

(This question is more of data/file management in UNIX/Mac OS X, but the data are fastq files anyway, so I'm asking this question here.)

I have many fastq files scattered in different directories (compressed directories as bz2 or tar.bz2 and non-compressed directories) and I'd like to collect these fastq files in one directory while changing the filenames according to their original directory names. So, my files look a bit like this:

>parentdirectory
->this
-->data
--->sample1
---->target.fastq.bz2
---->anotherfastq.fastq.bz2
---->anotherfile.bz2
--->sample2
---->target.fastq.bz2
---->anotherfastq.fastq.bz2
---->anotherfile.bz2
->that
-->those
--->data.tar.bz2
>anotherdirectory
->data
-->sample_a
--->target.fastq.bz2
--->anotherfastq.fastq.bz2
--->anotherfile.bz2
-->sample_b
--->target.fastq.bz2
--->anotherfastq.fastq.bz2
--->anotherfile.bz2

Please notice that the compressed data.tar.bz2 directory contains the same structure as other non-compressed data directories. My goal is to collect those target.fastq files uncompressed in one directory while changing the filename text "target" into its corresponding parent directory ("sample1", "sample2", "sample_a", etc.).

Any idea how to do that automatically/programmatically?

Thank you in advance for your kind help!

fastq • 5.8k views
ADD COMMENT
1
Entering edit mode

I suspect that Ram's answer will be the simplest in the long term (it'll probably take a bit of playing to get exactly what you want). The alternative is to just code a short little script in bash/python/perl/whatever to walk the directory structure and extract/rename as needed.

ADD REPLY
0
Entering edit mode

Thanks. Could you please point me to the right direction if I want to write a script in Python to do that? Thanks again.

ADD REPLY
2
Entering edit mode

You'll need to import the os and likely the glob modules in python (these should already be available, so there's nothing to install). The steps would then generally be as follows:

  1. List all subdirectories with os.listdir()
    • Iterate over those
  2. For each directory iterate over its subdirectories until you either find a subdirectory with compressed fastq files or one without subdirectories.
    • If you find a directory without fastq files or subdirectories then skip it.
    • Otherwise, continue
  3. If you find fastq files, the containing directory is a sample name.
    • Call bunzip2 with either the subprocess module or, more simply, the sys module and have the fastq file extracted to the target directory with a new name (I would prepend the sample name, but you can use any naming scheme you like).

You could also directly use the tar file in python (I think it's the tar module), but that might prove to be a bit more work.

ADD REPLY
0
Entering edit mode

For Python, you'll need to use File IO, basic file ops with os and some tar operations with the tarfile module.

Sources:

ADD REPLY
0
Entering edit mode

Thank you for the endorsement. I do suspect you meant to say short term, no? Py/Perl scripts are always better for long term :)

ADD REPLY
2
Entering edit mode

Oh, indeed! I need to follow the advice on the coffee mug in my office that translates to, "Don't say anything before the first cup of coffee!".

ADD REPLY
2
Entering edit mode

I'm on my second cup - had an unusually early first cup, guess that helped with the scripting :)

ADD REPLY
2
Entering edit mode
9.3 years ago
Ram 43k

find . -name *.fastq.bz2 -exec tar -C /target/directory -xjf {} \;

EDIT: Updated with (many) fixes to script

This should help:

$find . -name "target.fastq.bz2" -print | \
  rev | \
  awk -F "/" '
    BEGIN {OFS="\t"}
    { print $0,$2 }
  ' | \
  rev | \
  awk -F "\t" '
    { print "cp "$2" /target/dir/"$1" && tar -xjf /target/dir/"$1" -C /target/dir/" }
  ' >> toExec.sh

$chmod u+x toExec.sh && sh toExec.sh

Explanation:

  1. Find all target.fastq.bz2 files and print their full path
  2. Pick file name and immediate parent directory name from the full path
  3. Craft a command (to copy to target directory and rename) from the two components picked up
  4. Write all untar commands to a shell script
  5. Execute created shell script
ADD COMMENT
2
Entering edit mode

That's getting to be a pretty impressive one (well, two, but who's counting) liner!

ADD REPLY
0
Entering edit mode

Thanks for the quick answer, but won't it include also the anotherfastq.fastq.bz2 files? I don't want those; I need only the target.fastq.bz2 files.

ADD REPLY
1
Entering edit mode
<content moved to Answer>
ADD REPLY
0
Entering edit mode

I got errors:

tar: Error opening archive: Failed to open 'target.fastq.bz2'
ADD REPLY
1
Entering edit mode

Ah, yes. I apologize. There's all tiny problem we need to circumvent. Hold on while I update the command.

ADD REPLY
0
Entering edit mode

Thanks a lot!

ADD REPLY
0
Entering edit mode

I got errors from tar (unrecognized format), but I think I've fixed it using bzip2. It's still running, but I believe I've got it. Thank you!

ADD REPLY
1
Entering edit mode

Glad it works. You're very welcome!

ADD REPLY

Login before adding your answer.

Traffic: 1520 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6