Question: How to automatically decompress bz2 fastq files from different directories into one directory
0
gravatar for sentausa
4.6 years ago by
sentausa640
France
sentausa640 wrote:

Dear all,

(This question is more of data/file management in UNIX/Mac OS X, but the data are fastq files anyway, so I'm asking this question here.)

I have many fastq files scattered in different directories (compressed directories as bz2 or tar.bz2 and non-compressed directories) and I'd like to collect these fastq files in one directory while changing the filenames according to their original directory names. So, my files look a bit like this:

>parentdirectory
->this
-->data
--->sample1
---->target.fastq.bz2
---->anotherfastq.fastq.bz2
---->anotherfile.bz2
--->sample2
---->target.fastq.bz2
---->anotherfastq.fastq.bz2
---->anotherfile.bz2
->that
-->those
--->data.tar.bz2
>anotherdirectory
->data
-->sample_a
--->target.fastq.bz2
--->anotherfastq.fastq.bz2
--->anotherfile.bz2
-->sample_b
--->target.fastq.bz2
--->anotherfastq.fastq.bz2
--->anotherfile.bz2

Please notice that the compressed data.tar.bz2 directory contains the same structure as other non-compressed data directories. My goal is to collect those target.fastq files uncompressed in one directory while changing the filename text "target" into its corresponding parent directory ("sample1", "sample2", "sample_a", etc.).

Any idea how to do that automatically/programmatically?

Thank you in advance for your kind help!

bz2 fastq • 3.8k views
ADD COMMENTlink modified 4.6 years ago by RamRS22k • written 4.6 years ago by sentausa640
1

I suspect that Ram's answer will be the simplest in the long term (it'll probably take a bit of playing to get exactly what you want). The alternative is to just code a short little script in bash/python/perl/whatever to walk the directory structure and extract/rename as needed.

ADD REPLYlink written 4.6 years ago by Devon Ryan91k

Thanks. Could you please point me to the right direction if I want to write a script in Python to do that? Thanks again.

ADD REPLYlink written 4.6 years ago by sentausa640
2

You'll need to import the os and likely the glob modules in python  (these should already be available, so there's nothing to install). The steps would then generally be as follows:

  1. List all subdirectories with os.listdir()
    • Iterate over those
  2. For each directory iterate over its subdirectories until you either find a subdirectory with compressed fastq files or one without subdirectories.
    • If you find a directory without fastq files or subdirectories then skip it.
    • Otherwise, continue
  3. If you find fastq files, the containing directory is a sample name.
    • Call bunzip2 with either the subprocess module or, more simply, the sys module and have the fastq file extracted to the target directory with a new name (I would prepend the sample name, but you can use any naming scheme you like).

You could also directly use the tar file in python (I think it's the tar module), but that might prove to be a bit more work.

ADD REPLYlink written 4.6 years ago by Devon Ryan91k

For Python, you'll need to use File IO, basic file ops with os and some tar operations with the tarfile module.

Sources:

https://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files

http://stackoverflow.com/a/2759130/1394178

https://docs.python.org/2/library/tarfile.html

ADD REPLYlink written 4.6 years ago by RamRS22k

Thank you for the endorsement. I do suspect you meant to say short term, no? Py/Perl scripts are always better for long term :)

ADD REPLYlink written 4.6 years ago by RamRS22k
2

Oh, indeed! I need to follow the advice on the coffee mug in my office that translates to, "Don't say anything before the first cup of coffee!".

ADD REPLYlink written 4.6 years ago by Devon Ryan91k
2

I'm on my second cup - had an unusually early first cup, guess that helped with the scripting :)

ADD REPLYlink written 4.6 years ago by RamRS22k
2
gravatar for RamRS
4.6 years ago by
RamRS22k
Houston, TX
RamRS22k wrote:
find . -name *.fastq.bz2 -exec tar -C /target/directory -xjf {} \;

EDIT: Updated with (many) fixes to script

This should help:

$find . -name "target.fastq.bz2" -print | rev | awk -F "/" 'BEGIN{OFS="\t"} { print $0,$2 }' | rev | awk -F "\t" '{print "cp "$2" /target/dir/"$1" && tar -xjf /target/dir/"$1" -C /target/dir/" }' >> toExec.sh

$chmod u+x toExec.sh && sh toExec.sh

Explanation:

  1. Find all target.fastq.bz2 files and print their full path
  2. Pick file name and immediate parent directory name from the full path
  3. Craft a command (to copy to target directory and rename) from the two components picked up
  4. Write all untar commands to a shell script
  5. Execute created shell script
ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by RamRS22k
2

That's getting to be a pretty impressive one (well, two, but who's counting) liner!

ADD REPLYlink written 4.6 years ago by Devon Ryan91k

Thanks for the quick answer, but won't it include also the anotherfastq.fastq.bz2 files? I don't want those; I need only the target.fastq.bz2 files.

ADD REPLYlink written 4.6 years ago by sentausa640
1

<content moved to Answer>

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by RamRS22k

I got errors: tar: Error opening archive: Failed to open 'target.fastq.bz2'

ADD REPLYlink written 4.6 years ago by sentausa640
1

Ah, yes. I apologize. There's all tiny problem we need to circumvent. Hold on while I update the command.

ADD REPLYlink written 4.6 years ago by RamRS22k

Thanks a lot!

ADD REPLYlink written 4.6 years ago by sentausa640

I got errors from tar (unrecognized format), but I think I've fixed it using bzip2. It's still running, but I believe I've got it. Thank you!

ADD REPLYlink written 4.6 years ago by sentausa640
1

Glad it works. You're very welcome!

ADD REPLYlink written 4.6 years ago by RamRS22k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1040 users visited in the last hour