Question

Have a question about run jellyfish in many directories with different number of zipped fasta files

0

Entering edit mode

2.9 years ago

schlogl ▴ 160

Hi there! I have a directory tree like this:

Data/genome_splitted/GeneraName/chromosomes/*.fna.gz

And each GeneraName/chromosomes directory has 1 to many gzip fasta files and I want/need to run jellyfish in my data . And I want to put the results(mer_counts.jf) in other directory:

Results/GeneraName/chromosomes/kmers3(if i am count words with length 3)

But jellyfish wont works in gzip files and then I need to decompress and pipe the file to jellyfish. The manual says I can use generators:

"How to read multiple files at once?

Often, jellyfish can parse an input sequence file faster than gzip or fastq-dump (to parse SRA files) can output the sequence. This leads to many threads in jellyfish going partially unused. Jellyfish can be instructed to open multiple file at once. For example, to read two short read archive files simultaneously:

jellyfish count -F 2 <(fastq-dump -Z file1.sra) <(fastq-dump -Z file2.sra) ...

Another way is to use "generators". First, create a file containing, one per line, commands to generate sequence. Then pass this file to jellyfish and the number of generators to run simultaneously. Jellyfish will spawn subprocesses running the commands passed and read their standard output for sequence. By default, the commands are run using the shell in the SHELL environment variable, and this can be changed by the -S switch. Multiple generators will be run simultaneously as specified by the -G switch. For example:

ls *.fasta.gz | xargs -n 1 echo gunzip -c > generators
jellyfish count -g generators -G 4 ...

The first command created the command list into the 'generators' file, each command unzipping one FASTA file in the current directory. The second command runs jellyfish with 4 concurrent generators."

A generator is a file like this:

gunzip -c species_strain1_complete_genome.fna.gz
gunzip -c species_strain3_complete_genome.fna.gz...

I worked out this script:

#!usr/bin/env bash

echo "Counting k-mers with JellyFish"

input='teste_name.txt'

while IFS= read -r line
do
  echo "Counting kmers in $line genomes"
  ls test/$line/chromosomes/*.fna.gz | xargs -n 1 echo gunzip -c > generators
  jellyfish count -g generators -m 4 -s 100M -o $line'_4'.jf 
done < "$input"

But I need improve it because I have different number os fasta files to read for each genera and by my understanding if I use generators I only count 1 fasta file by genera and that is not what I need. And if I use -G I think I need to anticipate the number of generators for each sub directory 'chromosome', if not I think the current count overwrite the previous count right?. There are any way I can improve this script? I have not many experience with bash so... any help would be great. Thanks. Paulo

bash • 994 views

ADD COMMENT • link 2.9 years ago by schlogl ▴ 160