Question: How can I run lastz on multiple files using a bash loop and an associative array?
gravatar for SaltedPork
9 months ago by
SaltedPork80 wrote:

The lastz command will look like:

lastz reference.fasta sample.fasta --ambiguous=IUPAC --format=GENERAL > output.fasta

I need to run lastz once for every sample file, because each sample file has to be compared to its own individual reference (I have one reference for each sample).

I'm using find to get all of my 'sample' files, and will do the same for the reference files.

find . -type f -name 'sample.fasta' -not -path "*/temp/*"

The question is, how do I associate each sample file with each reference file, taking as input the output of the find command? Or is there a better way?

bash for loop find • 583 views
ADD COMMENTlink modified 9 months ago by cpad01129.4k • written 9 months ago by SaltedPork80

Based on your find command it appears that your files are all in one local directory. Is there a way to tell sample (and corresponding ref) apart by just file extension? If that is the case why not do

find . -type f -name 'sample.fasta' -not -path "*/temp/*" > files
for i in `cat files | sed 's/.fasta//'`; do lastz $i.ref $i.fasta --ambiguous=IUPAC --format=GENERAL > $i\_out.fasta; done
ADD REPLYlink modified 9 months ago • written 9 months ago by genomax57k
gravatar for mbens
9 months ago by
mbens100 wrote:

It depends on the location of your reference files. Assuming reference and sample are located within one directory, you won't need associative arrays.

Example file structure:

├── a
│   ├── reference.fasta
│   └── sample.fasta
├── b
│   ├── reference.fasta
│   └── sample.fasta
└── c
    ├── reference.fasta
    └── sample.fasta

Collect paths to samples and references in separate files

find . -type "f" -name "sample.fasta" | sort > samples.list
find . -type "f" -name "reference.fasta" | sort > references.list

Create association sample - reference

paste -d ',' references.list samples.list > file_association.csv

Create commands

while IFS=","
read -r col1 col2
    x=$(dirname $col2)
    echo "lastz $col1 $col2 --ambiguous=IUPAC --format=GENERAL > $x/output.fasta"
done < file_association.csv

or using GNU parallel

cat file_association.csv | parallel --dry-run --colsep ',' "lastz {1} {2} --ambiguous=IUPAC --format=GENERAL > {1//}/output.fasta"

Resulting commands

lastz ./a/reference.fasta ./a/sample.fasta --ambiguous=IUPAC --format=GENERAL > ./a/output.fasta
lastz ./b/reference.fasta ./b/sample.fasta --ambiguous=IUPAC --format=GENERAL > ./b/output.fasta
lastz ./c/reference.fasta ./c/sample.fasta --ambiguous=IUPAC --format=GENERAL > ./c/output.fasta
ADD COMMENTlink written 9 months ago by mbens100

Wow, this is precisely what I want, and yes moving the files to same directory is a lot easier. Thankyou so much!

ADD REPLYlink written 9 months ago by SaltedPork80
gravatar for cpad0112
9 months ago by
cpad01129.4k wrote:
$ tree .
├── a
│   ├── reference.fa
│   └── sample.fa
├── b
│   ├── reference.fa
│   └── sample.fa
└── c
    ├── reference.fa
    └── sample.fa

3 directories, 6 files

using parallel (remove dry-run option to execute the code):

$ ls -d */ | parallel --dry-run 'lastz ./{}reference.fa ./{}sample.fa --ambiguous=IUPAC --format=GENERAL > ./{}output.fa'

commands to be executed (i.e dry-run of code):

lastz ./a/reference.fa ./a/sample.fa --ambiguous=IUPAC --format=GENERAL > ./a/output.fa
lastz ./b/reference.fa ./b/sample.fa --ambiguous=IUPAC --format=GENERAL > ./b/output.fa
lastz ./c/reference.fa ./c/sample.fa --ambiguous=IUPAC --format=GENERAL > ./c/output.fa
ADD COMMENTlink written 9 months ago by cpad01129.4k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1952 users visited in the last hour