Question: How can I run lastz on multiple files using a bash loop and an associative array?
gravatar for SaltedPork
16 days ago by
SaltedPork60 wrote:

The lastz command will look like:

lastz reference.fasta sample.fasta --ambiguous=IUPAC --format=GENERAL > output.fasta

I need to run lastz once for every sample file, because each sample file has to be compared to its own individual reference (I have one reference for each sample).

I'm using find to get all of my 'sample' files, and will do the same for the reference files.

find . -type f -name 'sample.fasta' -not -path "*/temp/*"

The question is, how do I associate each sample file with each reference file, taking as input the output of the find command? Or is there a better way?

bash for loop find • 221 views
ADD COMMENTlink modified 16 days ago by cpad01124.1k • written 16 days ago by SaltedPork60

Based on your find command it appears that your files are all in one local directory. Is there a way to tell sample (and corresponding ref) apart by just file extension? If that is the case why not do

find . -type f -name 'sample.fasta' -not -path "*/temp/*" > files
for i in `cat files | sed 's/.fasta//'`; do lastz $i.ref $i.fasta --ambiguous=IUPAC --format=GENERAL > $i\_out.fasta; done
ADD REPLYlink modified 16 days ago • written 16 days ago by genomax40k
gravatar for mbens
16 days ago by
mbens100 wrote:

It depends on the location of your reference files. Assuming reference and sample are located within one directory, you won't need associative arrays.

Example file structure:

├── a
│   ├── reference.fasta
│   └── sample.fasta
├── b
│   ├── reference.fasta
│   └── sample.fasta
└── c
    ├── reference.fasta
    └── sample.fasta

Collect paths to samples and references in separate files

find . -type "f" -name "sample.fasta" | sort > samples.list
find . -type "f" -name "reference.fasta" | sort > references.list

Create association sample - reference

paste -d ',' references.list samples.list > file_association.csv

Create commands

while IFS=","
read -r col1 col2
    x=$(dirname $col2)
    echo "lastz $col1 $col2 --ambiguous=IUPAC --format=GENERAL > $x/output.fasta"
done < file_association.csv

or using GNU parallel

cat file_association.csv | parallel --dry-run --colsep ',' "lastz {1} {2} --ambiguous=IUPAC --format=GENERAL > {1//}/output.fasta"

Resulting commands

lastz ./a/reference.fasta ./a/sample.fasta --ambiguous=IUPAC --format=GENERAL > ./a/output.fasta
lastz ./b/reference.fasta ./b/sample.fasta --ambiguous=IUPAC --format=GENERAL > ./b/output.fasta
lastz ./c/reference.fasta ./c/sample.fasta --ambiguous=IUPAC --format=GENERAL > ./c/output.fasta
ADD COMMENTlink written 16 days ago by mbens100

Wow, this is precisely what I want, and yes moving the files to same directory is a lot easier. Thankyou so much!

ADD REPLYlink written 16 days ago by SaltedPork60
gravatar for cpad0112
16 days ago by
cpad01124.1k wrote:
$ tree .
├── a
│   ├── reference.fa
│   └── sample.fa
├── b
│   ├── reference.fa
│   └── sample.fa
└── c
    ├── reference.fa
    └── sample.fa

3 directories, 6 files

using parallel (remove dry-run option to execute the code):

$ ls -d */ | parallel --dry-run 'lastz ./{}reference.fa ./{}sample.fa --ambiguous=IUPAC --format=GENERAL > ./{}output.fa'

commands to be executed (i.e dry-run of code):

lastz ./a/reference.fa ./a/sample.fa --ambiguous=IUPAC --format=GENERAL > ./a/output.fa
lastz ./b/reference.fa ./b/sample.fa --ambiguous=IUPAC --format=GENERAL > ./b/output.fa
lastz ./c/reference.fa ./c/sample.fa --ambiguous=IUPAC --format=GENERAL > ./c/output.fa
ADD COMMENTlink written 16 days ago by cpad01124.1k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 882 users visited in the last hour