Question: How can I run lastz on multiple files using a bash loop and an associative array?
0
gravatar for SaltedPork
6 months ago by
SaltedPork70
SaltedPork70 wrote:

The lastz command will look like:

lastz reference.fasta sample.fasta --ambiguous=IUPAC --format=GENERAL > output.fasta

I need to run lastz once for every sample file, because each sample file has to be compared to its own individual reference (I have one reference for each sample).

I'm using find to get all of my 'sample' files, and will do the same for the reference files.

find . -type f -name 'sample.fasta' -not -path "*/temp/*"

The question is, how do I associate each sample file with each reference file, taking as input the output of the find command? Or is there a better way?

bash for loop find • 492 views
ADD COMMENTlink modified 6 months ago by cpad01127.7k • written 6 months ago by SaltedPork70
1

Based on your find command it appears that your files are all in one local directory. Is there a way to tell sample (and corresponding ref) apart by just file extension? If that is the case why not do

find . -type f -name 'sample.fasta' -not -path "*/temp/*" > files
for i in `cat files | sed 's/.fasta//'`; do lastz $i.ref $i.fasta --ambiguous=IUPAC --format=GENERAL > $i\_out.fasta; done
ADD REPLYlink modified 6 months ago • written 6 months ago by genomax52k
3
gravatar for mbens
6 months ago by
mbens100
Germany
mbens100 wrote:

It depends on the location of your reference files. Assuming reference and sample are located within one directory, you won't need associative arrays.

Example file structure:

├── a
│   ├── reference.fasta
│   └── sample.fasta
├── b
│   ├── reference.fasta
│   └── sample.fasta
└── c
    ├── reference.fasta
    └── sample.fasta

Collect paths to samples and references in separate files

find . -type "f" -name "sample.fasta" | sort > samples.list
find . -type "f" -name "reference.fasta" | sort > references.list

Create association sample - reference

paste -d ',' references.list samples.list > file_association.csv

Create commands

while IFS=","
read -r col1 col2
do
    x=$(dirname $col2)
    echo "lastz $col1 $col2 --ambiguous=IUPAC --format=GENERAL > $x/output.fasta"
done < file_association.csv

or using GNU parallel

cat file_association.csv | parallel --dry-run --colsep ',' "lastz {1} {2} --ambiguous=IUPAC --format=GENERAL > {1//}/output.fasta"

Resulting commands

lastz ./a/reference.fasta ./a/sample.fasta --ambiguous=IUPAC --format=GENERAL > ./a/output.fasta
lastz ./b/reference.fasta ./b/sample.fasta --ambiguous=IUPAC --format=GENERAL > ./b/output.fasta
lastz ./c/reference.fasta ./c/sample.fasta --ambiguous=IUPAC --format=GENERAL > ./c/output.fasta
ADD COMMENTlink written 6 months ago by mbens100

Wow, this is precisely what I want, and yes moving the files to same directory is a lot easier. Thankyou so much!

ADD REPLYlink written 6 months ago by SaltedPork70
3
gravatar for cpad0112
6 months ago by
cpad01127.7k
India
cpad01127.7k wrote:
$ tree .
.
├── a
│   ├── reference.fa
│   └── sample.fa
├── b
│   ├── reference.fa
│   └── sample.fa
└── c
    ├── reference.fa
    └── sample.fa

3 directories, 6 files

using parallel (remove dry-run option to execute the code):

$ ls -d */ | parallel --dry-run 'lastz ./{}reference.fa ./{}sample.fa --ambiguous=IUPAC --format=GENERAL > ./{}output.fa'

commands to be executed (i.e dry-run of code):

lastz ./a/reference.fa ./a/sample.fa --ambiguous=IUPAC --format=GENERAL > ./a/output.fa
lastz ./b/reference.fa ./b/sample.fa --ambiguous=IUPAC --format=GENERAL > ./b/output.fa
lastz ./c/reference.fa ./c/sample.fa --ambiguous=IUPAC --format=GENERAL > ./c/output.fa
ADD COMMENTlink written 6 months ago by cpad01127.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 452 users visited in the last hour