Question

Developing script that loops through multiple reference genomes...

0

Entering edit mode

3.8 years ago

Ada ▴ 10

I am in need developing a script that loops through multiple reference genomes, runs the alignment software for each genome (one by one), extracts and saves the results (only alignment %) from the sam file, and deletes the sam file after execution.

My input would be paired data.

alignment genome gene • 1.3k views

ADD COMMENT • link 3.8 years ago by Ada ▴ 10

0

Entering edit mode

bbsplit is meant to be used with multiple genomes at the same time.

bbsplit.sh in1=reads_R1.fq in2=reads_R2.fq  ref=ecoli.fa,salmonella.fa basename=out_%_#.fq outu=clean.fq

You could put a loop around that to deal with multiple samples:

  for i in *R1*.fq.gz
  do
    name = $(basename ${i} .fq.gz)
    bbsplit.sh in1=${name}_R1.fq.gz in2=${name}_R2.fq.gz  \
    ref=ecoli.fa,salmonella.fa,something.fa \ 
    basename=${name}_%_#.fq outu=${name}_clean.fq 
  done

ADD REPLY • link 3.8 years ago by GenoMax 141k

0

Entering edit mode

Do you mean I can do more than 1 paired-read at a time?

ADD REPLY • link 3.8 years ago by Ada ▴ 10

0

Entering edit mode

Not in one run. One sample at one time against multiple genomes. Doing more than 1 sample would not make logical sense. If you have access to a compute cluster you could start many such jobs in parallel.

ADD REPLY • link 3.8 years ago by GenoMax 141k

0

Entering edit mode

Interesting. But I would like to extract alignment stats from bbtools output. How do I go about doing so without producing .sam file (is that possible)?

ADD REPLY • link 3.8 years ago by Ada ▴ 10

1

Entering edit mode

By default bbsplit.sh will produce fastq files (it can do SAM/BAM files, but that is not recommended if you need to visualize the data). If you do not provide out* directive then you will only get the stats.

scafstats=<file>     Write statistics on how many reads mapped to which scaffold to this file.
refstats=<file>      Write statistics on how many reads were assigned to which reference to this file.
                     Unmapped reads whose mate mapped to a reference are considered assigned and will be counted.

ADD REPLY • link 3.8 years ago by GenoMax 141k

0

Entering edit mode

If I do one sample at a time, then is it really necessary to run a loop around multiple samples?

ADD REPLY • link 3.8 years ago by Ada ▴ 10

0

Entering edit mode

Does this loop run through the reference genomes one at a time or the samples 1 at a time? It looks like the latter.

ADD REPLY • link 3.8 years ago by Ada ▴ 10

1

Entering edit mode

The looping is using the samples. bbsplit.sh will align each read to each of to the reference genomes and will output reads to a file for the reference they best match. So, not exactly what you wanted, but likely still useful.

ADD REPLY • link 3.8 years ago by jared.andrews07 ★ 16k

0

Entering edit mode

Ada : Since you changed the content of the original post significantly, above set of comments now appear to be off-topic.

ADD REPLY • link 3.8 years ago by GenoMax 141k

0

Entering edit mode

in bash, using parallel:

$ tree
.
├── file1.fa
├── file2.fa
├── file3.fa
├── test1_R1.fastq
├── test1_R2.fastq
├── test2_R1.fastq
├── test2_R2.fastq
├── test3_R1.fastq
└── test3_R2.fastq

0 directories, 9 files

$ Reference_genomes=($(find . -type f -name "*.fa"))

parallel echo {} {=s/R1/R2/=} $(IFS=, ; echo "${Reference_genomes[*]}")  ::: *R1.fastq

test1_R1.fastq test1_R2.fastq ./file1.fa,./file2.fa,./file3.fa
test2_R1.fastq test2_R2.fastq ./file1.fa,./file2.fa,./file3.fa
test3_R1.fastq test3_R2.fastq ./file1.fa,./file2.fa,./file3.fa

ADD REPLY • link 3.8 years ago by cpad0112 21k

score 1 · Answer 1 · 2020-06-25

If you need to go through a set of reference genome one at one time you can use plain bbmap.sh from BBMap suite. Something like this would work (where your reference genomes are *.fa):

for y in *.fa 
  do
for i in *R1*.fq.gz
  do
    name = $(basename ${i} .fq.gz)
    bbmap.sh in1=${name}_R1.fq.gz in2=${name}_R2.fq.gz  \
    ref=${y} \ 
    other options for bbmap.sh
  done
done

If you do not provide out= directive then you will only get the stats. bbmap.sh writes stats to STDERR stream so capture that to get the info you need.