Question: how to de novo assemble a large number of bacterial genome with spades in Linux
0
gravatar for haomingju
4 weeks ago by
haomingju0
haomingju0 wrote:

Hi, I am a freshman in sequencing data analysis. When i have one fastq file for only one bacteria , i know how to assemble using Spades. For example, "spades.py --pe1-1 name.fq.gz --pe1-2 name.fq.gz -o spades_test". But I don't know how to deal with a large number of samples with one linux command. For example, when i have 10 fastq data (name1~name10), i won't like to assemble them one by one by hand. Can you tell me how can i do ? Thanks!

assembly • 152 views
ADD COMMENTlink modified 4 weeks ago by meowz40 • written 4 weeks ago by haomingju0
1

Type bash loop in google.

ADD REPLYlink written 4 weeks ago by Rob110
1

Take a look at bash for loops.

Just putting these commands in a loop is not going to make these go any faster. If you have access to a cluster you could potentially use a for loop to submit 10 parallel spades jobs otherwise they will run one after the other.

ADD REPLYlink written 4 weeks ago by genomax78k
1

Do you have access to a HPC or computing cluster? You should up your skills and use submission scripts or pipelines to manage this.

ADD REPLYlink written 4 weeks ago by Asaf7.0k

You can do it with the help of shell

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by meowz40

spades.py --pe1-1 name.fq.gz --pe1-2 name.fq.gz -o spades_test" I guess this says that you have paired end, but fragmented reads. But you have only one fragment per end. I guess you can use -1 and -2 direct. There is also a problem with naming convention in OP.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by cpad011212k
3
gravatar for meowz
4 weeks ago by
meowz40
meowz40 wrote:

You can do it with the help of shell

#!/bin/bash

for fol in "your fastq directory" ; do
echo $fol

for i in `ls $fol | tr "_" "\t" | cut -f4 | sort | uniq`; do
fitag=`ls $fol | grep $i | head -n1 | sed -e 's/L/\t/g' | cut -f1`
spades.py --pe1-1 $fol$fitag$i"_R1.fastq.gz"  --pe1-2 $fol$fitag$i"_R2.fastq.gz"  -o $fol$fitag$i".out"
done
done
ADD COMMENTlink written 4 weeks ago by meowz40

Although this solution works it should be avoided. As someone who spent a lot of time doing such things I can assure you that you will have to run this command more than once (a lot more actually), with different parameters, different datasets, maybe combine two samples (have I removed adapters?), you got the idea. You'll end up hacking this bash script in some unknown location, not sure which version of it you used to generate the results and when you'll write your manuscript you'll avoid sharing this code because it's, well, I'll say it. Ugly. What should you do? Make your results disposable. Save the input in a well documented, backed-up location and use pipelines to run the analysis, you can either use flowcraft for metagenomics assembly or craft your own. I can't stress this enough, learn how to use pipeline management systems like wdl, nextflow, snakemake, choose one, doesn't really matter which.

ADD REPLYlink written 4 weeks ago by Asaf7.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 789 users visited in the last hour