Question: Automating a SNP calling workflow
gravatar for zhou_1228
15 months ago by
zhou_12280 wrote:

I wrote a program to automate SNP calling process, but not sure if it works since I am new to shell scricpts. If I can get some review for the scripts, I would highly appreciate your help. The below is scripts,

cd ~/results


mkdir -p bwa sam bam vcf

for fq in ~/Tilling-by-seq/*.fastq
echo "working with file $fq"

base=$(basename $fq .fastq)
echo "base name is $base"


bwa aln -t 4 GlycinMaxNewbwaidx $3L_F_fq > $F_bwa
bwa aln -t 4 GlycinMaxNewbwaidx $3L_R_fq > $R_bwa
bwa sampe GlycinMaxNewbwaidx $F_bwa $R_bwa $3L_F_fq $3L_R_fq > $sam
samtools view -S -b $sam > $bam
samtools sort $bam -o $sorted_bam
samtools index $sorted_bam
freebayes -f $genome -p 24 --use-best-n-alleles 4 --pooled-discrete $sorted_bam > $vcf
sequencing snp next-gen • 515 views
ADD COMMENTlink modified 15 months ago by msimmer92230 • written 15 months ago by zhou_12280

I think you should run what you wrote and put the result/errors here besides the code, so people can help. But first; You are here overwrite the variable fq


same here;


I would like also to introduce you to workflow management, like snakemake. Also, there is other workflow management tools/language that will help you to be more productive.

ADD REPLYlink modified 15 months ago • written 15 months ago by Medhat8.6k
gravatar for msimmer92
15 months ago by
msimmer92230 wrote:

Your question for feedback about your code because you're new to shell scripts, instead of a specific question. I will, then, give some general tips based on this (that you may know or not at this point). Nevertheless, as Medhat pointed out, in this forum it's better to post more specific coding errors/questions.

  • As @Medhat commented on your post, beyond the thing about overwriting your variables, you could download workflow management programs that were design specifically to guide you throughout this process of pipeline making. Snakemake, Rufus, there are a bunch of them (see How To Organize A Pipeline Of Small Scripts Together? and the review paper: A review of bioinformatic pipeline frameworks. J Leipzig, Briefings in Bioinformatics, 2016, 1–7). In my lab, we haven't use this so far; we use text editors to visualize our pipelines nicely while we write, such as Visual Studio Code (my personal favorite), Emacs and Gedit depending on the person, and for now we're fine with this.

  • In my lab, at least, we consider it's good practice to always call variables in Bash with curly brackets (as you did with ${base}), because it prevents potential mistakes that can happen (see I recommend you do it, since I see you could benefit from that.

  • Taking automatization to the next level. To create a shell script that is as autonomous as possible, try to implement the concept of arguments. The idea is to make the code as general as possible to be able to be used plenty of times (avoid "hardcoding", that is, a lot of specific options and file paths or parameters that you typed in your code, but don't work for other files/cases). In the world of pipeline development, this is very important and will make your code mature significantly. To understand what I mean and how this might be useful for you, please read more about this here:

  • The choice of programming language. Certain ones, such as Bash and Python, are good as the "glue" that puts everything together (in other words, to make the principal code and then call programs and sub-scripts from there). This main code will be called, in programming jargon, a "wrapper". For bioinformatics analyses such as yours, a lot of great R packages are the gold standard for certain analyses. Because of this, it is likely that you end up creating a wrapper plus one or more R scripts that will be called by that wrapper (note: it is not good practice to use R for pipeline making - it is good that you're using Bash). It sounds trivial, but I point this out because some people try to use R because it's the first language they know (see How to write effective and stable bioinformatics pipeline in R ? for more opinions). In terms of which programming language is best, this is up to you, your field of study, your project and what it is used in your lab. For certain fields, like machine learning, you might need python because of the size of certain data or calculations. For gluing together programs in an average bioinformatics pipeline, such as SNP calling or differential expression analysis, whichever language you feel comfortable with and allows you to do a neat work is fine (you'll end up developing your own style, after a while :) that will be shaped by these decisions). It could happen that you face a new project and you decide to learn a new one and switch to be able to solve that particular situation, so keep an open mind and get minimally informed about the potentialities of other languages to be able to recognize when you might need another one.

  • Comment your code accordingly. The code is read by the computer, but also by humans. Even though your goal should be that the code is as self-explanatory as possible, comments are useful for both, yourself and others, specially in bioinformatics pipelines that people from different backgrounds will probably be using (not everyone has the ability to understand code). I recommend you visiting , where you could see a lot of code and pipelines that are published by professionals, to get ideas regarding the amount (and type) of comments that is reasonable (too much or too little can be a problem).

Happy coding!

ADD COMMENTlink modified 15 months ago • written 15 months ago by msimmer92230
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1090 users visited in the last hour