Question: Genomic Alignment And Snp/Indel Calling - My First Ever "Pipeline"
gravatar for Travis
9.6 years ago by
Travis2.8k wrote:

So I have finally generated some reads and run it through what I guess could be called a very rudimentary 'pipeline'. I generated a million paired end reads with wgsim then aligned with bwa, and used samtools/bcftools/vcftools. The commands ran like this:

bwa aln -t 10 hg19 -f test.read1.sai test.read1.fq
bwa aln -t 10 hg19 -f test.read2.sai test.read2.fq
bwa sampe -f test.sam hg19 test.read1.sai test.read2.sai test.read1.fq test.read2.fq
samtools view -S -b test.sam > wgsim100bpPE.bam
samtools sort test.bam test_sort
samtools mpileup -uf ../hg19/hg19.fa test_sort.bam > test.bcf
bcftools view -vcg test.bcf - > test.vcf

Now I know this is simplistic and I know other tools are available, and I will try more in the future. What I would like to know is if there are any other steps or parameters I should be using with this existing workflow to improve it? e.g. make it more efficient. Also, if I wanted to run 150 samples, would I just run this 150 times in a row, or would I want to do things differently.

Any help would be appreciated. And please be kind ;)

indel next-gen snp sequencing • 3.9k views
ADD COMMENTlink modified 4.9 years ago by Biostar ♦♦ 20 • written 9.6 years ago by Travis2.8k

Nice! I have almost the exact same chain for a germline analysis workflow.

ADD REPLYlink written 2.9 years ago by b10hazard30
gravatar for Swbarnes2
9.6 years ago by
Swbarnes21.5k wrote:

That's about what I've used, though I also throw in a rmdup after the sort stage, to get rid of PCR duplicates. I don't pre-process, but that's mostly because for my applications, people are much more concerned about failing to see real SNPs, than overestimating due to false positives.

I've also taken to using the -B on the mpileup, as I've seen sanger-verified SNPs not turn up with BAQ calculations were used, because the BAQ downgraded the hell out of the quality scores at those loci. And when I turned off the BAQ with -B, the SNPs popped up again.

You can pipe the sampe and first view stages together. I always used import instead of view, I'm not sure what the difference is, except that import requires the reference as an element in the command.

ADD COMMENTlink written 9.6 years ago by Swbarnes21.5k

Great comments - cheers!

ADD REPLYlink written 9.6 years ago by Travis2.8k
gravatar for Michael Dondrup
9.6 years ago by
Bergen, Norway
Michael Dondrup48k wrote:

Many people seem to use bwa and samtools mpileup in exactly this setting. There are many different aligners so it might make sense to test a few, like bowtie, novoalign, mosaic and bfast, however I would think, you will be quite fine using samtools and bwa. You might consider to add pre- and post-processing.

For pre-processing:

  • filter or clip reads on quality threshold
  • remove exact duplicate reads, if these are sequencing or PCR artifacts, they might create an artificially high coverage at some loci

this could be achieved by adding Fastx tools to your workflow

for postprocessing: - using varFilter -D100 however I have some doubts about how the -D parameter actually works.

In addition, in the documentation of samtools mpileup they show to use pipes (|) a lot. If you used that throughout your script, it might make the script just a little more efficient.

ADD COMMENTlink written 9.6 years ago by Michael Dondrup48k

Great input and help! Thanks!

ADD REPLYlink written 9.6 years ago by Travis2.8k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1254 users visited in the last hour