Question

Genomic Alignment And Snp/Indel Calling - My First Ever "Pipeline"

4

Entering edit mode

12.9 years ago

Travis ★ 2.8k

So I have finally generated some reads and run it through what I guess could be called a very rudimentary 'pipeline'. I generated a million paired end reads with wgsim then aligned with bwa, and used samtools/bcftools/vcftools. The commands ran like this:

bwa aln -t 10 hg19 -f test.read1.sai test.read1.fq
bwa aln -t 10 hg19 -f test.read2.sai test.read2.fq
bwa sampe -f test.sam hg19 test.read1.sai test.read2.sai test.read1.fq test.read2.fq
samtools view -S -b test.sam > wgsim100bpPE.bam
samtools sort test.bam test_sort
samtools mpileup -uf ../hg19/hg19.fa test_sort.bam > test.bcf
bcftools view -vcg test.bcf - > test.vcf

Now I know this is simplistic and I know other tools are available, and I will try more in the future. What I would like to know is if there are any other steps or parameters I should be using with this existing workflow to improve it? e.g. make it more efficient. Also, if I wanted to run 150 samples, would I just run this 150 times in a row, or would I want to do things differently.

Any help would be appreciated. And please be kind ;)

snp indel next-gen sequencing • 4.7k views

ADD COMMENT • link updated 8.2 years ago by Biostar 20 • written 12.9 years ago by Travis ★ 2.8k

0

Entering edit mode

Nice! I have almost the exact same chain for a germline analysis workflow.

ADD REPLY • link 6.2 years ago by b10hazard ▴ 30

score 4 · Answer 1 · 2011-05-28

That's about what I've used, though I also throw in a rmdup after the sort stage, to get rid of PCR duplicates. I don't pre-process, but that's mostly because for my applications, people are much more concerned about failing to see real SNPs, than overestimating due to false positives.

I've also taken to using the -B on the mpileup, as I've seen sanger-verified SNPs not turn up with BAQ calculations were used, because the BAQ downgraded the hell out of the quality scores at those loci. And when I turned off the BAQ with -B, the SNPs popped up again.

You can pipe the sampe and first view stages together. I always used import instead of view, I'm not sure what the difference is, except that import requires the reference as an element in the command.

score 3 · Answer 2 · 2011-05-28

Many people seem to use bwa and samtools mpileup in exactly this setting. There are many different aligners so it might make sense to test a few, like bowtie, novoalign, mosaic and bfast, however I would think, you will be quite fine using samtools and bwa. You might consider to add pre- and post-processing.

For pre-processing:

filter or clip reads on quality threshold
remove exact duplicate reads, if these are sequencing or PCR artifacts, they might create an artificially high coverage at some loci

this could be achieved by adding Fastx tools to your workflow

for postprocessing: - using vcfutils.pl varFilter -D100 however I have some doubts about how the -D parameter actually works.

In addition, in the documentation of samtools mpileup they show to use pipes (|) a lot. If you used that throughout your script, it might make the script just a little more efficient.