Question

Whole Genome Analysis Pipeline (Illumina)

2

Entering edit mode

12.0 years ago

NB ▴ 960

Hi,

I would like to know what is the feasible algorithm to map human whole genome sequences (Illumina) ? And what is the general pipeline followed for variant calling for whole genome analysis ?

Thank you, Nandini

illumina pipeline • 10k views

ADD COMMENT • link updated 12.0 years ago by reshetovdenis • 0 • written 12.0 years ago by NB ▴ 960

score 1 · Answer 1 · 2012-04-26

1

Entering edit mode

12.0 years ago

Sean Davis 26k

You might take a look here for some ideas:

http://www.broadinstitute.org/gsa/wiki/index.php/Best_Practice_Variant_Detection_with_the_GATK_v3

If you have not done this yourself before, I highly suggest getting a collaborator to work with you on these data.

ADD COMMENT • link 12.0 years ago by Sean Davis 26k

1

Entering edit mode

Yes, is there a bioinformatics group where you work? There are many things to consider and it would help a lot if you can discuss this with people who have experience working with next generation sequence data.

ADD REPLY • link 12.0 years ago by Rubal7 ▴ 830

0

Entering edit mode

Thank you for your reply. I have worked on SOLiD whole genome before using bioscope and now I have changed to Illumina ( each sample has been sequenced with flow cell having 7 or 8 lanes with 2 reads each)
So I was wondering if BWA-> Base quality score recalibration ->Local realignment -> MarkDuplicates -> Variant calling is a good option .

ADD REPLY • link 12.0 years ago by NB ▴ 960

0

Entering edit mode

Local realignment should probably come after BWA and before marking duplicates and recalibration.

ADD REPLY • link 12.0 years ago by Sean Davis 26k

score 1 · Answer 2 · 2012-05-01

The Gatk pipeline in the previous post is pretty good, but can be a bit painful when implementing the whole thing (and CPU/io intensive). I've been using http://www.realtimegenomics.com a lot for our sequencing project (1200x coverages of bovine genome) and their pipeline is a lot cleaner, ergonomic (4 commands, format, map, coverage, snp or cnv) and faster (5-10x on our cluster) than the BWA/GATK pipeline while giving comparable results (both gave 99.6% concordance with snp chip calls). And their documentation is pretty good, note while they are commercial there is a free license that's suitable for most research and commercial use on a small to medium scale, they support there software very very well.

The output from the rtg pipeline can be feed into GATK as well if you want just need to filter the bams slightly.

score 0 · Answer 3 · 2012-06-21

We just finished up our own automated pipeline which uses BWA, GATK, ANNOVAR and samtools to process fastq through to annotated VCF. It was designed for our illumina, human-whole genome data, so it assumes paired end data ATM, but it might be of use. It can download and compile/install each of the components (except ANNOVAR, which you'll need to give them your email address to get access to) and allows very high level of control over each of the programs via a single configuration file (which makes it easier to add data later on). It should run on PBS and SGE clusters as well as in serial, and helps ease the hassle of managing all of those jobs.

It's open source, and pretty extendable, but we haven't really put much effort into documenting how to do that just yet:) But, if you have another program that you prefer for variant calls or alignment, you probably can reuse one of the templates to have it use the alternate program. There are instructions on doing just that in the user's guide.

Anyway, if you are interested, have a look at ASAP. If anyone has ideas or questions relating to ASAP, I'd be happy to answer them.

score 0 · Answer 4 · 2012-06-28

We've created the pipeline that calls SNPs and SVs. The results are presented to users in Excel tables with effect annotation of each variation. Also the data about protein function, pathways and diseases is presented. The pipeline integrates: GATK best practice Pindel Ensembl variant effect prediction Polyphen SIFT http://code.google.com/p/ngs-pipeline/