Question

Generating Summary Statistics And Coverage Plots From Illumina Ga2 Data

4

Entering edit mode

14.7 years ago

David Quigley 11k

We have performed a targeted sequencing of a few megabases of human DNA on ~20 samples using Illumina GA2. I've done alignment and genotyping, but I know our targeting method (SureSelect) has coverage gaps where no reads will align. I would like to identify these gaps from the aligned reads. My end goals are:

1) Summary statistics of coverage depth for each sample

2) Identify regions where I have no coverage in any sample

3) Able to generate ad hoc plots of coverage depth using tools of my own choice

Before I put my nose down and start coding off of the SAM pileup format, has someone written a package (or function in a package) that solves this problem for me? I'd rather use someone else's debugged code than write my own. I'm not looking to install a web server and click around; my ideal output would be a matrix where each row is a locus and each column is a sample, with number of mapped reads as the matrix value.

next-gen sequencing visualization illumina • 4.9k views

ADD COMMENT • link updated 14.7 years ago by Aaron Statham ★ 1.1k • written 14.7 years ago by David Quigley 11k

Ram · Answer 1 · 2010-11-04

I'm most familiar with R/Bioconductor, and the general workflow I would use would be as follows (using the ShortRead, Rsamtools, rtracklayer and IRanges packages)

Firstly, run import() on a bed file of your SureSelect regions - this allows you to use them to find overlaps with your actual data later.

Use scanBam() to read in a BAM file of aligned reads, convert to a GRanges (or GRangesList for multiple samples) using as() and then call coverage() on each GRanges object - this gives you "times coverage" per basepair of the genome in a relatively compact run-length encoded format.

Edit: alternatively once you have GRanges objects you could just call countOverlaps() to count how many reads are in each of your SureSelect regions if you don't need basepair resolution information).

You can then look for regions with less than say 20x coverage using slice(lower=0, upper=20) and then find overlaps of these regions of low coverage with your SureSelect targets using findOverlaps().

Alternatively you could subset the genome wide coverage objects to the SureSelect target regions using Views() and then perform per-region coverage summarisations using viewMins(), viewsMaxs(), viewMeans() etc.

All up this shouldn't take more than a few hours start to finish for a first run through - Good luck!