Question: Generating Summary Statistics And Coverage Plots From Illumina Ga2 Data
gravatar for David Quigley
7.6 years ago by
David Quigley11k
San Francisco
David Quigley11k wrote:

We have performed a targeted sequencing of a few megabases of human DNA on ~20 samples using Illumina GA2. I've done alignment and genotyping, but I know our targeting method (SureSelect) has coverage gaps where no reads will align. I would like to identify these gaps from the aligned reads. My end goals are:

1) Summary statistics of coverage depth for each sample

2) Identify regions where I have no coverage in any sample

3) Able to generate ad hoc plots of coverage depth using tools of my own choice

Before I put my nose down and start coding off of the SAM pileup format, has someone written a package (or function in a package) that solves this problem for me? I'd rather use someone else's debugged code than write my own. I'm not looking to install a web server and click around; my ideal output would be a matrix where each row is a locus and each column is a sample, with number of mapped reads as the matrix value.

ADD COMMENTlink written 7.6 years ago by David Quigley11k
gravatar for Aaron Statham
7.6 years ago by
Aaron Statham1.1k
Aaron Statham1.1k wrote:

I'm most familiar with R/Bioconductor, and the general workflow I would use would be as follows (using the ShortRead, Rsamtools rtracklayer and IRanges packages)

Firstly, run import() on a bed file of your SureSelect regions - this allows you to use them to find overlaps with your actual data later.

Use scanBam() to read in a BAM file of aligned reads, convert to a GRanges (or GRangesList for multiple samples) using as() and then call coverage() on each GRanges object - this gives you "times coverage" per basepair of the genome in a relatively compact run-length encoded format.

Edit: alternatively once you have GRanges objects you could just call countOverlaps() to count how many reads are in each of your SureSelect regions if you don't need basepair resolution information).

You can then look for regions with less than say 20x coverage using slice(lower=0, upper=20) and then find overlaps of these regions of low coverage with your SureSelect targets using findOverlaps().

Alternatively you could subset the genome wide coverage objects to the SureSelect target regions using Views() and then perform per-region coverage summarisations using viewMins(), viewsMaxs(), viewMeans() etc.

All up this shouldn't take more than a few hours start to finish for a first run through - Good luck!

ADD COMMENTlink modified 7.6 years ago • written 7.6 years ago by Aaron Statham1.1k

Great steers, thanks very much.

ADD REPLYlink written 7.5 years ago by David Quigley11k

Great reply Aaron, could you provide some code to illustrate the process for those who are not familiar with R/BioC??

ADD REPLYlink written 6.9 years ago by Sakti340
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 637 users visited in the last hour