1000 genomes technical data: match exon capture probes to samples
Entering edit mode
6.2 years ago
eric.kern13 ▴ 230

I am studying how capture probe properties affect read depth in targeted exon sequencing. I am using data from the 1000 Genomes Project. My specific question: which exome sequencing/exon capture BAM files were generated using Nimblegen probes, and which used Agilent probes? 

I've read their FTP tutorial, their paper, and the supplementary info, and I've spent a lot of time on the FTP site. The closest I found was this README, which says which centers use which probe sets. (I can't find which ceters made which BAMs, though.) 

In general, switching between their paper, supplementary materials, README files, BAS files and tutorials makes it easy to leave gaps. This is a second-priority question, but in the future, is there a single resource that I can go to for technical questions about the 1000 Genomes project?

Thanks for your help.

next-gen • 1.8k views
Entering edit mode

This FAQ page provides a little more detail on the exome capture than the README you linked to.

Entering edit mode
6.2 years ago
Adam ★ 1.0k

I suggest you email info@1000genomes.org . They are pretty responsive. 

Entering edit mode

Thanks. I did that first. I wasn't sure how long it would take, so I posted here too.

Entering edit mode
6.2 years ago
eric.kern13 ▴ 230

In case anyone stumbles on this post, here is the impressively detailed answer from Holly Zheng-Bradley at 1000G. This paper may also be worth a look for people using 1000G data.

The 1000 Genomes Project exome sequence data were created by different
sequencing centres using different exome pulldown platforms. Below list the
centre abbreviation and the pulldown platform they used:

-- BGI: NimbleGen v1 2.1M_Human_Exome -- BI/WUGSC: Agilent

To look for exome BAM files made from data created by specific pulldown
platform, you may use our latest sequence index file as a starting point and
look for samples that have exome data produced by a specific sequencing centre.
If all exome data for a samples is produced by one centre, we know the exome
BAM file for that sample is based on data from pulldown platform used by that


Run a command line like below:

$ less 20130502.phase3.analysis.sequence.index | grep exome | cut
-f3,5,6,10,13,26 | sort -u | sort -k4 | less

…….. …….. ERR047782 1000 Genomes ACB exome sequencing BGI HG01990 ILLUMINA
exome ERR047783 1000 Genomes ACB exome sequencing BGI HG01990 ILLUMINA exome
ERR047784 1000 Genomes ACB exome sequencing BGI HG01990 ILLUMINA exome

Using HG01990 as example, basically you see that sample HG01990 is exome
sequenced by BGI (only), so the pulldown platform is NimbleGen v1
2.1M_Human_Exome. Of course you need to make sure HG01990 is not exome
sequenced by other centres (which shouldn't happen), because our sample level
BAMs are made by combining all available exome runs.

To get the exome BAM file for HG01990, you look into our alignment index file:


$ grep HG01990 20130502.phase3.exome.alignment.index | cut -f1
Some additional information: for downstream analysis, instead of using separate
pulldown bed file for each platform, the project used an Exome Project
Consensus BED files created by taking the union between the capture design
files used by different production centres (BGI, BI, and WUGSC) and CCDS. This
version was built based on the GRCh37.1 (NCBI HG19) reference sequences.


Login before adding your answer.

Traffic: 2318 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6