Difference between primary and capture targets
3
3
Entering edit mode
5.8 years ago
John ▴ 160

Hi,

does anybody can explain this explanation from roche sites:

hg19_primary_targets.bed: This file contains the design primary target (unpadded) in hg19 coordinates and gene annotation in the 4th column.

hg19_capture_targets.bed: This file contains coordinates showing the probe footprint with no padding in hg19 coordinates.

What BED should I use for coverage analysis? And what should I use for variant calling?

Thank you for explanation.

targets capture dna-seq • 4.3k views
6
Entering edit mode
3.5 years ago

I have struggled with this recently and am still lamenting why things are not standardized, at least terminology-wize. @Wouter has already explained well what is going on. I would just like to add, if I'm not mistaking (correct me if I am), that:

A bed file which holds the wishlist (based on a genome, e.g. GRCh38) of what you wish to capture from your DNA, can be referred to as "primary target" (NimbleGen), "empirical target" (MedExome) or "regions" by Agilent. If you are using GATK4's CollectHsMetrics, this will correspond to --TARGET_INTERVALS parameter.

The file which holds the actual probes that are thought to capture the primary target regions can be called "capture target" (NimbleGen, MedExome) or "covered" (Agilent). In e.g. aforementioned CollectHsMetrics, this corresponds to --BAIT_INTERVALS parameter.

Illumina offers just one file called truseq-exome-targeted-regions-manifest-v1-2.bed (link may be broken and filename may have changed by the time you're reading this). "Probes" come in a separate txt file and is not directly usable. You would need to create a bed file and use Picard's BedToInterval tool on that.

To better understand how this ties in, you could read the documentation for BaitDesigner.

Feel free to correct me, but this is how I envision the bait/target relationship using some crude sketching.

4
Entering edit mode
5.8 years ago

hg19_primary_targets.bed is the file which was used by Roche to design the assay: what they had in mind to cover using this kit. This is the biologically relevant target region. (It's not granted that this will work perfectly, it might be a bit more or a bit less on certain locations)

hg19_capture_targets.bed contains the intervals which are targeted directly by the assay. These intervals are 1-on-1 covered by the probes. This is the technically relevant target region. (However, the sequenced region will be bigger than this since flanking sequences are sequenced as well. That's what they mean with padding).

In my opinion you need to use hg19_primary_targets.bed for coverage analysis, because that's the aim of the assay. You want to know how well it performs on sequencing the target. I would also use this same interval for variant calling, but with a large padding interval (-ip flag in GATK, e.g. 75 or 100). You don't want to miss a very interesting SNP which was just not included your target region, right?

0
Entering edit mode

Thank you so much for reply and nice explanation. When I check my BED files in IGV, I can see that capture_targets are wider (almost everywhere) than primary_targets. So I do not need to exclude any primer sequences? I am more familiar with amplicon sequencing.

0
Entering edit mode

Primer sequences? In capturing there is no such thing as primers besides those universal primer sequences which are attached by ligation and aren't sequenced.

0
Entering edit mode

Thank you for explanation. I was not sure, because our lab people told it should be amplicon sequencing. So this is Enrichment. I see :)

1
Entering edit mode
2.0 years ago
Ram 36k

I stumbled upon this question today while looking to understand various BED files that come with each Agilent SureSelect kit. Here's some relevant documentation from Agilent:

### BED files

The three BED-format track files that SureDesign creates for each custom SureSelect design are described below. You can import these files into a compatible genome browser to graphically view the locations of the tracks in the genome. For detailed information on the tracks and how they can help you analyze your design, see Design analysis using tracks.

[design ID]_Regions.bed - This BED file contains a single track of the target regions of interest that SureDesign used to select the probes. You can use this track to see the exact regions that the program was attempting to cover when selecting the probes.

[design ID]_Covered.bed - This BED file contains a single track of the genomic regions that are covered by one or more probes in the design. The fourth column of the file contains annotation information. You can use this file for assessing coverage metrics.

[design ID]_AllTracks.bed - This multitrack BED file includes the following tracks:

• The Target Regions track is identical to the track in the Regions BED file.

• The Covered probes track is identical to the track in the Covered BED file.

• The Missed Regions track contains any regions from the Target Regions track that are not included in the Covered probes track.

• The Probes track contains the regions of all probes in the design.

### Text files

The three [sic] text files for a custom SureSelect design are described below. You can view these files in any text editor program (e.g., NotePad) or spreadsheet program (e.g., Excel). Any tables embedded in the text files are tab-delimited and contain column headers. Lines of text that start with a # character are comment lines.

[design ID]_Targets.txt - This file contains a list of the target identifiers that you entered when creating the design.

[design ID]_Report.txt - This file contains summary information on the design, the probes, the targets, and the parameters used to create the design.