Appropriate bed files from library capture kit for computing on target coverage of WES bam files with Picard and CollectHsMetrics
1
1
Entering edit mode
5.8 years ago
svlachavas ▴ 790

Dear Community,

based on a WES project of cancer bam files for variant calling purposes, I'm currently trying to collect some specific quality metrics-namely the on target coverage for each bam file. Based on a small search, I found than Picard is capable of this analysis, through the following function

https://broadinstitute.github.io/picard/command-line-overview.html#CollectHsMetrics

In detail, based on the protocol library kit used for the experimental design [SureSelect Clinical Research Exome V2, Agilent Technologies], I used the following link:

https://earray.chem.agilent.com/suredesign/home.htm

and selected the same kit for downloading the necessary bed files- SureSelect Clinical Research Exome V2 (Design ID-S30409818). My main issue is the following:

As the function arguments are the following:

java -jar picard.jar CollectHsMetrics \
      I=input_reads.bam \
      O=output_hs_metrics.txt \
      R=reference.fasta \
      BAIT_INTERVALS=bait.interval_list \
      TARGET_INTERVALS=target.interval_list

However, the files downloaded from agilent, included the following "name prefix" files:

_AllTracks.bed, _Covered.bed, _Padded.bed, _Regions.bed and a file named Targets.txt

Thus:

  1. Which of the above files should I use with the arguments BAIT_INTERVALS and TARGET_INTERVALS, respectively?
  2. Or alternatively, I have downloaded the wrong files, and I should search them in other repositories?

Thank you in advance and excuse me for any naive questions, but it is the first time that I'm trying to compute on target coverage !!

Best,

Efstathios

NGS bam Picard WES • 6.2k views
ADD COMMENT
1
Entering edit mode

I'm not sure how typical this situation is, where the Covered and Regions files are exactly the same intervals. In case it is not the norm, I am supplementing the response from finswimmer with a couple old posts for reference.

I use this prior post for reference on what the different Agilent bed files contain: Question: Human Exome Capture Library Coordinates Download

I use this prior post for reference on what the different companies call their bed files: Question: Difference between primary and capture targets

In the case at hand, where the Covered and Regions files are the same, the Bait and Target interval files could be set to either one. If the files were different, you would use Covered for Bait and Regions for Target.

ADD REPLY
0
Entering edit mode

It would help to see the contents of some of these files?

ADD REPLY
0
Entering edit mode

Sure Kevin, i just made a dropbox link with the compressed file from the above Agilent link:

https://www.dropbox.com/s/8ulh1o8hcib0mms/S30409818_hs_hg38.zip?dl=0

ADD REPLY
2
Entering edit mode
5.8 years ago

Hello,

the differences between the files is described in the header of each file. If I remember correctly the _Padded.bed is the same as _Regions.bed but have additional bases (20?) to the left and right of each interval. Decide yourself if you need this.

You should take the same bed file for the BAIT_INTERVALS and TARGET_INTERVALS parameter.

fin swimmer

ADD COMMENT
0
Entering edit mode

Dear Fin,

thank you for your comments-i have checked each file and the description in each-however, if the all_Tracks.bed is the same with the covered.bed, and the regions.bed is the same with the covered.bed, why there are created as different files ? and in the end, which specific file in your opinion should i use both in the bait and target intervals ? that is the covered.bed file ? (=Genomic regions covered by probes) ?

Thank you in advance,

Efstathios

ADD REPLY
2
Entering edit mode

If I'm honest, I've never understood what Agilent is doing here. I also take a look again on the files you've linked to.

  • covered.bed and regions.bed are exactly the same
  • padded.bed extended the regions by 100 bases on each site
  • For all_Track.bed I can just guess. I guess it contains the exon regions for all genes covered by this panel. But the panel itself will only cover the known coding regions.

You have to decide if in your analyses you are only interested in exonic regions or also the neighboring intronic regions. For the first one use covered.bed, for the later padded.bed.

I'd prefer using the covered.bed as basis and adjust the padding to 20 :)

fin swimmer

ADD REPLY
1
Entering edit mode

Hello Sir, I was also facing same issue with the selection of files for target coverage analysis and your explanation really helped.Thanks a lot.

ADD REPLY
0
Entering edit mode

Thanks a lot Fin for the explanations !! Really appreciated it !! I will go for the exonic regions, as they are of main interest.

Efstathios

ADD REPLY

Login before adding your answer.

Traffic: 1207 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6