Question: GRCh38 BAM with hg38 VCFs
6
gravatar for dmyersturnbull
4.7 years ago by
Stanford University
dmyersturnbull90 wrote:

I have a BAM file containing an alignment to the GRCh38 "analysis pipeline" set with decoys (from ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/ seqs_for_alignment_pipelines/).

I remapped NCBI's "resource bundle" VCF files to UCSC's hg38 using a combination of UCSC's chain files, CrossMap, and custom scripts. I want to run GATK indel realignment, base calibration, and variant calling on my BAM file using some of these remapped VCF files (e.g. -known on RealignerTargetCreator).

The BAM and VCF come from different assemblies (GRCh38 with decoys and hg38, respectively). I can change GRCh38 chromosome names to hg38 chromosome names (NCBI provides a map), but is this sufficient to ensure that tools interpret the VCF files' coordinates correctly?

next-gen genome • 5.2k views
ADD COMMENTlink modified 4.7 years ago by Denise - Open Targets5.0k • written 4.7 years ago by dmyersturnbull90
2
gravatar for Denise - Open Targets
4.7 years ago by
UK, Hinxton, EMBL-EBI
Denise - Open Targets5.0k wrote:

Unplaced: not known which chromosome the scaffold belongs to.

Unlocalized: although the chromosome within which the scaffold occurs is known, the scaffold's position or orientation is not known.

 

ADD COMMENTlink written 4.7 years ago by Denise - Open Targets5.0k
1
gravatar for Denise - Open Targets
4.7 years ago by
UK, Hinxton, EMBL-EBI
Denise - Open Targets5.0k wrote:

GRCh38 and hg38 are different names for the same (latest) human assembly. The former is the official name by the Genome Reference Consortium whereas the latter is the UCSC Genome Browser Assembly ID. So the coordinates should be exactly the same.

More info can be found GRCh37/38(NCBI) vs hg19/hg38(UCSC).

ADD COMMENTlink modified 4.7 years ago • written 4.7 years ago by Denise - Open Targets5.0k

Thanks.

The sequences for GRCh38 and hg38 are nearly identical. However, hg38 seems to have more 'N's and 'n's, at least. Second, the GRCh38 "analysis set" contains coordinates that hg38 (and GRCh38) doesn't. Third, the headers differ wildly:

GRCh38: >gi|568336022|gb|CM000664.2| Homo sapiens chromosome 2, GRCh38 reference primary assembly

hg38: >chr2

My concern is that to GATK RealignerTargetCreator I'm passing the GRCh38 analysis set using --reference_sequence and a BAM file aligned to this assembly using --input_file, but VCF files mapped to hg38 using --known. How does GATK know that '568336022' is the same as 'chr2'?

I confirmed that RealignerTargetCreator produces different target intervals if I remove that argument (I also reran with --known to confirm that it's completely deterministic). This is a great sign, but it doesn't prove that the VCF file is being treated correctly.

So my question is how to ensure that tools (especially GATK) treat the mixed references (GRCh38 analysis set and hg38) correctly. Is it enough to change 'chr2' to '568336022'?

ADD REPLYlink written 4.7 years ago by dmyersturnbull90
1
gravatar for Denise - Open Targets
4.7 years ago by
UK, Hinxton, EMBL-EBI
Denise - Open Targets5.0k wrote:

Hmm, now I'm curious. Got in touch with Kerstin Howe from GRC (that's the beauty of working on the same campus). Can you send me some examples so that I can pass them to her please? The headers are not a problem per se but the Ns are. 

ADD COMMENTlink written 4.7 years ago by Denise - Open Targets5.0k
1

Re your first issue: It seems that the UCSC download replaces all ambiguous bp with Ns. You are losing some information there, but as this only affects about 28 assembly components it probably is not a big problem. 

Second issue: I am assuming you are working with the GRCh38 analysis set that includes the alts (note there is no decoy for GRCh38) and are therefore observing that it contains sequence that is not included in any download of the primary assembly (= no alts) from UCSC or directly from the GRC. If you want to avoid this, please use the "no_alts" analysis set offered here. More details can be found here.

 

   

ADD REPLYlink written 4.7 years ago by Kerstin Howe10

Thanks! I'm glad that the masking isn't an issue.

For the second concern, I am using no_alts, but the readme mentions it includes an Epstein-Barr viral sequence as a decoy as well as unplaced and unlocalized scaffolds (aside: what's the difference between unplaced and unlocalized?). My understanding is that these sequences are not included in UCSC's hg38. Is that not correct?

ADD REPLYlink written 4.7 years ago by dmyersturnbull90

These links are dead. NCBI moved the content to another folder and declared it frozen. Check this.

 

ADD REPLYlink written 3.4 years ago by Jarretinha3.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2155 users visited in the last hour