Question

unidentified hg19 reference file

0

Entering edit mode

7.4 years ago

michal.inbar • 0

Hi all, newbie question here:

we have a hg19 reference file which is sorted in an unusual manner (see the order of contigs below). It was downloaded a few years ago, and we have no info as to its origin.

we have 40 bam files already aligned to this reference (are also sorted in this manner).

for the next pipleline steps, we would like to switch to the UCSC official hg19 reference. now, when trying to use gatk tools with the UCSC reference, it gives an error (see below).

we could try using gatk's ReorderSam to solve this problem, but we have 2 concerns:

1) Is there a way to identify where the original reference came from, and to verify it is indeed an acceptable reference? the weird sorting makes us suspicious.

things i've checked: - contig name are the same in both references - samtools flagstat is almost identical (few reads apart) for 2 bam files aligned against each reference

2) is using gatk's ReorderSam an acceptable solution for our situation? is it "good practice" to switch from one reference to another (UCSC) in the middle of a pipeline?

We are trying to decide if realigning 40 bams is necessary, and would appreciate your thoughts.

following is GATK error message- see order of old reference:

##### ERROR MESSAGE: Input files reads and reference have incompatible contigs.
Please see https://www.broadinstitute.org/gatk/guide/article?id=63 for more infor
mation. Error details: The contig order in reads and reference is not the same;
to fix this please see: (https://www.broadinstitute.org/gatk/guide/article?id=1328),  which describes reordering contigs in BAM and VCF files..
##### ERROR   reads contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chrX, c
hr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr20,
 chrY, chr19, chr22, chr21, chr6_ssto_hap7, chr6_mcf_hap5, chr6_cox_hap2, chr6_m
ann_hap4, chr6_apd_hap1, chr6_qbl_hap6, chr6_dbb_hap3, chr17_ctg5_hap1, chr4_ctg
9_hap1, chr1_gl000192_random, chrUn_gl000225, chr4_gl000194_random, chr4_gl00019
3_random, chr9_gl000200_random, chrUn_gl000222, chrUn_gl000212, chr7_gl000195_ra
ndom, chrUn_gl000223, chrUn_gl000224, chrUn_gl000219, chr17_gl000205_random, chr
Un_gl000215, chrUn_gl000216, chrUn_gl000217, chr9_gl000199_random, chrUn_gl00021
1, chrUn_gl000213, chrUn_gl000220, chrUn_gl000218, chr19_gl000209_random, chrUn_
gl000221, chrUn_gl000214, chrUn_gl000228, chrUn_gl000227, chr1_gl000191_random,
chr19_gl000208_random, chr9_gl000198_random, chr17_gl000204_random, chrUn_gl0002
33, chrUn_gl000237, chrUn_gl000230, chrUn_gl000242, chrUn_gl000243, chrUn_gl0002
41, chrUn_gl000236, chrUn_gl000240, chr17_gl000206_random, chrUn_gl000232, chrUn
_gl000234, chr11_gl000202_random, chrUn_gl000238, chrUn_gl000244, chrUn_gl000248
, chr8_gl000196_random, chrUn_gl000249, chrUn_gl000246, chr17_gl000203_random, c
hr8_gl000197_random, chrUn_gl000245, chrUn_gl000247, chr9_gl000201_random, chrUn
_gl000235, chrUn_gl000239, chr21_gl000210_random, chrUn_gl000231, chrUn_gl000229
, chrM, chrUn_gl000226, chr18_gl000207_random]
##### ERROR   reference contigs = [chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr
7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, ch
r19, chr20, chr21, chr22, chrX, chrY, chr1_gl000191_random, chr1_gl000192_random
, chr4_ctg9_hap1, chr4_gl000193_random, chr4_gl000194_random, chr6_apd_hap1, chr
6_cox_hap2, chr6_dbb_hap3, chr6_mann_hap4, chr6_mcf_hap5, chr6_qbl_hap6, chr6_ss
to_hap7, chr7_gl000195_random, chr8_gl000196_random, chr8_gl000197_random, chr9_
gl000198_random, chr9_gl000199_random, chr9_gl000200_random, chr9_gl000201_rando
m, chr11_gl000202_random, chr17_ctg5_hap1, chr17_gl000203_random, chr17_gl000204
_random, chr17_gl000205_random, chr17_gl000206_random, chr18_gl000207_random, ch
r19_gl000208_random, chr19_gl000209_random, chr21_gl000210_random, chrUn_gl00021
1, chrUn_gl000212, chrUn_gl000213, chrUn_gl000214, chrUn_gl000215, chrUn_gl00021
6, chrUn_gl000217, chrUn_gl000218, chrUn_gl000219, chrUn_gl000220, chrUn_gl00022
1, chrUn_gl000222, chrUn_gl000223, chrUn_gl000224, chrUn_gl000225, chrUn_gl00022
6, chrUn_gl000227, chrUn_gl000228, chrUn_gl000229, chrUn_gl000230, chrUn_gl00023
1, chrUn_gl000232, chrUn_gl000233, chrUn_gl000234, chrUn_gl000235, chrUn_gl00023
6, chrUn_gl000237, chrUn_gl000238, chrUn_gl000239, chrUn_gl000240, chrUn_gl00024
1, chrUn_gl000242, chrUn_gl000243, chrUn_gl000244, chrUn_gl000245, chrUn_gl00024
6, chrUn_gl000247, chrUn_gl000248, chrUn_gl000249]

genome • 2.7k views

ADD COMMENT • link updated 7.4 years ago by Eric T. ★ 2.8k • written 7.4 years ago by michal.inbar • 0

1

Entering edit mode

If you don't have the original sequence data you could get the sequences out using bamtofastq and then realign the data. Trying to hunt down what reference was used may be an exercise in frustration/of questionable value.

ADD REPLY • link 7.4 years ago by GenoMax 141k

1

Entering edit mode

If you are (even if only a bit) uncertain about the quality or integrity of your reference, I would definitely re-align everything. In the worst case you encounter any trouble later in the project due to this reference but are held responsible for it, as you are the data analyst. Do yourself a favor and exclude these sources of trouble right away.

ADD REPLY • link 7.4 years ago by ATpoint 82k

0

Entering edit mode

thanks genomax and Alexander. we actually do have the original data, but it would take time to realign since we have many samples. samtools flagstat returned same results for data aligned with the suspicious reference and the UCSC reference. Isn't it enough to prove that the suspicious reference is legit?

ADD REPLY • link 7.4 years ago by michal.inbar • 0

1

Entering edit mode

It really depends on what you want to do next. Anytime a reference is in question (or not of defined provenance) all downstream analysis may be in doubt, especially if someone needs to/tries to replicate your analysis. Realigning the data will take time now but you won't need to worry about questions down the road.

ADD REPLY • link 7.4 years ago by GenoMax 141k

score 1 · Answer 1 · 2016-12-08

It looks like your odd hg19 is sorted by chromosome length. Someone may have downloaded the chromosome sequences individually and then concatenated them in order of decreasing size. (I vaguely recall the GATK resource bundle did it this way at one point.)

If you have the original and current hg19 FASTA sequence files, you can split both files into individual chromosomes (e.g. 'fasta-explode' in the Exonerate package, or in a script), then verify that each pair of chromosomes is identical, either with md5sum or just comparing sequence identify (slower, but should still take no more than a few minutes with Biopython's SeqIO module on a reasonably current laptop). Or just re-sort the original hg19 chromosomes into the same order as the current one, verify that line wrapping width is the same, and md5sum the whole files.

That being said, you might be better off re-running your current pipeline from scratch, including read mapping, using the most recent version of an appropriate aligner like BWA-MEM -- your existing BAMs may not have been aligned as well as they could be, and you might want to set read groups and other tags while you're at it.

There is no fixed SOP for genome analysis; each project is different. ReorderSam will correctly reorder the reference sequences in a SAM/BAM file, and that's a sensible thing to do if you've verified that the only difference between the references is the chromosome ordering and that the original read mapping was done the way you want it. But if you plan to include any more samples in your analysis, you'll be better off choosing your reference genome and alignment methods now, and keeping them fixed for the duration of the project, or until you're ready to validate an updated pipeline. And that may require repeating previous steps to ensure your methods are consistent across the project.