Question: How to determine reference genome that was used for alignment given BAM with obscure or no header
gravatar for
2.9 years ago by
United States
Irene@Sequencing.com220 wrote:

Is there any other way to determine for a BAM file what human reference genome was used for alignment if this info is not able to be ascertained from the header of the BAM file? (We have come across some human BAM files where the header does not contain enough info to determine the reference genome used for alignment.)

ADD COMMENTlink modified 2.9 years ago by Ryan Thompson3.3k • written 2.9 years ago by Irene@Sequencing.com220

It would be great if you can paste one such header. I would go by the name, number of chromosomes to know which organism's reference genome was used for the alignment. You can also pick up an aligned read from the BAM file and blat it against several organisms and check if the output coordinates from the blat match with coordinates specified in the BAM file. Of course, everything I mentioned will only work for known popular reference genomes and not for a very specific yeast of strain that was de novo assembled by a particular lab and used as a reference genome for some RNA-seq experiment. 

PS: OP edited the question and is interested in knowing which reference assembly was used for alignment rather than which organism ?

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by Ashutosh Pandey11k

Thank you for your reply. It is a human genome (this is known and to clarify I just added the info to the original question).

Here is the header info:

@PG ID:bwa PN:bwa VN:0.7.9a-r786 CL:/mnt/isilon/illumina/bioinfo/bin/bwa mem -M -t 6 -w500 -O5 /mnt/isilon/illumina/reference/Panels/HGA-SURESELECTV4/HGA-SURESELECTV4.unmasked.fa /mnt/isilon/illumina/Samples/10/11/HGA-SURESELECTV4/Igel_639a_Lane0_CCTAATCC/data/11_HGA-SURESELECTV4_Igel_639a_Lane0_CCTAATCC_R1_.fq.ntrim.gz /mnt/isilon/illumina/Samples/10/11/HGA-SURESELECTV4/Igel_639a_Lane0_CCTAATCC/data/11_HGA-SURESELECTV4_Igel_639a_Lane0_CCTAATCC_R2_.fq.ntrim.gz
ADD REPLYlink modified 2.9 years ago by Pierre Lindenbaum107k • written 2.9 years ago by Irene@Sequencing.com220

As is suggested by others, run samtools view -H and give the output. A mapped BAM has to have @SQ lines. If the BAM generator allows to produce a BAM without @SQ, it is a bug and should get reported to the developers. Once you have @SQ, the easiest way to find the reference build is to match the chromosomes lengths. It works most of time. Some BAMs have md5sum. That will be more reliable.

ADD REPLYlink written 2.9 years ago by lh330k

It's weird that there are no SN tags in the header. SN or reference sequence name are required tags so every bam file should have them. It is an exome sequencing data because they used SureSelect V4 capture kit (Agilent) and they have used HGA-SURESELECTV4.unmasked.fa  as a reference genome. I would select a few aligned reads from the bam file and blat it against different versions of human genome at UCSC genome browser, and compare the coordinates in the bam file with the blat results. This should tell you about the reference assembly (hg19 or hg38) used. This approach may not distinguish between smaller subversion changes within a given assembly but it should be able to distinguish between hg18 and hg19 or hg19 and hg38.

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by Ashutosh Pandey11k

Where did you get these BAM files? Are they from published results? Or collaborators? Best bet would be reading the paper or asking.

Anyway, by the command-line from the BAM file, I would guess reads were aligned to a fasta obtained somehow from Agilent SureSelectV4, maybe just the probes, maybe to the exons on which the probes match. But it is just a wild guess, really.

ADD REPLYlink written 2.9 years ago by h.mon14k

The BAM files were uploaded by one of our users (it is from clinical exome sequencing). Due to the number of BAM files that are being uploaded each day, we are seeing a lot of variations in the headers. Because of this, we are focused on automating a way to determine the reference genome in cases like this (where the header doesn't provide enough info, the header may have been obscured for some reason or the header may even be missing) as opposed to asking the lab that produced the file.

So far it sounds like using BLAT may provide a way to automate the determination of the major reference genome used for the alignment so we'll be looking into this further (how to automate this using BLAT).

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by Irene@Sequencing.com220

For a mapped BAM, @SQ info can't be missing, or the file is corrupted.

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by lh330k

Well then, I would suggest to force the Users to report the genome versions they used to generated the BAM files. Or even let them upload the reference in cases it is not a common version. You can develop a script based on my answer to check for the common versions. If Users tend to use many different versions, it is actually best to make a guideline of which version to use. If not possible, in special, Users have to report genome version and upload reference file. You can make a sample submission page to help with this. 

ADD REPLYlink written 2.9 years ago by biocyberman650
gravatar for biocyberman
2.9 years ago by
biocyberman650 wrote:

This require some manual work:

1. List reference sequence names and sizes from the bam file:

samtools view -H  mybam.BAM |grep '^@SQ'>chromosomes.txt

2. Compare chromosome names and sizes with that of various versions of human genome:

hg18, GRCh36, hg19, GRCh37, hg38, GRCh38

Some clues:

  • Each version has different sizes of many chromosomes. So chromosome sizes will help to differentiate major releases.
  • hg* line (for the lack of better term in my head right now) names chromosomes starting with 'chr', whereas GRCh* line does with only chromosome number: '1' is equivalent with chr1. So chromosome naming helps differentiate line.

For calculation of chromosome length from a fasta file (i.e. of the human genome), install bioawk and run this:

bioawk -c fastx '{print $name"\t"length($seq)}' hg19.fa



ADD COMMENTlink written 2.9 years ago by biocyberman650
gravatar for Ryan Thompson
2.9 years ago by
Ryan Thompson3.3k
TSRI, La Jolla, CA
Ryan Thompson3.3k wrote:

If you know that your BAM file represents RNA-seq or exome data, you can count the number of reads overlapping exon coordinates for every relevant genome build and see which one gives the maximum percent of reads overlapping gene exons. The genome build with the highest % overlap is the correct one. You can do likewise for any other type of data that is enriched for a known set of genomic regions (e.g. promoters, miRNA, etc.).

Another option is to simply extract the reads and re-align them yourself, and either compare to see which genome build yields the most similar alignments, or just forget about the original alignments and use your new ones instead.

ADD COMMENTlink written 2.9 years ago by Ryan Thompson3.3k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1550 users visited in the last hour