9.7 years ago
win ▴ 910

Hi all, Hope someone can help. I will be getting several BAM files from users and they may not inform us which reference file to use. Is there any way to look into a BAM file and know for sure or some way that i could infer which file to use.

For e.g. i believe that 1000 Genomes uses their own bam file whereas Illumina used UCSC fasta files?

Any ideas?

Thanks, A

Uhm... yell at those users? Apply the LART until they provide the required information?

I have a really difficult time seeing why it's necessary to accomodate lusers who don't even know what their own files contain.

What is a reference file & why you need it for BAM and where Illumina used UCSC fasta files?

1000 Genomes uses the GRCh37 reference (still a fast file, though), which should have identical coordinates for autosomes as UCSC hg19. The chromosome naming conventions differ though, and there are some unplaced contains/scaffolds that differ between the two as well, I think.

9.7 years ago
matted 7.6k

That's an unfortunate situation, but maybe unavoidable sometimes.

You will get chromosome names and lengths in the header of the BAM (samtools view -H test.bam).

Good pipelines will put in optional fields that describe each reference well, e.g. from a 1000 Genomes BAM:

@SQ     SN:1    LN:249250621    M5:1b22b98cdeb4a9304cb5d48026a85128     UR:        AS:NCBI37       SP:Human

Minimal pipelines will just have the length of each chromosome:

@SQ     SN:chr1     LN:249250621

In the worst case, you can infer the reference from the chromosome names (and number of chromosomes) and the assembly version by the sizes. I think they differ by a few bases e.g. from hg17 to hg18 to hg19. If for some reason they don't, you can look at reads around inter-reference variant sites and see which allele is called as matching the reference.

This is all pretty nasty though. You could also realign the reads to a reference you choose and know by handing the BAM directly to bwa (and other aligners take BAM directly by now as well).


