Reference File From Bam
1
3
Entering edit mode
9.7 years ago
win ▴ 910

Hi all, Hope someone can help. I will be getting several BAM files from users and they may not inform us which reference file to use. Is there any way to look into a BAM file and know for sure or some way that i could infer which file to use.

For e.g. i believe that 1000 Genomes uses their own bam file whereas Illumina used UCSC fasta files?

Any ideas?

Thanks, A

reference file bam • 6.6k views
2
Entering edit mode

Uhm... yell at those users? Apply the LART until they provide the required information?

I have a really difficult time seeing why it's necessary to accomodate lusers who don't even know what their own files contain.

0
Entering edit mode

What is a reference file & why you need it for BAM and where Illumina used UCSC fasta files?

0
Entering edit mode

1000 Genomes uses the GRCh37 reference (still a fast file, though), which should have identical coordinates for autosomes as UCSC hg19. The chromosome naming conventions differ though, and there are some unplaced contains/scaffolds that differ between the two as well, I think.

6
Entering edit mode
9.7 years ago
matted 7.6k

That's an unfortunate situation, but maybe unavoidable sometimes.

You will get chromosome names and lengths in the header of the BAM (samtools view -H test.bam).

Good pipelines will put in optional fields that describe each reference well, e.g. from a 1000 Genomes BAM:

@SQ     SN:1    LN:249250621    M5:1b22b98cdeb4a9304cb5d48026a85128     UR:ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz        AS:NCBI37       SP:Human


Minimal pipelines will just have the length of each chromosome:

@SQ     SN:chr1     LN:249250621


In the worst case, you can infer the reference from the chromosome names (and number of chromosomes) and the assembly version by the sizes. I think they differ by a few bases e.g. from hg17 to hg18 to hg19. If for some reason they don't, you can look at reads around inter-reference variant sites and see which allele is called as matching the reference.

This is all pretty nasty though. You could also realign the reads to a reference you choose and know by handing the BAM directly to bwa (and other aligners take BAM directly by now as well).