Question

Reference File From Bam

3

Entering edit mode

11.5 years ago

win ▴ 970

Hi all,

Hope someone can help. I will be getting several BAM files from users and they may not inform us which reference file to use. Is there any way to look into a BAM file and know for sure or some way that I could infer which file to use.

For e.g. I believe that 1000 Genomes uses their own bam file whereas Illumina used UCSC fasta files?

Any ideas?

Thanks,
A

bam • 7.9k views

ADD COMMENT • link updated 13 months ago by Ram 43k • written 11.5 years ago by win ▴ 970

2

Entering edit mode

Uhm... yell at those users? Apply the LART until they provide the required information?

I have a really difficult time seeing why it's necessary to accomodate lusers who don't even know what their own files contain.

ADD REPLY • link 11.5 years ago by Marvin ▴ 890

0

Entering edit mode

What is a reference file & why you need it for BAM and where Illumina used UCSC fasta files?

ADD REPLY • link 11.5 years ago by Sukhi Singh 11k

0

Entering edit mode

1000 Genomes uses the GRCh37 reference (still a fast file, though), which should have identical coordinates for autosomes as UCSC hg19. The chromosome naming conventions differ though, and there are some unplaced contains/scaffolds that differ between the two as well, I think.

ADD REPLY • link 11.5 years ago by Matt Shirley 10k

score 6 · Answer 1 · 2012-10-24

That's an unfortunate situation, but maybe unavoidable sometimes.

You will get chromosome names and lengths in the header of the BAM (samtools view -H test.bam).

Good pipelines will put in optional fields that describe each reference well, e.g. from a 1000 Genomes BAM:

@SQ     SN:1    LN:249250621    M5:1b22b98cdeb4a9304cb5d48026a85128     UR:ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz        AS:NCBI37       SP:Human

Minimal pipelines will just have the length of each chromosome:

@SQ     SN:chr1     LN:249250621

In the worst case, you can infer the reference from the chromosome names (and number of chromosomes) and the assembly version by the sizes. I think they differ by a few bases e.g. from hg17 to hg18 to hg19. If for some reason they don't, you can look at reads around inter-reference variant sites and see which allele is called as matching the reference.

This is all pretty nasty though. You could also realign the reads to a reference you choose and know by handing the BAM directly to bwa (and other aligners take BAM directly by now as well).