Question

Using 2 different references in my pipeline

0

Entering edit mode

7.5 years ago

michal.devir • 0

Hi,

I am running some analysis on bam files that I have downloaded from ICGC. I don't have the original fastaq files, just the bam. The bams was aligned using a different reference than the one I use in my pipeline - I use hg19 from UCSC, I am not sure which reference was used for the bam but I think this is the Ensmble reference. The result is different naming convention ('1' vs 'chr1', 'GL000241.1' vs 'chrUn_gl000241'), and different order of contigs. This causes problems, for example when working with GATK. What is the right way to handle such a situation?

Thanks, Michal.

genome next-gen ICGC • 1.7k views

ADD COMMENT • link updated 7.5 years ago by igor 13k • written 7.5 years ago by michal.devir • 0

1

Entering edit mode

Download the files your pipeline needs for that other reference, or extract read information from the BAM files and map them against your reference. I don't think there is an easy way out here.

ADD REPLY • link 7.5 years ago by Zaag ▴ 860

0

Entering edit mode

Thanks a lot. I hoped that there is a simple way to do that, but I guess I'll have to work hard for that... :-/

ADD REPLY • link 7.5 years ago by michal.devir • 0

1

Entering edit mode

It is additional work but not necessarily hard :)

Hopefully your bam has both mapped and unmapped reads. Otherwise you are missing a part of the original data.

ADD REPLY • link 7.5 years ago by GenoMax 142k

score 2 · Answer 1 · 2016-11-01

Try CrossMap to convert between different references: http://crossmap.sourceforge.net/

You may still run into problems with GATK because it will not only need the same contig names, but also same contig order, so you may need to also sort all the files again. You could also run into issues with alternate contigs if some of the files have them and others do not.

score 1 · Answer 2 · 2016-11-01

Two solutions: if you really want to use the hg19 reference and not the ensembl reference, convert the bam to fastq and perform the alignment against your preferred reference. Alternatively, change your annotation to the ensembl annotation. It's the best to use matching reference and annotation, you can probably try some nasty hacks such as changing the chromosome identifiers, but that will not make you happy in the long run.