Coments On This Feature: Alignment Coordinate Projection From 1.Bam To 2.Bam To Create 3.Bam?
2
5
Entering edit mode
10.5 years ago

After gathering some comments about this in the previous post, I would like to know where people think this feature could best fit into:

Having a BAM file 'short-A' from short reads against the thousands of assembled contigs of genome A, one wants to project the entries in BAM file 'short-A' onto the chromosomes in genome B. Genome A and genome B are too divergent to directly align the short reads from A straight to genome B, but one can align most of the thousands of assembled contigs from genome A to genome B, producing a 'contigs-A' file against genome B. This method would place each of the millions of short reads in my 'short-A' BAM file in the coordinates for genome B using the 'contigs-A' alignments as the guide, by projecting each position in the cigar lines of the coordinate system of one alignment set to the other.

For what I've googled around, I think this could go into:

• Samtools
• Bamtools
• any other option?

It would look something like:

coordinateprojection -a readsA-contigsA.bam -b contigsA-chrsB.bam -o readsA-chrsB.bam


Just as an example dataset, here is the mouse readsets used for contig assembly piled up against the mouse contigs -- here just the whole gene body of mouse PAX2, PAX5 and PAX8. Also, the mouse assembled contigs aligned to human using lastz with chaining:

And this would be run into something like:

coordinateprojection -a readsMousecontigsMouse.bam -b contigsMousechrsHuman.bam -o readsMousechrsHuman.bam


Again, this is not bed-like information but actual nucleotide-by-nucleotide alignment coordinates, so it would involve the projection of each of the cigarlines position by position, including insertions and deletions.

Looking forward to suggestions.

My idea right now is to add this feature as a patch against samtools, since it would be easy to make use of the ftp/http fetching capabilities in it. Given that db providers like Ensembl/UCSC could easily produce speciesAspeciesB.bam ftp dumps from their pre-computed genomic alignments, this would make the projection method very simple and straightforward to the users.

bam samtools liftover • 2.7k views
2
Entering edit mode

Surely this is something you should write yourself? It's rather niche.

0
Entering edit mode

can you add a link to the previous discussion?

0
Entering edit mode

@Aaron Statham: thanks for the comment. My idea right now is to add this feature as a patch against samtools, since it would be useful to make use of the ftp/http fetching abilities in it. Given that db providers like Ensembl/UCSC could easily produce speciesAspeciesB.bam ftp dumps from their pre-computed genomic alignments, this would make the projection method very simple and straightforward to the users.

0
Entering edit mode

@Aaron Statham, thanks for the comment. My idea right now is to add this feature as a patch against samtools, since it would be easy to make use of the ftp/http fetching capabilities in it. Given that db providers like Ensembl/UCSC could easily produce speciesAspeciesB.bam ftp dumps from their pre-computed genomic alignments, this would make the projection method very simple and straightforward to the users.

0
Entering edit mode

I think the most difficult point in this scenario is the treatment of gaps and mismatches in the alignments between the genomes. How should e.g. an alignment mouse to human like 100M20I/D100M (>90% identity) be treated? If you used the tool to project coverage from mouse aligned reads to human chromosomes, which coverage should be assigned to the gaps? Is it possible to come up with a sensible definition?

5
Entering edit mode
10.5 years ago

Perhaps I am being naive, but it seems that this is exactly the sort of thing that the UCSC liftover tool is meant to handle. Liftover is not restricted to different genome builds of the same species. It can be used to project mouse alignments to human; the projections are limited to the species for which the UCSC folks have already generated "chain" files.

So, a basic recipe for your problem using a mix of UCSC liftover and BEDTools might be:

1. Convert your 'short-A' BAM file to BED format (bamToBed).
2. Use the liftover tool to project these BED features to genomeB.
3. Convert the project BED features to BAM (bedToBam) format.
4. Sort and index the projected BAM.
2
Entering edit mode

ah right, that level of detail would certainly be lost.

2
Entering edit mode

I suspect this is a case where you will need to write your own tool...

0
Entering edit mode

I don't think converting to bed would allow me to project the cigarlines position by position, including insertions and deletions.

0
Entering edit mode

I've edited my question, I think it may be clearer now. Thx!

0
Entering edit mode

I kind of feel this should be in samtools or bamtools or one of these, so I want to make sure it makes sense as a feature overall.

0
Entering edit mode

@aaronQuinlan, I've rewritten my question as a request for comments.

4
Entering edit mode
10.5 years ago

This would probably be best as a stand-alone tool for now. Evolve it to become stable and applicable. You can later make a Github pull request offering project-coordinates to be embedded into a toolset that will be appropriate at the time.

Bamtools and Samtools are both MIT licensed so you may want to start with that license too so that you can merge code any time with little or no discussion.