Change Coordinates -A 1.Bam -B 2.Bam -O 3.Bam?
4
4
Entering edit mode
12.8 years ago

I've got a BAM file 'short-A' of my short reads against the thousands of assembled contigs of my genome A, and I now would like to project the entries in BAM file 'short-A' onto the chromosomes in genome B. Genome A and genome B are too different to align the short reads from A straight to genome B, but close enough that I can align most of the thousands of assembled contigs from genome A to genome B, producing a 'contigs-A' file against genome B. So I now want to place each of the millions of short reads in my 'short-A' BAM file in the coordinates for genome B using the 'contigs-A' alignments as the guide.

Is there a standard way of doing that?

I don't know if it makes sense to have this in samtools or somewhere else, but I imagine it could be something like:

coordinateprojection -a readsAcontigsA.bam -b contigsAchrsB.bam -o readsAchrsB.bam

Just as an example dataset, here is the mouse readsets used for assembly piled up against the mouse contigs (here just the whole gene body of mouse PAX2, PAX5 and PAX8), and the mouse assembled contigs aligned to human using lastz:

ftp://ftp.ebi.ac.uk/pub/databases/ensembl/avilella/t/coordprojection/readsMousecontigsMouse.bam ftp://ftp.ebi.ac.uk/pub/databases/ensembl/avilella/t/coordprojection/readsMousecontigsMouse.bam.bai

ftp://ftp.ebi.ac.uk/pub/databases/ensembl/avilella/t/coordprojection/contigsMousechrsHuman.bam ftp://ftp.ebi.ac.uk/pub/databases/ensembl/avilella/t/coordprojection/contigsMousechrsHuman.bam.bai

And I would like to have a tool that does something like:

coordinateprojection -a readsMousecontigsMouse.bam -b contigsMousechrsHuman.bam -o readsMousechrsHuman.bam

Cheers

bam liftover samtools • 5.3k views
ADD COMMENT
3
Entering edit mode

Why not re-align the reads to the new reference?

ADD REPLY
3
Entering edit mode

What do you use for genome A vs genome B alignment? What is the format of the output? Not sure about your genome, but often one can get MAF file where multiple regions of A map to multiple regions of B.

ADD REPLY
3
Entering edit mode

what do you mean by 'projection of coordinates' to the new contigs and in what way is that different than mapping the reads to the new contigs?

ADD REPLY
1
Entering edit mode

this would be easier if there were clear rules about dealing with ambiguities - multiple hits and converging hits

ADD REPLY
0
Entering edit mode

Genome A and genome B are too different to re-align the short reads from A straight to genome B.

ADD REPLY
0
Entering edit mode

@Jeremy Leipzig: Supposing one has already dealt with most of them using "bwa aln" to map the short reads and "bwa bwasw" to map the contigs, I would be happy to just copy the whole lot on the other coordinates.

ADD REPLY
0
Entering edit mode

@avilella, if you can write out the specifics of how you want this method to work then I can probably implement it for you (most likely in R) and get this question answered within 6 days.

ADD REPLY
0
Entering edit mode

I've now added an example dataset using lastz for mouse contigs against human chromosomes. From the lastz soft sam I created a bam. In parallel, I created the bam pileup of the mouse reads into the mouse contigs.

ADD REPLY
4
Entering edit mode
12.8 years ago
brentp 24k

UCSC has a tool called liftOver that may be of help. You'll have to create a .chain file that maps regions from one genome to the other. Here is an explanation of how to do this. Briefly, you use BLAT to find the regions of similarity that are used to make the alignments.

I'm not sure that this will do better than re-mapping the reads, but it seems to be what you are requesting.

After you have the chain file, then you can use liftOver on your BAM coordinates to map them to the new genome.

ADD COMMENT
2
Entering edit mode
12.8 years ago

if I understood correctly your point, since remapping reads against genome B is not an option, then brentp's solution is the one I would follow: if you need to project positions from one genome to other the best available tool nowadays is UCSC's LiftOver. if this solution is not what you are looking for then I would go for a combined approach of what most of us think you should do (remapping) and what it seems you want to do (project coordinates).

if you already have your genome A assembled contigs, I would split the BAM file into several BAM files containing only the reads of each contig (1 BAM file per contig). then I would project each contig start and end position into genomeB's reference, generating several sections of the entire reference (1 fasta reference of genomeB per genomeA aligned contig). finally I would pair each contig files (genomeA BAM file and genomeB fasta reference) and remap each genomeA contig reads against the associated genomeB projected reference.

in summary, if genomeA and genomeB are as similar as you expect, then forcing all reads from a genomeA contig to map its corresponding genomeB reference would retain a similar coverage, and you would end up with several genomeB BAM files which you could eventually merge into a single final genomeB BAM file containing all original reads now mapped as desired.

ADD COMMENT
2
Entering edit mode
12.8 years ago
Michael 54k

Since I read your question for the first time, the feeling was bothering me that the approach you want to take is invalid. I have been thinking about a solution since then and now see the need to supply an answer. However, that might not be the answer you would like to have, some critical points also due to missing information in your question.

In one sentence: you will make up (fake) read alignments, indirectly claiming that reads map in a position of genome B, while they in fact don't align there and could only be aligned to genome A.

First, I have to question your statement:

Genome A and genome B are too different to align the short reads from A straight to genome B, but close enough that I can align most of the thousands of assembled contigs from genome A to genome B

How can you possibly claim that with only trying bwa? And what is the real distance between those genomes, what is the sequence identity, and what is the degree of genome rearrangements/syntheny?

One of the core problems of your approach is using bwa to align the reads. bwa has low sensitivity, especially when it comes to more different sequences. So, try a more sensitive aligner first to map the reads directly, eg. bfast, blat, mosaic, novoalign, lastz, or try LAST (thanx to darked's comment).

Why is the mapping approach invalid? There are two approaches I could think of:

  • globally align whole contigs of genome A against Genome B (pretty much treating it like a large short read), yielding one mapping coordinate, ambiguity asside, for the contig of A in genome B.

  • locally align the contigs such that partial contigs of A, with minimum sequence identity cut-off, within a contig can be aligned to a region in B, yielding possibly a set of genomic coordinates in B per contig.

Both approaches have in common that they don't deal with ambiguities, and that is a big deal.

Then, I claim that approach 1 in invalid and will create bogus mapped coordinates of reads, assigning them to regions where definitely do not align.

Only one example of what could go wrong: Imagine the following situation, a contig of genome A, consisting of three regions R1, R2, R3 of high sequence conservation (think of that for example as three highly conserved genes). All these regions are present in genome B as well (R1', R2', R3'), but due to a rearrangement, R2' is replaced by R4' and relocated to a different position. Now, following your relatively low sequence-identity threshold R1,R2,R3 -> R1',R4',R3'. Now for genome A, you got a lot of reads for R2, which now will be assigned to R1',R4',R3' but they will be assigned to region R4', where they definitely do not belong to.

Based on the above, the second approach would be much better, that would be best accomplished by a bidirectional best blast hit between the genomes (with eg. identity cut-off 90%), and then mapping those regions. Would also need to treat ambiguities somehow, and again bwa is not the tool of choice. However, if stretches of high sequence similarity existed, then the reads would possibly also align there directly.

Conclusion: Before thinking about how to technically perform an analysis, we have to think more about the validity of the approach. Lacking some important pieces of information, the above couldn't be the final word there. I consider mapping read alignments between distant genomes problematic. The risk is to build up the impression (or tricking yourself into believing) that there exist alignments where in fact they couldn't be aligned, and that ambiguities go unnoticed. Only alignments that can be found and proven are valid alignments, if your reads don't align, yes, then they don't align. Try more sensitive tools to align reads directly.

ADD COMMENT
1
Entering edit mode

No, I tried lastZ, but ofc had to stripp the quality information converting to fasta, as a downside.

ADD REPLY
1
Entering edit mode

A few points. 1) I second that bwa is not sensitive enough for cross-species alignment, at least under the default. 2) No one is going to align contigs "globally". That is simply impossible most of time. 3) Most of large-scale changes are easier to detect from whole-genome alignment then from short reads. 4) In general, Albert takes the right strategy: aligning contigs and then converting the short read coordinates, which, if done correctly, should be better than directly mapping short reads. Nonetheless, there is no convenient tool for such an application.

ADD REPLY
0
Entering edit mode

Great answer. Short question about alternative mappers: did you use lastz for Illumina fastq reads mapping, or iz it a typo and it should be LAST (http://last.cbrc.jp/). BTW, I am using the later.

ADD REPLY
0
Entering edit mode

A few points. 1) I second that bwa is not sensitive enough for cross-species alignment, at least under the default. 2) No one is going to align contigs "globally". That is simply impossible most of time. 3) Most of large-scale changes are easier to detect from whole-genome alignment then from short reads. 4) In general, Albert takes the right strategy: aligning contigs and then converting the short read coordinates, which should be better than directly mapping short reads. Nonetheless, there is no convenient tool for such an application.

ADD REPLY
0
Entering edit mode

I've now added an example dataset using lastz for mouse contigs against human chromosomes. From the lastz soft sam I created a bam. In parallel, I created the bam pileup of the mouse reads into the mouse contigs.

ADD REPLY
0
Entering edit mode

@lh3, regarding 2) I had the impression that he just wanted to do this and that would be not a good idea, using bwa also for aligning the contigs, I read this out of one of avilella's comments. regarding 4) which albert, or do you mean avilella?

ADD REPLY
1
Entering edit mode
12.8 years ago
Darked89 4.6k

You did not provided info about the reads you are mapping (Illumina paired?), their lengths and the genomes (no exact species needed, but it may help if one knows if these are i.e. compact Eukaryote, complex mammals, or some even more exotic combination).

For inter-mammalian short read mapping you may look at LAST last.cbrc.jp): http://last.cbrc.jp/doc/last-tutorial.txt (see Example 8: Align aardvark fastq reads to the human genome). That way you may (or not...) sensibly align reads from A to B.

ADD COMMENT
0
Entering edit mode

I've now added an example where I first assembled short reads from species A in contigs in species A, cross-align contigs in species A to species B using LASTZ, and also pile up all the reads from species A onto the assembled contigs in species A. I would now want to do the coordinate projection from one bam to another as described in the example.

ADD REPLY

Login before adding your answer.

Traffic: 2571 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6