Since I read your question for the first time, the feeling was bothering me that the approach you want to take is invalid. I have been thinking about a solution since then and now see the need to supply an answer. However, that might not be the answer you would like to have, some critical points also due to missing information in your question.
In one sentence: you will make up (fake) read alignments, indirectly claiming that reads map in a position of genome B, while they in fact don't align there and could only be aligned to genome A.
First, I have to question your statement:
Genome A and genome B are too
different to align the short reads
from A straight to genome B, but close
enough that I can align most of the
thousands of assembled contigs from
genome A to genome B
How can you possibly claim that with only trying bwa? And what is the real distance between those genomes, what is the sequence identity, and what is the degree of genome rearrangements/syntheny?
One of the core problems of your approach is using bwa to align the reads. bwa has low sensitivity, especially when it comes to more different sequences. So, try a more sensitive aligner first to map the reads directly, eg. bfast, blat, mosaic, novoalign, lastz, or try LAST (thanx to darked's comment).
Why is the mapping approach invalid? There are two approaches I could think of:
globally align whole contigs of genome A against Genome B (pretty much treating it like a large short read), yielding one mapping coordinate, ambiguity asside, for the contig of A in genome B.
locally align the contigs such that partial contigs of A, with minimum sequence identity cut-off, within a contig can be aligned to a region in B, yielding possibly a set of genomic coordinates in B per contig.
Both approaches have in common that they don't deal with ambiguities, and that is a big deal.
Then, I claim that approach 1 in invalid and will create bogus mapped coordinates of reads, assigning them to regions where definitely do not align.
Only one example of what could go wrong: Imagine the following situation, a contig of genome A, consisting of three regions R1, R2, R3 of high sequence conservation (think of that for example as three highly conserved genes). All these regions are present in genome B as well (R1', R2', R3'), but due to a rearrangement, R2' is replaced by R4' and relocated to a different position. Now, following your relatively low sequence-identity threshold R1,R2,R3 -> R1',R4',R3'. Now for genome A, you got a lot of reads for R2, which now will be assigned to R1',R4',R3' but they will be assigned to region R4', where they definitely do not belong to.
Based on the above, the second approach would be much better, that would be best accomplished by a bidirectional best blast hit between the genomes (with eg. identity cut-off 90%), and then mapping those regions. Would also need to treat ambiguities somehow, and again bwa is not the tool of choice.
However, if stretches of high sequence similarity existed, then the reads would possibly also align there directly.
Conclusion: Before thinking about how to technically perform an analysis, we have to think more about the validity of the approach. Lacking some important pieces of information, the above couldn't be the final word there. I consider mapping read alignments between distant genomes problematic. The risk is to build up the impression (or tricking yourself into believing) that there exist alignments where in fact they couldn't be aligned, and that ambiguities go unnoticed. Only alignments that can be found and proven are valid alignments, if your reads don't align, yes, then they don't align.
Try more sensitive tools to align reads directly.