Short version: How would you go about generating multiple species alignments of novel transcripts from bos taurus (assembly UMD3.1) with human/mouse/dog for use with PhyloCSF?
Context and what I've tried so far:
Through a sequencing experiment, our lab has identified a large set of new transcripts in Bos taurus. We want to determine if those transcripts are coding or non-coding. To do so, we thought of using the PhyloCSF software, as was done in Cabili et al, 2011.
To use PhyloCSF, we need to generate a multiple species alignment of our transcripts. However, since these are unknown transcripts, it is impossible to find their inter-species homologs directly. Instead, we set out to generate genomic multi-species alignment, from which we aimed to extract our regions of interest.
However, I've now spent most of the week banging my head trying to figure out the best way to do this. So far I have:
- Obtained pairwise alignment of the bosTau6 assembly of the cow genome to hg19, mm9 and canFam2 from the UCSC Genome Browser download page
- Converted those to MAF format
- Stitched those pairwise MAF together using TBA
- Tried to extract regions of interest using bx_python
Currently, bx_python crashes when I try to index my MAF file saying it cannot fit a range of 0..148,823,899 into a bin of 4681. This is the length of a whole chromosome, so I'm guessing my MAF file must be broken. wc -L gives me a maximum line size of 158,337,101, which I am pretty sure isn't normal.
I'm going to keep trying to figure out where I went wrong, but I would be grateful for any suggestions of alternative data sources, tools or pipelines for generating my multi species alignments.