Question: Generating Multiple Species Alignment Of Novel Transcripts For Phylocsf
3
gravatar for Eric Fournier
7.5 years ago by
Eric Fournier1.4k
Quebec, Canada
Eric Fournier1.4k wrote:

Short version: How would you go about generating multiple species alignments of novel transcripts from bos taurus (assembly UMD3.1) with human/mouse/dog for use with PhyloCSF?

Context and what I've tried so far:

Through a sequencing experiment, our lab has identified a large set of new transcripts in Bos taurus. We want to determine if those transcripts are coding or non-coding. To do so, we thought of using the PhyloCSF software, as was done in Cabili et al, 2011.

To use PhyloCSF, we need to generate a multiple species alignment of our transcripts. However, since these are unknown transcripts, it is impossible to find their inter-species homologs directly. Instead, we set out to generate genomic multi-species alignment, from which we aimed to extract our regions of interest.

However, I've now spent most of the week banging my head trying to figure out the best way to do this. So far I have:

  1. Obtained pairwise alignment of the bosTau6 assembly of the cow genome to hg19, mm9 and canFam2 from the UCSC Genome Browser download page
  2. Converted those to MAF format
  3. Stitched those pairwise MAF together using TBA
  4. Tried to extract regions of interest using bx_python

Currently, bx_python crashes when I try to index my MAF file saying it cannot fit a range of 0..148,823,899 into a bin of 4681. This is the length of a whole chromosome, so I'm guessing my MAF file must be broken. wc -L gives me a maximum line size of 158,337,101, which I am pretty sure isn't normal.

I'm going to keep trying to figure out where I went wrong, but I would be grateful for any suggestions of alternative data sources, tools or pipelines for generating my multi species alignments.

genome non multiple • 4.5k views
ADD COMMENTlink modified 6.6 years ago by jgwang0 • written 7.5 years ago by Eric Fournier1.4k
2
gravatar for Repineme
7.5 years ago by
Repineme110
Repineme110 wrote:

Get FASTA sequence of your genomic regions from Galaxy and use stitchMAFblocks function to extract 49 mammals MAFs. Then you can use phyloCSF to process the data. But there is a catch. There are few species name typos in phyloCSF and will throw errors. You can correct them easily.

Otherwise you can simply use infamous CPC http://cpc.cbi.pku.edu.cn.

ADD COMMENTlink written 7.5 years ago by Repineme110

"Infamous" CPC? What's the story there?

ADD REPLYlink written 7.5 years ago by Eric Fournier1.4k

Do you guys know how to prepare the multiple alignment now? If you know, please tell me. Thanks a lot!

ADD REPLYlink written 6.6 years ago by jgwang0

This tool doesn't do multiple alignment, however it blasts your genomic region to the known regions in BLAST database and then calculates the ORF and gives a score the separates noncoding from coding regions.

ADD REPLYlink written 7.5 years ago by Repineme110

Do you guys know how to prepare the multiple alignment now? If you know, please tell me. Thanks a lot!

ADD REPLYlink written 6.6 years ago by jgwang0
0
gravatar for jgwang
6.6 years ago by
jgwang0
jgwang0 wrote:

Do you guys know how to prepare the multiple alignment now? If you know, please tell me. Thanks a lot!

ADD COMMENTlink written 6.6 years ago by jgwang0

Here's what I ended up doing, in a nutshell: 1. Get species-to-species alignments from the UCSC Genome Browser (Cow to mouse, Cow to Dog, Cow to human, etc.) in axt format. 2. Convert from axt to MAF format using the axtToMaf tool from the kent source tree (Again, from the UCSC Genome Browser) 3. Split all alignments into their chromosome parts and fix up the sequence names so they fit the expected format for Multiz 4. Run multiz iteratively to stitch the alignments together 5. Extracted intervals using bx-python

It ended up being painful and complicated, but the results were enough for my ends. If you want, I can supply you with the scripts I used so you can get a better idea of what I did exactly.

ADD REPLYlink written 6.6 years ago by Eric Fournier1.4k

Thanks for your reply. But I have my own transcripts. Do you think I should construct my own pairwise alignment (transcript vs genome) ? I'd like your scripts you used. My email is jgwang@mix.wvu.edu Thank you very much

ADD REPLYlink written 6.6 years ago by jgwang0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1514 users visited in the last hour