I was wondering about options for creating the "chain" file for converting genome coordinates from one genome assembly to another. Malachi Griffith did an excellent summary about Converting Genome Coordinates From One Genome Version To Another but most of these tools actually need the "chain" file (that is the file that describes the pair-wise alignments between two genomes) so I would like to know how to create this file or whether there is any tool doing the coordinates transference just starting from the 2 genomes and a file of annotated features (eg bed, gff). Thanks!
In that post you mention, there is flo
https://github.com/wurmlab/flo, which I recommend and can also liftover GFF files.
Please use the search function.
The above post (answered by Pierre Lindenbaum) will bring you to this link. Looking into it there is an automated script that will generate a chain file for you.
This is outdated and difficult to get working, I don't recommend it. Instead, I'd recommend flo (https://github.com/wurmlab/flo) like @jean.elbers suggested. Just beware, it can be pretty CPU and memory intensive for big genomes, and takes quite a while.
The UCSC option provides a script that was relevant in 2018. That is a bit too recent for me to deem outdated... furthermore it's a single script.
Sorry, I meant the first page you linked is outdated: "This page is an interesting historical discussion and well worth the read. "
As for the script, I personally found it difficult to follow. I didn't manage to get it working, and a lot of stuff is hard-coded into the script. It's just honestly easier to use flo, but it's always good to give another option.
Does anyone know if there is a tool to make the chain files required for all of these programs without relying on UCSC tools? Every Program I have found uses UCSC dependencies that you need to pay for if you aren't academic if you want to make a chain file for your own genome.
While Liftoff
(https://github.com/agshumate/Liftoff) doesn't make a chain file, it is another program for lifting over annotations that is under GPL-3.0 license. Please contact your legal department to verify if you could use that and its dependencies.
I found the post and top answer to be helpful. So, thank you very much!
For example, I downloaded the UCSC executables from here.
I think followed the minimal instructions, which I found I could further modify (for single-chromosome sequences that were each less than 500,000 bp):
#prepare files
cd $ID1
faToTwoBit $ID1.fa $ID1.2bit
twoBitInfo $ID1.2bit chrom.sizes
cd ..
cd $ID2
faToTwoBit $ID2.fa $ID2.2bit
twoBitInfo $ID2.2bit chrom.sizes
cd ..
# create .chain file
blat $ID1/$ID1.2bit $ID2/$ID2.fa $ID1\to$ID2.psl -tileSize=12 -minScore=100 -minIdentity=98
axtChain -linearGap=medium -psl $ID1\to$ID2.psl $ID1/$ID1.2bit $ID2/$ID2.2bit $ID1\to$ID2.chain
I was also able to run CrossMap (installed using pip3 install CrossMap
), to confirm that .chain file can be run without generating any error messages:
CrossMap.py gff $CHAIN.gz $GFFIN $GFFOUT
Whether or not CrossMap provided the best conversion could be up for debate, and I am not sure if you might want to change the parameters to generate the .chain file in some circumstances.
However, I think this is enough to show that the custom .chain file generation was successful.