I have chloroplast assemblies in the contig stage from about 90 individuals. I want to extract CDS from all of them for phylogenetic analysis, and I've tried a few different methods with varying success. I have a closely-related reference plastid genome that is annotated. I want to see if anyone else has some suggestions as to the best way to accomplish this.
The tool PGA was recommended to me for annotation the assemblies, but it will not accept a multifasta as input. I have to "fuse" the contigs somehow for this tool to work, which I've tried to do a couple different ways described below. Once I have assemblies with just a single sequence, I have no problems annotating them and getting the extracted CDS.
- I used minimap2 to align the contigs to the reference, then extract them in the correct order (I've also tried mauve contig mover for this). Once I get the contigs in the correct order, I use BBmap fuse.sh to "fuse" the contigs in the correct order. This script has the ability to add a given number of "Ns" in between each fused contig. I chose to add 5 between each contig.
- I used ABACAS from the PAGIT pipeline, which again aligns the contigs to the reference genome, but it will also output a single fasta sequence consisting of the ordered/fused contigs similar to my previous method. The difference here is ABACAS inserts a number of Ns based on the number of gaps in the reference alignment, which at first sounded like exactly what I wanted. The other VERY FRUSTRATING thing that ABACAS does is insert 100 N's if a contig overlaps. I have not found an options that disables this feature.
With both of these methods, the Ns they generate end up in some of the CDS--not enough to throw out the assemblies, but too much to go through and check on manually. When I generate alignments for the CDS, I can see some instances where the 5 "N"s from the BBmap fuse script end up creating a 5 nucleotide gap in all of the other sequences that should not be there. The ABACAS method sounds like the best option, but what it does with the overlapping contigs seems like it is bound to cause issues as well. Neither of the two methods really does what I want with overlapping contigs anyway.
(As an aside: One way of dealing with the ABACAS-generated sequences that sounded appealing was using something like IMAGE or Gapfiller, but there is no manual available for IMAGE and all links I have found to Gapfiller are dead.)
What I think I want is something similar to what you can do in Geneious: essentially just align the contigs to the reference genome, transfer annotations, and generate a consensus sequence from the alignment. I might do this if I had 10 samples instead of 90.
I guess my main questions boil down to:
- Are there any better tools/methods to generate a single sequence from multiple contigs?
- Am I going at this the wrong way? Should I look for an annotation and/or annotation transfer tool that can work with contigs?
Sorry if my explanation of things are unorganized and rambling. I have been working with this data for a long time and it has given me trouble in one way or another the whole time, which has been quite frustrating for a novice to bioinformatics like me. Thanks for the help and let me know if there is anything I can clarify.