Extracting CDS from fragmented assemblies
2
2
Entering edit mode
3.1 years ago
ebaldwin ▴ 30

I have chloroplast assemblies in the contig stage from about 90 individuals. I want to extract CDS from all of them for phylogenetic analysis, and I've tried a few different methods with varying success. I have a closely-related reference plastid genome that is annotated. I want to see if anyone else has some suggestions as to the best way to accomplish this.

The tool PGA was recommended to me for annotation the assemblies, but it will not accept a multifasta as input. I have to "fuse" the contigs somehow for this tool to work, which I've tried to do a couple different ways described below. Once I have assemblies with just a single sequence, I have no problems annotating them and getting the extracted CDS.

  • I used minimap2 to align the contigs to the reference, then extract them in the correct order (I've also tried mauve contig mover for this). Once I get the contigs in the correct order, I use BBmap fuse.sh to "fuse" the contigs in the correct order. This script has the ability to add a given number of "Ns" in between each fused contig. I chose to add 5 between each contig.
  • I used ABACAS from the PAGIT pipeline, which again aligns the contigs to the reference genome, but it will also output a single fasta sequence consisting of the ordered/fused contigs similar to my previous method. The difference here is ABACAS inserts a number of Ns based on the number of gaps in the reference alignment, which at first sounded like exactly what I wanted. The other VERY FRUSTRATING thing that ABACAS does is insert 100 N's if a contig overlaps. I have not found an options that disables this feature.

With both of these methods, the Ns they generate end up in some of the CDS--not enough to throw out the assemblies, but too much to go through and check on manually. When I generate alignments for the CDS, I can see some instances where the 5 "N"s from the BBmap fuse script end up creating a 5 nucleotide gap in all of the other sequences that should not be there. The ABACAS method sounds like the best option, but what it does with the overlapping contigs seems like it is bound to cause issues as well. Neither of the two methods really does what I want with overlapping contigs anyway.

(As an aside: One way of dealing with the ABACAS-generated sequences that sounded appealing was using something like IMAGE or Gapfiller, but there is no manual available for IMAGE and all links I have found to Gapfiller are dead.)

What I think I want is something similar to what you can do in Geneious: essentially just align the contigs to the reference genome, transfer annotations, and generate a consensus sequence from the alignment. I might do this if I had 10 samples instead of 90.

I guess my main questions boil down to:

  • Are there any better tools/methods to generate a single sequence from multiple contigs?
  • Am I going at this the wrong way? Should I look for an annotation and/or annotation transfer tool that can work with contigs?

Sorry if my explanation of things are unorganized and rambling. I have been working with this data for a long time and it has given me trouble in one way or another the whole time, which has been quite frustrating for a novice to bioinformatics like me. Thanks for the help and let me know if there is anything I can clarify.

Assembly alignment • 1.1k views
ADD COMMENT
0
Entering edit mode

Hi! Are these prokaryotic contigs? If so, please check FragGeneScan.

https://omics.informatics.indiana.edu/FragGeneScan/

ADD REPLY
1
Entering edit mode
3.1 years ago
shelkmike ★ 1.2k

If I were to solve this task, I would align reference proteins to my contigs using tblastn and then extract and join matching regions using custom scripts.

ADD COMMENT
0
Entering edit mode

Also, If you need to perform a phylogenetic analysis, you can do this without genome annotation, using JolyTree

ADD REPLY
1
Entering edit mode
3.1 years ago
Carambakaracho ★ 3.2k

Afaik, this is far from being a standard task and each time I had to do something like this it proofed to be relatively tedious and time consuming. My focus never was phylogenomics, and in addition it's been a while and since then I changed my job, so I the tools/solution might be a bit rusty.

As shelkmike suggested, you could use proteins, or for close enough species map/align transcripts using good old gmap/gsnap. For contig alignment, I used Last (or was it lastz?) (and a colleage then created multialignemnt mafs with mummer as far as I remember). For visualization I used mauve.

If you want to fuse contigs, I'd go with 100 or more Ns, ideally track their coordinates in a bed file. I know you're using contigs, but 5 N stretches are not rare in scaffolded assmeblies. I used a propriatry software for that, so I don't know an open tool, but frankly, that's not too complicated.

Hope that helps

ADD COMMENT
0
Entering edit mode

Thanks for the input. It seems every task I do in bioinformatics is not standard!

I'm considering staying with the fusing contigs method for now since I already have the pipeline to deal with that, and right now I need to move forward with this project and not mess around with custom scripts. I realized after I did it that 5 Ns was too little, but when I was doing that it was more to test that the scripts/rest of the pipeline were working.

ADD REPLY

Login before adding your answer.

Traffic: 2748 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6