Hello, I would like some guidance assembling and annotating a eukaryotic genome using only short-reads.
Quick background, I'm fairly new to bioinformatics and no one in my lab is a computational person, so I'm really just trying to figure this out as I go. My project will involve assembling the genomes of a couple non-model plants. I don't have the plants in my possession yet, so I'm just trying to work with available data from SRA for practice. Conveniently, a team in China has done short-read sequencing on one of the plants I intend to study. Unfortunately, there isn't long-read data available, but I will do long-read sequencing when I actually have my plants.
I assembled the short-read data using ABySS, and the summary stats from Quast and ABySS are decent, I think? N50: 16kb L50: 10k N75: 7kb L75: 25k total length: 627Mb (close to the expected genome size)
Now that I have this assembly, I would like to do masking and annotation. For masking I want to use RepeatModeler and RepeatMasker, but when I tried to run RepeatModeler, the job finished in minutes, which seems wrong? I did get a warning that N50 was 7kb and it didn't like that. Not sure why the N50 is different from Quast and ABySS, though. RepeatModeler also finished after only doing round1. Should I go about constructing a repeat library/masking in a different way? I know masking with an assembly this fragmented won't be the best, but I'm not sure if there's something else I should do?
For annotation I'm going to use Braker2 and there is a high-quality assembly of a plant with an LCA ~80Mya, so I think that should help a lot.
Any suggestions on how I should move forward are very welcome and appreciated, thanks!