"Simplifing" a very fragmented genome for visualisation?
2
1
Entering edit mode
12 months ago
setschmann ▴ 10

Hi there,

I have a very fragmented reference genome that i want to visualize (with jbrowse or similar)

this is the paper for the genome: link

here the genome can be found: link2

here are the quast statistics:

########
QUAST Results
########

All statistics are based on contigs of size >= 500 bp, unless otherwise noted (e.g., "# contigs (>= 0 bp)" and "Total length (>= 0 bp)" include all contigs).

Assembly                    Abal.1_1
# contigs (>= 0 bp)         37192295
# contigs (>= 1000 bp)      1276678
# contigs (>= 5000 bp)      529013
# contigs (>= 10000 bp)     343016
# contigs (>= 25000 bp)     145508
# contigs (>= 50000 bp)     46234
Total length (>= 0 bp)      18167382048
Total length (>= 1000 bp)   13017811908
Total length (>= 5000 bp)   11361640463
Total length (>= 10000 bp)  10034318481
Total length (>= 25000 bp)  6872368770
Total length (>= 50000 bp)  3406852776
# contigs                   1887964
Largest contig              297427
Total length                13450974050
GC (%)                      38.76
N50                         25814
N75                         9780
L50                         139726
L75                         348468
# N's per 100 kbp           1703.76


any help how i can wrangle this beast into something presentable would be helpful.

I tried a cutoff to get rid of the contigs < 1000bp wich helped, but is there a way to rescaffold or similar?

fasta reference visualisation genome • 859 views
1
Entering edit mode

I know plant genomes are messed up, but more than 3 billion contigs longer than 50kb? 18 billion contings in total?? That would put an estimate of genome size in the order of 1e14 base pairs or more. Definitely not a great assembly, if I'm not missing something important.

edit: the paper actually mentions 37 million scaffolds, for a total of 18 Gb, so maybe it would be better to start from there? (that is, scaffolds instead of contigs)

2
Entering edit mode
12 months ago

Unless you have data that the original authors did not have, you won't be able to re-scaffold (== only with new/better data you will be able to increase it)

filtering on length as you did is likely the best way forward (though crude indeed). Alternatively you can focus on the scaffold/contigs that contain genes? (or have RNAseq data aligned to it ...)

0
Entering edit mode

i was thinking maybe cd-hit could help by clustering similar regions? or is this the wrong programm to use?

how could i find the scaffolds/contigs that have genes? doing it manually is nigh impossible with the amount of data.

0
Entering edit mode

if you have annotation at hand (which you do I read) then just look up all the sequences in the GFF file or such and go with those contigs/scaffolds.

Alternatively, run a blastx of the assembly against a protein DB , bit crude I know and will take up some time as well.

1
Entering edit mode
12 months ago

You probably won't be able to improve this assembly without adding long reads. Eg pacbio HiFi or long nanopore reads. That would be expensive.

cd-hit is not relevant here.

If you want to perform annotation, but it is probably already available, then look at tools like maker2.

There are people who like to combine all of these unplaced contigs into a sort of chr0. I'm a little doubtful on doing this with someone elses' (public) data though.

ie.

>chr0
contig1 - 250Ns -contig2 - 250Ns -contig3.


This might give you some performance improvements in JBrowse. Otherwise, I'd just make sure the contigs have easy names like contig1, 2, 3 etc to allow easy browsing.

0
Entering edit mode

everythings there, annotation and genes and everything, a complete reference genome. is there an easy way to rename the contigs?

and a quick question about CD-HIT-EST, thats essentially cd-hit for DNA. I'm a little confused here, why is this not applicable here?

0
Entering edit mode

If you rename the contigs, you'll have to rename the annotation file contig names too. Maybe use sed, but this might be complicated.

Think about what CD-HIT would do to an assembly containing repeats. Why change the assembly and collapse the repeats? Why get rid of them ?

I would prefer to select the contigs containing genes for preferential treatment. Having said that, this assembly is going to be tricky to work with, with genes fragmented across small contigs etc.