Question

Short Read Data Genome Assembly

2

Entering edit mode

22 months ago

SomeOne ▴ 240

Hello,

I have recently started working on genome assembly of a fungus genome.I have illumina short read sequencing paired-end (2x150bp) data taken from NCBI. Based on this data, I am trying to set up pipeline for genome assembly, which can later be used for our upcomming sequencing data.

Going through multiple litrature papers and tutorials, I made this workflow.

FastQC data check and Data Trimming(if needed)
De novo genome assembly using spades (as no reference genome is available) -> contigs.fasta
contigs.fasta Quality check with QUAST and BUSCO
RepeatMasking and RepeatModeling
Annotation of assembly

As every tutorial just ends on these 4 steps, my queries are

Spades gave ma a contigs.fasta file. Is their any method to make scaffolds from this (contigs.fasta) file. can this be done based n just the illumina short read data ?
Is it necessary to turn contigs -> scaffolds if only short read data is available ? or the contigs.fasta can be used for further processing?
Is repeatMasking and RepeatModeling are two different steps of one ?
Is there anything or anyother analysis that should be done.

If you think these are naive questions, just know that I am new to genome assemblies. learning and trying to understand the steps which most of the tutorials/publications don't mention.

spades genome-assembly • 4.8k views

ADD COMMENT • link updated 22 months ago by ccstaats ▴ 40 • written 22 months ago by SomeOne ▴ 240

score 5 · Accepted Answer · 2023-09-06

5

Entering edit mode

22 months ago

alex.zaccaron ▴ 480

SPAdes also outputs a scaffolds.fasta, which has some contigs arranged into scaffolds. You can use this file for dowstream analyses. In general, scaffolding with only short reads does not give big improvements.
Not necessary.
Not familiar with RepeatModeling, but they should refer to the same step of masking repeats in the genome. For novel species, you will need to identify repeats de novo. RepeatModeler is a good tool for this, but there are other options, like EarGrey and EDTA.
Depends on what you want to do. Usually, the next steps involve gene prediction and annotation.

ADD COMMENT • link 22 months ago by alex.zaccaron ▴ 480

1

Entering edit mode

This should be the answer

ADD REPLY • link 22 months ago by samuel.a.odonnell ▴ 600

0

Entering edit mode

thank you. I did get a scaffold.fasta file but number of scaffolds were just 2/3 lower than number of contigs.

for the last point (4): My initail objective is to compare genome assemblies within species. what I'm planning to do is to create

Short read assembly (some Samples)
Long Read Assembly (some Samples)
Hybrid assembly (some Samples)

ADD REPLY • link 22 months ago by SomeOne ▴ 240

score 2 · Accepted Answer · 2023-09-06

2

Entering edit mode

22 months ago

ccstaats ▴ 40

I would like suggest to use Funannotate. It is pretty straighfforward and do all the work of predict genes and annotate them. In the previous step, the pipeline can repeat mask you genome assembly.

For scaffolding, Spades in fact produces some rearrangements. But if you have a reference genome of an assembly from a phylogenetic close organism, consider using ragtag. Also, very useful. Best, Charley

ADD COMMENT • link 22 months ago by ccstaats ▴ 40

0

Entering edit mode

Yes, as im working with a fungus genome, i foundthat funannotate is the good way to go for annotation. So far i am following this path

contigs -> Clean (contigs >= 500bp) -> sort (big to small) -> Mask -> train -> Predict -> update -> Annotate

As far as i understand, the TRAIN part requires transcript. assembly created from RNA-seq data. Correct me if i am wrong here.

ADD REPLY • link 22 months ago by SomeOne ▴ 240

1

Entering edit mode

You don't need to train if you have a phylogenetically close organism in the Funannotate DB. Please take a look into this tutorial

ADD REPLY • link 22 months ago by ccstaats ▴ 40