How do you assemble contigs without using a reference genome?
1
0
Entering edit mode
3.2 years ago
DNAngel ▴ 240

I have done WES for various vertebrate species and have obtained contigs after running velveth and velvetg with various hashes and obtained good N50 values. My question now is, if I want to try assembling the contigs into a genome without using a reference, what tool is available for that for non-mammalian vertebrates? I want to be able to annotate all the assembled genomes afterwards to extract all protein-coding genes - can this step simply be done without having to do anything further with the contigs (i.e. can I use the contigs.fa files produced by velvet straight into annotation?)

I would also try the process where I do map the contigs using a reference genome using bwa mem but I want to implement both methods to see how the results vary.

assembly genome • 1.4k views
ADD COMMENT
2
Entering edit mode
3.2 years ago
h.mon 34k

Did you do whole genome sequencing (WGS) or whole exome sequencing (WES)? As you said you obtained good N50 values, I assume you performed whole genome sequencing.

When you used velvet, you already assembled (draft) genomes without using a reference. If you only have paired-end Illumina sequencing, there is not much else you can do to improve the quality of the draft except trying different assemblers - after all, velvet is really old and not developed anymore. Reference-based assemblies may introduce misassemblies, I wouldn't recommend unless there are very good reference genomes from very closely related species. In addition to N50 and related metrics, use BUSCO do evaluate the draft assemblies.

You can certainly predict protein-coding genes from these drafts, but I expect you will be missing a lot of genes due to the fragmented draft assembly.

ADD COMMENT
0
Entering edit mode

I did WES. I wanted to use allpaths-LG but none of my servers are linux based and it was not installing properly. Velvet was working fine so I don't see why it is a problem even if its old?

ADD REPLY
1
Entering edit mode

You can't assemble WES data. Have a look at a genome browser for how WES data actually looks like - small covered regions around exons. Exons are not overlapping or close enough to contain spatially relevant information.

And by the way - linux is really non-optional for bioinformatics ..... time to reformat a server, or at least get a virtual machine, or apply for a cluster allocation elsewhere.

ADD REPLY
0
Entering edit mode

But exome data has to be assembled or aligned in such a way to find variants (I'm not doing this but it is the most used process for exome data). Even if there are larger gaps it should be possible to assemble and obtain exons which can later be concatenated into coding sequences. With that said, most tools that "assemble" exome sequence data utilize BAM files to simply view the results against a reference sequence to locate variants - my goal is to skip that last step and just obtain the sequences (with the variants included) so I can do other analyses by gene. IGV consensus I find is not functional for me because I need to obtain all genes, not just one specific region of interest.

ADD REPLY
1
Entering edit mode

With that said, most tools that "assemble" exome sequence data utilize BAM files to simply view the results against a reference sequence to locate variants

Mapping the sequencing data and calling SNPs and indels in relation to a reference genome is probably faster than assembling the exome, and is almost certainly more precise and less error-prone.

Assembling the exome will not give you coding genes, not even will give you exons, because the part of the introns at the exon boundary will also be captured. Such an assembly will be a big mess of hundreds thousand contigs, with no way of knowing which contigs belong to the same gene except by mapping them to an annotated reference genome.

ADD REPLY
0
Entering edit mode

Okay so now I am more uncertain of what to do because I have heard both sides. I have exome data - I can map them use bwa to a reference genome but then I have to obtain coding sequences. This can be done when my reference sequence is JUST a coding sequence, the sequence reads are mapped and after mpileup I can pull out my sequence no problem. This is a problem when I have to do it using the entire reference genome because my scripts now fail to pull out sequences (it tries to pull out one large sequence). Hence why I also thought to try de novo methods to obtain contigs, then map those contigs in some way or at least blast them to see where they belong against the reference genome.

If I did map my reads to the reference genome and get to the point of mpileup files, perhaps there is a way to use the .gff file to pull out exon boundaries? I have not worked with that yet, but I am not sure how else to proceed since I cannot find tutorials on my specific task unfortunately. I can view my bam files in IGV, I can see the SNPs but I need sequences (even at least by exon if not coding sequence because of intergenic regions).

ADD REPLY

Login before adding your answer.

Traffic: 1127 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6