Question

How to combined de novo and genome guided assembly

0

Entering edit mode

7.0 years ago

Bioinfonext ▴ 460

Dear All,

I am working a plant species whose draft genome sequence is available. I have done de novo and genome-guided assembly separately, please suggest how can I combined these both assemblies to generate a reference transcriptome sequences for raw read count.

Should I do blastn between them to remove overlap sequences or somthing else? should I further assembled both transcriptome assembly?

Thanks

RNA-Seq • 3.9k views

ADD COMMENT • link updated 7.0 years ago by Rohit ★ 1.5k • written 7.0 years ago by Bioinfonext ▴ 460

0

Entering edit mode

Transcriptome? You can make hybrid assemblies using Trinity, IDBA_hybrid, or even spades from your both attemps, or using draft genome as trusted reference.

ADD REPLY • link 7.0 years ago by Buffo ★ 2.4k

0

Entering edit mode

Yes, It is RNAseq data.

ADD REPLY • link 7.0 years ago by Bioinfonext ▴ 460

0

Entering edit mode

You want to assemble RNAseq reads? Why?

ADD REPLY • link 7.0 years ago by Joe 21k

0

Entering edit mode

Actually draft genome sequences is available. It do not contain all genes sequences. So I have done genome-guided assembly using StringTie and de novo assembly using Trinity.

Now to make complete reference genes sequences for raw read count, I want to remove overlapping gene sequences between these two assemblies so that I can have non-redundant genes sequences.

ADD REPLY • link 7.0 years ago by Bioinfonext ▴ 460

1

Entering edit mode

Why have you assembled RNAseq reads? What are you trying to do? All the short-read sequencing in the world is never going to allow you to close a genome and get a completed sequence.

If you want to get raw read counts from your RNAseq you should be mapping the reads (e.g. with bwa, bowtie2 etc) to the existing reference (or a reassembly if you have access to the original sequencing) and then calculating the raw read counts from the alignment map, not assembling.

ADD REPLY • link 7.0 years ago by Joe 21k

0

Entering edit mode

Thanks, I am not expecting to get all genes sequences, but atleast to retrieve those genes transcript which are present in my transcriptome data and may have imp. functional role but do not present in currently reported CDS sequences.

ADD REPLY • link 7.0 years ago by Bioinfonext ▴ 460

0

Entering edit mode

So you are looking for untranslated genome features? sRNAs, pseudogenes etc?

ADD REPLY • link 7.0 years ago by Joe 21k

0

Entering edit mode

I am looking for protein coding gene sequences which are not present in the current annotated CDS. Yes, I want to remove redundancy and want to select the longest transcript from both genome guided and de novo assembled transcripts.

ADD REPLY • link 7.0 years ago by Bioinfonext ▴ 460

0

Entering edit mode

I'm genuinely curious, what is the problem with assembling RNA-seq reads?

ADD REPLY • link 7.0 years ago by cschu181 ★ 2.8k

0

Entering edit mode

They can be assembled potentially, they are just short reads afterall - but why would you want to? If you have a region of no transcription, you won't reverse transcribe any cDNA to be sequenced from that region of the genome in the library prep. If there are no reads there, then the assembler will have to terminate the contig there as there will be no more read sequences to overlap. Even in the best case, you'll have very different coverage at intergenic and genic regions. You'd end up with an assembly but it would probably be full of short contigs so it'd be pretty shitty.

ADD REPLY • link 7.0 years ago by Joe 21k

0

Entering edit mode

That sounds like an issue with a genome assembly from RNA-seq reads, which I would totally support would be bonkers. But a transcriptome assembly (which seems to be what OP wants to do) should be fine with RNA-seq, no?

ADD REPLY • link 7.0 years ago by cschu181 ★ 2.8k

0

Entering edit mode

When you say 'assembling a transcriptome' though, what exactly do you mean? Maybe it's just a syntactic difference, because when I hear the word 'assembly' I take that to mean literally using an assembler. If you want to do transcriptomics from RNAseq via mapping though, then sure! I think we possibly just misunderstand one another when we're using the word 'assemble' in the context of RNAseq/transcriptomics.

ADD REPLY • link 7.0 years ago by Joe 21k

0

Entering edit mode

In this case I was assuming de novo, e.g. using Trinity, Velvet/Oases, IDBAtran etc. So, yes literally using an assembler.

ADD REPLY • link 7.0 years ago by cschu181 ★ 2.8k

0

Entering edit mode

I've never done it personally, as I've always had a reference genome to map against so I'm not sure of the use case. Particularly in this case as the OP said there was a reference for the organism too.

It's possible I'm misunderstanding the question as I've personally never come accross the need to do transcriptome assembly (instead of just mapping etc to a reference)

ADD REPLY • link 7.0 years ago by Joe 21k

0

Entering edit mode

I think it's simply complementary, especially if the genomic reference is missing, incomplete, or otherwise of bad quality. In my field we often don't have a reference genome, so no other choice. I was just asking, because your earlier reply sounded to me as if that is something completely out of the question and I wanted to know why.

ADD REPLY • link 7.0 years ago by cschu181 ★ 2.8k

0

Entering edit mode

Sorry can you be more specific? I think you have not clear what you have, and what you want.

ADD REPLY • link 7.0 years ago by Buffo ★ 2.4k

score 0 · Answer 1 · 2017-04-04

0

Entering edit mode

7.0 years ago

Rohit ★ 1.5k

From your comments I think what you want is redundancy removal. This can be done with -

1) Without the reference using Vmatch or CD-hit or uclust without the need of a reference - combine both denovo and genome guided assembly transcripts and take only the longest from the superset removing complete overlapping regions.

2) Use the reference to map the transcripts with a split-aware mapper and keep only the longest one in the region when they are overlapping subsets.

It would have been easier if you had framed the question as "keeping the longest transcripts or removing the redundancy" :)

ADD COMMENT • link 7.0 years ago by Rohit ★ 1.5k

0

Entering edit mode

Like this one How to map genome guided assembly TRANSCRIPTS to Genome and extract the longest one for each genome locus

ADD REPLY • link 7.0 years ago by WouterDeCoster 47k

0

Entering edit mode

I missed this one, so the answer fits to that question well :)

ADD REPLY • link 7.0 years ago by Rohit ★ 1.5k

0

Entering edit mode

Yes exactly, this I want from genome-guided assembly after that I also want to extract some of the genes sequences from de novo assembly which can not able to retrieve through draft genome-guided assembly (because genome sequence is incomplete and do not contain all protein-coding genes sequences.

ADD REPLY • link 7.0 years ago by Bioinfonext ▴ 460