Question: How to combined de novo and genome guided assembly
0
gravatar for Bioinfonext
2.0 years ago by
Bioinfonext140
Korea
Bioinfonext140 wrote:

Dear All,

I am working a plant species whose draft genome sequence is available. I have done de novo and genome-guided assembly separately, please suggest how can I combined these both assemblies to generate a reference transcriptome sequences for raw read count.

Should I do blastn between them to remove overlap sequences or somthing else? should I further assembled both transcriptome assembly?

Thanks

rna-seq • 1.4k views
ADD COMMENTlink modified 2.0 years ago by Rohit1.3k • written 2.0 years ago by Bioinfonext140

Transcriptome? You can make hybrid assemblies using Trinity, IDBA_hybrid, or even spades from your both attemps, or using draft genome as trusted reference.

ADD REPLYlink written 2.0 years ago by Buffo1.5k

Yes, It is RNAseq data.

ADD REPLYlink written 2.0 years ago by Bioinfonext140

You want to assemble RNAseq reads? Why?

ADD REPLYlink written 2.0 years ago by jrj.healey11k

Actually draft genome sequences is available. It do not contain all genes sequences. So I have done genome-guided assembly using StringTie and de novo assembly using Trinity.

Now to make complete reference genes sequences for raw read count, I want to remove overlapping gene sequences between these two assemblies so that I can have non-redundant genes sequences.

ADD REPLYlink written 2.0 years ago by Bioinfonext140
1

Why have you assembled RNAseq reads? What are you trying to do? All the short-read sequencing in the world is never going to allow you to close a genome and get a completed sequence.

If you want to get raw read counts from your RNAseq you should be mapping the reads (e.g. with bwa, bowtie2 etc) to the existing reference (or a reassembly if you have access to the original sequencing) and then calculating the raw read counts from the alignment map, not assembling.

ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by jrj.healey11k

Thanks, I am not expecting to get all genes sequences, but atleast to retrieve those genes transcript which are present in my transcriptome data and may have imp. functional role but do not present in currently reported CDS sequences.

ADD REPLYlink written 2.0 years ago by Bioinfonext140

So you are looking for untranslated genome features? sRNAs, pseudogenes etc?

ADD REPLYlink written 2.0 years ago by jrj.healey11k

I am looking for protein coding gene sequences which are not present in the current annotated CDS. Yes, I want to remove redundancy and want to select the longest transcript from both genome guided and de novo assembled transcripts.

ADD REPLYlink written 2.0 years ago by Bioinfonext140

I'm genuinely curious, what is the problem with assembling RNA-seq reads?

ADD REPLYlink written 2.0 years ago by cschu1811.6k

They can be assembled potentially, they are just short reads afterall - but why would you want to? If you have a region of no transcription, you won't reverse transcribe any cDNA to be sequenced from that region of the genome in the library prep. If there are no reads there, then the assembler will have to terminate the contig there as there will be no more read sequences to overlap. Even in the best case, you'll have very different coverage at intergenic and genic regions. You'd end up with an assembly but it would probably be full of short contigs so it'd be pretty shitty.

ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by jrj.healey11k

That sounds like an issue with a genome assembly from RNA-seq reads, which I would totally support would be bonkers. But a transcriptome assembly (which seems to be what OP wants to do) should be fine with RNA-seq, no?

ADD REPLYlink written 2.0 years ago by cschu1811.6k

When you say 'assembling a transcriptome' though, what exactly do you mean? Maybe it's just a syntactic difference, because when I hear the word 'assembly' I take that to mean literally using an assembler. If you want to do transcriptomics from RNAseq via mapping though, then sure! I think we possibly just misunderstand one another when we're using the word 'assemble' in the context of RNAseq/transcriptomics.

ADD REPLYlink written 2.0 years ago by jrj.healey11k

In this case I was assuming de novo, e.g. using Trinity, Velvet/Oases, IDBAtran etc. So, yes literally using an assembler.

ADD REPLYlink written 2.0 years ago by cschu1811.6k

I've never done it personally, as I've always had a reference genome to map against so I'm not sure of the use case. Particularly in this case as the OP said there was a reference for the organism too.

It's possible I'm misunderstanding the question as I've personally never come accross the need to do transcriptome assembly (instead of just mapping etc to a reference)

ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by jrj.healey11k

I think it's simply complementary, especially if the genomic reference is missing, incomplete, or otherwise of bad quality. In my field we often don't have a reference genome, so no other choice. I was just asking, because your earlier reply sounded to me as if that is something completely out of the question and I wanted to know why.

ADD REPLYlink written 2.0 years ago by cschu1811.6k

Sorry can you be more specific? I think you have not clear what you have, and what you want.

ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by Buffo1.5k
0
gravatar for Rohit
2.0 years ago by
Rohit1.3k
California
Rohit1.3k wrote:

From your comments I think what you want is redundancy removal. This can be done with -

1) Without the reference using Vmatch or CD-hit or uclust without the need of a reference - combine both denovo and genome guided assembly transcripts and take only the longest from the superset removing complete overlapping regions.

2) Use the reference to map the transcripts with a split-aware mapper and keep only the longest one in the region when they are overlapping subsets.

It would have been easier if you had framed the question as "keeping the longest transcripts or removing the redundancy" :)

ADD COMMENTlink written 2.0 years ago by Rohit1.3k

Like this one How to map genome guided assembly TRANSCRIPTS to Genome and extract the longest one for each genome locus

ADD REPLYlink written 2.0 years ago by WouterDeCoster38k

I missed this one, so the answer fits to that question well :)

ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by Rohit1.3k

Yes exactly, this I want from genome-guided assembly after that I also want to extract some of the genes sequences from de novo assembly which can not able to retrieve through draft genome-guided assembly (because genome sequence is incomplete and do not contain all protein-coding genes sequences.

ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by Bioinfonext140
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 649 users visited in the last hour