Please suggest an appropriate genome-guided transcriptome assembler
1
3
Entering edit mode
2.9 years ago
seta ★ 1.5k

Dear all,

I have RNA-seq data generated by Illumina Hiseq 2000 as 100bp, PE from human, control and diseased samples. I'm looking for polymorphic simple sequence marker (SSR) between two groups of control and disease. I'm going to do genome-guided transcriptome assembly for each group, then survey the probable polymorphic marker between them. For genome-guided transcriptome assembly, I know about cufflinks and stringtie, but as I found here some people suggested to avoid using them. Could you please kindly suggest me the appropriate tool for this purpose?

Any other comments on the issue would be highly appreciated.

Thanks

RNA-Seq genome alignment marker • 1.2k views
0
Entering edit mode

can you elaborate why some people advise to avoid them? Is

here

the biostar forum btw?

0
Entering edit mode

Yes, here, biostar forum, I don't exactly remember why. However, I performed genome-guided transcriptome assembly with two programs, cufflinks and stringtie and obtained so different results. It sounds stringtie miss a lot of genes, unlike cufflinks.

1
Entering edit mode
2.9 years ago

The one tool that is no longer recommended (even by the developers of the program) is TopHat / TopHat2. I have not seen anybody not recommending the use of HISAT2 for the purposes of genome-guided de novo transcriptome assembly. Use HISAT2 / StringTie.

Kevin

0
Entering edit mode

Thank you, Kevin. For alignment, I used STAR, it sounds great. However, my issue is the genome-guided assembler, as I said in my previous comment, the results of STAR/cufflinks and STAR/Stringtie are so different, Stringtie created few genes compared to cufflinks, I don't know why it missed lots of genes. So, I'm looking for another suitable genome-guided assembler, what about Trinity?

1
Entering edit mode

Trinity you can of course also use, and there are other de novo transcriptome assemblers too. I would imagine that many of the differences between Cufflinks and StringTie relate to low abundance transcripts. These tools undoubtedly use different thresholds, too?

Trinity has a good reputation, if you wanted to use that instead

0
Entering edit mode

Dear @Kevin, Hi. I have the RNA-seq data of a fish (3 cond1 and 3 cond2 as biological replicate) and I have done Trinity de novo assembly on it. Now the draft genome of that species have released. In your opinion which pipeline and approach is better for me to do a genome-guided comparison? Thanks

1
Entering edit mode

HISAT2 would be okay to use. It would be interesting to use the new reference genome as a guide (in HISAT2) for the purposes of identifying the transcriptional 'landscape' of this fish species. That would make for a very good publication.

0
Entering edit mode

Thank you. So, prior to using HISAT2, I should make an index reference of my species of interest using STAR or no?

2
Entering edit mode

After you run HISAT2, you then use a program called 'StringTie' for the purposes of identifying transcripts in the aligned data. If you encounter different types f errors during this process, please feel free to open up new questions on Biostars.

0
Entering edit mode

Could you kindly suggest me some source or paper for appropriate scripts of HISAT2 usage? I mean scripts for indexing and mapping all 6 left and right reads to reference genome and then downstream DEG analysis? it looks like that this program has many -options.

1
Entering edit mode

Sure thing! Here is the publication in Nature Methods: Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown.

If you cannot obtain the PDF. then I may be able to get it for you.

0
Entering edit mode

Dear @Kevin, I have used "./hisat2-build -p 6 '/home/Salmon-genome/GCF_salmon_genome.fna' ht2_base_salmon_genome" script in order to create an indexed genome using Hisat2. (is it a good script?)

8 *.ht2 files have been created.

I guess before using StringTie, I should mapped my individual paired-ends to the indexed reference genome, using HISAT2, is that correct?

0
Entering edit mode

Hello Farbod, yes, that is correct. HISAT2 is a 'splice aware' alignment program, i.e., it can take RNA-seq reads and faithfully map these back to a reference genome, taking into account the fact that RNA-seq reads are mRNA and are comprised [mostly] of exon.

0
Entering edit mode

Dear @Kevin, hi . It seems that each .sam files that are produced in the mapping procedure will be a huge file, yes?

1
Entering edit mode

That is likely, yes, because SAM is uncompressed data. You can compress these to BAM or CRAM (both binary) in order to save disk space. BAM is likely more appropriate, as many programs do not yet explicitly support CRAM.

0
Entering edit mode

Hi @Kevin, Now I have 6 .sam files for my 12 fastq files (3 for cond1 and 3 for cond2), and my final goal is DEG analysis.

Should I use them (6 SAM files) directly in StringTie or I should merge them to just one file? or maybe use SAMtools/BCFtools before StringTie?

1
Entering edit mode

You should keep them separate. If you merge them, you cannot then obtain any useful statistics because it would be a 1 versus 1 comparison. By keeping them separate, you will have 3 versus 3, which is the bare minimum that anyone should have for differential expression analysis.

0
Entering edit mode

You are right. Thank you very much.

So, I should now proceed to StringTie level. Yes?

2
Entering edit mode

Yes, indeed, Sir. If you have aligned your data with HISAT2, then StringTie is the next step. StringTie will allow you to identify the expression level of your transcripts ('transcript abundances').

You should aim to read through the entire online manual: https://ccb.jhu.edu/software/stringtie/index.shtml?t=manual#run

1
Entering edit mode

Here is the recommended workflow for differential expression analysis: http://ccb.jhu.edu/software/stringtie/index.shtml?t=manual#de