Best pipeline for RNAseq assembly and analysis (or help with stringtie assembly)
Entering edit mode
7 weeks ago
Katherine • 0


I am new to bioinformatics and have been working with three sets of RNAseq data by Ilumina, two of which are of a certain disease condition and the other is a control (this low sample number makes analysis very hard). The end goal is to identify and/or confirm biomarkers for the disease.

I have assembled the sequences using hisat2 to map to the GRCh38 human reference genome and stringtie for the assembly, creating output files that I could feed into Ballgown for analysis. I have also done some analysis with DESeq2. I have several problems:

  1. My assembled transcripts are all (I think) labeled with MSTRG.[#], a labelling convention that is assigned by stringtie for unknown transcripts. However, when I manually take some of these sequences and use a genome viewer, they are clearly matching to a gene. How do I get stringtie to actually map the gene names to the transcripts? Is this a problem with my reference files?

  2. I have been unable to extract the fasta files from the gtf files that I have created. I have tried gffread which gave me errors and cannot get agat to download. How do I extract the fasta files?

  3. How do I analyze only 3 datasets? Are there databases with other RNAseq data that I could easily download to supplement my analysis? I am really trying to get together a pipeline for when we get hopefully 200 samples.

I mainly want to know if there are important steps that I am missing or if there is a better assembly and analysis pipeline that is more up-to-date for human transcriptome assembly. Tips about checking the quality of assembly would be fantastic too. Happy to provide more info; any advice/resources would be great!

human assembly transcriptome • 326 views
Entering edit mode

For point 1) use updated gtf for assigning transcripts.

Entering edit mode

I thought that the file I used was the most up to date. Here is the header of the file:

gff-version 3
!gff-spec-version 1.21
!processor NCBI annotwriter
!genome-build GRCh38.p14
!genome-build-accession NCBI_Assembly:GCF_000001405.40
!annotation-source NCBI Homo sapiens Annotation Release 110
sequence-region NC_000001.11 1 248956422

Is there a separate program that can assign the transcripts after assembly?


Login before adding your answer.

Traffic: 2083 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6