Question

Best pipeline for RNAseq assembly and analysis (or help with stringtie assembly)

0

Entering edit mode

22 months ago

Katherine • 0

Hello,

I am new to bioinformatics and have been working with three sets of RNAseq data by Ilumina, two of which are of a certain disease condition and the other is a control (this low sample number makes analysis very hard). The end goal is to identify and/or confirm biomarkers for the disease.

I have assembled the sequences using hisat2 to map to the GRCh38 human reference genome and stringtie for the assembly, creating output files that I could feed into Ballgown for analysis. I have also done some analysis with DESeq2. I have several problems:

My assembled transcripts are all (I think) labeled with MSTRG.[#], a labelling convention that is assigned by stringtie for unknown transcripts. However, when I manually take some of these sequences and use a genome viewer, they are clearly matching to a gene. How do I get stringtie to actually map the gene names to the transcripts? Is this a problem with my reference files?
I have been unable to extract the fasta files from the gtf files that I have created. I have tried gffread which gave me errors and cannot get agat to download. How do I extract the fasta files?
How do I analyze only 3 datasets? Are there databases with other RNAseq data that I could easily download to supplement my analysis? I am really trying to get together a pipeline for when we get hopefully 200 samples.

I mainly want to know if there are important steps that I am missing or if there is a better assembly and analysis pipeline that is more up-to-date for human transcriptome assembly. Tips about checking the quality of assembly would be fantastic too. Happy to provide more info; any advice/resources would be great!

human assembly transcriptome • 784 views

ADD COMMENT • link updated 21 months ago by Ram 43k • written 22 months ago by Katherine • 0

0

Entering edit mode

For point 1) use updated gtf for assigning transcripts.

ADD REPLY • link 22 months ago by cpad0112 21k

0

Entering edit mode

I thought that the file I used was the most up to date. Here is the header of the file:

gff-version 3
!gff-spec-version 1.21
!processor NCBI annotwriter
!genome-build GRCh38.p14
!genome-build-accession NCBI_Assembly:GCF_000001405.40
!annotation-source NCBI Homo sapiens Annotation Release 110
sequence-region NC_000001.11 1 248956422
species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9606

Is there a separate program that can assign the transcripts after assembly?

ADD REPLY • link updated 21 months ago by Ram 43k • written 22 months ago by Katherine • 0