I'm wondering if someone can check my work using STAR to map and StringTie for assembly and quantification. My data is from RNAseq on 29 timecourse samples, ~60M PE 100bp reads/sample. My initial goal is to discover novel isoforms that are present in my timecourse compared to the genome annotation file: Mus_musculus.GRCm38.100.gtf
Can someone please confirm that my merged.stats make sense and there aren't any red flags in the results (or glaring problems with the pipeline)? Also, help to determine how many novel transcripts are present in my samples and not in the reference? I see novel introns and novel exons and novel loci, but not sure how to calculate how many novel transcripts are there. Maybe this is a dumb question.....Help!
STAR to generate a genome index
STAR --runThreadN 40 --runMode genomeGenerate --genomeDir star --genomeFastaFiles genome/GRCm38.primary_assembly.genome.fa --sjdbOverhang 99 --sjdbGTFfile genes/Mus_musculus.GRCm38.100.gtf
Aligned (total of 29 timecourse samples)
STAR --genomeDir star --readFilesCommand zcat --readFilesIn samples/2_Forward.fq.gz samples/2_Reverse.fq.gz --outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 16000000000 --outSAMunmapped Within --twopassMode Basic --outFilterMultimapNmax 1 --quantMode TranscriptomeSAM --outSAMstrandField intronMotif --runThreadN 40 --outFileNamePrefix "2_star/"
Use StringTie assemble transcripts for each sample (named WT*.gtf, final total of 29 .gtf files):
stringtie-2.1.4/stringtie -p 40 -G genes/Mus_musculus.GRCm38.100.gtf -o WT2.gtf 2_star/Aligned.sortedByCoord.out.bam
Merge transcripts from all 29 samples:
stringtie-2.1.4/stringtie --merge -p 40 -G genes/Mus_musculus.GRCm38.100.gtf -o adipo_stringtie_merged.gtf mergelist.txt
Examine how the transcripts compare to the reference annotation
gffcompare-0.12.1/gffcompare -r genes/Mus_musculus.GRCm38.100.gtf -G -o merged adipo_stringtie_merged.gtf
Contents of merged.stats
Summary for dataset: adipo_stringtie_merged.gtf Query mRNAs : 171578 in 67086 loci (143536 multi-exon transcripts) (26390 multi-transcript loci, ~2.6 transcripts per locus) Reference mRNAs : 141881 in 53534 loci (114993 multi-exon) Super-loci w/ reference transcripts: 53533 -----------------| Sensitivity | Precision | Base level: 100.0 | 67.5 | Exon level: 100.0 | 73.3 | Intron level: 100.0 | 66.8 | Intron chain level: 100.0 | 80.1 | Transcript level: 100.0 | 82.7 | Locus level: 100.0 | 79.8 |
Matching intron chains: 114993 Matching transcripts: 141881 Matching loci: 53534 Missed exons: 0/446708 ( 0.0%) Novel exons: 162803/610424 ( 26.7%) Missed introns: 0/284950 ( 0.0%) Novel introns: 141429/426396 ( 33.2%) Missed loci: 0/53534 ( 0.0%) Novel loci: 13553/67086 ( 20.2%)
Total union super-loci across all input datasets: 67086 171578 out of 171578 consensus transcripts written in merged.annotated.gtf (0 discarded as redundant)
Thank you for your response.
As per your comment RE: STAR arguments- That is useful advice. I will read more about those options in the STAR manual.
Yes, my data is strand-specific. The STAR manual says I do not need any additional options if my data is strand-specific. Also, for the mapping step, I did include --outSAMstrandField intronMotif so that the output is usable for StringTie. Is there something else I am missing that I should include in my STAR run regarding an option for strandedness?
And I apologize. I should've been more specific as to what step I am on in the analysis. I paused after I compared the merged.gtf file to the genome annotation file so that I could see how many novel transcripts were found in the entire time-course. After this step I will run the below command for all my samples using the merged.gtf file and then look at novel transcripts in each of the samples using gffcompare.
While I have you- I actually did try running this command on one sample and then looked at the "WT2_gene_abund.txt" output file and saw that only the MSTRG transcripts had counts. All other transcripts (those annotated with gene names) had all "0.0" so something obviously went wrong... any idea why this is??
You are right, you do not need to have provided anything additional to STAR for strand-sensitivity. But for StringTie, you need to specify either the
--frparam to indicate which strand is being sequenced. If you are not sure which strand has been sequenced, then you can identify that empirically by running STAR with
--quantMode GeneCountsparam, which generates a count file. If you then look into that count file and see something like this (for e.g.) -
then it indicates that the alignment outcome is strand-specific. In the above e.g. R2 is aligning with the sense strand of the gene. The StringTie param would be then
--rf. Read the section 7 of STAR manual for more.
Coming to StringTie run with the merged GTF; I am not sure why you would find only MSTRG loci with counts. Maybe you should check from STAR stage if things are going alright. Look into Log.final.out created by STAR run, and within there what is the value for Uniquely mapped reads % ? A decent dataset would have anything upwards of ~70% If this value is not upto the mark then that might indicate something wrong. Also, before the alignment itself, you would have run fastQC on the raw reads. That should have told if the fastq reads are more or less ok. Finally, you could look into the GTF produced during the first run of StringTie. Did you have FPKM or TPM values present for known genes? Pick up some positive controls (genes which you know would have expression value) and
greptheir symbol (or their ENSMUSG ID) in the GTF file (from 1st run). If you still are not seeing values for known markers then open your STAR BAM within IGV and look at some of those genes. Do you see alignment with coverage rising above exons, and "blue bridges" spanning introns (visual representation in IGV for splice-aware alignment). As an e.g., here is a IGV snapshot for a human RNA-seq data.
The above steps will tell you what might gone wrong.
Greater than 90% of reads are uniquely mapped and the quality of reads is very good (determined by fastqc), so it's not anything to do with the quality of the sequence files.
I looked at the WT2.gtf file (the one created during the first run of StringTie and strangely it includes a very small number of known genes (i.e. mostly STRNG identifiers). I searched for well known genes and could not find them, so that explains why they are not being counted in the second run of StringTie. I'm not sure why this is.... I will look at the STAR output in IGV. Was the appropriate STAR .bam file "Aligned.sortedByCoord.out.bam" or should In have used "Aligned.toTranscriptome.out.bam" for the first run of StringTie??
Ok, I'm looking at the IGV visualization for one of my STAR alignment files "Aligned.sortedByCoord.out.bam" and I do not see the blue bridges. There are only the grey boxes.
Yes, "Aligned.sortedByCoord.out.bam" is the file to look at using IGV. Why should you not see blue bridges.. if the read quality was ok (as per fastQC) and STAR says ~90% unique alignment. That is quite puzzling, to be honest. If there was a chance of contamination in the reads (some other organism.. just wondering) then your STAR unique algn. % should have been much lower. You can do some more debugging - 1) Look further into the Log.final.out from STAR and look at the values for the -
Are the annotated splice junctions ~90% of Total? If not, then that might indicate an issue. I would recommend to run STAR again with the some of the confidence measures (ENCODE options) as in STAR manual (for long RNA-seq).
Here is the whole Log.Final.Out file. Annotated (sjdb) / Total = 99.6%
Started job on | Dec 22 17:01:37 Started mapping on | Dec 22 17:12:11 Finished on | Dec 22 17:35:52 Mapping speed, Million of reads per hour | 161.82
I will run STAR again with your recommendations and see what happens.
I ran the following command
STAR --genomeDir star --readFilesIn samples/2_Forward.fq samples/2_Reverse.fq --outSAMtype BAM SortedByCoordinate --outFilterType BySJout --alignSJoverhangMin 8 --limitBAMsortRAM 16000000000 --outSAMunmapped Within --twopassMode Basic --quantMode TranscriptomeSAM --outSAMstrandField intronMotif --runThreadN 40 --outFileNamePrefix "2_ENCODE_star/"
Started job on | Dec 27 13:27:30 Started mapping on | Dec 27 13:39:38 Finished on | Dec 27 14:31:05 Mapping speed, Million of reads per hour | 79.96
I also checked the alignment in IGV and saw the same thing as above (no blue bridges). I also ran it through StringTie and encountered the same problems as I did previously with annotated genes showing no counts.
I literally have no idea what is wrong. When I look at the STAR Aligned.sortedByCoord.out.bam file in IGV, it's showing that reads mapped to annotated genes, but then StringTie just isn't assigning the counts correctly. It's really weird.
Hi, I am at my wit's end here, not sure where the problem is arising. Don't worry about IGV, as STAR is recognising splice junctions (predominantly known ones), so I think splice-aware alignment is happening. When you load BAM file into IGV, the default mode is where each read is visible (as a horizontal strip), and if the read has been splice-aligned then there is thin blue line connecting the stretches split across exon boundary. The "bridges" become prominent if you right-click the BAM name in the left-hand panel and choose "Collapsed" view.
In any case, the only time I have come across such a situation is when the chromosome naming pattern was different in the input BAM as compared to the input GTF for StringTie. A warning was thrown. Maybe check your StingTie log file and also (just to be sure) that the chromosome name pattern in STAR Aligned.sortedByCoord.out.bam and the input GTF have the same chr name style.
If there is no headway still, then I would suggest to provide both these values to
--quantMode TranscriptomeSAM GeneCounts. So, you would get gene counts from STAR, do that at least for a couple of samples and check if gene level counts are making sense in terms of the biology you are chasing.
1) If the gene count data from STAR seem alright then the issue might be somewhere in StringTie. 2) If quantification of alternative transcripts satisfies your question then you could skip all this alignment+ assembly and go to tools like Kallisto. It would (once you have the index ready) give you isoform-level (NOTE - known isoforms only) quantification in 10-15min per sample. 3) Finally, if novel events (novel exons, novel exon junction boundaries, retained intron) are your interest and StringTie is fails to work, there are other tools as well. Check rMATS, MAJIQ, LeafCutter etc.
Looking at the gene counts file, I see the following. There are only 90 gene IDs listed. There should be ~55,000 genes. Should they all be listed in this file, or is this just a subset? I should say that I checked a few of these and they are found in the merged.gtf that Stringtie gives me (sorry for jumbled formatting...not sure how to make it look like it does in the file I'm coping from)
N_unmapped 1339885 1339885 1339885 N_multimapping 4906514 4906514 4906514 N_noFeature 62306436 62317452 62306473 N_ambiguous 0 0 0 ENSMUSG00000094836 0 0 0 ENSMUSG00000094383 0 0 0 ENSMUSG00000094474 0 0 0 ENSMUSG00000094791 0 0 0 ENSMUSG00000096550 0 0 0 ENSMUSG00000094172 0 0 0 ENSMUSG00000094887 0 0 0 ENSMUSG00000091585 0 0 0 ENSMUSG00000095763 0 0 0 ENSMUSG00000095523 0 0 0 ENSMUSG00000095475 0 0 0 ENSMUSG00000094855 0 0 0 ENSMUSG00000051412 2755 0 2755 ENSMUSG00000090805 0 0 0 ENSMUSG00000089163 0 0 0 ENSMUSG00000061654 197 0 197 ENSMUSG00000079834 1420 0 1420 ENSMUSG00000095250 0 0 0 ENSMUSG00000095787 0 0 0 ENSMUSG00000096100 0 0 0 ENSMUSG00000094054 0 0 0 ENSMUSG00000095672 0 0 0 ENSMUSG00000079190 0 0 0 ENSMUSG00000094514 0 0 0 ENSMUSG00000094121 0 0 0 ENSMUSG00000094576 0 0 0 ENSMUSG00000079764 0 0 0 ENSMUSG00000079774 0 0 0 ENSMUSG00000079773 0 0 0 ENSMUSG00000096271 0 0 0 ENSMUSG00000095456 0 0 0 ENSMUSG00000095247 5 0 5 ENSMUSG00000095585 0 0 0 ENSMUSG00000096873 0 0 0 ENSMUSG00000095755 0 0 0 ENSMUSG00000096646 0 0 0 ENSMUSG00000096506 12 0 12 ENSMUSG00000095552 0 0 0 ENSMUSG00000095207 0 0 0 ENSMUSG00000079222 2 0 2 ENSMUSG00000094874 0 0 0 ENSMUSG00000095500 0 0 0 ENSMUSG00000095450 0 0 0 ENSMUSG00000094728 0 0 0 ENSMUSG00000062783 4 0 4 ENSMUSG00000096808 0 0 0 ENSMUSG00000094303 0 0 0 ENSMUSG00000096236 0 0 0 ENSMUSG00000096756 0 0 0 ENSMUSG00000095666 0 0 0 ENSMUSG00000096680 0 0 0 ENSMUSG00000095505 0 0 0 ENSMUSG00000094741 0 0 0 ENSMUSG00000074720 0 0 0 ENSMUSG00000094337 0 0 0 ENSMUSG00000095570 0 0 0 ENSMUSG00000096728 0 0 0 ENSMUSG00000093828 0 0 0 ENSMUSG00000095623 0 0 0 ENSMUSG00000094661 0 0 0 ENSMUSG00000095320 0 0 0 ENSMUSG00000094350 0 0 0 ENSMUSG00000096237 0 0 0 ENSMUSG00000094722 0 0 0 ENSMUSG00000095728 0 0 0 ENSMUSG00000095076 0 0 0 ENSMUSG00000096776 0 0 0 ENSMUSG00000096244 0 0 0 ENSMUSG00000079800 0 0 0 ENSMUSG00000095092 0 0 0 ENSMUSG00000079192 0 0 0 ENSMUSG00000079794 0 0 0 ENSMUSG00000094799 0 0 0 ENSMUSG00000095019 0 0 0 ENSMUSG00000094915 0 0 0 ENSMUSG00000079808 1 0 1 ENSMUSG00000095041 5903 13 5890 ENSMUSG00000063897 347 9 338 ENSMUSG00000084520 0 0 0 ENSMUSG00000099278 0 0 0 ENSMUSG00000095434 0 0 0 ENSMUSG00000094431 0 0 0 ENSMUSG00000094621 1 0 1 ENSMUSG00000098647 0 0 0 ENSMUSG00000096730 0 0 0 ENSMUSG00000095742 406 15 391
Ok, I see. Something is not right. Irrespective of whether a gene gets any count assigned or not, all of the IDs are present in the count file created. So, it should have ~55k lines (or, the number genes in the GTF file provided when STAR index was created).
Just wondering if it could be possible that the GTF file (or maybe the genome file even) was corrupted while being downloaded (?). Or, somehow the STAR index creation step didn't go properly. The STAR index size is around ~25Gb in case of GRCh38. The mouse genome and GTF are near about the same size (as human genome and GTF) and so the index should be in that ballpark of size.
You could further check by going into your index dir. and looking for
geneInfo.tabfile. If you do
headcommand on it, then the 1st line tells you the number of genes in the index. You could also
wc -lthat file for the same outcome. You could also check the log file created while the index was made.
If either of these checks are not satisfactory, I would recommend creating the index again. ** For formatting, select the text and click the "Code sample" button on the formatting panel.
I re-created the index a bunch of times using different genome and annotation files and the same thing kept happening, BUT I had a star index I did a few months ago on a different drive and that one looks fine. The arguments for the one that worked (from GenomeParameters.txt) was-
and the arguments for one that didn't work is-
Other than changing the number of threads, I have no idea what went wrong. In any case, I'm going to re-run my pipeline with the old star index and it should work.
Ok, this is new information as I had the impression that you have always used "Mus_musculus.GRCm38.100.gtf" as GTF file. The command set-up that you pointed out which didn't work, has gencode GTF has input instead. The genome file is the same in both case (and I am assuming it is from Ensembl and has chromosome names like
1, 2, 3etc. The gencode GTF (gencode.vM25.primary_assembly.annotation.gtf) on the other hand would have names like
chr1. I am not sure if STAR index creation throws error or not in such a case, but the BAM that gets created from alignment to that index would chr names like in the genome fasta file. If that BAM is given as input to StringTie with the gencode GTF then definitely the quantification file would only have MSTRG ids as output (because the chr names wouldn't match for the known genes). Please ensure that the chr naming style is consistent across the fasta/GTF inputs given all along the steps.
I actually tried running the indexing step on a different machine and it worked with the same exact code that failed above (looked at the geneInfo.tab file and everything looked good), so it's something funky behind the scenes on my computer. No idea. You've been a ton of help! I appreciate it.
I did go through and run the pipeline again (for only 4 samples tol see if the results make sense) and am looking at the merged.stats file and the numbers for novel stuff is REALLY low, which makes me think I need to change my run parameters somewhere to be less stringent somehow. Here is all the code I ran and the contents of my merged stats file:
Generate genome index using STAR
Align to genome using STAR
Assemble Samples 2 to 5 only
Merge sample 2 to 5 only with main annotation
Compare merged vs main annotation
#Contents of merged_2to5only.stats file
gffcompare v0.12.1 | Command line was:
gffcompare-0.12.1/gffcompare -r genes/gencode.vM25.primary_assembly.annotation.gtf -G -o merged_2to5only adipo_stringtie_merged_2to5only.gtf
= Summary for dataset: adipo_stringtie_merged_2to5only.gtf
Query mRNAs : 148525 in 53729 loci (121240 multi-exon transcripts)
(20486 multi-transcript loci, ~2.8 transcripts per locus)
Reference mRNAs : 142004 in 53534 loci (115073 multi-exon)
Super-loci w/ reference transcripts: 53294
-----------------| Sensitivity | Precision |
Intron chain level: 99.9 | 94.8 | Transcript level: 99.4 | 95.0 | Locus level: 99.7 | 99.0 |
Total union super-loci across all input datasets: 53729 148525 out of 148525 consensus transcripts written in merged_2to5only.annotated.gtf (0 discarded as redundant)
Hi, This STAR+StringTie workflow is able to detect novel transcripts as far as I can tell from my own experience with multiple datatypes (splicing defects, CRISPR knockouts etc.). But to be fair, I knew which gene to look at. So, I do not have an idea of how much novel stuff you can detect on average in human/ mouse RNA-seq.
In any case, 435 novel loci is not too bad. There could be two conditions I think, broadly which would lead to detecting novel stuff - a) A transcript/ gene which is expressed in that tissue, but not known in reference annotation b) A transcript/ gene arising due to aberrant splicing. This could be happening due to say a splice-site altering mutation. Or it could also happen if splicing complex proteins are compromised (some known cases in human data include SF3B1, U1 snRNA etc.)
Now, in case of model organisms which have been studied for quite long, like human, mouse, it would be rare to get condition _a)_ as consortia like Gencode ensure that we have comprehensive annotation of the transcriptome.
If you are suspecting condition _b)_ then by all means check the BAM file for that particular gene, or if you are suspecting systemic aberrant splicing then you should probably profile donor/ acceptor site statistics from the data (like Fig.1c of this paper)
So, I would recommend to look at your data with some hypothesis you might have, which should better tell if the amount of novel loci being detected are low or at par.
== If you are going to invest more time into testing STAR params, then I would recommend to turn on chimeric loci detection (i.e. fusion genes). By default thats off; see here for some recommended settings.
I would not be expecting _b)_ , so potentially these numbers are OK. I may try playing around with STAR parameters more, too. I'll check out chimeric loci detection parameters. Thanks :)