STAR to generate a genome index

Question

STAR and StringTie to identify novel isoforms (please check my work)

3

Entering edit mode

3.3 years ago

Jen ▴ 70

I'm wondering if someone can check my work using STAR to map and StringTie for assembly and quantification. My data is from RNAseq on 29 timecourse samples, ~60M PE 100bp reads/sample. My initial goal is to discover novel isoforms that are present in my timecourse compared to the genome annotation file: Mus_musculus.GRCm38.100.gtf

Can someone please confirm that my merged.stats make sense and there aren't any red flags in the results (or glaring problems with the pipeline)? Also, help to determine how many novel transcripts are present in my samples and not in the reference? I see novel introns and novel exons and novel loci, but not sure how to calculate how many novel transcripts are there. Maybe this is a dumb question.....Help!

STAR to generate a genome index

STAR --runThreadN 40 --runMode genomeGenerate --genomeDir star --genomeFastaFiles genome/GRCm38.primary_assembly.genome.fa --sjdbOverhang 99 --sjdbGTFfile genes/Mus_musculus.GRCm38.100.gtf

Aligned (total of 29 timecourse samples)

STAR --genomeDir star --readFilesCommand zcat --readFilesIn samples/2_Forward.fq.gz samples/2_Reverse.fq.gz --outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 16000000000 --outSAMunmapped Within --twopassMode Basic --outFilterMultimapNmax 1 --quantMode TranscriptomeSAM --outSAMstrandField intronMotif --runThreadN 40 --outFileNamePrefix "2_star/"

Use StringTie assemble transcripts for each sample (named WT*.gtf, final total of 29 .gtf files):

stringtie-2.1.4/stringtie -p 40 -G genes/Mus_musculus.GRCm38.100.gtf -o WT2.gtf 2_star/Aligned.sortedByCoord.out.bam

Merge transcripts from all 29 samples:

stringtie-2.1.4/stringtie --merge -p 40 -G genes/Mus_musculus.GRCm38.100.gtf -o adipo_stringtie_merged.gtf mergelist.txt

Examine how the transcripts compare to the reference annotation

gffcompare-0.12.1/gffcompare -r genes/Mus_musculus.GRCm38.100.gtf -G -o merged adipo_stringtie_merged.gtf

Contents of merged.stats

Summary for dataset: adipo_stringtie_merged.gtf Query mRNAs : 171578 in 67086 loci (143536 multi-exon transcripts) (26390 multi-transcript loci, ~2.6 transcripts per locus) Reference mRNAs : 141881 in 53534 loci (114993 multi-exon) Super-loci w/ reference transcripts: 53533 -----------------| Sensitivity | Precision | Base level: 100.0 | 67.5 | Exon level: 100.0 | 73.3 | Intron level: 100.0 | 66.8 | Intron chain level: 100.0 | 80.1 | Transcript level: 100.0 | 82.7 | Locus level: 100.0 | 79.8 |

 Matching intron chains:  114993
   Matching transcripts:  141881
          Matching loci:   53534

      Missed exons:       0/446708  (  0.0%)
       Novel exons:  162803/610424  ( 26.7%)
    Missed introns:       0/284950  (  0.0%)
     Novel introns:  141429/426396  ( 33.2%)
       Missed loci:       0/53534   (  0.0%)
        Novel loci:   13553/67086   ( 20.2%)

Total union super-loci across all input datasets: 67086 171578 out of 171578 consensus transcripts written in merged.annotated.gtf (0 discarded as redundant)

RNA-Seq STAR StringTie gffcompare novel isoforms • 6.9k views

ADD COMMENT • link written 3.3 years ago by Jen ▴ 70

score 1 · Answer 1 · 2020-12-26

1

Entering edit mode

3.3 years ago

Amitm ★ 2.2k

Hi, Since you have 100bp PE reads (that too in good lib. size), you should use additional confidence measures for alignment in STAR. In the STAR manual, look at the ENCODE options section. Specifically --outFilterType and --alignSJoverhangMin are worthwhile in keeping false +ve rate low if StringTie (or any other transcriptome assembler) is to be used downstream. Also, you are using --outFilterMultimapNmax 1, which means any read can map to max. 1 loc. only. Mammalian genomes have loads of repetitive/ transposable element content and sometimes these are biologically relevant. You might be loosing out genuine alignments where only one of the read has multi-hits, but the other in the pair has a unique hit. The default value of 10 is sensible and not too lenient.

Another aspect before coming to StringTie, is whether this is strand-specific data? If yes, then that info. helps in filtering out counts that are from the minus strand of the gene.

Now to StringTie; I think that you are missing one step after stringtie --merge. The GTF output created from this step has all transcript models detected in any of your samples. The next step, before going to gffcompare, is to run StringTie again for each of the samples using the output GTF from stringtie --merge. Remember that for this second run of StringTie, the -e param should be activated, so that only transcript models present in the input GTF (i.e. adipo_stringtie_merged.gtf) are used. We do not want any more novel isoform detection in this second run.

The output GTFs from each of the sample then should be given to gffcompare. That would tell you what is happening in each sample. The stats that you have at the moment is giving you a collapsed summary. When you get sample-wise summary, you might see a trend.

Also, when you re-run StringTie with the merged GTF, the novel transcripts are comparable across samples because they would share the same tracking ID.

ADD COMMENT • link 3.3 years ago by Amitm ★ 2.2k

0

Entering edit mode

Thank you for your response.

As per your comment RE: STAR arguments- That is useful advice. I will read more about those options in the STAR manual.

Yes, my data is strand-specific. The STAR manual says I do not need any additional options if my data is strand-specific. Also, for the mapping step, I did include --outSAMstrandField intronMotif so that the output is usable for StringTie. Is there something else I am missing that I should include in my STAR run regarding an option for strandedness?

And I apologize. I should've been more specific as to what step I am on in the analysis. I paused after I compared the merged.gtf file to the genome annotation file so that I could see how many novel transcripts were found in the entire time-course. After this step I will run the below command for all my samples using the merged.gtf file and then look at novel transcripts in each of the samples using gffcompare.

stringtie-2.1.4/stringtie -e -B -p 40 -A WT2_gene_abund.txt -G adipo_stringtie_merged.gtf -o ballgown/WT2/WT2.gtf WT2.bam

While I have you- I actually did try running this command on one sample and then looked at the "WT2_gene_abund.txt" output file and saw that only the MSTRG transcripts had counts. All other transcripts (those annotated with gene names) had all "0.0" so something obviously went wrong... any idea why this is??

ADD REPLY • link 3.3 years ago by Jen ▴ 70

0

Entering edit mode

Hi,

You are right, you do not need to have provided anything additional to STAR for strand-sensitivity. But for StringTie, you need to specify either the --rf or --fr param to indicate which strand is being sequenced. If you are not sure which strand has been sequenced, then you can identify that empirically by running STAR with --quantMode GeneCounts param, which generates a count file. If you then look into that count file and see something like this (for e.g.) -

ENSG00000227232 65  0   65
ENSG00000278267 29  1   28

then it indicates that the alignment outcome is strand-specific. In the above e.g. R2 is aligning with the sense strand of the gene. The StringTie param would be then --rf. Read the section 7 of STAR manual for more.

Coming to StringTie run with the merged GTF; I am not sure why you would find only MSTRG loci with counts. Maybe you should check from STAR stage if things are going alright. Look into Log.final.out created by STAR run, and within there what is the value for Uniquely mapped reads % ? A decent dataset would have anything upwards of ~70% If this value is not upto the mark then that might indicate something wrong. Also, before the alignment itself, you would have run fastQC on the raw reads. That should have told if the fastq reads are more or less ok. Finally, you could look into the GTF produced during the first run of StringTie. Did you have FPKM or TPM values present for known genes? Pick up some positive controls (genes which you know would have expression value) and grep their symbol (or their ENSMUSG ID) in the GTF file (from 1st run). If you still are not seeing values for known markers then open your STAR BAM within IGV and look at some of those genes. Do you see alignment with coverage rising above exons, and "blue bridges" spanning introns (visual representation in IGV for splice-aware alignment). As an e.g., here is a IGV snapshot for a human RNA-seq data.

The above steps will tell you what might gone wrong.

ADD REPLY • link 3.3 years ago by Amitm ★ 2.2k

0

Entering edit mode

Greater than 90% of reads are uniquely mapped and the quality of reads is very good (determined by fastqc), so it's not anything to do with the quality of the sequence files.

I looked at the WT2.gtf file (the one created during the first run of StringTie and strangely it includes a very small number of known genes (i.e. mostly STRNG identifiers). I searched for well known genes and could not find them, so that explains why they are not being counted in the second run of StringTie. I'm not sure why this is.... I will look at the STAR output in IGV. Was the appropriate STAR .bam file "Aligned.sortedByCoord.out.bam" or should In have used "Aligned.toTranscriptome.out.bam" for the first run of StringTie??

ADD REPLY • link 3.3 years ago by Jen ▴ 30

0

Entering edit mode

Ok, I'm looking at the IGV visualization for one of my STAR alignment files "Aligned.sortedByCoord.out.bam" and I do not see the blue bridges. There are only the grey boxes.

ADD REPLY • link 3.3 years ago by Jen ▴ 70

0

Entering edit mode

Yes, "Aligned.sortedByCoord.out.bam" is the file to look at using IGV. Why should you not see blue bridges.. if the read quality was ok (as per fastQC) and STAR says ~90% unique alignment. That is quite puzzling, to be honest. If there was a chance of contamination in the reads (some other organism.. just wondering) then your STAR unique algn. % should have been much lower. You can do some more debugging - 1) Look further into the Log.final.out from STAR and look at the values for the -

Number of splices: Total
Number of splices: Annotated (sjdb)
Number of splices: Non-canonical

Are the annotated splice junctions ~90% of Total? If not, then that might indicate an issue. I would recommend to run STAR again with the some of the confidence measures (ENCODE options) as in STAR manual (for long RNA-seq).

ADD REPLY • link 3.3 years ago by Amitm ★ 2.2k

0

Entering edit mode

Here is the whole Log.Final.Out file. Annotated (sjdb) / Total = 99.6%

Started job on | Dec 22 17:01:37 Started mapping on | Dec 22 17:12:11 Finished on | Dec 22 17:35:52 Mapping speed, Million of reads per hour | 161.82

                      Number of input reads |   63874228
                  Average input read length |   200
                                UNIQUE READS:
               Uniquely mapped reads number |   58056474
                    Uniquely mapped reads % |   90.89%
                      Average mapped length |   199.50
                   Number of splices: Total |   38467474
        Number of splices: Annotated (sjdb) |   38328311
                   Number of splices: GT/AG |   38167345
                   Number of splices: GC/AG |   268039
                   Number of splices: AT/AC |   20120
           Number of splices: Non-canonical |   11970
                  Mismatch rate per base, % |   0.21%
                     Deletion rate per base |   0.01%
                    Deletion average length |   1.77
                    Insertion rate per base |   0.01%
                   Insertion average length |   1.20
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |   0
         % of reads mapped to multiple loci |   0.00%
    Number of reads mapped to too many loci |   4261524
         % of reads mapped to too many loci |   6.67%
                              UNMAPPED READS:
   % of reads unmapped: too many mismatches |   0.00%
             % of reads unmapped: too short |   2.33%
                 % of reads unmapped: other |   0.11%
                              CHIMERIC READS:
                   Number of chimeric reads |   0
                        % of chimeric reads |   0.00%

I will run STAR again with your recommendations and see what happens.

ADD REPLY • link 3.3 years ago by Jen ▴ 70

0

Entering edit mode

I ran the following command

STAR --genomeDir star --readFilesIn samples/2_Forward.fq samples/2_Reverse.fq --outSAMtype BAM SortedByCoordinate --outFilterType BySJout --alignSJoverhangMin 8 --limitBAMsortRAM 16000000000 --outSAMunmapped Within --twopassMode Basic --quantMode TranscriptomeSAM --outSAMstrandField intronMotif --runThreadN 40 --outFileNamePrefix "2_ENCODE_star/"

Log.final.out

Started job on | Dec 27 13:27:30 Started mapping on | Dec 27 13:39:38 Finished on | Dec 27 14:31:05 Mapping speed, Million of reads per hour | 79.96

                      Number of input reads |   68563888
                  Average input read length |   200
                                UNIQUE READS:
               Uniquely mapped reads number |   62317489
                    Uniquely mapped reads % |   90.89%
                      Average mapped length |   199.51
                   Number of splices: Total |   41031275
        Number of splices: Annotated (sjdb) |   41028273
                   Number of splices: GT/AG |   40835874
                   Number of splices: GC/AG |   175040
                   Number of splices: AT/AC |   17948
           Number of splices: Non-canonical |   2413
                  Mismatch rate per base, % |   0.18%
                     Deletion rate per base |   0.01%
                    Deletion average length |   1.83
                    Insertion rate per base |   0.01%
                   Insertion average length |   1.21
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |   4906514
         % of reads mapped to multiple loci |   7.16%
    Number of reads mapped to too many loci |   164562
         % of reads mapped to too many loci |   0.24%
                              UNMAPPED READS:
   % of reads unmapped: too many mismatches |   0.00%
             % of reads unmapped: too short |   1.58%
                 % of reads unmapped: other |   0.14%
                              CHIMERIC READS:
                   Number of chimeric reads |   0
                        % of chimeric reads |   0.00%

I also checked the alignment in IGV and saw the same thing as above (no blue bridges). I also ran it through StringTie and encountered the same problems as I did previously with annotated genes showing no counts.

I literally have no idea what is wrong. When I look at the STAR Aligned.sortedByCoord.out.bam file in IGV, it's showing that reads mapped to annotated genes, but then StringTie just isn't assigning the counts correctly. It's really weird.

ADD REPLY • link 3.3 years ago by Jen ▴ 70

0

Entering edit mode

Hi, I am at my wit's end here, not sure where the problem is arising. Don't worry about IGV, as STAR is recognising splice junctions (predominantly known ones), so I think splice-aware alignment is happening. When you load BAM file into IGV, the default mode is where each read is visible (as a horizontal strip), and if the read has been splice-aligned then there is thin blue line connecting the stretches split across exon boundary. The "bridges" become prominent if you right-click the BAM name in the left-hand panel and choose "Collapsed" view.

In any case, the only time I have come across such a situation is when the chromosome naming pattern was different in the input BAM as compared to the input GTF for StringTie. A warning was thrown. Maybe check your StingTie log file and also (just to be sure) that the chromosome name pattern in STAR Aligned.sortedByCoord.out.bam and the input GTF have the same chr name style.

If there is no headway still, then I would suggest to provide both these values to --quantMode TranscriptomeSAM GeneCounts. So, you would get gene counts from STAR, do that at least for a couple of samples and check if gene level counts are making sense in terms of the biology you are chasing.

1) If the gene count data from STAR seem alright then the issue might be somewhere in StringTie. 2) If quantification of alternative transcripts satisfies your question then you could skip all this alignment+ assembly and go to tools like Kallisto. It would (once you have the index ready) give you isoform-level (NOTE - known isoforms only) quantification in 10-15min per sample. 3) Finally, if novel events (novel exons, novel exon junction boundaries, retained intron) are your interest and StringTie is fails to work, there are other tools as well. Check rMATS, MAJIQ, LeafCutter etc.

ADD REPLY • link 3.3 years ago by Amitm ★ 2.2k

0

Entering edit mode

Looking at the gene counts file, I see the following. There are only 90 gene IDs listed. There should be ~55,000 genes. Should they all be listed in this file, or is this just a subset? I should say that I checked a few of these and they are found in the merged.gtf that Stringtie gives me (sorry for jumbled formatting...not sure how to make it look like it does in the file I'm coping from)

N_unmapped 1339885 1339885 1339885 N_multimapping 4906514 4906514 4906514 N_noFeature 62306436 62317452 62306473 N_ambiguous 0 0 0 ENSMUSG00000094836 0 0 0 ENSMUSG00000094383 0 0 0 ENSMUSG00000094474 0 0 0 ENSMUSG00000094791 0 0 0 ENSMUSG00000096550 0 0 0 ENSMUSG00000094172 0 0 0 ENSMUSG00000094887 0 0 0 ENSMUSG00000091585 0 0 0 ENSMUSG00000095763 0 0 0 ENSMUSG00000095523 0 0 0 ENSMUSG00000095475 0 0 0 ENSMUSG00000094855 0 0 0 ENSMUSG00000051412 2755 0 2755 ENSMUSG00000090805 0 0 0 ENSMUSG00000089163 0 0 0 ENSMUSG00000061654 197 0 197 ENSMUSG00000079834 1420 0 1420 ENSMUSG00000095250 0 0 0 ENSMUSG00000095787 0 0 0 ENSMUSG00000096100 0 0 0 ENSMUSG00000094054 0 0 0 ENSMUSG00000095672 0 0 0 ENSMUSG00000079190 0 0 0 ENSMUSG00000094514 0 0 0 ENSMUSG00000094121 0 0 0 ENSMUSG00000094576 0 0 0 ENSMUSG00000079764 0 0 0 ENSMUSG00000079774 0 0 0 ENSMUSG00000079773 0 0 0 ENSMUSG00000096271 0 0 0 ENSMUSG00000095456 0 0 0 ENSMUSG00000095247 5 0 5 ENSMUSG00000095585 0 0 0 ENSMUSG00000096873 0 0 0 ENSMUSG00000095755 0 0 0 ENSMUSG00000096646 0 0 0 ENSMUSG00000096506 12 0 12 ENSMUSG00000095552 0 0 0 ENSMUSG00000095207 0 0 0 ENSMUSG00000079222 2 0 2 ENSMUSG00000094874 0 0 0 ENSMUSG00000095500 0 0 0 ENSMUSG00000095450 0 0 0 ENSMUSG00000094728 0 0 0 ENSMUSG00000062783 4 0 4 ENSMUSG00000096808 0 0 0 ENSMUSG00000094303 0 0 0 ENSMUSG00000096236 0 0 0 ENSMUSG00000096756 0 0 0 ENSMUSG00000095666 0 0 0 ENSMUSG00000096680 0 0 0 ENSMUSG00000095505 0 0 0 ENSMUSG00000094741 0 0 0 ENSMUSG00000074720 0 0 0 ENSMUSG00000094337 0 0 0 ENSMUSG00000095570 0 0 0 ENSMUSG00000096728 0 0 0 ENSMUSG00000093828 0 0 0 ENSMUSG00000095623 0 0 0 ENSMUSG00000094661 0 0 0 ENSMUSG00000095320 0 0 0 ENSMUSG00000094350 0 0 0 ENSMUSG00000096237 0 0 0 ENSMUSG00000094722 0 0 0 ENSMUSG00000095728 0 0 0 ENSMUSG00000095076 0 0 0 ENSMUSG00000096776 0 0 0 ENSMUSG00000096244 0 0 0 ENSMUSG00000079800 0 0 0 ENSMUSG00000095092 0 0 0 ENSMUSG00000079192 0 0 0 ENSMUSG00000079794 0 0 0 ENSMUSG00000094799 0 0 0 ENSMUSG00000095019 0 0 0 ENSMUSG00000094915 0 0 0 ENSMUSG00000079808 1 0 1 ENSMUSG00000095041 5903 13 5890 ENSMUSG00000063897 347 9 338 ENSMUSG00000084520 0 0 0 ENSMUSG00000099278 0 0 0 ENSMUSG00000095434 0 0 0 ENSMUSG00000094431 0 0 0 ENSMUSG00000094621 1 0 1 ENSMUSG00000098647 0 0 0 ENSMUSG00000096730 0 0 0 ENSMUSG00000095742 406 15 391

ADD REPLY • link 3.3 years ago by Jen ▴ 70

0

Entering edit mode

Ok, I see. Something is not right. Irrespective of whether a gene gets any count assigned or not, all of the IDs are present in the count file created. So, it should have ~55k lines (or, the number genes in the GTF file provided when STAR index was created).

Just wondering if it could be possible that the GTF file (or maybe the genome file even) was corrupted while being downloaded (?). Or, somehow the STAR index creation step didn't go properly. The STAR index size is around ~25Gb in case of GRCh38. The mouse genome and GTF are near about the same size (as human genome and GTF) and so the index should be in that ballpark of size.

You could further check by going into your index dir. and looking for geneInfo.tab file. If you do head command on it, then the 1st line tells you the number of genes in the index. You could also wc -l that file for the same outcome. You could also check the log file created while the index was made.

If either of these checks are not satisfactory, I would recommend creating the index again. ** For formatting, select the text and click the "Code sample" button on the formatting panel.

ADD REPLY • link 3.3 years ago by Amitm ★ 2.2k

0

Entering edit mode

I re-created the index a bunch of times using different genome and annotation files and the same thing kept happening, BUT I had a star index I did a few months ago on a different drive and that one looks fine. The arguments for the one that worked (from GenomeParameters.txt) was-

STAR   --runMode genomeGenerate   --runThreadN 16   --genomeDir star   --genomeFastaFiles Mus_musculus.GRCm38.dna.primary_assembly.fa      --sjdbGTFfile Mus_musculus.GRCm38.100.gtf
versionGenome   20201
genomeFastaFiles    Mus_musculus.GRCm38.dna.primary_assembly.fa 
genomeSAindexNbases 14
genomeChrBinNbits   18
genomeSAsparseD 1
sjdbOverhang    100
sjdbFileChrStartEnd - 
sjdbGTFfile Mus_musculus.GRCm38.100.gtf
sjdbGTFchrPrefix    -
sjdbGTFfeatureExon  exon
sjdbGTFtagExonParentTranscript  transcript_id
sjdbGTFtagExonParentGene    gene_id
sjdbInsertSave  Basic
genomeFileSizes 2798565611 22356046436

and the arguments for one that didn't work is-

STAR   --runMode genomeGenerate   --runThreadN 40   --genomeDir star   --genomeFastaFiles genome/Mus_musculus.GRCm38.dna.primary_assembly.fa      --sjdbGTFfile genes/gencode.vM25.primary_assembly.annotation.gtf
versionGenome   20201
genomeFastaFiles    genome/Mus_musculus.GRCm38.dna.primary_assembly.fa 
genomeSAindexNbases 14
genomeChrBinNbits   18
genomeSAsparseD 1
sjdbOverhang    100
sjdbFileChrStartEnd - 
sjdbGTFfile genes/gencode.vM25.primary_assembly.annotation.gtf
sjdbGTFchrPrefix    -
sjdbGTFfeatureExon  exon
sjdbGTFtagExonParentTranscript  transcript_id
sjdbGTFtagExonParentGene    gene_id
sjdbInsertSave  Basic
genomeFileSizes 2741279807 21885792228

Other than changing the number of threads, I have no idea what went wrong. In any case, I'm going to re-run my pipeline with the old star index and it should work.

ADD REPLY • link 3.3 years ago by Jen ▴ 70

0

Entering edit mode

Ok, this is new information as I had the impression that you have always used "Mus_musculus.GRCm38.100.gtf" as GTF file. The command set-up that you pointed out which didn't work, has gencode GTF has input instead. The genome file is the same in both case (and I am assuming it is from Ensembl and has chromosome names like 1, 2, 3 etc. The gencode GTF (gencode.vM25.primary_assembly.annotation.gtf) on the other hand would have names like chr1. I am not sure if STAR index creation throws error or not in such a case, but the BAM that gets created from alignment to that index would chr names like in the genome fasta file. If that BAM is given as input to StringTie with the gencode GTF then definitely the quantification file would only have MSTRG ids as output (because the chr names wouldn't match for the known genes). Please ensure that the chr naming style is consistent across the fasta/GTF inputs given all along the steps.

ADD REPLY • link 3.3 years ago by Amitm ★ 2.2k

0

Entering edit mode

I actually tried running the indexing step on a different machine and it worked with the same exact code that failed above (looked at the geneInfo.tab file and everything looked good), so it's something funky behind the scenes on my computer. No idea. You've been a ton of help! I appreciate it.

ADD REPLY • link 3.3 years ago by Jen ▴ 70

0

Entering edit mode

I did go through and run the pipeline again (for only 4 samples tol see if the results make sense) and am looking at the merged.stats file and the numbers for novel stuff is REALLY low, which makes me think I need to change my run parameters somewhere to be less stringent somehow. Here is all the code I ran and the contents of my merged stats file:

Generate genome index using STAR

STAR --runThreadN 16 --runMode genomeGenerate --genomeDir star --genomeFastaFiles genome/GRCm38.primary_assembly.genome.fa --sjdbOverhang 99 --sjdbGTFfile genes/gencode.vM25.primary_assembly.annotation.gtf

Align to genome using STAR

STAR --genomeDir star_THIS --readFilesCommand zcat --readFilesIn samples/2_Forward.fq.gz samples/2_Reverse.fq.gz --outSAMtype BAM SortedByCoordinate --outFilterType BySJout --alignSJoverhangMin 8 --limitBAMsortRAM 16000000000 --outSAMunmapped Within --twopassMode Basic --quantMode TranscriptomeSAM GeneCounts --outSAMstrandField intronMotif --runThreadN 16 --outFileNamePrefix "2_star_THIS/"

STAR --genomeDir star_THIS --readFilesCommand zcat --readFilesIn samples/3_Forward.fq.gz samples/3_Reverse.fq.gz --outSAMtype BAM SortedByCoordinate --outFilterType BySJout --alignSJoverhangMin 8 --limitBAMsortRAM 16000000000 --outSAMunmapped Within --twopassMode Basic --quantMode TranscriptomeSAM GeneCounts --outSAMstrandField intronMotif --runThreadN 16 --outFileNamePrefix "3_star_THIS/"

STAR --genomeDir star_THIS --readFilesCommand zcat --readFilesIn samples/4_Forward.fq.gz samples/4_Reverse.fq.gz --outSAMtype BAM SortedByCoordinate --outFilterType BySJout --alignSJoverhangMin 8 --limitBAMsortRAM 16000000000 --outSAMunmapped Within --twopassMode Basic --quantMode TranscriptomeSAM GeneCounts --outSAMstrandField intronMotif --runThreadN 16 --outFileNamePrefix "4_star_THIS/"

STAR --genomeDir star_THIS --readFilesCommand zcat --readFilesIn samples/5_Forward.fq.gz samples/5_Reverse.fq.gz --outSAMtype BAM SortedByCoordinate --outFilterType BySJout --alignSJoverhangMin 8 --limitBAMsortRAM 16000000000 --outSAMunmapped Within --twopassMode Basic --quantMode TranscriptomeSAM GeneCounts --outSAMstrandField intronMotif --runThreadN 16 --outFileNamePrefix "5_star_THIS/"

Assemble Samples 2 to 5 only

stringtie-2.1.4/stringtie -p 40 -G genes/gencode.vM25.primary_assembly.annotation.gtf -o WT2_star_THIS.gtf 2_star_THIS/Aligned.sortedByCoord.out.bam
stringtie-2.1.4/stringtie -p 40 -G genes/gencode.vM25.primary_assembly.annotation.gtf -o WT3_star_THIS.gtf 3_star_THIS/Aligned.sortedByCoord.out.bam
stringtie-2.1.4/stringtie -p 40 -G genes/gencode.vM25.primary_assembly.annotation.gtf -o WT4_star_THIS.gtf 4_star_THIS/Aligned.sortedByCoord.out.bam
stringtie-2.1.4/stringtie -p 40 -G genes/gencode.vM25.primary_assembly.annotation.gtf -o WT5_star_THIS.gtf 5_star_THIS/Aligned.sortedByCoord.out.bam

Merge sample 2 to 5 only with main annotation

stringtie-2.1.4/stringtie --merge -p 40 -G genes/gencode.vM25.primary_assembly.annotation.gtf -o adipo_stringtie_merged_2to5only.gtf mergelist_2to5only.txt

Compare merged vs main annotation

gffcompare-0.12.1/gffcompare -r genes/gencode.vM25.primary_assembly.annotation.gtf -G -o merged_2to5only adipo_stringtie_merged_2to5only.gtf

#Contents of merged_2to5only.stats file

gffcompare v0.12.1 | Command line was:

gffcompare-0.12.1/gffcompare -r genes/gencode.vM25.primary_assembly.annotation.gtf -G -o merged_2to5only adipo_stringtie_merged_2to5only.gtf

#

= Summary for dataset: adipo_stringtie_merged_2to5only.gtf

Query mRNAs : 148525 in 53729 loci (121240 multi-exon transcripts)

(20486 multi-transcript loci, ~2.8 transcripts per locus)

Reference mRNAs : 142004 in 53534 loci (115073 multi-exon)

Super-loci w/ reference transcripts: 53294

-----------------| Sensitivity | Precision |

    Base level:   100.0     |    96.6    |
    Exon level:    94.1     |    97.3    |
  Intron level:   100.0     |    98.9    |

Intron chain level: 99.9 | 94.8 | Transcript level: 99.4 | 95.0 | Locus level: 99.7 | 99.0 |

 Matching intron chains:  114995
   Matching transcripts:  141172
          Matching loci:   53363

      Missed exons:       0/446814 (  0.0%)
       Novel exons:    1754/421810 (  0.4%)
    Missed introns:       4/284950 (  0.0%)
     Novel introns:     783/288229 (  0.3%)
       Missed loci:       0/53534 (  0.0%)
        Novel loci:     435/53729 (  0.8%)

Total union super-loci across all input datasets: 53729 148525 out of 148525 consensus transcripts written in merged_2to5only.annotated.gtf (0 discarded as redundant)

ADD REPLY • link 3.3 years ago by Jen ▴ 30

0

Entering edit mode

Hi, This STAR+StringTie workflow is able to detect novel transcripts as far as I can tell from my own experience with multiple datatypes (splicing defects, CRISPR knockouts etc.). But to be fair, I knew which gene to look at. So, I do not have an idea of how much novel stuff you can detect on average in human/ mouse RNA-seq.

In any case, 435 novel loci is not too bad. There could be two conditions I think, broadly which would lead to detecting novel stuff - a) A transcript/ gene which is expressed in that tissue, but not known in reference annotation b) A transcript/ gene arising due to aberrant splicing. This could be happening due to say a splice-site altering mutation. Or it could also happen if splicing complex proteins are compromised (some known cases in human data include SF3B1, U1 snRNA etc.)

Now, in case of model organisms which have been studied for quite long, like human, mouse, it would be rare to get condition _a)_ as consortia like Gencode ensure that we have comprehensive annotation of the transcriptome.

If you are suspecting condition _b)_ then by all means check the BAM file for that particular gene, or if you are suspecting systemic aberrant splicing then you should probably profile donor/ acceptor site statistics from the data (like Fig.1c of this paper)

So, I would recommend to look at your data with some hypothesis you might have, which should better tell if the amount of novel loci being detected are low or at par.

== If you are going to invest more time into testing STAR params, then I would recommend to turn on chimeric loci detection (i.e. fusion genes). By default thats off; see here for some recommended settings.

ADD REPLY • link 3.3 years ago by Amitm ★ 2.2k

0

Entering edit mode

I would not be expecting _b)_ , so potentially these numbers are OK. I may try playing around with STAR parameters more, too. I'll check out chimeric loci detection parameters. Thanks :)

ADD REPLY • link 3.3 years ago by Jen ▴ 30