Hi,
I'm using Hisat2, Stringtie, Stringtie-Merge, gffcompare and ballgown for RNAseq analysis. I wanted to detect differential expressed lncRNAs.
For Hisat2, I downloaded GRCh38 genome index "genome_snp_tran" from Hisat2 website Hisat2 Indexes. After running the hisat2 command with the available fastq files it gave a bam file as output. Using samtools sorted the bam files and used sorted.bam files as input for Stringtie.
stringtie -p 8 -G gencode_Release27_GRCh38.p10_PRI/gencode.v27.primary_assembly.annotation.gtf -o /dir/A133/A133_GRCh38.gtf -l /A133/A133 /dir/A133/A133.sorted.bam"
This gave the output A133_GRCh38.gtf file. It looks like following:
1 StringTie transcript 14007 16679 1000 - . gene_id "/dir/A133/A133.1"; transcript_id "/dir/A133/A133.1.1"; cov "7.175807"; FPKM "0.578671"; TPM "1.022094";
1 StringTie exon 14007 15038 1000 - . gene_id "/dir/STB133/STB133.1"; transcript_id "/dir/A133/A133.1.1"; exon_number "1"; cov "7.036415";
1 StringTie exon 15796 15947 1000 - . gene_id "/dir/A133/A133.1"; transcript_id "/dir/A133/A133.1.1"; exon_number "2"; cov "9.265032";
1 StringTie exon 16607 16679 1000 - . gene_id "/dir/A133/A133.1"; transcript_id "/dir/A133/A133.1.1"; exon_number "3"; cov "4.796226";
- I don't understand why there are no gene_id and transcript_id? Why it shows PATH there?
I used the same command on the other samples A765, A987, A456. It gave A765_GRCh38.gtf, A987_GRCh38.gtf, A456_GRCh38.gtf files as outputs.
Next I used Stringtie-Merge. I wrote the path of all the Stringtie output gtf files in text file "mergelist.txt"
stringtie --merge -p 8 -G gencode_Release27_GRCh38.p10_PRI/gencode.v27.primary_assembly.annotation.gtf -o /stringtiemerge/stringtie_merged.gtf mergelist.txt”
Output "stringtie_merged_gtf" is like following:
# StringTie version 1.3.3
1 StringTie transcript 13989 26963 1000 - . gene_id "MSTRG.1"; transcript_id "MSTRG.1.1";
1 StringTie exon 13989 15038 1000 - . gene_id "MSTRG.1"; transcript_id "MSTRG.1.1"; exon_number "1";
1 StringTie exon 15796 15947 1000 - . gene_id "MSTRG.1"; transcript_id "MSTRG.1.1"; exon_number "2";
1 StringTie exon 16607 16765 1000 - . gene_id "MSTRG.1"; transcript_id "MSTRG.1.1"; exon_number "3";
GL000218.1 StringTie transcript 51867 54893 1000 - . gene_id "MSTRG.15857"; transcript_id "ENST00000612565.1"; gene_name "AL354822.1"; ref_gene_id "ENSG00000278384.1";
GL000218.1 StringTie exon 51867 54893 1000 - . gene_id "MSTRG.15857"; transcript_id "ENST00000612565.1"; exon_number "1"; gene_name "AL354822.1"; ref_gene_id "ENSG00000278384.1";
chr1 ENSEMBL exon 1529223 1529331 . + . gene_id "ENSG00000197785.13"; transcript_id "ENST00000536055.5"; exon_number "15"; gene_name "ATAD3A";
chr1 ENSEMBL exon 1533926 1534687 . + . gene_id "ENSG00000197785.13"; transcript_id "ENST00000536055.5"; exon_number "16"; gene_name "ATAD3A";
chr1 HAVANA transcript 1517070 1517775 . + . gene_id "ENSG00000197785.13"; transcript_id "ENST00000439513.1"; gene_name "ATAD3A"; ref_gene_id "ENSG00000197785.13";
chr1 HAVANA exon 1517070 1517412 . + . gene_id "ENSG00000197785.13"; transcript_id "ENST00000439513.1"; exon_number "1"; gene_name "ATAD3A";
chr1 HAVANA exon 1517716 1517775 . + . gene_id "ENSG00000197785.13"; transcript_id "ENST00000439513.1"; exon_number "2"; gene_name "ATAD3A";
chr1 HAVANA transcript 1517621 1522780 . + . gene_id "ENSG00000197785.13"; transcript_id "ENST00000429957.1"; gene_name "ATAD3A"; ref_gene_id "ENSG00000197785.13";
chr6 ENSEMBL exon 32034860 32034992 . + . gene_id "ENSG00000224389.8"; transcript_id "ENST00000425700.3"; exon_number "39"; gene_name "C4B";
chr6 ENSEMBL exon 32035136 32035417 . + . gene_id "ENSG00000224389.8"; transcript_id "ENST00000425700.3"; exon_number "40"; gene_name "C4B";
chr6 HAVANA transcript 32015581 32016297 . + . gene_id "ENSG00000224389.8"; transcript_id "ENST00000490071.1"; gene_name "C4B"; ref_gene_id "ENSG00000224389.8";
chr6 HAVANA exon 32015581 32015657 . + . gene_id "ENSG00000224389.8"; transcript_id "ENST00000490071.1"; exon_number "1"; gene_name "C4B";
chr6 HAVANA exon 32015867 32016297 . + . gene_id "ENSG00000224389.8"; transcript_id "ENST00000490071.1"; exon_number "2"; gene_name "C4B";
chr6 HAVANA transcript 32015619 32017136 . + . gene_id "ENSG00000224389.8"; transcript_id "ENST00000485543.1"; gene_name "C4B"; ref_gene_id "ENSG00000224389.8";
- What are those "MSTRG" for gene_id and transcript_id?
- Why in the second column I see Stringtie, HAVANA, ENSEMBL?
- In the first column what is "GL000218.1"?
- Why the first column"Chromosome" is 1 sometimes and chr1 sometimes?
Please help me to understand these outputs. Did I do anything wrong? Is there any problem with gtf or genome index?
Thank Devon. What about A133_GRCh38.gtf file output. Why there is no gene_id, transcript_id or genename? Why there is path?
I have no idea what you're talking about. Try posting a few lines of the relevant file.
It was already mentioned in the above question. I gave a A133_GRCh38.gtf file output.
They're just arbitrary names, they have no importance in and of themselves.
Hello Devon,
Sorry for asking you again. The "MSTRG" in the above stringtie_merged.gtf file are the novel transcripts? because I see some with Ensembl gene id and transcript id. I think that if stringtie doesn't find any ref gene locations from gtf file but finds rna-seq alignments in that location stringtie constructs that as novel transcripts? Am I right?
Another question: After stringtie-merge, I used gffcompare to examine how the transcripts compare with reference annotation. the reference annotation is with both protein coding and lncRNAs. How will I know which is protein coding gene and known lncRNAs from stringtie_merged.gtf file?
Sequencing is strand-specific (RF). How to get the counts for the pc genes and known lncRNAs from all samples? With counts I would like to do differential analysis.
It's not entirely clear if they're purely novel transcripts or whether modified transcripts are also included.
You'd need to filter the GTF files for lncRNAs or protein coding genes to know for sure, I don't think gffcompare will report that.
You can use featureCounts to get counts (allegedly stringTie can produce some sort of counts too...but I'd be hesitant to trust that).
I would like to use htseq-counts.
htseq-count -f bam -q sample1.bam gencode.v27.primary_assembly.annotation.gtf > sample1_counts.txt
Is this right for strand-specific (RF)? Is bam fine or I need to give sorted.bam?
And could you please tell how to strip "chr" from gtf file in linux.
You'll want
-s reverse
I think. You can stripchr
characters withsed
.I tried with this:
htseq-count –s reverse –f bam –q sample1.sorted.bam gencode.v27.primary_assembly.annotation.gtf > sample1_counts.txt
I have this error:
/path/soft/apps/HTSeq/0.6.1p1-goolf-1.7.20-Python-2.7.11/bin/htseq-count: Error: Please provide two arguments.
Call with '-h' to get usage information.
There's presumably an extra space somewhere (or in that version maybe
-q
is doing something different).Could you please tell something about my previous comment.