I am using gffread to extract transcript sequences from genome based on a stringtie merged gtf file.
The command I use for generating stringtie merge gtf file is:
And the command I used for gffread is:
After receiving the result transcripts.fa file, I take a quick peek with (
grep -A 1 ">" transcripts.fa | head -n 10) and the first couple lines are:
MSTRG.1.1 gene=MSTRG.1|Csa01g001000 GTATCTGAAGTGTTCTGGCGTTTCCATGAAATTTTGGGTTTTGAAGAGGCCGTCTCACC-
Csa01g001000.1 gene=MSTRG.1|Csa01g001000 AAGAAAAACCTCTTTTTTGCTCACTTTCTCGCAATATACAAATCTTCTCTTCTTCTTCTTC-
Csa01g001000.2 gene=MSTRG.1|Csa01g001000 AAGAAAAACCTCTTTTTTGCTCACTTTCTCGCAATATACAAATCTTCTCTTCTTCTTCTTC-
MSTRG.1.4 gene=MSTRG.1|Csa01g001000 AATGGGCTTCCACTGCAGTTTGAAGATTTTTTTGTGCTGTCACTTGGACGTATTGACAT
I know the MSTRG is a default gene ID given by StringTie merge and Csa is gene ID in my Cs_genes_v2.gff3 file, my 1st question is that does MSTRG.1 and Csa01g001000 mean the same gene? (I assume that, just want to double check) If so, my 2nd question is, are MSTRG.1.1 and MSTRG.1.4 two novel transcripts from gene MSTRG.1|Csa01g001000?
Also, after playing around with transcripts.fa file, I realized that some genes ONLY have MSTRG as gene ID, e.g.
MSTRG.64921.1 gene=MSTRG.64921 CATTTGGATGTGATTCACCATGCATGTTGCTTCAGAGAACGGCTAATATTCACCATG
MSTRG.64922.1 gene=MSTRG.64922 CCGTACGCATCAATAAATCCCTGAAAGACCTTGGTAAACGAACGTGGTGGAAAGAC
My 3rd question is, are these two novel genes? (i.e. no annotation in my Cs_genes_v2.gff3)
My last question is, does Stringtie assign a unique MSTRG ID to EVERY gene during merge process? If so, how come I have transcipts without a MSTRG ID, e.g.
Csa37250s010.1 gene=Csa37250s010 ATGCTTAGGTTCAAAACTAATAAGCGAACGTCTACACCCTTTGGAATTGAAGCTGGTA
Csa37304s010.1 gene=Csa37304s010 CCGGCAGTGATAGGCGGTTGGAGAGGTGCGTATGTGGTGAACGTGGTGGTGGTCGT
Csa37329s010.1 gene=Csa37329s010 ATGAAAGGTAAAGGAGGACCTGAGAATCCTCACTGTAGTTTTAGAGGTGTTAGACAAA
Sorry for such a long list of questions
Thank you very much! Liyong