Question

Detecting isoforms in raw counts (or FPKM) after htseq-count

0

Entering edit mode

7.1 years ago

Fill ▴ 70

Hi, Is it possible to detect isoforms for every exact gene in raw counts (or FPKM units) after htseq-count with flag -i gene_id?

Thanks in advance!

RNA-Seq htseq-count fpkm alternative splicing • 4.7k views

ADD COMMENT • link 7.1 years ago by Fill ▴ 70

0

Entering edit mode

Every row of the file is an isoform. Just open it up in Notepad and take a look. Now you've detected them!

ADD REPLY • link 7.1 years ago by karl.stamm 4.1k

0

Entering edit mode

Thanks for replay. But how can I find any information about each isoform? There are many isoforms with adding ".some_number". E.g. "ENSG00000001167.13" So, I could find information on UniProt about "ENSG00000001167" (without .some_number) That's what I found:

sp|P23511|NFYA_HUMAN Nuclear transcription factor Y subunit alpha OS=Homo sapiens GN=NFYA PE=1 SV=2 MEQYTANSNSSTEQIVVQAGQIQQQQQGGVTAVQLQTEAQVASASGQQVQTLQVVQGQPL MVQVSGGQLITSTGQPIMVQAVPGGQGQTIMQVPVSGTQGLQQIQLVPPGQIQIQGGQAV QVQGQQGQTQQIIIQQPQTAVTAGQTQTQQQIAVQGQQVAQTAEGQTIVYQPVNADGTIL QQVTVPVSGMITIPAASLAGAQIVQTGANTNTTSSGQGTVTVTLPVAGNVVNSGGMVMMV PGAGSVPAIQRIPLPGAEMLEEEPLYVNAKQYHRILKRRQARAKLEAEGKIPKERRKYLH ESRHRHAMARKRGEGGRFFSPKEKDSPHMQDPNQADEEAMTQIIRVS

sp|P23511-2|NFYA_HUMAN Isoform Short of Nuclear transcription factor Y subunit alpha OS=Homo sapiens GN=NFYA MEQYTANSNSSTEQIVVQAGQIQQQVQGQPLMVQVSGGQLITSTGQPIMVQAVPGGQGQT IMQVPVSGTQGLQQIQLVPPGQIQIQGGQAVQVQGQQGQTQQIIIQQPQTAVTAGQTQTQ QQIAVQGQQVAQTAEGQTIVYQPVNADGTILQQVTVPVSGMITIPAASLAGAQIVQTGAN TNTTSSGQGTVTVTLPVAGNVVNSGGMVMMVPGAGSVPAIQRIPLPGAEMLEEEPLYVNA KQYHRILKRRQARAKLEAEGKIPKERRKYLHESRHRHAMARKRGEGGRFFSPKEKDSPHM QDPNQADEEAMTQIIRVS

And how I should identify the only .13 isoform? How htseq-count appropriate numbers (.some_number) for each isoform?

Thanks in advance!

ADD REPLY • link 7.1 years ago by Fill ▴ 70

0

Entering edit mode

You can get isoforms expression from Cufflinks in the isoforms.fpkm_tracking file.

ADD REPLY • link 7.1 years ago by Ron ★ 1.2k

score 4 · Accepted Answer · 2017-03-23

4

Entering edit mode

7.1 years ago

Santosh Anand 5.7k

Unless you are doing differential Expression analysis, you should not count the alternate isoform transcripts via htseq-count. The main reason is that htseq-count removes all the ambiguous reads from counting, and any read which are common to two or more transcript will be ambiguous in this case. See how htseq counts the reads in default (union) mode

http://www-huber.embl.de/users/anders/HTSeq/doc/count.html#count

In your case, gene_A and gene_B become transcript_A and transcript_B of the same gene. That said, of course, you can do it and there is a small note on the same page regarding that:

Can I use htseq-count to count reads mapping to transcripts rather than genes?

In principle, you could instruct htseq-count to count for each of a gene’s transcript individually, by specifying --idattr transcript_id. However, all reads mapping to exons shared by several transcripts will then be considered ambiguous. (See second question.) Counting them for each transcript that contains the exons would be possible but makes little sense for typical use cases. (See first question.) If you want to perform differential expression analysis on the level of individual transcripts, maybe ahve a look at our paper on DEXSeq for a discussion on why we prefer performing such analyses on the level of exons instead.

If you are not doing differential Expression, then you may try featureCounts which is more flexible and extremely fast. http://bioinf.wehi.edu.au/featureCounts/

ADD COMMENT • link 7.1 years ago by Santosh Anand 5.7k

0

Entering edit mode

When I use the flag -i gene_id counts are summarized for exact gene? Maybe I should change to -i transcript_id, and then gene_A after alterntaive splicng become transcript_a and transcript_b, and with the -i transcript_id each trascript would be exact feature for summarizing counts?
What if I'd use the -intersection-nonempty mode?

ADD REPLY • link 7.1 years ago by Fill ▴ 70

1

Entering edit mode

yes, -i transcript_id will do counting on transcript level. But it will grossly underestimate the counts as there will be many common exons among transcripts, and all the reads to common exons will be discarded. You have been warned!
-intersection-nonempty mode will not help. The situation with multiple transcript of same gene is like shown in the last picture (the penultimate figure's situation is very less possible in case of transcripts). And once again, htseq developers warns against counting on transcript level. Your same situation is written in this post; and see the reply by Simon Anders (developer) http://seqanswers.com/forums/showthread.php?t=18068

Why don't you try some of the progs which are specialized to do that (Cufflinks, RSEM...?) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4712774/