Question: Detecting isoforms in raw counts (or FPKM) after htseq-count
0
gravatar for Fill
2.4 years ago by
Fill70
Fill70 wrote:

Hi, Is it possible to detect isoforms for every exact gene in raw counts (or FPKM units) after htseq-count with flag -i gene_id?

Thanks in advance!

ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by Fill70

Every row of the file is an isoform. Just open it up in Notepad and take a look. Now you've detected them!

ADD REPLYlink written 2.4 years ago by karl.stamm3.5k

Thanks for replay. But how can I find any information about each isoform? There are many isoforms with adding ".some_number". E.g. "ENSG00000001167.13" So, I could find information on UniProt about "ENSG00000001167" (without .some_number) That's what I found:

sp|P23511|NFYA_HUMAN Nuclear transcription factor Y subunit alpha OS=Homo sapiens GN=NFYA PE=1 SV=2 MEQYTANSNSSTEQIVVQAGQIQQQQQGGVTAVQLQTEAQVASASGQQVQTLQVVQGQPL MVQVSGGQLITSTGQPIMVQAVPGGQGQTIMQVPVSGTQGLQQIQLVPPGQIQIQGGQAV QVQGQQGQTQQIIIQQPQTAVTAGQTQTQQQIAVQGQQVAQTAEGQTIVYQPVNADGTIL QQVTVPVSGMITIPAASLAGAQIVQTGANTNTTSSGQGTVTVTLPVAGNVVNSGGMVMMV PGAGSVPAIQRIPLPGAEMLEEEPLYVNAKQYHRILKRRQARAKLEAEGKIPKERRKYLH ESRHRHAMARKRGEGGRFFSPKEKDSPHMQDPNQADEEAMTQIIRVS

sp|P23511-2|NFYA_HUMAN Isoform Short of Nuclear transcription factor Y subunit alpha OS=Homo sapiens GN=NFYA MEQYTANSNSSTEQIVVQAGQIQQQVQGQPLMVQVSGGQLITSTGQPIMVQAVPGGQGQT IMQVPVSGTQGLQQIQLVPPGQIQIQGGQAVQVQGQQGQTQQIIIQQPQTAVTAGQTQTQ QQIAVQGQQVAQTAEGQTIVYQPVNADGTILQQVTVPVSGMITIPAASLAGAQIVQTGAN TNTTSSGQGTVTVTLPVAGNVVNSGGMVMMVPGAGSVPAIQRIPLPGAEMLEEEPLYVNA KQYHRILKRRQARAKLEAEGKIPKERRKYLHESRHRHAMARKRGEGGRFFSPKEKDSPHM QDPNQADEEAMTQIIRVS

And how I should identify the only .13 isoform? How htseq-count appropriate numbers (.some_number) for each isoform?

Thanks in advance!

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by Fill70

You can get isoforms expression from Cufflinks in the isoforms.fpkm_tracking file.

ADD REPLYlink written 2.4 years ago by Ron970
4
gravatar for Santosh Anand
2.4 years ago by
Santosh Anand4.9k
Santosh Anand4.9k wrote:

Unless you are doing differential Expression analysis, you should not count the alternate isoform transcripts via htseq-count. The main reason is that htseq-count removes all the ambiguous reads from counting, and any read which are common to two or more transcript will be ambiguous in this case. See how htseq counts the reads in default (union) mode

http://www-huber.embl.de/users/anders/HTSeq/doc/count.html#count

In your case, gene_A and gene_B become transcript_A and transcript_B of the same gene. That said, of course, you can do it and there is a small note on the same page regarding that:

Can I use htseq-count to count reads mapping to transcripts rather than genes?

In principle, you could instruct htseq-count to count for each of a gene’s transcript individually, by specifying --idattr transcript_id. However, all reads mapping to exons shared by several transcripts will then be considered ambiguous. (See second question.) Counting them for each transcript that contains the exons would be possible but makes little sense for typical use cases. (See first question.) If you want to perform differential expression analysis on the level of individual transcripts, maybe ahve a look at our paper on DEXSeq for a discussion on why we prefer performing such analyses on the level of exons instead.

If you are not doing differential Expression, then you may try featureCounts which is more flexible and extremely fast. http://bioinf.wehi.edu.au/featureCounts/

ADD COMMENTlink written 2.4 years ago by Santosh Anand4.9k
  1. When I use the flag -i gene_id counts are summarized for exact gene? Maybe I should change to -i transcript_id, and then gene_A after alterntaive splicng become transcript_a and transcript_b, and with the -i transcript_id each trascript would be exact feature for summarizing counts?
  2. What if I'd use the -intersection-nonempty mode?
ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by Fill70
1
  1. yes, -i transcript_id will do counting on transcript level. But it will grossly underestimate the counts as there will be many common exons among transcripts, and all the reads to common exons will be discarded. You have been warned!

  2. -intersection-nonempty mode will not help. The situation with multiple transcript of same gene is like shown in the last picture (the penultimate figure's situation is very less possible in case of transcripts). And once again, htseq developers warns against counting on transcript level. Your same situation is written in this post; and see the reply by Simon Anders (developer) http://seqanswers.com/forums/showthread.php?t=18068

Why don't you try some of the progs which are specialized to do that (Cufflinks, RSEM...?) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4712774/

ADD REPLYlink written 2.4 years ago by Santosh Anand4.9k

Yes, now I will try RSEM or MISO. Thanks!

ADD REPLYlink written 2.4 years ago by Fill70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1335 users visited in the last hour