I am analyzing 50 bp single-end stranded (Illumina TruSeq from HiSeq2000) reads from Arabidopsis, using the TAIR10 scaffold for alignment. This scaffold contains alternative splicing annotations (GENEA.1, GENEA.2), which I would like to ignore for the purposes of analyzing the data.
When I Tophat to align to this scaffold with alternative splicing annotations, I get lots of read counts for say, Actin, which has two alternative splicing isoforms.
However, when I count reads with htseq-count, I get zero reads for both alternative splicing isoforms of Actin.
Is there a way to simply get the number of reads per individual gene, regardless of whether it's a splice isoform?
My hypothesis is that I could change the original .gff from the original annotations
Chr1 TAIR10 gene 5928 8737 . - . ID=AT1G01020;Note=protein_coding_gene;Name=AT1G01020
Chr1 TAIR10 mRNA 5928 8737 . - . ID=AT1G01020.1;Parent=AT1G01020;Name=AT1G01020.1;Index=1
Chr1 TAIR10 protein 6915 8666 . - . ID=AT1G01020.1-Protein;Name=AT1G01020.1;Derives_from=AT1G01020.1
Chr1 TAIR10 exon 8571 8737 . - . Parent=AT1G01020.1
...............
Chr1 TAIR10 mRNA 6790 8737 . - . ID=AT1G01020.2;Parent=AT1G01020;Name=AT1G01020.2;Index=1
Chr1 TAIR10 protein 7315 8666 . - . ID=AT1G01020.2-Protein;Name=AT1G01020.2;Derives_from=AT1G01020.2
Chr1 TAIR10 exon 8571 8737 . - . Parent=AT1G01020.2
and replace them with the genes as a whole, then run htseq-count
:
Chr1 TAIR10 gene 5928 8737 . - . ID=AT1G01020; Note=protein_coding_gene;Name=AT1G01020
Chr1 TAIR10 mRNA 5928 8737 . - . ID=AT1G01020; Parent=AT1G01020;Name=AT1G01020;Index=1
Chr1 TAIR10 protein 6915 8666 . - . ID=AT1G01020-Protein;Name=AT1G01020;Derives_from=AT1G01020
Chr1 TAIR10 exon 8571 8737 . - . Parent=AT1G01020
.............
Chr1 TAIR10 mRNA 6790 8737 . - . ID=AT1G01020;Parent=AT1G01020;Name=AT1G01020;Index=1
Chr1 TAIR10 protein 7315 8666 . - . ID=AT1G01020-Protein;Name=AT1G01020;Derives_from=AT1G01020
Chr1 TAIR10 exon 8571 8737 . - . Parent=AT1G01020
but I didn't know if this was Kosher practice.
Thanks for your help!
Usually htseq-count does count the reads for each whole gene, summing up all the counts for every splice variant. in order to do this, it uses by default the field
gene_id
present in standard ensembl gtf files. to use another field, you should set the-i
flag. you may want to use "ID" for example. then the software will count all the reads that can be assigned to each gene, regardless of the isoform. see http://www-huber.embl.de/users/anders/HTSeq/doc/count.htmlBoth "Name" and "ID" are listed several times in the GFF, once for "gene" (where there is NO splice annotation) then for "mRNA" (where the annotation is given). When I run HTseq I was using "Parent" as geneID, which identifies each exon. Obviously "Parent" is NOT annotated in the first instance, but annotated at the exon level. But any other potential identifiers, ID, Parent, gene, name, all are annotated. So unless I can specifiy a specific instance of an identifier this is not possible.
I see... why don't you add a new field with the TAIR10 identifier (second field of your gtf), which is always the same in your example. with awk it would be something like:
then you can run:
htseq-count -i new_id ...
and it should summarize the counts for every gene
I don't understand. By second "field" do you mean the second column? That would not work because the second column is all "TAIR10". However, the third column describes the feature (chromosome, gene, mRNA, protein, exon, five_prime_UTR, etc.).
So are you saying if I replace the "gene" feature with the geneID (without annotation), then regardless of what the features (mRNA, protein, Parent, etc.), HTseq will summarize the counts?
Like this:
That will be some fancy capture/replace code, because it will need to catch the first instance of the ID column following each occurrence of "gene" feature.