Pipeline to get htseq read counts, ignoring alternative splicing
1
1
Entering edit mode
8.3 years ago
Malia ▴ 10

I am analyzing 50 bp single-end stranded (Illumina TruSeq from HiSeq2000) reads from Arabidopsis, using the TAIR10 scaffold for alignment. This scaffold contains alternative splicing annotations (GENEA.1, GENEA.2), which I would like to ignore for the purposes of analyzing the data.

When I Tophat to align to this scaffold with alternative splicing annotations, I get lots of read counts for say, Actin, which has two alternative splicing isoforms.

However, when I count reads with htseq-count, I get zero reads for both alternative splicing isoforms of Actin.

Is there a way to simply get the number of reads per individual gene, regardless of whether it's a splice isoform?

My hypothesis is that I could change the original .gff from the original annotations

Chr1    TAIR10    gene    5928    8737    .    -    .    ID=AT1G01020;Note=protein_coding_gene;Name=AT1G01020
Chr1    TAIR10    mRNA    5928    8737    .    -    .    ID=AT1G01020.1;Parent=AT1G01020;Name=AT1G01020.1;Index=1
Chr1    TAIR10    protein    6915    8666    .    -    .    ID=AT1G01020.1-Protein;Name=AT1G01020.1;Derives_from=AT1G01020.1
Chr1    TAIR10    exon    8571    8737    .    -    .    Parent=AT1G01020.1
...............
Chr1    TAIR10    mRNA    6790    8737    .    -    .    ID=AT1G01020.2;Parent=AT1G01020;Name=AT1G01020.2;Index=1
Chr1    TAIR10    protein    7315    8666    .    -    .    ID=AT1G01020.2-Protein;Name=AT1G01020.2;Derives_from=AT1G01020.2
Chr1    TAIR10    exon    8571    8737    .    -    .    Parent=AT1G01020.2


and replace them with the genes as a whole, then run htseq-count:

Chr1    TAIR10    gene    5928    8737    .    -    .    ID=AT1G01020; Note=protein_coding_gene;Name=AT1G01020
Chr1    TAIR10    mRNA    5928    8737    .    -    .    ID=AT1G01020; Parent=AT1G01020;Name=AT1G01020;Index=1
Chr1    TAIR10    protein    6915    8666    .    -    .    ID=AT1G01020-Protein;Name=AT1G01020;Derives_from=AT1G01020
Chr1    TAIR10    exon    8571    8737    .    -    .    Parent=AT1G01020
.............
Chr1    TAIR10    mRNA    6790    8737    .    -    .    ID=AT1G01020;Parent=AT1G01020;Name=AT1G01020;Index=1
Chr1    TAIR10    protein    7315    8666    .    -    .    ID=AT1G01020-Protein;Name=AT1G01020;Derives_from=AT1G01020

Chr1    TAIR10    exon    8571    8737    .    -    .    Parent=AT1G01020


but I didn't know if this was Kosher practice.

RNA-Seq htseq tophat alignment • 4.1k views
0
Entering edit mode

Usually htseq-count does count the reads for each whole gene, summing up all the counts for every splice variant. in order to do this, it uses by default the field gene_id present in standard ensembl gtf files. to use another field, you should set the -i flag. you may want to use "ID" for example. then the software will count all the reads that can be assigned to each gene, regardless of the isoform. see http://www-huber.embl.de/users/anders/HTSeq/doc/count.html

0
Entering edit mode
Chr1    TAIR10    gene    3631    5899    .    +    .    ID=AT1G01010;Note=protein_coding_gene;Name=AT1G01010
Chr1    TAIR10    mRNA    3631    5899    .    +    .    ID=AT1G01010.1;Parent=AT1G01010;Name=AT1G01010.1;Index=1
Chr1    TAIR10    exon    3631    3913    .    +    .    Parent=AT1G01010.1


Both "Name" and "ID" are listed several times in the GFF, once for "gene" (where there is NO splice annotation) then for "mRNA" (where the annotation is given). When I run HTseq I was using "Parent" as geneID, which identifies each exon. Obviously "Parent" is NOT annotated in the first instance, but annotated at the exon level. But any other potential identifiers, ID, Parent, gene, name, all are annotated. So unless I can specifiy a specific instance of an identifier this is not possible.

0
Entering edit mode

I see... why don't you add a new field with the TAIR10 identifier (second field of your gtf), which is always the same in your example. with awk it would be something like:

awk '{print $0";new_id="$2}' your.gtf >new.gtf


then you can run:

htseq-count -i new_id ...

and it should summarize the counts for every gene

0
Entering edit mode

I don't understand. By second "field" do you mean the second column? That would not work because the second column is all "TAIR10". However, the third column describes the feature (chromosome, gene, mRNA, protein, exon, five_prime_UTR, etc.).

So are you saying if I replace the "gene" feature with the geneID (without annotation), then regardless of what the features (mRNA, protein, Parent, etc.), HTseq will summarize the counts?

Like this:

Chr1    TAIR10    chromosome    1    30427671    .    .    .    ID=Chr1;Name=Chr1
Chr1    TAIR10    **gene=AT1G01010**    3631    5899    .    +    .    ID=AT1G01010;Note=protein_coding_gene;Name=AT1G01010
Chr1    TAIR10    mRNA    3631    5899    .    +    .    ID=AT1G01010.1;Parent=AT1G01010;Name=AT1G01010.1;Index=1
Chr1    TAIR10    protein    3760    5630    .    +    .    ID=AT1G01010.1-Protein;Name=AT1G01010.1;Derives_from=AT1G01010.1
Chr1    TAIR10    exon    3631    3913    .    +    .    Parent=AT1G01010.1


That will be some fancy capture/replace code, because it will need to catch the first instance of the ID column following each occurrence of "gene" feature.

1
Entering edit mode
8.3 years ago
Martombo ★ 3.0k

Sorry there has been some misunderstanding... I realized now that TAIR10 is your scaffold, so we don't need it for this purpose. by the way shouldn't it be the first column? anyway that's not important now...

I'll explain what I meant with my previous comments in more details, describing a bit better also how htseq-count works:

So, htseq-count is generally meant to provide counts for every gene, independently of the splice variants (though it can be used for such purpose as well). the definition of "gene" used by htseq is the union of all the exons of a gene. even the ones that would be mutually exclusive for example, so this is not a strict biological term. htseq counts all the reads that can be assigned to the exons of a gene (and not to the exons of any other gene) and outputs the final sum. in practice, it uses only the entries with "exon" in the third column (can be changed with the -t option) and sums them up for all the entries with the same "gene_id" identifier (can be changed with the -i option) in the last column. so you can forget about all the other CDS, mRNA, UTRs entries. the only thing you want to check, is that your exons have all a unique id for each gene.

if I understood correctly from you examples, you were using "Parent" as gene id and you changed its value in your gene, in order to have the same for all the splice variants, is that so? (it's a bit hard to read your example, since there is no line break) in principle, that should work, did you try it already? I would anyway suggest you to modify your whole file, so that you can get the proper results for a gene level count for all your genes.

0
Entering edit mode

Ah...so if htseq assigns read counts based on exon features and not gene, then that may be my problem, because Parent has the splice variant annotation at the exon feature. I edited the original post to try to make things clearer.

Because there is only one gene feature (but multiple mRNA features for one gene), I could change the -i option to "ID" instead of "Parent", since that will be how htseq assigns reads? I will try this to see how it works.

0
Entering edit mode

Just an update, your suggestions (and the following code) solved my problem! The issue was two-sided, since tophat2 needed the fr-firststrand argument as specified for Illumina TruSeq stranded library preps. Then htseq needed to be run using --stranded-reverse because tophat2 reads all align to genome opposite to the true orientation.

tophat2 --b2-very-sensitive --no-novel-juncs --library-type fr-firststrand
htseq-count -t gene -i ID --stranded=reverse

0
Entering edit mode

Yes, the reads orientation can often be confusing