Question: Pipeline to get htseq read counts, ignoring alternative splicing
1
gravatar for Malia
4.8 years ago by
Malia10
University of Utah
Malia10 wrote:

I am analyzing 50 bp single-end stranded (Illumina TruSeq from HiSeq2000) reads from Arabidopsis, using the TAIR10 scaffold for alignment. This scaffold contains alternative splicing annotations (GENEA.1, GENEA.2), which I would like to *ignore* for the purposes of analyzing the data. 

When I Tophat to align to this scaffold with alternative splicing annotations, I get lots of read counts for say, Actin, which has two alternative splicing isoforms. 

However, when I count reads with htseq-count, I get zero reads for *both* alternative splicing isoforms of Actin.

Is there a way to simply get the number of reads per individual gene, regardless of whether it's a splice isoform?

My hypothesis is that I could change the original .gff from the original annotations

Chr1    TAIR10    gene    5928    8737    .    -    .    ID=AT1G01020;Note=protein_coding_gene;Name=AT1G01020
Chr1    TAIR10    mRNA    5928    8737    .    -    .    ID=AT1G01020.1;Parent=AT1G01020;Name=AT1G01020.1;Index=1
Chr1    TAIR10    protein    6915    8666    .    -    .    ID=AT1G01020.1-Protein;Name=AT1G01020.1;Derives_from=AT1G01020.1
Chr1    TAIR10    exon    8571    8737    .    -    .    Parent=AT1G01020.1
...............
Chr1    TAIR10    mRNA    6790    8737    .    -    .    ID=AT1G01020.2;Parent=AT1G01020;Name=AT1G01020.2;Index=1
Chr1    TAIR10    protein    7315    8666    .    -    .    ID=AT1G01020.2-Protein;Name=AT1G01020.2;Derives_from=AT1G01020.2
Chr1    TAIR10    exon    8571    8737    .    -    .    Parent=AT1G01020.2

 

and replace them with the genes as a whole, then run htseq-count:

Chr1    TAIR10    gene    5928    8737    .    -    .    ID=AT1G01020; Note=protein_coding_gene;Name=AT1G01020
Chr1    TAIR10    mRNA    5928    8737    .    -    .    ID=AT1G01020; Parent=AT1G01020;Name=AT1G01020;Index=1
Chr1    TAIR10    protein    6915    8666    .    -    .    ID=AT1G01020-Protein;Name=AT1G01020;Derives_from=AT1G01020
Chr1    TAIR10    exon    8571    8737    .    -    .    Parent=AT1G01020
.............
Chr1    TAIR10    mRNA    6790    8737    .    -    .    ID=AT1G01020;Parent=AT1G01020;Name=AT1G01020;Index=1
Chr1    TAIR10    protein    7315    8666    .    -    .    ID=AT1G01020-Protein;Name=AT1G01020;Derives_from=AT1G01020

Chr1    TAIR10    exon    8571    8737    .    -    .    Parent=AT1G01020

 

but I didn't know if this was Kosher practice.

Thanks for your help!

rna-seq alignment tophat htseq • 2.8k views
ADD COMMENTlink modified 4.8 years ago by RamRS23k • written 4.8 years ago by Malia10

usually htseq-count does count the reads for each whole gene, summing up all the counts for every splice variant. in order to do this, it uses by default the field "gene_id" present in standard ensembl gtf files. to use another field, you should set the -i flag. you may want to use "ID" for example. then the software will count all the reads that can be assigned to each gene, regardless of the isoform. see http://www-huber.embl.de/users/anders/HTSeq/doc/count.html

ADD REPLYlink modified 4.8 years ago • written 4.8 years ago by Martombo2.5k

Chr1    TAIR10    gene    3631    5899    .    +    .    ID=AT1G01010;Note=protein_coding_gene;Name=AT1G01010

Chr1    TAIR10    mRNA    3631    5899    .    +    .    ID=AT1G01010.1;Parent=AT1G01010;Name=AT1G01010.1;Index=1

Chr1    TAIR10    exon    3631    3913    .    +    .    Parent=AT1G01010.1

Both "Name" and "ID" are listed several times in the GFF, once for "gene" (where there is NO splice annotation) then for "mRNA" (where the annotation is given). When I run HTseq I was using "Parent" as geneID, which identifies each exon. Obviously "Parent" is NOT annotated in the first instance, but annotated at the exon level. But any other potential identifiers, ID, Parent, gene, name, all are annotated. So unless I can specifiy a specific instance of an identifier this is not possible.

 

 

 

ADD REPLYlink modified 4.8 years ago • written 4.8 years ago by Malia10

I see... why don't you add a new field with the TAIR10 identifier (second field of your gtf), which is always the same in your example. with awk it would be something like:

awk '{print $0";new_id="$2}' your.gtf >new.gtf

then you can run:

htseq-count -i new_id ...

and it should summarize the counts for every gene

ADD REPLYlink written 4.8 years ago by Martombo2.5k

I see... why don't you add a new field with the TAIR10 identifier (second field of your gtf), which is always the same in your example. with awk it would be something like:

awk '{print $0";new_id="$2}' your.gtf >new.gtf

then you can run:

htseq-count -i new_id ...

and it should summarize the counts for every gene

ADD REPLYlink written 4.8 years ago by Martombo2.5k

I don't understand. By second "field" do you mean the second column? That would not work because the second column is all "TAIR10". However, the third column describes the feature (chromosome, gene, mRNA, protein, exon, five_prime_UTR, etc.).

So are you saying if I replace the "gene" feature with the geneID (without annotation), then regardless of what the features (mRNA, protein, Parent, etc.), HTseq will summarize the counts?

Like this:

Chr1    TAIR10    chromosome    1    30427671    .    .    .    ID=Chr1;Name=Chr1

Chr1    TAIR10    gene=AT1G01010    3631    5899    .    +    .    ID=AT1G01010;Note=protein_coding_gene;Name=AT1G01010

Chr1    TAIR10    mRNA    3631    5899    .    +    .    ID=AT1G01010.1;Parent=AT1G01010;Name=AT1G01010.1;Index=1

Chr1    TAIR10    protein    3760    5630    .    +    .    ID=AT1G01010.1-Protein;Name=AT1G01010.1;Derives_from=AT1G01010.1

Chr1    TAIR10    exon    3631    3913    .    +    .    Parent=AT1G01010.1

That will be some fancy capture/replace code, because it will need to catch the first instance of the ID column following each occurrence of "gene" feature. 

ADD REPLYlink written 4.8 years ago by Malia10
1
gravatar for Martombo
4.8 years ago by
Martombo2.5k
Seville, ES
Martombo2.5k wrote:

sorry there has been some misunderstanding... I realized now that TAIR10 is your scaffold, so we don't need it for this purpose. by the way shouldn't it be the first column? anyway that's not important now...

I'll explain what I meant with my previous comments in more details, describing a bit better also how htseq-count works:

so, htseq-count is generally meant to provide counts for every gene, independently of the splice variants (though it can be used for such purpose as well). the definition of "gene" used by htseq is the union of all the exons of a gene. even the ones that would be mutually exclusive for example, so this is not a strict biological term. htseq counts all the reads that can be assigned to the exons of a gene (and not to the exons of any other gene) and outputs the final sum. in practice, it uses only the entries with "exon" in the third column (can be changed with the -t option) and sums them up for all the entries with the same "gene_id" identifier (can be changed with the -i option) in the last column. so you can forget about all the other CDS, mRNA, UTRs entries. the only thing you want to check, is that your exons have all a unique id for each gene.

if I understood correctly from you examples, you were using "Parent" as gene id and you changed its value in your gene, in order to have the same for all the splice variants, is that so? (it's a bit hard to read your example, since there is no line break) in principle, that should work, did you try it already? I would anyway suggest you to modify your whole file, so that you can get the proper results for a gene level count for all your genes.

ADD COMMENTlink written 4.8 years ago by Martombo2.5k

Ah...so if htseq assigns read counts based on exon features and not gene, then that may be my problem, because Parent has the splice variant annotation at the exon feature. I edited the original post to try to make things clearer.

Because there is only one gene feature (but multiple mRNA features for one gene), I could change the -i option to "ID" instead of "Parent", since that will be how htseq assigns reads? I will try this to see how it works.

 

ADD REPLYlink modified 4.8 years ago • written 4.8 years ago by Malia10

Just an update, your suggestions (and the following code) solved my problem! The issue was two-sided, since tophat2 needed the fr-firststrand argument as specified for Illumina TruSeq stranded library preps. Then htseq needed to be run using --stranded-reverse because tophat2 reads all align to genome opposite to the true orientation.

tophat2 --b2-very-sensitive --no-novel-juncs --library-type fr-firststrand
htseq-count -t gene -i ID --stranded=reverse
ADD REPLYlink written 4.7 years ago by Malia10
glad I could help. yes, the reads orientation can often be confusing
ADD REPLYlink written 4.7 years ago by Martombo2.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 539 users visited in the last hour