Question

htseq-count settings for differential expression

1

Entering edit mode

3.5 years ago

candron ▴ 10

Hi All,

I am doing a differential gene expression analysis with samples form a non-model plant species, Nicotiana benthamiana.

I am now at the step of counting, trying out htseq-count on Hisat2-generated bam files. Recommended options are --type=exon and --idattr=gene_id, i.e. count exons and summarize on genes.

However, my GFF file does not contain gene_id as an attribute.

My question: Is it OK to use Parent as --idattr in place of the (non-existent) gene_id as follows?

htseq-count -f bam -s yes -t exon -i Parent example.bam Niben101_annotation.allfeatures.gff > example_htseq_counts_file.txt

Few lines from the GFF file:

Niben101Ctg00116    maker   gene    2   501 .   -   .   ID=Niben101Ctg00116g00002;Name=Niben101Ctg00116g00002;PredictionNote=maker-snap;AltID=maker-Niben101Ctg00116-snap-gene-0.1
Niben101Ctg00116    maker   mRNA    2   501 .   -   .   ID=Niben101Ctg00116g00002.1;Parent=Niben101Ctg00116g00002;Name=Niben101Ctg00116g00002.1;_AED=0.00;_eAED=0.00;_QI=0|1|0.33|1|0|0|3|0|110;AltID=maker-Niben101Ctg00116-snap-gene-0.1-mRNA-1
Niben101Ctg00116    maker   exon    2   25  .   -   .   ID=Niben101Ctg00116g00002.1:exon:003;Parent=Niben101Ctg00116g00002.1;AltID=maker-Niben101Ctg00116-snap-gene-0.1-mRNA-1:exon:4517
Niben101Ctg00116    maker   exon    120 314 .   -   .   ID=Niben101Ctg00116g00002.1:exon:002;Parent=Niben101Ctg00116g00002.1;AltID=maker-Niben101Ctg00116-snap-gene-0.1-mRNA-1:exon:4516
Niben101Ctg00116    maker   exon    391 501 .   -   .   ID=Niben101Ctg00116g00002.1:exon:001;Parent=Niben101Ctg00116g00002.1;AltID=maker-Niben101Ctg00116-snap-gene-0.1-mRNA-1:exon:4515

Thank you all in advance,

RNA-Seq htseq-count • 1.0k views

ADD COMMENT • link 3.4 years ago by candron ▴ 10

score 0 · Answer 1 · 2020-11-30

yoooooooo,

I just spent a few weeks on this very issue.

From what I can see, it looks like you wouldn't want to use "Parent" as your gene_ID because there are multiple features with the same value for the "Parent" attribute. If you counted by this attribute, it would result in the 3rd, 4th, and 5th line above being counted as one feature, so rather than seeing the counts for exon:001, exon:002, and exon:003 separately, you would see the sum of the counts of those three exons for the parent Niben101Ctg00116g00002.1

It would probably be better to use "AltID" in lieu of gene_ID as each exon seems to have its own unique AltID.

But before you do any of that, you'll need to convert your GFF into the GFF style that htseq can recognize. Apparently, GFF is a bullshit format with not a lot of consistency in the conventions. Take a look at the yeast genome GFF they provide for their walkthrough:

2-micron    protein_coding  exon    252 1523    .   +   .    gene_id "R0010W"; transcript_id "R0010W"; exon_number "1"; gene_name "FLP1"; transcript_name "FLP1";
2-micron    protein_coding  CDS 252 1520    .   +   0    gene_id "R0010W"; transcript_id "R0010W"; exon_number "1"; gene_name "FLP1"; transcript_name "FLP1"; protein_id "R0010W";
2-micron    protein_coding  start_codon 252 254 .   +   0    gene_id "R0010W"; transcript_id "R0010W"; exon_number "1"; gene_name "FLP1"; transcript_name "FLP1";
2-micron    protein_coding  stop_codon  1521    1523    .   +   0    gene_id "R0010W"; transcript_id "R0010W"; exon_number "1"; gene_name "FLP1"; transcript_name "FLP1";
2-micron    protein_coding  exon    1887    3008    .   -   .    gene_id "R0020C"; transcript_id "R0020C"; exon_number "1"; gene_name "REP1"; transcript_name "REP1";

As you can see, the 9th column of this file is very different from yours (instead of "=" there is simply a space " " between the attribute and its value, and all of the attribute values are enclosed in quotation marks. Also, there is a space after each semicolon, as well as a semicolon at the end of the attribute list.) I had no luck in finding a way to do these conversions automatically so I more or less did find and replace until it looked just like the example, and even then it took a few rounds of error from htseq to get everything perfect. Moral of the story, if you're not working with a model organism that has an ideally formatted gff, they don't give a damn about you bruh.