How to modify a gff3 file for HTSeq?
1
0
Entering edit mode
6.2 years ago
Gary ▴ 480

I would like to use HTSeq (htseq-count) and edgeR to analysis our alligator RNA-Seq. The alligator gff3 file I download from GIGADB (http://gigadb.org/dataset/100126) was not accepted by htseq-count as the below. What I need is that there is a gene symbol in the exon type row, e.g.

scaffold-729 AUGUSTUS exon 101305 101913 . - . ID=exon67799;Parent=rna5642;Name=WNT3A.

However, there is no gene symbol in the exon type row. The gene symbol I need only appears in the gene type row. Could you teach me how to modify the gff3 file that htseq-count can accept? Many thanks.

Gary

scaffold-729    AUGUSTUS    gene    101305    186845    1    -    .    ID=gene3770;Name=WNT3A;gene=WNT3A;Dbxref=CrocBase:AMISG003770,GeneID:395396,PhylomeDB:Phy004KWLF_ALLMI;Note=WNT3A inferred by phylogenetic tree homology from Gallus gallus EntrezGene:395396 PhylomeDB:Phy004KWLF_ALLMI
scaffold-729    AUGUSTUS    mRNA    101305    186845    .    -    .    ID=rna5642;Name=AMIST005642;transcript_id=AMIST005642;gene=WNT3A;Dbxref=CrocBase:AMIST005642,GeneID:395396,PhylomeDB:Phy004KWLF_ALLMI;Parent=gene3770;Note=WNT3A inferred by phylogenetic tree homology from Gallus gallus EntrezGene:395396 PhylomeDB:Phy004KWLF_ALLMI
scaffold-729    AUGUSTUS    CDS    101434    101913    .    -    0    ID=cd59543;Parent=rna5642
scaffold-729    AUGUSTUS    CDS    106298    106563    .    -    2    ID=cd59544;Parent=rna5642
scaffold-729    AUGUSTUS    CDS    141700    141941    .    -    1    ID=cd59545;Parent=rna5642
scaffold-729    AUGUSTUS    CDS    186490    186560    .    -    0    ID=cd59546;Parent=rna5642
scaffold-729    AUGUSTUS    exon    101305    101913    .    -    .    ID=exon67799;Parent=rna5642
scaffold-729    AUGUSTUS    exon    106298    106563    .    -    .    ID=exon67800;Parent=rna5642
scaffold-729    AUGUSTUS    exon    141700    141941    .    -    .    ID=exon67801;Parent=rna5642
scaffold-729    AUGUSTUS    exon    186490    186845    .    -    .    ID=exon67802;Parent=rna5642
scaffold-729    AUGUSTUS    intron    101914    106297    .    -    .    ID=intron53902;Parent=rna5642
scaffold-729    AUGUSTUS    intron    106564    141699    .    -    .    ID=intron53903;Parent=rna5642
scaffold-729    AUGUSTUS    intron    141942    186489    .    -    .    ID=intron53904;Parent=rna5642
RNA-Seq rna-seq next-gen HTSeq gff3 • 7.9k views
2
Entering edit mode

May be you can try -i="Name" . See the doc

0
Entering edit mode

Many thanks. However, after trying –i=Name or –i=’Name’, the htseq-count show an error: Error occured when processing GFF file (line 6 of file amis_RNASeqSoftware_v1.2.gff3): Feature exon1 does not contain a Name attribute [Exception type: ValueError, raised in count.py:53]. I guess that htseq-count only can identify the Name attribute if the Name attribute and the exon type in the same row.

Gary

1
Entering edit mode

As Geek_y implied, the defaults are appropriate for GTF files from Ensembl. They aren't always applicable to any random GFF file (that's part of the problem with GFF as a format). When something doesn't work, reading the documentation should be your first step.

0
Entering edit mode

Thanks. You are right. By default, htseq-count expects a GTF file. I can run htseq-count well with mouse and chicken RNA-Seq, using RefSeq or Ensembl annotation files downloaded from the iGenome. I think my problem is that I don’t know how to modify an alligator GFF file to match the format htseq-count need shown in its document.

Gary

1
Entering edit mode
6.2 years ago
michael.ante ★ 3.6k

You may have a look at Convertion Of Gff3 To Gtf.

I tried out the "gffread" , as well as the "rtracklayer" approach. Both worked perfectly fine for me.

0
Entering edit mode

Here the problem is htseq-count, by default looks for gene_id attribute for counting. In this case you may just tell htseq-count to take Name instead of gene_id.

0
Entering edit mode

Many thanks. However, after trying –i=Name, the htseq-count show an error: Error occured when processing GFF file (line 6 of file amis_RNASeqSoftware_v1.2.gff3): Feature exon1 does not contain a Name attribute [Exception type: ValueError, raised in count.py:53]. I guess that htseq-count only can identify the Name attribute if the Name attribute and the exon type in the same row.

0
Entering edit mode

Thanks you so much. I believe it could be just what I need. I will try to learn the gffread, even using unix command lines is still not easy for me now. Thanks again.

0
Entering edit mode

Did you solve the problem?