Question: How to modify a gff3 file for HTSeq?
0
gravatar for Gary
5.8 years ago by
Gary480
Taiwan/Taichung/China Medical University Hospital
Gary480 wrote:

I would like to use HTSeq (htseq-count) and edgeR to analysis our alligator RNA-Seq. The alligator gff3 file I download from GIGADB (http://gigadb.org/dataset/100126) was not accepted by htseq-count as the below. What I need is that there is a gene symbol in the exon type row, e.g.

scaffold-729 AUGUSTUS exon 101305 101913 . - . ID=exon67799;Parent=rna5642;Name=WNT3A.

However, there is no gene symbol in the exon type row. The gene symbol I need only appears in the gene type row. Could you teach me how to modify the gff3 file that htseq-count can accept? Many thanks.

Gary

 

scaffold-729    AUGUSTUS    gene    101305    186845    1    -    .    ID=gene3770;Name=WNT3A;gene=WNT3A;Dbxref=CrocBase:AMISG003770,GeneID:395396,PhylomeDB:Phy004KWLF_ALLMI;Note=WNT3A inferred by phylogenetic tree homology from Gallus gallus EntrezGene:395396 PhylomeDB:Phy004KWLF_ALLMI
scaffold-729    AUGUSTUS    mRNA    101305    186845    .    -    .    ID=rna5642;Name=AMIST005642;transcript_id=AMIST005642;gene=WNT3A;Dbxref=CrocBase:AMIST005642,GeneID:395396,PhylomeDB:Phy004KWLF_ALLMI;Parent=gene3770;Note=WNT3A inferred by phylogenetic tree homology from Gallus gallus EntrezGene:395396 PhylomeDB:Phy004KWLF_ALLMI
scaffold-729    AUGUSTUS    CDS    101434    101913    .    -    0    ID=cd59543;Parent=rna5642
scaffold-729    AUGUSTUS    CDS    106298    106563    .    -    2    ID=cd59544;Parent=rna5642
scaffold-729    AUGUSTUS    CDS    141700    141941    .    -    1    ID=cd59545;Parent=rna5642
scaffold-729    AUGUSTUS    CDS    186490    186560    .    -    0    ID=cd59546;Parent=rna5642
scaffold-729    AUGUSTUS    exon    101305    101913    .    -    .    ID=exon67799;Parent=rna5642
scaffold-729    AUGUSTUS    exon    106298    106563    .    -    .    ID=exon67800;Parent=rna5642
scaffold-729    AUGUSTUS    exon    141700    141941    .    -    .    ID=exon67801;Parent=rna5642
scaffold-729    AUGUSTUS    exon    186490    186845    .    -    .    ID=exon67802;Parent=rna5642
scaffold-729    AUGUSTUS    intron    101914    106297    .    -    .    ID=intron53902;Parent=rna5642
scaffold-729    AUGUSTUS    intron    106564    141699    .    -    .    ID=intron53903;Parent=rna5642
scaffold-729    AUGUSTUS    intron    141942    186489    .    -    .    ID=intron53904;Parent=rna5642
rna-seq next-gen gff3 htseq • 7.4k views
ADD COMMENTlink modified 5.8 years ago by michael.ante3.6k • written 5.8 years ago by Gary480
2

May be you can try -i="Name" . See the doc

ADD REPLYlink modified 5.8 years ago • written 5.8 years ago by geek_y11k

Many thanks. However, after trying –i=Name or –i=’Name’, the htseq-count show an error: Error occured when processing GFF file (line 6 of file amis_RNASeqSoftware_v1.2.gff3): Feature exon1 does not contain a Name attribute [Exception type: ValueError, raised in count.py:53]. I guess that htseq-count only can identify the Name attribute if the Name attribute and the exon type in the same row.

Gary

ADD REPLYlink written 5.8 years ago by Gary480
1

As Geek_y implied, the defaults are appropriate for GTF files from Ensembl. They aren't always applicable to any random GFF file (that's part of the problem with GFF as a format). When something doesn't work, reading the documentation should be your first step.

ADD REPLYlink written 5.8 years ago by Devon Ryan97k

Thanks. You are right. By default, htseq-count expects a GTF file. I can run htseq-count well with mouse and chicken RNA-Seq, using RefSeq or Ensembl annotation files downloaded from the iGenome. I think my problem is that I don’t know how to modify an alligator GFF file to match the format htseq-count need shown in its document.

Gary

ADD REPLYlink written 5.8 years ago by Gary480
1
gravatar for michael.ante
5.8 years ago by
michael.ante3.6k
Austria/Vienna
michael.ante3.6k wrote:

You may have a look at Convertion Of Gff3 To Gtf.

I tried out the "gffread" , as well as the "rtracklayer" approach. Both worked perfectly fine for me.

ADD COMMENTlink modified 5.8 years ago • written 5.8 years ago by michael.ante3.6k

Here the problem is htseq-count, by default looks for gene_id attribute for counting. In this case you may just tell htseq-count to take Name instead of gene_id.

ADD REPLYlink modified 5.8 years ago • written 5.8 years ago by geek_y11k

Many thanks. However, after trying –i=Name, the htseq-count show an error: Error occured when processing GFF file (line 6 of file amis_RNASeqSoftware_v1.2.gff3): Feature exon1 does not contain a Name attribute [Exception type: ValueError, raised in count.py:53]. I guess that htseq-count only can identify the Name attribute if the Name attribute and the exon type in the same row.

ADD REPLYlink written 5.8 years ago by Gary480

Thanks you so much. I believe it could be just what I need. I will try to learn the gffread, even using unix command lines is still not easy for me now. Thanks again.

ADD REPLYlink written 5.8 years ago by Gary480

Did you solve the problem?

ADD REPLYlink written 4.8 years ago by kanika.15180
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1321 users visited in the last hour