Error when trying to run HTSeq
1
0
Entering edit mode
4.1 years ago
oma219 ▴ 30

Hello,

I was running HTSeq count script and I keep getting an error saying "Error: Feature EDO49941-1 does not contain a 'gene_id' attribute"? I'm guessing its because I used a .gff3 instead of .gtf. So if I were to convert the file, should I use the method makeTxDdfromGFF() from the genomicFeatures package or is there a better way?

UPDATED QUESTION: I tried using the -i command line option but keep getting the error below, so I was hoping some one just help me figure out how to use the -i command line option so it works with my .gff3 file? Thanks.

Error: Feature EDO49941-1 does not contain a 'ID' attribute

Here is the command I used:

htseq-count -i ID  /home/ubuntu/data/rnaseq/nematostella/venus/alignment/1_STARAligned.out.sam /home/ubuntu/data/rnaseq/nematostella/venus/genome/Nematostella_vectensis.GCA_000209225.1.33.gff3


Here are a couple lines of my .gff3 file:

NEMVEscaffold_1 JGI gene    25970   37222   .   -   .   ID=gene:NEMVEDRAFT_v1g196074;biotype=protein_coding;description=Predicted protein  [Source:UniProtKB/TrEMBL%3BAcc:A7REN5];gene_id=NEMVEDRAFT_v1g196074;logic_name=jgi_nemve
NEMVEscaffold_1 JGI gene    40959   43943   .   -   .   ID=gene:NEMVEDRAFT_v1g196075;biotype=protein_coding;description=Predicted protein  [Source:UniProtKB/TrEMBL%3BAcc:A7REN6];gene_id=NEMVEDRAFT_v1g196075;logic_name=jgi_nemve
NEMVEscaffold_1 JGI pseudogene  46093   47031   .   +   .   ID=gene:NEMVEDRAFT_v1g5443;biotype=pseudogene;gene_id=NEMVEDRAFT_v1g5443;logic_name=jgi_nemve
NEMVEscaffold_1 JGI gene    47232   48589   .   +   .   ID=gene:NEMVEDRAFT_v1g237689;biotype=protein_coding;description=Predicted protein  [Source:UniProtKB/TrEMBL%3BAcc:A7REN7];gene_id=NEMVEDRAFT_v1g237689;logic_name=jgi_nemve


Thanks.

RNA-Seq Assembly • 1.8k views
0
Entering edit mode

Have you tried -i gene_id which is the default.

0
Entering edit mode

Yeah I started with that, -i gene_id. But it says "gene_id" is not found so I wasn't confused? I thought I may need to change it since it was .gff3 file.

0
Entering edit mode
4.1 years ago

See HTSeq-count options, you need to set the -i flag correctly.

0
Entering edit mode

Yeah I tried running it with -i ID because there is an ID attribute in the .gff3 file I think. But it wasn't working, it just says it cannot find ID either. So I wasn't sure what I should set the -i flag as so its compatible with the .gff3 file?

1
Entering edit mode

Let me quickly check your gff3 file to see which attributes are present, just a second.

Ah no, wait, I can't see it. Because it's on your computer, not mine. Inconvenient.

Damn.

You may have to add more information if you want this to work. We are working on a new feature to read the mind of those who post questions. But since that is not yet fully ready, please show a few lines of the gff file.

0
Entering edit mode

Sorry, I thought that the format for .gff3 files was consistent across all files. I'm still trying to learn and I'm obviously not at your level of understanding yet. So I would really appreciate it if you could just take it down a notch and just be a little more patient. I'm trying my best to learn how the pipeline works and its various components. I'll try to be more informative with questions in the future.

0
Entering edit mode

Then show us a few lines of your file.

0
Entering edit mode

I did, I posted it with the original question along with the command I was running.

0
Entering edit mode

Good, I hadn't seen you had updated the post. My apologies for my rude reaction, but we get so many posts with insufficient information and it's taking so much time to get what we need so we can help you. I constantly think "do I want to spend time trying to get the necessary information or will I just give up on this one." It's not just you, obviously.

Is there something odd with the line containing "EDO49941-1"? Have a look at it, e.g. using grep. The error seems to be specific for that feature, not for those before it...

0
Entering edit mode

I understand, I appreciate you taking the time to help. I searched the .gff3 and the error seems to occur when the ID goes from genes to transcripts in the lines below. If I want to study differential analysis, I'm guessing I should be counting the transcript reads but I'm not entirely sure how to edit this file so it would be compatible? Should I remove all the lines that don't say ID=transcript?

NEMVEscaffold_1 Ensembl_Metazoa pseudogenic_tRNA    3211073 3211144 .   -   .   ID=gene:EMNVEG00000012497;Name=tRNA-Pseudo;biotype=tRNA_pseudogene;description=tRNA-Pseudo for anticodon CAU [Source:TRNASCAN_SE%3BAcc:tRNA-Pseudo];gene_id=EMNVEG00000012497;logic_name=ncrna_eg
NEMVEscaffold_1 JGI gene    3213112 3216799 .   -   .   ID=gene:NEMVEDRAFT_v1g79543;biotype=protein_coding;description=Predicted protein  [Source:UniProtKB/TrEMBL%3BAcc:A7RFD9];gene_id=NEMVEDRAFT_v1g79543;logic_name=jgi_nemve
NEMVEscaffold_1 JGI gene    3224805 3225727 .   -   .   ID=gene:NEMVEDRAFT_v1g78975;biotype=protein_coding;description=Predicted protein  [Source:UniProtKB/TrEMBL%3BAcc:A7RFE0];gene_id=NEMVEDRAFT_v1g78975;logic_name=jgi_nemve
NEMVEscaffold_1 JGI mRNA    25970   37222   .   -   .   ID=transcript:EDO49941;Parent=gene:NEMVEDRAFT_v1g196074;biotype=protein_coding;transcript_id=EDO49941
NEMVEscaffold_1 JGI mRNA    40959   43943   .   -   .   ID=transcript:EDO49942;Parent=gene:NEMVEDRAFT_v1g196075;biotype=protein_coding;transcript_id=EDO49942
NEMVEscaffold_1 JGI pseudogene  46093   47031   .   +   .   ID=transcript:NEMVEDRAFT_v1g5443;Parent=gene:NEMVEDRAFT_v1g5443;biotype=pseudogene;transcript_id=NEMVEDRAFT_v1g5443
NEMVEscaffold_1 JGI mRNA    47232   48589   .   +   .   ID=transcript:EDO49826;Parent=gene:NEMVEDRAFT_v1g237689;biotype=protein_coding;transcript_id=EDO49826

0
Entering edit mode

You could use the following to keep only lines with 'ID=transcript:'

grep 'ID=transcript:' yourfile.gff > newfile.gff

0
Entering edit mode

Is that the typical protocol when you go to perform counts using htseq-count if you don't mind me asking? Do you usually have to extract only certain parts of your .gff3 files to use with the program? Thanks for the help!

0
Entering edit mode

I haven't worked with anything else than human annotation, and those files contain in the last column always the same tags. So no, I don't need to edit my file prior to running htseq-count.

0
Entering edit mode

Once you have got this working give featureCounts a try. It is straight forward to use and may be more tolerant to differences in GFF files like this.

0
Entering edit mode

So just to make sure I understand the purpose of the .gff3 file for htseq-count, the .gff3 provides htseq-count with the genome so it can compare? I'm not sure of the purpose of it because doesn't the .sam file already contain the reads mapped to the genome?

0
Entering edit mode

Annotation file provides the locations (co-ordinates) where the coding regions (genes) are present in the genome. So while the sam file contains the locations with reference to the genome (I assume you have aligned to the genome) they need to be correlated to the exon/gene boundary co-ordinates provided by GFF fie to get the read counts.

0
Entering edit mode

Oh okay, so if I want to modify my .gff3 so its compatible with htseq-count, do I need to just take all the lines that say gene or exon? Originally, I thought I wanted the transcripts but I realized that was contained in my .sam files. Sorry for all the questions, I'm just want to make sure i modify the .gff3 file correctly and don't lose a bunch of reads.