Question: Ambiguous reads in htseq-count
gravatar for drkousi31
13 days ago by
drkousi310 wrote:

Hi all, I have a dount in htseq-count output.

I used samtools to extract uniquely mapped properly paired reads and sorted by name using the command

samtools view -bS -h -q 40 -f 2 -F 512 ${i} | samtools sort -n -o ${i}_sorted.bam -@ 20

I used this output for htseq-count using the coomand

htseq-count -f bam -r name -s no -t exon -i ID ${i} GCF_003121395.1_UOA_WB_1_genomic.gff >${i}_htseqcnt.

eg. output from two files

file 1

__no_feature 1537976

__ambiguous 5157805

__too_low_aQual 0

__not_aligned 0

__alignment_not_unique 0


__no_feature 3546477

__ambiguous 11880801

__too_low_aQual 0

__not_aligned 0

__alignment_not_unique 0

Mygff file is like this

gene 3503760 3523716 . - . ID=gene7;

mRNA 3503760 3523716 . - . ID=rna12;Parent=gene7;

exon 3515344 3515454 . - . ID=id60;Parent=rna12;

CDS ID=cds9;Parent=rna12

CDS ID=cds9;Parent=rna12

CDS ID=cds9;Parent=rna12

mRNA 3503760 3523659 . - . ID=rna13;Parent=gene7;

exon 3523589 3523659 . - . ID=id63;Parent=rna13;

My questions: why are there ambiguous reads when my input is properly paired reads (output from samtools)? My gff file doesn't have a gene_id option. can anyone please advice, where am I wrong?

rna-seq htseq-count • 128 views
ADD COMMENTlink modified 23 hours ago • written 13 days ago by drkousi310
gravatar for lieven.sterck
13 days ago by
VIB, Ghent, Belgium
lieven.sterck4.5k wrote:

ambiguous points here to reads that can uniquely be assigned to a single genic feature, not that the mapping itself was/is ambiguous. This can of instance happen when genes overlap in their UTR region and a read aligned in that region will (depending on the counting mode) often be assigned as ambiguous.

see here for the explanation (and image explaining it)

If your gff file does not have the (default) gene_id you can tell to htseq-count to use a different tag (see in doc above to know how to)

now, looking at your output, it rather is an issue of indeed not providig the correct tag to use for the gene-assignment (try to use for instance 'ID' in your case)

ADD COMMENTlink modified 13 days ago • written 13 days ago by lieven.sterck4.5k

Thank you for pointing out the mistake in assigning the gene tag in "-i". The output I got is by using "-i ID", if you see my command above.

As per the manual: If the name of the attribute containing the gene ID for exon lines is not gene_id, use the --idattr. Often, its is, for example, Parent, GeneID or ID. Make sure it is the gene ID and not the exon ID.

Sorry I didn't paste my full gff file in the question

NC_037545.1 Gnomon gene 3503760 3523716 . - . ID=gene-NKIRAS1;Dbxref=GeneID:102394338;Name=NKIRAS1;gbkey=Gene;gene=NKIRAS1;gene_biotype=protein_coding

NC_037545.1 Gnomon mRNA 3503760 3523716 . - . ID=rna-XM_006074115.2;Parent=gene-NKIRAS1;Dbxref=GeneID:102394338,Genbank:XM_006074115.2;Name=XM_006074115.2;gbkey=mRNA;gene=NKIRAS1;model_evidence=Supporting evidence includes similarity to: 5 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 152 samples with support for all annotated introns;product=NFKB inhibitor interacting Ras like 1%2C transcript variant X1;transcript_id=XM_006074115.2

NC_037545.1 Gnomon exon 3523593 3523716 . - . ID=exon-XM_006074115.2-1;Parent=rna-XM_006074115.2;Dbxref=GeneID:102394338,Genbank:XM_006074115.2;gbkey=mRNA;gene=NKIRAS1;product=NFKB inhibitor interacting Ras like 1%2C transcript variant X1;transcript_id=XM_006074115.2

NC_037545.1 Gnomon exon 3515344 3515454 . - . ID=exon-XM_006074115.2-2;Parent=rna-XM_006074115.2;Dbxref=GeneID:102394338,Genbank:XM_006074115.2;gbkey=mRNA;gene=NKIRAS1;product=NFKB inhibitor interacting Ras like 1%2C transcript variant X1;transcript_id=XM_006074115.2

NC_037545.1 Gnomon exon 3509449 3509690 . - . ID=exon-XM_006074115.2-3;Parent=rna-XM_006074115.2;Dbxref=GeneID:102394338,Genbank:XM_006074115.2;gbkey=mRNA;gene=NKIRAS1;product=NFKB inhibitor interacting Ras like 1%2C transcript variant X1;transcript_id=XM_006074115.2

ID: is different for different exon

Parent: is same for exons within mRNA, but different for exons between mRNAs of same gene

gene: is same for all exons of a gene.

Is it correct to use "-i gene" ?? Link to my full gff file is this

Please suggest the correct attribute for "-i". Thank you

ADD REPLYlink modified 12 days ago • written 12 days ago by drkousi310

from this , yes I would say -i gene is the way to go. Does it not work?

ADD REPLYlink written 12 days ago by lieven.sterck4.5k

I tried with -i gene and got this error

100000 GFF lines processed.

200000 GFF lines processed.

300000 GFF lines processed. .



1600000 GFF lines processed.

Error occured when processing GFF file (line 1688729 of file /DATA/kous/g_index/GCF_003121395.1_UOA_WB_1_genomic.gff): Feature id836346 does not contain a 'gene' attribute [Exception type: ValueError, raised in]

Output of sed -n '1688729p' my_gff is

NC_006295.1 RefSeq exon 377 445 . + . ID=id836346;Parent=rna71515;gbkey=tRNA;product=tRNA-Phe

I found some of the lines (53 lines) with the seqname NC_006295.1 doesnot have a 'gene' attribute. Is there basic linux command to insert gene=unknown to these lines alone in the 9th column, with which I can edit the gff file slightly and use it?

command grep NC_006295.1 < my_gff | grep -v gene gives the lines without gene. Is there a way to insert "gene=unknown" to the 9th column for these lines in gff file?

Please give me suggestions.

ADD REPLYlink modified 1 day ago • written 1 day ago by drkousi310

it sounds like you are using a 'faulty' gff file. where did you get that from?

Looking at your extract it shows that this is a tRNA-exon. Most likley you are not interested in those , so you could filter your gff file and remove all tRNA lines from it

ADD REPLYlink written 1 day ago by lieven.sterck4.5k

I downloaded gff file from ncbi with this link

Yes I am not interested in tRNAs. Should I filter all tRNAs or only the lines without 'gene' ?

ADD REPLYlink written 1 day ago by drkousi310

yes, go ahead and filter all tRNA lines

ADD REPLYlink written 1 day ago by lieven.sterck4.5k

Sorry. Please help me how to filter tRNA lines?

As the 3rd column has 3 features: gene, tRNA and exon. gbkey has variation between gene and tRNA. Some lines have gene_biotype=tRNA and some lines have product=tRNA instead.

In some cases like in tRNA-methyltransferase gene, gbkey=mRNA and product=tRNA-methyltransferase.

ADD REPLYlink modified 1 day ago • written 1 day ago by drkousi310
grep -v 'tRNA' <your gff file> > new_gff file


awk '$3 !~ /tRNA/' <your gff file> > new_gff file

from the top of my head and without knowing your gff file, so do check the output you obtain

ADD REPLYlink written 1 day ago by lieven.sterck4.5k

Using grep -v 'tRNA' removes mRNA of genes like tRNA methyltransferase.
Using awk '$3 !~ /tRNA/' retains exons of tRNA.

Please check if this approach is correct.

write lines without 'gene'

grep -v gene my.gff > no_gene_gff

Check features of no_gene_gff

awk '{print $3}' no_gene_gff | sort | uniq -c

509 1
1 annotwriter
1 Bubalus
26169 cDNA_match
2 D_loop
24 exon
1 origin_of_replication
509 region
2 rRNA
22 tRNA

Since the lines without genes do not have mRNA as evident from above is it ok to use,

grep 'gene' my_gff file > gene_gff file

and use this gene_gff file for htseq count?

ADD REPLYlink modified 20 hours ago • written 20 hours ago by drkousi310

yes, you're absolutely correct on using the -v tRNA approach.

Hmm, ok , it's worth a try at least indeed. I'm just wondering if you might not miss things that way ... anyway, no way I can tell from here, so go ahead and give it a try I would say

ADD REPLYlink written 20 hours ago by lieven.sterck4.5k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1729 users visited in the last hour