HTSeq-Count: no_feature too high?
0
0
Entering edit mode
3 months ago
sea.joson ▴ 10

Hello!

I am learning how to do differential expression. I was given transcriptome data and I am fortunate that the sample already has an established annotated genome. So far, I aligned the transcripts to the reference genome via STAR and I've had good mapping stats (>95%). Now, I'm quantifying the counts with HTSeq-Count. I expected it to be a relatively smooth process, but the results have been bugging me.

Here is the line I used:

htseq-count -f bam -r pos -t exon $RNA/AN0_1.Aligned.sortedByCoord.out.bam$GEN > AN0_1.no.exon.counts


The results:

__no_feature    14805713
__ambiguous     4766268
__too_low_aQual 0
__not_aligned   0
__alignment_not_unique  113292


I feel like that's a high value for no_features right?

I've tried playing around with the strandedness (-s), but all options give similar results. I also went back and made sure the GTF file has the gene_id attribute. I also tried using a GFF file instead.

Might it also have to do with this warning sign from the err file?

Warning: Mate pairing was ambiguous for 23164 records; mate key for first such record: ('A00564:87:HJKFCDSXX:2:2205:13838:9330', 'first', 'CP059858.1', 91661, 'CP059858.1', 91831, 317).


Any help will be much appreciated!

htseq-count • 1.4k views
0
Entering edit mode

Warning: Mate pairing was ambiguous

It makes me think you have paired-end reads that have been trimmed inappropriately without using paired-end mode.

Also, can you try running it again with -t gene and let me know what results you get?

0
Entering edit mode

This is run with -t gene

__no_feature    14786206
__ambiguous     4691
__too_low_aQual 0
__not_aligned   0
__alignment_not_unique  113292

*Warning: Mate pairing was ambiguous for 23164 records; mate key for first such record: ('A00564:87:HJKFCDSXX:2:2205:13838:9330', 'first', 'CP059858.1', 91661, 'CP059858.1', 91831, 317).*


I'm not super certain about the trimming because when I got the files, I think they've already been pre-processed a bit. I checked them out using fastqc and they seem fine.

0
Entering edit mode

Do you have identical number of reads in R1/R2 files?

Can you run the following program from BBMap suite on each file and show the output

\$ readlength.sh -Xmx2g in=your_file.fastq.gz
Executing jgi.MakeLengthHistogram [-Xmx2g, in=your_file.fastq.gz]

#Bases: 10838700
#Max:   300
#Min:   300
#Avg:   300.0
#Median:        300
#Mode:  300
#Std_Dev:       0.0

0
Entering edit mode

Processed 56800750 reads.
#Bases: 8520112500
#Max:   150
#Min:   150
#Avg:   150.0
#Median:        150
#Mode:  150
#Std_Dev:       0.0
150     56800750        100.000%        56800750        100.000%        8520112500      100.000%        8520112500      100.000%

0
Entering edit mode

Can you run the command independently on the two files?

0
Entering edit mode

Processed 28400375 reads.
#Bases: 4260056250
#Max:   150
#Min:   150
#Avg:   150.0
#Median:        150
#Mode:  150
#Std_Dev:       0.0
150     28400375        100.000%        28400375        100.000%        4260056250      100.000%        4260056250      100.000%
Time:   67.509 seconds.


Processed 28400375 reads.
#Bases: 4260056250
#Max:   150
#Min:   150
#Avg:   150.0
#Median:        150
#Mode:  150
#Std_Dev:       0.0
150     28400375        100.000%        28400375        100.000%        4260056250      100.000%        4260056250      100.000%
Time:   69.977 seconds.

0
Entering edit mode

These files appear to be non-trimmed and intact.

Did you make sure that these actually have reads from R1 and R2 by inspecting the headers? Hopefully you don't have the same data in the two files.

Also inspect the aligned BAM in IGV to see where the reads are aligning (i.e. are they piling up on exons etc). Do you have rRNA gene models in your GTF? Perhaps you may have a lot of rRNA contamination.

0
Entering edit mode

Oh. I did not trim my reads. Do you recommend I try again but with trimmed reads instead? Also, may I ask how you are able to tell that the files "appear to be non-trimmed and intact?"

I did check the R1 and R2 and they are not the same data.

I also went ahead and checked the bam file in IGV. Unfortunately, I do no yet know how to fully interpret what I was seeing.

0
Entering edit mode

No need. STAR should be able to handle any adapter sequences present by soft-clipping them.

Also, may I ask how you are able to tell that the files "appear to be non-trimmed and intact?"

You can see that the length of reads is identical (150 bp ) in both files and number of bases is also identical. If reads are trimmed to remove adapters there will be a range of sizes present and the number of bases will likely be different in two files.

I do no yet know how to fully interpret what I was seeing.

Find a smallish gene in IGV (which shows exon structure) and post a screen shot of what the alignments look like.

0
Entering edit mode

Is this right? https://imgur.com/hXGSf4b

0
Entering edit mode

That looks right. Reads piling up under exons and a good number too.

If you are open to trying another counting program out can you try featureCounts? It is part of a package called subread (LINK). A tutorial is available: https://subread.sourceforge.net/featureCounts.html

0
Entering edit mode

Yeah I think this is a next move for me.

0
Entering edit mode

Finally got to try featureCounts. It seems to be better, correct?

Status  AN0_1.Aligned.sortedByCoord.out.bam
Assigned        45141493
Unassigned_Unmapped     0
Unassigned_Singleton    0
Unassigned_MappingQuality       0
Unassigned_Chimera      0
Unassigned_FragmentLength       0
Unassigned_Duplicate    0
Unassigned_MultiMapping 226584
Unassigned_Secondary    0
Unassigned_NonSplit     0
Unassigned_NoFeatures   8469135
Unassigned_Overlapping_Length   0
Unassigned_Ambiguity    24850

0
Entering edit mode
0
Entering edit mode

0
Entering edit mode

Thank you both for all the help! Now I just gotta figure why that was the case in the first place.

0
Entering edit mode

0
Entering edit mode

sea.joson: And to add another point to this question, do the chromosome names in your reference and GTF file match perfectly.

0
Entering edit mode

Yep. I checked and made sure that the chromosome names used are the same across the board.

0
Entering edit mode

https://www.ncbi.nlm.nih.gov/genome/360?genome_assembly_id=968151

Fortunately, the genome has already been worked on. The samples I'm working on are transcriptomes of the same strain grown in different conditions.

0
Entering edit mode

Would you give a try with another .gtf for the same organism from another source? Maybe Ensembl?