Dear all,
I'm in the middle of a standard RNA-seq analysis and got a rather basic question that I cannot answer myself due to to missing experience: Since I'm getting quite a high percentage of read count assigned to 'no_feature' from HTSeq-count: I get roughly 25% ( some up to 30%) of the reads counts assigned to no_feature. And I checked the standard issues like maching chromosome names in the GTF and sam files etc. I also tried -m intersect_strict instead of -m union but then I had basically all counts assigned to no_feature. It's important to add that I'm working on a non-model organism (horse) , so I guess I would expect a higher number in that regard, but still, I'm not sure if any troubleshooting is necessary or if that's just as good as it can get. I would really appreciate to get your advice, and if you need any information or clarification just let me know please.
Please find below my command lines as well as GTF and SAM sample lines and an htseq output I get.
HTseq command line:
htseq-count -m union -r pos -i transcript_id -a 10 -o ${NAME}_out.sam --stranded=no -f bam $path ref_EquCab2.0_top_level.chr.gtf>count_table.txt
HTseq sample output (tail)
rna9995 5
rna9996 0
rna9997 0
rna9998 0
rna9999 0
__no_feature 10342058
__ambiguous 8603905
__too_low_aQual 0
__not_aligned 0
__alignment_not_unique 1814027
HTSeq errors/output
34600000 SAM alignment record pairs processed.
34700000 SAM alignment record pairs processed.
34800000 SAM alignment record pairs processed.
Warning: Mate records missing for 1498 records; first such record: <SAM_Alignment object: Paired-end read 'J00121:58:H75JHBBXX:6:2113:6908:24261' aligned to chr22:[27095622,27095771)/->.
Warning: Mate pairing was ambiguous for 105845 records; mate key for first such record: ('J00121:58:H75JHBBXX:6:2227:26808:37501', 'second', 'chr1', 942, 'chr1', 1108, 312).
34833739 SAM alignment pairs processed.
GTF file i used (head)
chrMT RefSeq exon 1 70 . + . transcript_id "rna43393";
chrMT RefSeq exon 71 1045 . + . transcript_id "rna43394";
chrMT RefSeq exon 1046 1112 . + . transcript_id "rna43395";
chrMT RefSeq exon 1113 2693 . + . transcript_id "rna43396";
chrMT RefSeq exon 2694 2768 . + . transcript_id "rna43397";
chrMT RefSeq CDS 2771 3727 . + 0 transcript_id "gene27150"; gene_id "gene27150"; gene_name "ND1";
chrMT RefSeq exon 3727 3795 . + . transcript_id "rna43398"; gene_id "gene27150"; gene_name "ND1";
chrMT RefSeq exon 3793 3865 . - . transcript_id "rna43399";
chrMT RefSeq exon 3868 3936 . + . transcript_id "rna43400";
chrMT RefSeq CDS 3937 4977 . + 0 transcript_id "gene27151"; gene_id "gene27151"; gene_name "ND2";
head of a sample SAM file:
J00121:58:H75JHBBXX:6:1101:24200:42337 163 chr1 689 255 151M = 833 294 CGGGGCCTTGCGGGGGAGGCCCGTGGAGGGCGCGACGGGCTCGGCCGCCGGGCTGGCCTTTTCCCCACTGGTCTTCCGAGTCGACCGGCTCTGGCGGTGGGGACCGGGCCCGGTCCTCGGATGCCTCCTCCTCCGTGGCAGTTTTTTGTCC AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJFJJFJJJJJJ7FJJFJ7F<AFJJJJJ NH:i:1 HI:i:1 AS:i:293 nM:i:3 NM:i:0 MD:Z:151 jM:B:c,-1 jI:B:i,-1
J00121:58:H75JHBBXX:6:1109:2595:15838 163 chr1 716 255 150M = 818 251 GGGCGCGACGGGCTCGGCCGCCGGGCTGGCCTTTTCCCCACTGGTCTTCCGAGTCGACCGGCTCTGGCGGTGGGGACCGGGCCCGGTCCTCGGATGCCTCCTCCTCCGTGGCAGTTTTTTGTCCAAGTCCCGCCCTGGAGAAGAGCGTGG AAAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFJJJJJJJJJJJJJJJJJJJJJAFJJJF<JJJAFJJJJJJJJJJJJJJJJFFFFJJJJJJJJJJJJJJJJJJJJJJJJJJFJJAJJAAA-<-7FF7A-<FFJJFAFJ NH:i:1 HI:i:1 AS:i:289 nM:i:4 NM:i:1 MD:Z:144C5 jM:B:c,-1 jI:B:i,-1
J00121:58:H75JHBBXX:6:2206:15006:18915 163 chr1 716 255 150M = 818 251 GGGCGCGACGGGCTCGGCCGCCGGGCTGGCCTTTTCCCCACTGGTCTTCCGAGTCGACCGGCTCTGGCGGTGGGGACCGGGCCCGGTCCTCGGATGCCTCCTCCTCCGTGGCAGTTTTTTGTCCAAGTCCCGCCCTGGAGAAGAGCGTGG AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJFJJJJJJJJJJJJ<JJFJJJJJJJJJJ<JJJJAJJJJJJJJJJFJF7FJJJJAFJJJJJFFFFJJJJFAJJJJJJJAJJ<FFJFJFJJ NH:i:1 HI:i:1 AS:i:289 nM:i:4 NM:i:1 MD:Z:144C5 jM:B:c,-1 jI:B:i,-1
J00121:58:H75JHBBXX:7:1202:30107:19337 163 chr1 716 255 150M = 818 251 GGGCGCGACGGGCTCGGCCGCCGGGCTGGCCTTTTCCCCACTGGTCTTCCGAGTCGACCGGCTCTGGCGGTGGGGACCGGGCCCGGTCCTCGGATGCCTCCTCCTCCGTGGCAGTTTTTTGTCCAAGTCCCGCCCTGGAGAAGAGCGTGG AAAFFAJJJJJJFJJFJJJFJJJJJJJJJJJJJJJFFFJ<JJJJJJJJJJJJJFJJJJJFFJJJJJFJJJJFJJA77AJ<JJFJFJAJJJJFF-F-AFFJJJFF<F<AFFFFAAF-AJJJFF--7A<FAJ)))7<-FF-)7-<AFJF<A< NH:i:1 HI:i:1 AS:i:289 nM:i:4 NM:i:1 MD:Z:144C5 jM:B:c,-1 jI:B:i,-1
J00121:58:H75JHBBXX:6:2115:19948:27408 163 chr1 722 255 151M = 869 298 GACGGGCTCGGCCGCCGGGCTGGCCTTTTCCCCACTGGTCTTCCGAGTCGACCGGCTCTGGCGGTGGGGACCGGGCCCGGTCCTCGGATGCCTCCTCCTCCGTGGCAGTTTTTTGTCCAAGTCCCGCCCTGGAGAAGACCGTGGACCGGCC AAFFFJJFJFJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJAJJJFJJJ<JJJJJJJJFJJJJJJJJJFJJFFJJJJAJJJJJJJJJAJJJJJ-FFJJJJJJJJJJFJJJFJJJJJFFJJFFFJJJJJJFJ7JFJJ NH:i:1 HI:i:1 AS:i:296 nM:i:2 NM:i:0 MD:Z:151 jM:B:c,-1 jI:B:i,-1
STAR command line
STAR --outFileNamePrefix $SEED --outFilterMultimapNmax 50 --outFilterMismatchNmax4 --seedSearchStartLmax 25 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 100000--sjdbGTFfile Equus_caballus.EquCab2.72.gtf --outFilterMismatchNmax 4 --outFilterType BySJout --outSAMtype BAM SortedByCoordinate --outSAMstrandField intronMotif --outSAMattributes All --outTmpDir ./$SGE_TASK_ID --runThreadN 4 --genomeDir /data/references/horse/StarIdx --readFilesIn ../raw_data/${SEED}_combined_R1.fastq ../raw_data/${SEED}_combined_R2.fastq