I am dealing with Illumina genomic (not transcriptomic) paired-end sequencing. For some purpose, I'd like to have a number of reads that cover a particular feature (let's say a gene) defined in a BED or GFF file.
I have sorted SAM and BAM files.
My BED files looks like this:
RandomSeqSTD 500000 505001 RandomSeqSTD 1000000 1001001 RandomSeqSTD 1500000 1500501 RandomSeqSTD 2000000 2000251 RandomSeqSTD 2500000 2500101
My GFF file (which I made) looks like this:
##gff-version 3 ##sequence-region RanSeqSTD 1 4659625 RanSeqSTD random CDS 500000 505000 . . . gene_id=5000bp;transcript_id=same RanSeqSTD random CDS 1000000 1001000 . . . gene_id=1000bp;transcript_id=same RanSeqSTD random CDS 1500000 1500500 . . . gene_id=500bp;transcript_id=same RanSeqSTD random CDS 2000000 2000250 . . . gene_id=250bp;transcript_id=same RanSeqSTD random CDS 2500000 2500100 . . . gene_id=100bp;transcript_id=same
First I tried getting read number for each of these features with:
bedtools multicov -bams *.bam -bed BEDfile.bed
I got a read count and in theory this might be enough, however the problem is that I have limited control over what is considerate to be part of the feature. So I tried using HTSeq with the sorted SAM file and the above GFF file with the following command line:
htseq-count -s no -m union -i gene_id -t CDS STD100_sorted.sam RanSeqGFF.gff3 > HTSeq.txt
However, I keep getting count "0" for all gene_id features.
Any idea why HTSeq gives me this and any idea how this can be done differently?
Note: I am fully aware that HTSeq was not developed for genomic sequencing and thus should not be used for this.