Entering edit mode
5.0 years ago
b10hazard ▴ 30
I have a bam file and a gtf file that contains a single line....
chr5 UCSC exon 13744354 13744380 . + . gene_id "UDHG"; transcript_id "NM_033847"; gene_name "UDHG";
The bam file contains paired end reads. How do I get the number of reads that align to this exon?
This question has been asked a gazillion of times. Did you search for it ? What did you find ?
Of course I did. The gazillion answers I found all sucked. I tried bedtools coverage via...
But just got a 8.6 GB file containing a ton of useless info an none of it was grouped by exon. Then I tried...
After 2 hours I got a lovely 4.9GB file that was pretty much useless since it also was not broken up by exon.
Both these methods suck and result in a ton of data that is not targeted to my exon of interest. Why on earth these tools parse the entire bam instead of the region of interest I specified in my gtf/bed file is beyond me. I actually don't think any tool exists that can parse only the specified region of my bam and give me an single line output containing the count of my exon of interest.
how about ?
When you test a method, it's wise to create a model subset of your data and test on it. Even subsetting 10% of your BAM data would yield a better testing method than running all your data through a tool.
Already tried that. The output of htseq-count and bedtools coverage is smaller but it doesn't solve the fundamental problem of the output file being exon specific. I'm trying dexseq right now. Hopefully that tool won't produce gibberish.
Are you trying each new tool on the 10% subset or the whole dataset?
And everything is gibberish? ¯\_(ツ)_/¯
Yes, initially I was trying the tool on the entire bam thinking that the tools would use the provided GTF or BED files to target the specified exon instead of parsing the entire file. Why the heck these tools parse the entire file when the user specifies a targeted region is beyond me. Then I made the bam subset and reran. The bedtools output keeps breaking up my exon into subsets (no idea why). Htseq-count outputs... nothing. I've found a solution though. samtools depth does give me depths at each base which is close enough for me. I should have used that from the beginning.
How can you get a 4.9Gb file with htseq-count and a gtf with only one line? What is the content of this file?
that marks a region of 26 bp. Coverage has to be in the ... tens of thousands (or a magnitude of scale larger) to get this big a file.
Doesn't htseq outputs only count numbers? Output should be:
And counts for each one of those. How can such a file be 4.9Gb? What I am missing here?
Yup, the file is populated with repeats of what you posted above. I've given up on htseq-count. I used samtools depth to get my answer. I have no idea what is up with htseq-count. Could be some config/package dependency bug on our cluster.
Maybe the output is not gibberish and just needs summarizing? Did you try Google-ing
how to interpret htseq output?
doesn't the GTF file for input to htseq-count needs to have this kind of format in the last column?:
so with '=' sign and no spaces ?
Were you able to find a way out to count number of reads aligning to a specific exon? Please help. I am facing the same problem.
I wasn't really looking for one. This thread is 2 years old though, so I don't recall what I was doing at that time.