Question: Htseq-Count Not Working With Bam From Tophat
1
gravatar for siddharth.sethi5
7.1 years ago by
Medical Research Council, Oxford, United Kingdom
siddharth.sethi5230 wrote:

Hi everyone
I have RNA seq data. I used Tophat2 for the alignment, got the junctions successfully.

When I use HTSeq-count (latest version) to count the reads, I get all counts as zero .The output below is for only chr10

nofeature 0 ambiguous 0 toolowaQual 0 notaligned 0 alignmentnotunique 1213

Here is my command
python -m HTSeq.scripts.count -s no -i genename chr10.sorted.sam Musmusculus.gtf > count_output

I have checked the SAM file, it do contains unique reads ("NH:1"). I sorted the BAM file using samtools (samtools sort -n ).

My SAM file looks like this:

WTCHG33331111:3:1101:2760:117234#TCGCATAA 161 chr10 85405316 255 28M610N23M = 85406036 771 CACCAAT AAGCACTGGTTGTCCTGTGAGTTGATAATCCACAGCAATGAGGG bbeeeeeggggghiibffdgghb[]gghiehdghghhihgghiiihihf AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0NM:i:0 MD:Z:51 YT:Z:UU XS:A:- NH:i:1 WTCHG33331111:3:1101:2760:117234#TCGCATAA 81 chr10 85406036 255 51M = 85405316 -771 TTCCCATCCACCACG TGCTGCTCCACTTTGGCGATATTGTGCGACACATCA fhiihgffc^geciihhihiihhffiihhfhfgefgiigeggecccce_ AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0MD:Z:51 YT:Z:UU XS:A:- NH:i:1 WTCHG33331111:3:1101:4247:53268#TCGCATAC 97 chr10 84641325 255 51M = 88137631 3496357 ATTCAGCTTACACTT TCCACATTGCTGTTGATCACAAAAGGAAGTCAAGAC bbbeeeeegggggiiiiihiiiiiihiiiiiiihiidhiiifiihiihfbf AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0MD:Z:51 YT:Z:UU NH:i:1 WTCHG33331111:3:1101:4963:41163#TCGCATAC 97 chr10 85405305 255 39M610N12M = 85406105 1268 CCCCATG GTGCCACCAATAAGCACTGGTTGTCCTGTGAGTTGATAATCCAC Y\^cccWc^ca^aeefaaegfJee]cgggfghfX^YbY^caeeadh[cX AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0NM:i:0 MD:Z:51 YT:Z:UU XS:A:- NH:i:1 WTCHG33331111:3:1101:4963:41163#TCGCATAC 145 chr10 85406105 255 38M417N13M = 85405305 -1268 AGGTCAT CAGGGGTTGTGTTGAAGACTTTGGCAAAAGCCTGACGGGTTAAG bbeeeac^eaVfdfgfehgafabfdgfcaafecfbZecgegcca^^^_a AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0NM:i:0 MD:Z:51 YT:Z:UU XS:A:- NH:i:1

My gtf file was downloaded from Ensemble and i replaced the contig names according to my BAM file. It looks like
chr1 snRNA exon 3092097 3092206 . + . geneid "ENSMUSG00000064842"; transcriptid "ENSMUST00000082908"; exonnumber "1"; genename "U6"; genebiotype "snRNA"; transcriptname "U6.149-201"; chr1 processedtranscript exon 3203690 3206425 . - . geneid "ENSMUSG00000051951"; transcriptid "ENSMUST0000016289 7"; exonnumber "1"; genename "Xkr4"; genebiotype "proteincoding"; transcriptname "Xkr4-003"; chr1 processedtranscript exon 3195982 3197398 . - . geneid "ENSMUSG00000051951"; transcriptid "ENSMUST0000016289 7"; exonnumber "2"; genename "Xkr4"; genebiotype "proteincoding"; transcriptname "Xkr4-003"; chr1 processedtranscript exon 3203520 3205713 . - . geneid "ENSMUSG00000051951"; transcriptid "ENSMUST0000015926 5"; exonnumber "1"; genename "Xkr4"; genebiotype "proteincoding"; transcriptname "Xkr4-002"; chr1 processedtranscript exon 3196604 3197398 . - . geneid "ENSMUSG00000051951"; transcriptid "ENSMUST0000015926 5"; exonnumber "2"; genename "Xkr4"; genebiotype "proteincoding"; transcriptname "Xkr4-002";

I tried HTSeq-count with one of my BAM (BWA align) from DNA-seq and it worked properly, so i think something's wrong with my tophat BAM file or I am doing something wrong?? I will appreciate any kind of help.

Thanks
Sid

tophat • 8.4k views
ADD COMMENTlink modified 7.1 years ago by shrirambhosle30 • written 7.1 years ago by siddharth.sethi5230

can you provide minimal snippets that reproduce the problem? (e.g., your GTF example doesn't have annotations for chr10; in the sam file, seq and qualscores don't appear to be the same length).

ADD REPLYlink written 7.1 years ago by Ryan Dale4.8k

Hi Daler, i checked the seq and quality scores in the sam file, they appear to be of the same length(51) , atleast for the reads i checked. The gtf file do contains all chromosomes:

chr10 proteincoding exon 3134304 3134909 . + . geneid "ENSMUSG00000015202"; transcriptid "ENSMUST00000015346"; exon _number "1"; genename "Cnksr3"; genebiotype "proteincoding"; transcriptname "Cnksr3-001";
chr10 protein
coding CDS 3134858 3134909 . + 0 geneid "ENSMUSG00000015202"; transcriptid "ENSMUST00000015346"; exon number "1"; genename "Cnksr3"; genebiotype "proteincoding"; transcriptname "Cnksr3-001"; proteinid "ENSMUSP00000015346";
chr10 proteincoding startcodon 3134858 3134860 . + 0 geneid "ENSMUSG00000015202"; transcriptid "ENSMUST0000001534 6"; exonnumber "1"; genename "Cnksr3"; genebiotype "proteincoding"; transcript_name "Cnksr3-001";

Actually i have tried using the complete BAM(all chr) also, but the result is the same. All my BAMs from tophat are not working. The BAM files looks fine, i used them with cufflinks and cuffdiff, it worked fine

ADD REPLYlink modified 7.1 years ago • written 7.1 years ago by siddharth.sethi5230

Maybe it's just a formatting issue on Biostar, but for the first sequence in your example, but I get

len('CACCAATAAGCACTGGTTGTCCTGTGAGTTGATAATCCACAGCAATGAGGG')= 51

and

len('bbeeeeeggggghiibffdgghb[]gghiehdghghhihgghiiihihf') = 49

Since my own files from TopHat work fine with htseq-count, I think it's necessary to troubleshoot with the exact input that gives you problems. To avoid Biostar formatting issues (like the loss of tabs and whatever causes the above length discrepancy), try pastebin or create a gist.

ADD REPLYlink written 7.1 years ago by Ryan Dale4.8k
1
gravatar for shrirambhosle
7.1 years ago by
shrirambhosle30 wrote:

Hi

Try using the latest version of HTSeq (HTSeq-0.5.3p9): http://pypi.python.org/pypi/HTSeq

Cheers, Shriram

ADD COMMENTlink written 7.1 years ago by shrirambhosle30
0
gravatar for Mikael Huss
7.1 years ago by
Mikael Huss4.6k
Stockholm
Mikael Huss4.6k wrote:

I am not sure but it could be that you have to explicitly specify the feature name (e g "geneid"). If you look at the HTSeq-count manual (http://www-huber.embl.de/users/anders/HTSeq/doc/count.html), it specifies that the default is to look for "gene_id" in the GTF (including the underscore). You may have to specify "-i geneid" for it to work.

There is nothing intrinsically incompatible with Tophat BAM/SAM files and HTSeq; I use this combo myself all the time.

ADD COMMENTlink written 7.1 years ago by Mikael Huss4.6k

Hi Mikael, actually somehow the underscores got omitted both from my command and the GTF file when i posted them here. I double checked it, they are present. Still i tried changing the "-i" parameter, used default and all possible things, but the result is the same. Can you think of any other problem that might be causing this? thanks

ADD REPLYlink written 7.1 years ago by siddharth.sethi5230

Not really. Maybe something silly like the endline characters being wrong in the GTF file after you changed contig names. On Mac OS X you can accidentally get "\r"as newline characters instead of "\n" which isn't visible to you in e g TextEdit but which is shown as ^M in Emacs or the shell. But that is just a wild guess

ADD REPLYlink written 7.1 years ago by Mikael Huss4.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1045 users visited in the last hour