Question: Counting Short-Read Gene Alignments With Bowtie/Tophat.
gravatar for Chris Warth
10.2 years ago by
Chris Warth90
United States
Chris Warth90 wrote:

Let me start by saying I am very new to this so I may be asking naive questions.

I am aligning short-reads against the mm9 mouse genome using tophat (which uses bowtie, of course). What I would like to get is a count of how many times each known gene is hit by one of the reads. I have seen suggestions that one only need to sort and count the RNAME field of the resulting SAM file to get this info.

  cat accepted_hits.sam | cut -f 3 | sort | uniq -c

However because I am aligning to the complete mouse genome, my SAM file has RNAME values like "chr1", "chr10", etc.

So how do people normally get a hit count per gene in experiments like this? It seems like there are two ways I can proceed.

1) align against a known gene table downloaded from ucsc. That way the RNAME field in each alignment will identify the name of the gene it aligned to. The problem is this will not discover novel splice sites (the purpose of using tophat), nor will it allow me to discover novel gene expression.

2) align against the complete genome as usual, but then postprocess to find overlap with between the alignment location and any known genes.

Is that how people do this, and are there any existing tools that can help turn a SAM file (or BAM) into a gene expression count?

Thank you in advance.

gene tophat alignment • 6.2k views
ADD COMMENTlink written 10.2 years ago by Chris Warth90

I'm missing something here. The SAM/BAM files only describes the alignments (short reads vs ref). If you want to find some novel splice sites, don't you have to 'call' the new variants (with, e.g. samtools ) ?

ADD REPLYlink written 10.2 years ago by Pierre Lindenbaum131k

Being a novice, I'm not sure I understand the question. Tophat outputs a "junctions.bed" file with splice sites that it has discovered without reference to a pre-existing gene map (although you can provide a file of known genes to help it avoid spurious junctions.)

ADD REPLYlink written 10.2 years ago by Chris Warth90

ok, I didn't know Tophat

ADD REPLYlink written 10.2 years ago by Pierre Lindenbaum131k
gravatar for biobot 0.0.77.a.1099
10.2 years ago by
biobot 0.0.77.a.10996.1k wrote:

To some extent, you can do either, but I think 2) is better. You've identified two problems with option 1) - missing novel genes and novel splice variants. There is also the potential problem of any redundancy in your reference sequences (e.g. from multiple transcripts per gene) causing mapping scores to be reduced. Now that the technology has improved so that "short" reads are relatively long, there isn't so much benefit to be had from mapping directly to transcriptome sequences, especially now that we have tools like TopHat.

You will most likely find that if you plot mean gene coverage by percentile of gene length for all your genes, that there will be a 3' bias (deeper at the 3' end). Some people only count reads in the 3' region for this reason. You may, or may not find this, depending on the lab methods used.

Counting reads per gene and finding expression levels are two separate problems. For the former, you can use any of the many overlap-finders e.g. BEDTools. For the latter there are packages such as DEGSeq and the (unfortunately) very similarly named DESeq.

Importantly, for any differential gene expression, you should also have an experimental design including biological replicates, but that's another story.

ADD COMMENTlink written 10.2 years ago by biobot 0.0.77.a.10996.1k

Thank you very much. I'll look into using BEDTools as an overlap finder.

ADD REPLYlink written 10.2 years ago by Chris Warth90

which packages do you think is better, DEGSeq or DEseq?

ADD REPLYlink written 9.1 years ago by Beifish50
gravatar for Ryan Dale
10.2 years ago by
Ryan Dale4.9k
Bethesda, MD
Ryan Dale4.9k wrote:

Another option for finding overlaps is HTSeq, by the author of DESeq. You can either use the off-the-shelf script htseq-count that comes with it, or you can roll your own to get more detailed output (provided you know some Python).

ADD COMMENTlink written 10.2 years ago by Ryan Dale4.9k

Thanks for the pointer. I thought I was going to have to code my own overlap-finder in python so rolling my own script to work with HTSeq shouldn't be a problem.

ADD REPLYlink written 10.2 years ago by Chris Warth90
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1861 users visited in the last hour