Question

Counting reads for gene in bacterial genomes

1

Entering edit mode

9.6 years ago

Asaf 10k

Hi,

I'm analyzing paired-end expression libraries of E. coli and wanted to get some advise about how to count the number of reads for each gene. After mapping the reads to the genome I use htseq-count to generate a list of reads per gene. So far so good. The problem is that when using paired-end reads the entire fragment between the reads is taken into consideration as should be, but when a fragment overlaps two genes it's tagged as ambiguous and not counted for either of the genes. This could be a major problem in the case of small genes which are a part of a larger operon, and in adittion throws away about 5%-10% of the mapped reads.

I changed the script of htseq-count and added the option to count fragments that overlap more than one gene, for each of these genes. This significantly increased the number of reads per gene, especially for the reads that previously had a low number of reads.

My question is, is this okay? should I take it into consideration when analyzing DE genes? Obviously normalizing by total counts won't be applicable but I use DESeq so this shouldn't be an issue. Has anyone encountered this problem? have you had a better solution?

Thanks

RNA-Seq htseq-count bacteria paired-end • 4.7k views

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.6 years ago by Asaf 10k

Ram · Answer 1 · 2014-10-02

So,

After I haven't found what I was looking for I implemented such a read counting paradigm. It works only for paired-end data and count the number of fragments (even the parts that aren't found in the reads) that overlap with each gene, even if the fragment overlaps several genes. The length of overlap between the gene and the fragment can be adjusted (default 5).

It's available at https://github.com/asafpr/count_PE_fragments

The script contains the documentation.

This method changed the counts per gene significantly, especially at the lower levels of expression, and more significantly there is a better correlation between replicates.

Ram · Answer 2 · 2014-10-01

1

Entering edit mode

9.6 years ago

Devon Ryan 104k

Which of the htseq-count modes did you use (e.g., intersection-nonempty)? Was this a strand-specific dataset? BTW, featureCounts has the built-in option to do exactly what you had to modify htseq-count to do.

In general, it's going to be a bad idea to assign counts to multiple genes like that. If you can't change settings to something that keeps an acceptable number of counts (throwing away 5-10% isn't so bad in some circumstances) then I would suggest that you use an EM algorithm like that provided by eXpress to get expected counts. You can then use that with limma::voom.

ADD COMMENT • link 9.6 years ago by Devon Ryan 104k

0

Entering edit mode

Thanks Divon,

I used the default union mode, the dataset is strand-specific.

I took a look at featureCounts, it does have the -O option for counting reads for more than one feature but the fragment between the two reads is not counted so in the rare cases where there is a small gene surrounded by two other genes in the same transcript this gene will usually not be counted.

About eXpress, it seems that the main focus of this tool is to distinguish between alternative transcripts of the same gene. The problem is that the transcripts of E. coli are unannotated, only the ORFs. I'm afraid that it won't be able to deal with fragments that overlap some of the feature or several features, I didn't find any reference for this issue in the website or paper.

Thanks again

ADD REPLY • link 9.6 years ago by Asaf 10k

0

Entering edit mode

Do the bacteria you're working on use splicing? If so, I wouldn't count the fragment between the paired-ends. If not, then there's probably not much else you can do other than lose some data.

ADD REPLY • link 9.6 years ago by Devon Ryan 104k

0

Entering edit mode

No splicing, I'm confident using the fragment between the reads. The major problem is not loosing reads but underestimating short genes.

ADD REPLY • link 9.6 years ago by Asaf 10k

0

Entering edit mode

I guess it becomes a question of whether things like this even really constitute different genes then, or whether it makes more sense to think of them as different transcripts of the same gene.

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.6 years ago by Devon Ryan 104k

0

Entering edit mode

After looking at the code of htseq-count, it only treats the reads, not the entire fragment. I'll need to write my own script.

ADD REPLY • link 9.6 years ago by Asaf 10k