Question: Counting reads for gene in bacterial genomes
1
gravatar for Asaf
6.2 years ago by
Asaf8.5k
Israel
Asaf8.5k wrote:

Hi,

I'm analyzing paired-end expression libraries of E. coli and wanted to get some advise about how to count the number of reads for each gene. After mapping the reads to the genome I use htseq-count to generate a list of reads per gene. So far so good. The problem is that when using paired-end reads the entire fragment between the reads is taken into consideration as should be, but when a fragment overlaps two genes it's tagged as ambiguous and not counted for either of the genes. This could be a major problem in the case of small genes which are a part of a larger operon, and in adittion throws away about 5%-10% of the mapped reads.

I changed the script of htseq-count and added the option to count fragments that overlap more than one gene, for each of these genes. This significantly increased the number of reads per gene, especially for the reads that previously had a low number of reads.

My question is, is this okay? should I take it into consideration when analyzing DE genes? Obviously normalizing by total counts won't be applicable but I use DESeq so this shouldn't be an issue. Has anyone encountered this problem? have you had a better solution?

Thanks

ADD COMMENTlink modified 6.2 years ago • written 6.2 years ago by Asaf8.5k
3
gravatar for Asaf
6.2 years ago by
Asaf8.5k
Israel
Asaf8.5k wrote:

So,

After I haven't found what I was looking for I implemented such a read counting paradigm. It works only for paired-end data and count the number of fragments (even the parts that aren't found in the reads) that overlap with each gene, even if the fragment overlaps several genes. The length of overlap between the gene and the fragment can be adjusted (default 5).

It's available at: https://github.com/asafpr/count_PE_fragments

The script contains the documentation.

This method changed the counts per gene significantly, especially at the lower levels of expression, and more significantly there is a better correlation between replicates.

ADD COMMENTlink written 6.2 years ago by Asaf8.5k
1
gravatar for Devon Ryan
6.2 years ago by
Devon Ryan97k
Freiburg, Germany
Devon Ryan97k wrote:

Which of the htseq-count modes did you use (e.g., intersection-nonempty)? Was this a strand-specific dataset? BTW, featureCounts has the built-in option to do exactly what you had to modify htseq-count to do.

In general, it's going to be a bad idea to assign counts to multiple genes like that. If you can't change settings to something that keeps an acceptable number of counts (throwing away 5-10% isn't so bad in some circumstances) then I would suggest that you use an EM algorithm like that provided by eXpress to get expected counts. You can then use that with limma::voom.

ADD COMMENTlink written 6.2 years ago by Devon Ryan97k

Thanks Divon,

I used the default union mode, the dataset is strand-specific.

I took a look at featureCounts, it does have the -O option for counting reads for more than one feature but the fragment between the two reads is not counted so in the rare cases where there is a small gene surrounded by two other genes in the same transcript this gene will usually not be counted.

About eXpress, it seems that the main focus of this tool is to distinguish between alternative transcripts of the same gene. The problem is that the transcripts of E. coli are unannotated, only the ORFs. I'm afraid that it won't be able to deal with fragments that overlap some of the feature or several features, I didn't find any reference for this issue in the website or paper.

Thanks again

ADD REPLYlink written 6.2 years ago by Asaf8.5k

Do the bacteria you're working on use splicing? If so, I wouldn't count the fragment between the paired-ends. If not, then there's probably not much else you can do other than lose some data.

ADD REPLYlink written 6.2 years ago by Devon Ryan97k

No splicing, I'm confident using the fragment between the reads. The major problem is not loosing reads but underestimating short genes.

ADD REPLYlink written 6.2 years ago by Asaf8.5k

I guess it becomes a question of whether things like this even really constitute different genes then, or whether it makes more sense to think of them as different transcripts of the same gene.
 

ADD REPLYlink written 6.2 years ago by Devon Ryan97k

After looking at the code of htseq-count, it only treats the reads, not the entire fragment. I'll need to write my own script.

ADD REPLYlink written 6.2 years ago by Asaf8.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1041 users visited in the last hour