Question

Find read distribution over every gene in RNA-seq

0

Entering edit mode

3.5 years ago

Shred ★ 1.4k

Hi guys,

I'm looking for an existing tool to see the read distribution over genes in an RNA-seq sample. Just to be clearer, I want something like the output of the script read_distribution.pyinto the RseQC toolset but into every gene, not just a resume of the whole alignment file. Something like:

GENE_ID,3'UTR/CDS
gene_1, n
gene_2, n

What I've tried so far is to count specifically for 3' UTR and for CDS using htseq-count: then I've normalized the count value to the feature length (so a value count/kb ) and after I've done the relationship between the UTR and CDS values.

Now, could this approach considered legit to have a similar information, considering that I'm working with a poorly annotated genome? If no, is there a tool out there which does the task? Coding it by scratch in Python requires time and I don't want to reinvent the weel, so it's my last choice.

RNA-Seq python • 1.2k views

ADD COMMENT • link updated 3.5 years ago by Istvan Albert 100k • written 3.5 years ago by Shred ★ 1.4k

score 1 · Accepted Answer · 2020-10-20

1

Entering edit mode

3.5 years ago

Istvan Albert 100k

There are many tools that compute coverages over features, say ~~bedtools genomcov~~, featurecounts and htseq-count. The latter two are designed for RNA-Seq and may apply other RNA-Seq specific assumptions.

Once you have the coverages however you would have to parse that to place or group them in certain ways. It is not a counting tools' job to format the information in all kinds of "meaningful" formats. There are just too many ways people might want this information displayed.

The resulting file is thankfully simple enough to process.

ADD COMMENT • link 3.5 years ago by Istvan Albert 100k

0

Entering edit mode

Ok but take for example genomecov. It gives you the output of the whole genome, or by each position. It is not designed to give stats about each gene by default giving an annotation file and an alignment one. That's the reason why I've tried the approach with htseq-count. The output results, while normalized by feature lengths, seems so different from the read_distribution.py one.

ADD REPLY • link 3.5 years ago by Shred ★ 1.4k

1

Entering edit mode

ah indeed genomecov does not do that, I meant to write bedtools coverage instead, for that you need to transform the alignment to bed file first though.

when it comes to computing coverages there could be substantial differences in what each tool considers an overlapping read/feature pair (must be full, partial to a threshold, any overlap etc) and also what to do in situations where a read overlaps with multiple features (take both, take one, take none etc), and what to do when a read aligns in multiple regions but only one such region overlaps with a feature etc.

Often the various choices lead to minor differences, in other times there may substantial changes.

ADD REPLY • link 3.5 years ago by Istvan Albert 100k

0

Entering edit mode

bedtools coverage seems doing what I was looking for, after some adjustment on output. Thanks!

ADD REPLY • link 3.5 years ago by Shred ★ 1.4k