Question: Find read distribution over every gene in RNA-seq
0
gravatar for Shred
5 weeks ago by
Shred230
Shred230 wrote:

Hi guys,

I'm looking for an existing tool to see the read distribution over genes in an RNA-seq sample. Just to be clearer, I want something like the output of the script read_distribution.pyinto the RseQC toolset but into every gene, not just a resume of the whole alignment file. Something like:

GENE_ID,3'UTR/CDS
gene_1, n
gene_2, n

What I've tried so far is to count specifically for 3' UTR and for CDS using htseq-count: then I've normalized the count value to the feature length (so a value count/kb ) and after I've done the relationship between the UTR and CDS values.

Now, could this approach considered legit to have a similar information, considering that I'm working with a poorly annotated genome? If no, is there a tool out there which does the task? Coding it by scratch in Python requires time and I don't want to reinvent the weel, so it's my last choice.

rna-seq python • 141 views
ADD COMMENTlink modified 5 weeks ago by Istvan Albert ♦♦ 85k • written 5 weeks ago by Shred230
1
gravatar for Istvan Albert
5 weeks ago by
Istvan Albert ♦♦ 85k
University Park, USA
Istvan Albert ♦♦ 85k wrote:

There are many tools that compute coverages over features, say bedtools genomcov, featurecounts and htseq-count. The latter two are designed for RNA-Seq and may apply other RNA-Seq specific assumptions.

Once you have the coverages however you would have to parse that to place or group them in certain ways. It is not a counting tools' job to format the information in all kinds of "meaningful" formats. There are just too many ways people might want this information displayed.

The resulting file is thankfully simple enough to process.

ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by Istvan Albert ♦♦ 85k

Ok but take for example genomecov. It gives you the output of the whole genome, or by each position. It is not designed to give stats about each gene by default giving an annotation file and an alignment one. That's the reason why I've tried the approach with htseq-count. The output results, while normalized by feature lengths, seems so different from the read_distribution.py one.

ADD REPLYlink written 5 weeks ago by Shred230
1

ah indeed genomecov does not do that, I meant to write bedtools coverage instead, for that you need to transform the alignment to bed file first though.

when it comes to computing coverages there could be substantial differences in what each tool considers an overlapping read/feature pair (must be full, partial to a threshold, any overlap etc) and also what to do in situations where a read overlaps with multiple features (take both, take one, take none etc), and what to do when a read aligns in multiple regions but only one such region overlaps with a feature etc.

Often the various choices lead to minor differences, in other times there may substantial changes.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by Istvan Albert ♦♦ 85k

bedtools coverage seems doing what I was looking for, after some adjustment on output. Thanks!

ADD REPLYlink written 5 weeks ago by Shred230
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1957 users visited in the last hour