My question is: Does anyone know how I can combine the files output from the tuxedo pipeline in order to discover how many reads were mapped to transfrags from each of the class codes reported by cuffcompare (=, c, j, i, etc)?
I have tried calculating it from the coverage values reported from the .tmap files, but these calculations add up to a value higher than the total number of mapped reads. So, maybe calculating number of reads from "average coverage" is not the best way, but then where/how can I obtain these values directly without "reverse engineering" it.
I would like to reproduce for my dataset the analysis reported in the table 2 (which is transcribed in the end of the post) from the supplementary material of the 2010 nature biotechnology paper describing cufflinks (http://www.nature.com/nbt/journal/v28/n5/full/nbt.1621.html). However I have not found an elegant and straight forward way of calculating the "Assembled reads %" (column 4).
Thank you in advance for your help.
Table 2 from the supplementary material of PMID: 20436464
Table 2. Classification of all transfrags produced at any time point with respect to annotated gene models and masked repeats in the mouse genome. Transfrags that are present in multiple time point assemblies are multiply counted to preserve the relative distribution of transfrags among the categories across the full experiment.
Category Transfrags % of total transfrags Assembled reads (%) Match to known isoform 39,857 13.5 76.7 Novel isoform of known gene 18,565 6.3 11.3 Contained in known isoform 71,029 24.1 4.6 Repeat 41,906 14.2 0.6 Intronic 32,658 11.1 0.6 Polymerase run-on 18,522 6.3 0.5 Intergenic 48,604 16.5 1.2 Other artifacts 22,483 7.7 4.5 Total transfrags 293,624 100 100