Hi,
I am aiming to get gene expression data but have no reference genome. I have mapped Hi-Seq reads, onto representative Iso-seq sequencing from my samples as a sort of reference. I have put the Iso-seq data through Iso-seq3 -> Cogent -> ANGEL to get full-length, unique isoforms in open reading frame. I have then run Kallisto to map the Hi-seq reads onto those, Imported it into R using catchSalmon, and have been analysing isoform counts in edgeR.
How do I collapse these into actual gene counts?
Can I use the naming convention output from Cogent/ANGEL which is PB.[loci index#].[Isoform index#]| to find a way to sum them to loci in EdgeR using that info?
(I have read How to convert transcript level TPM to gene level TPM ? and other answers) Is using tximport (I assume with my Kallisto abundance.tsv) still the way to go? If so, how/can I get it to stop after the 2nd "." e.g. at PB.3 for the below, since my target IDs aren't ENS Identifiers, and the tximport ignoreAfterBar=TRUE option would still include my Isoform indexes as separate IDs in this format?
_
One of my Kallisto abundance.tsv files looks like this:
target_id length eff_length est_counts tpm
PB.2.1|002537|path0:1-1624(+)|transcript/18304|m.1 405 260.579 198 10.9076
PB.3.1|004815|path2:1-3039(+)|transcript/3426|m.4 2187 2042.14 1920.76 13.5017
PB.3.2|004815|path2:1047-3035(+)|transcript/13401|m.5 933 788.137 60.0475 1.09369
etc.