I wish to calculate distance to polyA site for each exon for each gene. The ultimate data.frame will somehow be :
GeneA.name Exon1 Start End Distance
GeneA.name Exon2 Start End Distance
GeneB.name Exon1 Start End Distance
GeneB.name Exon2 Start End Distance
Each gene has many isoforms, namely NM1234, NM12345, NM_123456. If I don't assemble isoforms into one universal data, the exons will get duplicated.
My idea is to get all the exons location for given gene, but the isoforms information upsets me.
For a given gene (let's say gene: HIPK1 here), I want to have all the exons assembled all in just single one line BED12 format.
Is there any method to give me the universal date set? Maybe UCSC genome browser has the default tool?
I've a backup plan: use
mergeBED (BEDTOOLS suit)to get the overall isoforms. Because
mergeBED will drop ID information and only save location information, a
hash indexing the gene name and "NM" name is needed in home-made perl, which is time-wasting if we already have the easy way to export.