How to deal with multiple to one matching of ids ?
1
0
Entering edit mode
8.4 years ago
jack ▴ 960

Hi,

I have expression data with ucsc ids.I was converting the ucsc ids to refseq ids. there are multiple to one matching .

I need to work with refseq ids. so, should I sum up the expession levels of them in case of multiple to one matching?

ucsc           Refseq
uc002cie.2    NM_138418
uc002cic.1    NM_138418
uc002cid.1    NM_138418
uc002cif.1    NM_138418
uc002cig.1    NM_145294
uc002cih.1    NM_145294
uc002cik.1    NM_145294
uc002cim.1    NM_145294
uc010uul.1    NM_145294
uc002cii.1    NM_145294
uc002cij.1    NM_145294
uc002cil.1    NM_145294

genomics genome RNA-Seq • 1.1k views
1
Entering edit mode

I see that you've tagged this RNAseq, but this typically only occurs with microarray data. Is this really RNAseq and, if so, why not just get expression data for the refseq features directly?

1
Entering edit mode

this is RNA-seq. i got it from TCGA. so it's not possible to get it in Refseq features.

1
Entering edit mode
8.4 years ago

Ah, TCGA data, that explains it :)

Assuming you're using the "Expected counts" from RSEM that TCGA provides, then just add them up.

0
Entering edit mode

why I should sum up them ? I'm bit confused. if they are same isoforms, then why they have different ucsc Ids ?

1
Entering edit mode

They're not the same in the UCSC annotation (or Ensembl, if you were using that), just in RefSeq. In UCSC, Fam195A and C16orf14 are different, in RefSeq they're the same.

0
Entering edit mode

I see, but do you know, which annotation is more accurate?

2
Entering edit mode

My personal order of preference would be:

1. Ensembl or Gencode
2. UCSC
3. RefSeq

If you need refseq for a downstream analysis that depends on it then there's no way around it. As a general principle, try to stick with the original annotation system as much as you can. Converting between the various annotation systems always leads to a bit of increased noise and loss of data.