Question

ColData and rowData Sample_id mapping from TCGA database

0

Entering edit mode

16 months ago

Jakpa ▴ 50

Hello everyone,

I downloaded Bladder cancer data from TCGA . I extracted the sample id with this code:

head(Blca_res$id)

output: 'a8c61671-89cb-43bc-8c88-5c107954d11c,''b03b7b9b-00ef-4e0d-bac2-0b1059d57a87,''bf98764d-1604-4a14-8e06-1c785a085db9,''c0bc697a-ac64-4605-9abc-f0fe85eb481a,''bd52f6c8-6f8b-4056-8a3e-8cdc96644952,''ab504dbf-e1f0-46d2-83f9-0f4066055c71'

I wrote this to get same from clinical data:

head(tcgaBlca_data@colData$sample_id)

output: 'f9bd70b2-6cde-48e5-9f0d-55d86ccfeba8,''3cae49a3-6deb-40f9-84cc-68b9b53543ff,''015e6b08-ab3c-4d1d-99e4-77b5e10bd7fc,''f09e1eeb-bcd5-4dba-92f0-7d4b34b81ce7,''0ac8e522-3c64-42f2-a66f-bd40530a328a,''3c71158d-98ff-4ef5-923f-ba31a25036ec'.

There are more than 60,000 rows with this sampl_id's. What I want to find out is if each sample Id in Blca_res$id are same with tcgaBlca_data@colData$sample_id. e.g, is 'a8c61671-89cb-43bc-8c88-5c107954d11c from Blca_res$id also in tcgaBlca_data@colData$sample_id?

Any suggestion on how I can implement this with lines of code in R?

Regards,

GeneExpression R TCGA • 609 views

ADD COMMENT • link 16 months ago by Jakpa ▴ 50

0

Entering edit mode

is the format of your head output correct? Do the sample ids actually have commas in the string?

ADD REPLY • link 16 months ago by jv ★ 1.8k

0

Entering edit mode

No. There no commas. but a dot like this .

but, I have sorted it using a more readable column in the data.

Thanks

ADD REPLY • link 16 months ago by Jakpa ▴ 50

score 0 · Answer 1 · 2022-11-28

One option to get a quick count would be to use the R table function. In this case I would use table twice to count how many of the sample ids are present once or twice between the two vectors, e.g.,

table(table(c(Blca_res$id, tcgaBlca_data@colData$sample_id))

To show how this would play out:

df <- data.frame("id" = c('a8c61671-89cb-43bc-8c88-5c107954d11c', 'b03b7b9b-00ef-4e0d-bac2-0b1059d57a87', 'bf98764d-1604-4a14-8e06-1c785a085db9', 'c0bc697a-ac64-4605-9abc-f0fe85eb481a', 'bd52f6c8-6f8b-4056-8a3e-8cdc96644952' , 'ab504dbf-e1f0-46d2-83f9-0f4066055c71'), 
                 "sample_id" = c('f9bd70b2-6cde-48e5-9f0d-55d86ccfeba8', '3cae49a3-6deb-40f9-84cc-68b9b53543ff', '015e6b08-ab3c-4d1d-99e4-77b5e10bd7fc','f09e1eeb-bcd5-4dba-92f0-7d4b34b81ce7','0ac8e522-3c64-42f2-a66f-bd40530a328a','3c71158d-98ff-4ef5-923f-ba31a25036ec'))
table(table(c(df$id, df$sample_id)))

 1 
12

meaning that all 12 of the ids in df$id and df$sample_id occur once each...