Microarray - multiple probe-ids matching to the same gene symbol but different ensembl_gene_id
2
0
Entering edit mode
8 months ago
manaswwm ▴ 490

Hello all,

Newbie in microarray analysis here - I am currently trying to do some differential analysis from some microarray data (Affymetrix). I know that the probe used in the experiment was HG U95A. I am currently trying to identify the corresponding ensembl_gene_ids for every probe id using this biomaRt code:

library(biomaRt)

#declaring hsap mart
hsap_mart = useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl")

#extracting the gene symbols and geneIDs based on the affymetrix probe ID
affy_probe_genenames = getBM(attributes = c("ensembl_gene_id", "affy_hg_u95a"),
                             filters = "affy_hg_u95a", values = "1007_s_at",
                             mart = hsap_mart, useCache = FALSE)

I notice that for probe 1007_s_at I get the following 5 ensembl_gene_ids - "ENSG00000234078", "ENSG00000137332", "ENSG00000230456", "ENSG00000215522" and "ENSG00000204580"

Since there is only one corresponding expression value for 1007_s_at in the dataset, I was wondering how the choice is usually made on the corresponding ensembl_gene_id in (for example in this case, multiple gene ids per probe id).

All the 5 ensemble gene ids do seem to have the same gene symbol (DDR1).

Thanks in advance!

microarray affymetrix • 903 views
ADD COMMENT
1
Entering edit mode
8 months ago

Hello,

It appears to be related to the fact that, at this locus, there are alternate haplotype sequences, which each have their own ENSG ID for this gene. One can take a look at the locus targeted by this probe at the UCSC Genome Browser: https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&las...

The alternative sequences in question are labelled 'chr6_GL000251v2_alt'.

The 'true' ID of the gene seems to be ENSG00000204580, based on Ensembl and UCSC records:

I am unsure how to automate the correct selection of the ENSG ID in cases like this. Perhaps if you pull also, via biomaRt, the contig / chromosome, it will reflect there. Or, you could in addition pull entrezgene_id and, hopefully, it will be blank for those ENSG IDs that are on the alternate sequences.

Kind regards,

Kevin

ADD COMMENT
0
Entering edit mode

Thanks for your message! I see, the trick with the entrzgene_id does not seem to work as all genes have the same entrez id (also seen in the message from @bk11). However, the trick with contig/chromosome name seems to work as pointed out by @bk11. So the logic here is that the genes that are present on the chromosomes are preferred over the ones that are on scaffolds?

ADD REPLY
1
Entering edit mode
8 months ago
bk11 ★ 2.4k

You could choose the longest gene and that has cytogenetic band. In your case ENSG00000204580 is the longest having band information.

hsap_mart = useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl")

affy_probe_genenames = getBM(attributes = c("ensembl_gene_id", "affy_hg_u95a","hgnc_symbol", "chromosome_name", "band",'entrezgene_id',"start_position","end_position"),
                         filters = "affy_hg_u95a", values = "1007_s_at",
                        mart = hsap_mart, useCache = FALSE)
affy_probe_genenames$size=affy_probe_genenames$end_position - affy_probe_genenames$start_position
affy_probe_genenames

ensembl_gene_id affy_hg_u95a hgnc_symbol      chromosome_name   band entrezgene_id start_position end_position  size
1 ENSG00000234078    1007_s_at        DDR1 HSCHR6_MHC_MANN_CTG1                  780        2191217      2210394 19177
2 ENSG00000137332    1007_s_at        DDR1  HSCHR6_MHC_COX_CTG1                  780        2360744      2379927 19183
3 ENSG00000230456    1007_s_at        DDR1  HSCHR6_MHC_DBB_CTG1                  780        2137270      2156466 19196
4 ENSG00000215522    1007_s_at        DDR1  HSCHR6_MHC_QBL_CTG1                  780        2136143      2155326 19183
5 ENSG00000204580    1007_s_at        DDR1                    6 p21.33           780       30876421     30900156 23735
ADD COMMENT
0
Entering edit mode

Thanks! The trick to extracting the chromosome/scaffold name does seem to work in this case. So is the logic that genes on chromosomes are preferred over the ones that are on scaffolds? If yes, then is there any intuitive reasoning for this (probably a newbie question)?

ADD REPLY

Login before adding your answer.

Traffic: 1730 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6