Question: match() function returning NA even when there is match
gravatar for lech.kaczmarczyk
3.3 years ago by
lech.kaczmarczyk50 wrote:

I have a data frame with gene names like this:

> test
1                  mmu-miR-181a-5p
2                  mmu-miR-181b-5p
3 mmu-miR-199a-3p__mmu-miR-199b-3p
4 mmu-miR-669o-3p__mmu-miR-669a-3p
5                  mmu-miR-669d-5p
6                   mmu-miR-103-3p

I truncate the names as follows, to be able to match the them with miRbase IDs:

> test$A <- gsub( "-3p*$", "", test$A)
> test$A <- gsub( "-5p*$", "", test$A)
> test
1                  mmu-miR-181a
2                  mmu-miR-181b
3 mmu-miR-199a-3p__mmu-miR-199b
4 mmu-miR-669o-3p__mmu-miR-669a
5                  mmu-miR-669d
6                   mmu-miR-103

Now I would like to use a biomaRt and find the ensembl IDs for the genes, but the match fails to find a match:

> ensembl = useMart(biomart = "ensembl", dataset = "mmusculus_gene_ensembl")
> genemap <- getBM( attributes = c("ensembl_gene_id", "gene_biotype", "external_gene_name","mirbase_id" ,"mirbase_trans_name"),
+                   mart = ensembl )
> idx <- match(test$A, genemap$mirbase_id )
> idx

Out of this list, mmu-mir-669d should give a match but it doesn't. This is just an example - out of a complete lists I got about 16 matches, while I was expecting hundreds.

I was thinking of spaces generated by the gsub function, but there are no spaces. It's likely stupid errorn but where? Any educated guesses will be welcome...

rna-seq • 1.5k views
ADD COMMENTlink modified 3.3 years ago by Kevin Blighe69k • written 3.3 years ago by lech.kaczmarczyk50
gravatar for Kevin Blighe
3.3 years ago by
Kevin Blighe69k
Republic of Ireland
Kevin Blighe69k wrote:


The match function looks for perfect matches, which, in this scenario, is a good thing because gene annotation can be very difficult and frustrating, with vagueness and ambiguity between different naming systems.

The only issue that you are facing is with the names of the miRNAs for which you are searching. I was able to identify each of your miRNAs in the test data-frame using the following code:

test <- data.frame(c("mmu-miR-181a-5p","mmu-miR-181b-5p","mmu-miR-199a-3p__mmu-miR-199b-3p","mmu-miR-669o-3p__mmu-miR-669a-3p","mmu-miR-669d-5p","mmu-miR-103-3p"))
colnames(test) <- c("A")
test$A <- gsub( "-3p*$", "", test$A)
test$A <- gsub( "-5p*$", "", test$A)

test$A <- gsub("R", "r", test$A)
test$A <- gsub("mmu-mir-181a", "mmu-mir-181a-1", test$A)
test$A <- gsub("mmu-mir-181b", "mmu-mir-181b-1", test$A)
test$A <- gsub("^mmu-mir-[0-9]*[a-z]-[35]p__", "", test$A)
test$A <- gsub("mmu-mir-103", "mmu-mir-103-1", test$A)
test$A <- gsub("mmu-mir-669a", "mmu-mir-669a-1", test$A)

ensembl <- useMart(biomart = "ensembl", dataset = "mmusculus_gene_ensembl")
matches <- getBM(mart=ensembl, attributes=c("ensembl_gene_id", "gene_biotype", "external_gene_name","mirbase_id"), filter="mirbase_id", values=test$A, uniqueRows=TRUE)

     ensembl_gene_id gene_biotype external_gene_name     mirbase_id
1 ENSMUSG00000065553        miRNA           Mir103-1  mmu-mir-103-1
2 ENSMUSG00000065565        miRNA          Mir181a-1 mmu-mir-181a-1
3 ENSMUSG00000065458        miRNA          Mir181b-1 mmu-mir-181b-1
4 ENSMUSG00000092807        miRNA            Mir199b   mmu-mir-199b
5 ENSMUSG00000096583        miRNA          Mir669a-1 mmu-mir-669a-1
6 ENSMUSG00000095699        miRNA            Gm26092 mmu-mir-669a-1
7 ENSMUSG00000077834        miRNA            Mir669d   mmu-mir-669d

You can see that I first tidy up the names of your miRNAs in the second block of my code. For example, the match and getBM functions will never find matches for lookup terms like mmu-miR-199a-3p__mmu-miR-199b or mmu-miR-669o-3p__mmu-miR-669a. In this example, I have actually just eliminated the first miRNA in these 2 lookup terms and only focused on the miRNA after the '__'. For the other miRNAs, I searched for them HERE to see what the official term could be.

You can also see that I used the getBM function differently here.

ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by Kevin Blighe69k

Thanks a bunch for a comprehensive reply and teaching me useful regular expressions. What a shame, was (apparently) too tired to tell that there was a capital R in the query :)

ADD REPLYlink written 3.3 years ago by lech.kaczmarczyk50

No problem! I cannot be sure about the case sensitivity of the getBM function, but the other parts that I modified in the micro-RNA names are important!

ADD REPLYlink written 3.3 years ago by Kevin Blighe69k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2149 users visited in the last hour