Question: match() function returning NA even when there is match
0
gravatar for lech.kaczmarczyk
4 weeks ago by
lech.kaczmarczyk20 wrote:

I have a data frame with gene names like this:

> test
                                 A
1                  mmu-miR-181a-5p
2                  mmu-miR-181b-5p
3 mmu-miR-199a-3p__mmu-miR-199b-3p
4 mmu-miR-669o-3p__mmu-miR-669a-3p
5                  mmu-miR-669d-5p
6                   mmu-miR-103-3p

I truncate the names as follows, to be able to match the them with miRbase IDs:

> test$A <- gsub( "-3p*$", "", test$A)
> test$A <- gsub( "-5p*$", "", test$A)
> test
                              A
1                  mmu-miR-181a
2                  mmu-miR-181b
3 mmu-miR-199a-3p__mmu-miR-199b
4 mmu-miR-669o-3p__mmu-miR-669a
5                  mmu-miR-669d
6                   mmu-miR-103

Now I would like to use a biomaRt and find the ensembl IDs for the genes, but the match fails to find a match:

> ensembl = useMart(biomart = "ensembl", dataset = "mmusculus_gene_ensembl")
> genemap <- getBM( attributes = c("ensembl_gene_id", "gene_biotype", "external_gene_name","mirbase_id" ,"mirbase_trans_name"),
+                   mart = ensembl )
> idx <- match(test$A, genemap$mirbase_id )
> idx
[1] NA NA NA NA NA NA

Out of this list, mmu-mir-669d should give a match but it doesn't. This is just an example - out of a complete lists I got about 16 matches, while I was expecting hundreds.

I was thinking of spaces generated by the gsub function, but there are no spaces. It's likely stupid errorn but where? Any educated guesses will be welcome...

rna-seq • 132 views
ADD COMMENTlink modified 4 weeks ago by Kevin Blighe3.4k • written 4 weeks ago by lech.kaczmarczyk20
1
gravatar for Kevin Blighe
4 weeks ago by
Kevin Blighe3.4k
Republic of Ireland (Éire)
Kevin Blighe3.4k wrote:

Hey,

The match function looks for perfect matches, which, in this scenario, is a good thing because gene annotation can be very difficult and frustrating, with vagueness and ambiguity between different naming systems.

The only issue that you are facing is with the names of the miRNAs for which you are searching. I was able to identify each of your miRNAs in the test data-frame using the following code:

test <- data.frame(c("mmu-miR-181a-5p","mmu-miR-181b-5p","mmu-miR-199a-3p__mmu-miR-199b-3p","mmu-miR-669o-3p__mmu-miR-669a-3p","mmu-miR-669d-5p","mmu-miR-103-3p"))
colnames(test) <- c("A")
test$A <- gsub( "-3p*$", "", test$A)
test$A <- gsub( "-5p*$", "", test$A)

test$A <- gsub("R", "r", test$A)
test$A <- gsub("mmu-mir-181a", "mmu-mir-181a-1", test$A)
test$A <- gsub("mmu-mir-181b", "mmu-mir-181b-1", test$A)
test$A <- gsub("^mmu-mir-[0-9]*[a-z]-[35]p__", "", test$A)
test$A <- gsub("mmu-mir-103", "mmu-mir-103-1", test$A)
test$A <- gsub("mmu-mir-669a", "mmu-mir-669a-1", test$A)

require("biomaRt")
ensembl <- useMart(biomart = "ensembl", dataset = "mmusculus_gene_ensembl")
matches <- getBM(mart=ensembl, attributes=c("ensembl_gene_id", "gene_biotype", "external_gene_name","mirbase_id"), filter="mirbase_id", values=test$A, uniqueRows=TRUE)
matches

     ensembl_gene_id gene_biotype external_gene_name     mirbase_id
1 ENSMUSG00000065553        miRNA           Mir103-1  mmu-mir-103-1
2 ENSMUSG00000065565        miRNA          Mir181a-1 mmu-mir-181a-1
3 ENSMUSG00000065458        miRNA          Mir181b-1 mmu-mir-181b-1
4 ENSMUSG00000092807        miRNA            Mir199b   mmu-mir-199b
5 ENSMUSG00000096583        miRNA          Mir669a-1 mmu-mir-669a-1
6 ENSMUSG00000095699        miRNA            Gm26092 mmu-mir-669a-1
7 ENSMUSG00000077834        miRNA            Mir669d   mmu-mir-669d

You can see that I first tidy up the names of your miRNAs in the second block of my code. For example, the match and getBM functions will never find matches for lookup terms like mmu-miR-199a-3p__mmu-miR-199b or mmu-miR-669o-3p__mmu-miR-669a. In this example, I have actually just eliminated the first miRNA in these 2 lookup terms and only focused on the miRNA after the '__'. For the other miRNAs, I searched for them HERE to see what the official term could be.

You can also see that I used the getBM function differently here.

ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by Kevin Blighe3.4k
1

Thanks a bunch for a comprehensive reply and teaching me useful regular expressions. What a shame, was (apparently) too tired to tell that there was a capital R in the query :)

ADD REPLYlink written 4 weeks ago by lech.kaczmarczyk20

No problem! I cannot be sure about the case sensitivity of the getBM function, but the other parts that I modified in the micro-RNA names are important!

ADD REPLYlink written 4 weeks ago by Kevin Blighe3.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1439 users visited in the last hour