Extracting strings from the fasta header
2
1
Entering edit mode
3.9 years ago
lokraj2003 ▴ 120

I want to extract gene name , gene start position and gene stop position from the fasta header of the fasta file. I have tried to extract based on the position but those locations are not consistent. Is there any other way to extract them ?

This is what I have tried so far.

    #I have a vector of these file names. Here I have just one element


   names1 =>"lcl|NC_005336.1_cds_NP_957781.1_1 [locus_tag=ORFVgORF001] [db_xref=GeneID:2947687] 
[protein=ORF001 hypothetical protein] [protein_id=NP_957781.1] [location=complement(3162..3611)] [gbkey=CDS]"

     #Then I extracted words from the string list 

     string_list1 <-  str_extract_all(names1, boundary("word"))

            #result 
            string_list1[1]



            [[1]]
             [1] "lcl"                           "NC_005336.1_cds_NP_957781.1_1"
             [3] "locus_tag"                     "ORFVgORF001"                  
             [5] "db_xref"                       "GeneID"                       
             [7] "2947687"                       "protein"                      
             [9] "ORF001"                        "hypothetical"                 
            [11] "protein"                       "protein_id"                   
            [13] "NP_957781.1"                   "location"                     
            [15] "complement"                    "3162"                         
            [17] "3611"                          "gbkey"                        
            [19] "CDS"

So, I was trying to extract 4th ,16th and 17th element from this list. It works for this particular example. This does not work for other headers where these positions are different. Usually, gene name is consistently present at the 4th position. But, the start and stop location differ among the fasta headers. So, this strategy is not working and I can't think of any other strategy.

fasta R string • 1.1k views
ADD COMMENT
0
Entering edit mode

Split/focus on the actual keys like locus-tag or location=compliment if those are consistent. This might require regular expressions

ADD REPLY
3
Entering edit mode
3.9 years ago
Alex Nesmelov ▴ 200

If gene name is like [locus_tag=gene_name] and coordinates like [location=complement(3162..3611)]

library(tidyverse)

names1 <- "lcl|NC_005336.1_cds_NP_957781.1_1 [locus_tag=ORFVgORF001] [db_xref=GeneID:2947687][protein=ORF001 hypothetical protein] [protein_id=NP_957781.1] [location=complement(3162..3611)] [gbkey=CDS]"

(res <-  
 str_replace_all(names1,
              "^.*?locus_tag=(.*?)\\].*?\\[location.*?(\\d+)\\.\\.(\\d+).*?$",
              "\\1___\\2___\\3") %>%
 str_split("___")
)

If names will be a "gene_name" column in a data.frame called df, a clean final table can be easily produced:

df %>% 
mutate(gene_name =  str_replace_all(gene_name,
                                   "^.*?locus_tag=(.*?)\\].*?\\[location.*?(\\d+)\\.\\.(\\d+).*?$",
                                   "\\1___\\2___\\3")) %>% 
separate(gene_name,
         sep="___",
         into = c("gene", "start", "end"))
ADD COMMENT
0
Entering edit mode

Awesome. It works. Actually I have my gene names in the column of a data frame, so this is perfect. Would you mind telling me briefly what these regular expressions are doing? Thanks again for taking your time!

ADD REPLY
0
Entering edit mode

We are replacing whole string by three values of interest which are matched via parentheses and referred in replacement as \\1, \\2 \\3. The trick is to match somehow a whole string to get rid of it.

  • ^.*?locus_tag= ------ anything from the start ^ up to locus_tag=, including it. This part is matched for replacement and then will be deleted.
  • (.*?)\\] --------- anything after locus_tag= up to the next square bracket. It is a gene name and its extracted using paranteses.
  • .?\\[location.?(\\d+) -------- anything up to "[location" and after it up to the number consisting of more than one digits (\\d+). Number is extracted as gene start via parentheses, other matched parts will be removed.
  • \\.\\. ------ two points separating gene coordinates
  • .*?$ -------- anything up to the end of string $.
ADD REPLY
2
Entering edit mode
3.9 years ago
zx8754 11k

Here is the start:

# example data
x <- c("lcl|NC_005336.1_cds_NP_957781.1_1 [locus_tag=ORFVgORF001] [db_xref=GeneID:2947687] [protein=ORF001 hypothetical protein] [protein_id=NP_957781.1] [location=complement(3162..3611)] [gbkey=CDS]",
       "lcl|NC_001111_NP_999_1 [locus_tag=Test001] [db_xref=GeneID:2947687] [protein=ORF001 hypothetical protein] [protein_id=NP_957781.1] [gbkey=CDS]")

f1 <- function(x, pattern){
  lapply(strsplit(x, " "), function(i){
    grep(pattern, i, value = TRUE)
  })
  }

f1(x, "locus_tag")
# [[1]]
# [1] "[locus_tag=ORFVgORF001]"
# 
# [[2]]
# [1] "[locus_tag=Test001]"
f1(x, "location")
# [[1]]
# [1] "[location=complement(3162..3611)]"
# 
# [[2]]
# character(0)
ADD COMMENT
0
Entering edit mode

This works. Since, I have all the names in a data frame, the solution provided by @Alex Nemelov suits my need. Thank you!

ADD REPLY

Login before adding your answer.

Traffic: 2470 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6