Question: Extracting strings from the fasta header
1
gravatar for lokraj2003
5 months ago by
lokraj200390
lokraj200390 wrote:

I want to extract gene name , gene start position and gene stop position from the fasta header of the fasta file. I have tried to extract based on the position but those locations are not consistent. Is there any other way to extract them ?

This is what I have tried so far.

    #I have a vector of these file names. Here I have just one element


   names1 =>"lcl|NC_005336.1_cds_NP_957781.1_1 [locus_tag=ORFVgORF001] [db_xref=GeneID:2947687] 
[protein=ORF001 hypothetical protein] [protein_id=NP_957781.1] [location=complement(3162..3611)] [gbkey=CDS]"

     #Then I extracted words from the string list 

     string_list1 <-  str_extract_all(names1, boundary("word"))

            #result 
            string_list1[1]



            [[1]]
             [1] "lcl"                           "NC_005336.1_cds_NP_957781.1_1"
             [3] "locus_tag"                     "ORFVgORF001"                  
             [5] "db_xref"                       "GeneID"                       
             [7] "2947687"                       "protein"                      
             [9] "ORF001"                        "hypothetical"                 
            [11] "protein"                       "protein_id"                   
            [13] "NP_957781.1"                   "location"                     
            [15] "complement"                    "3162"                         
            [17] "3611"                          "gbkey"                        
            [19] "CDS"

So, I was trying to extract 4th ,16th and 17th element from this list. It works for this particular example. This does not work for other headers where these positions are different. Usually, gene name is consistently present at the 4th position. But, the start and stop location differ among the fasta headers. So, this strategy is not working and I can't think of any other strategy.

string R fasta • 228 views
ADD COMMENTlink modified 5 months ago by Alex Nesmelov170 • written 5 months ago by lokraj200390

Split/focus on the actual keys like locus-tag or location=compliment if those are consistent. This might require regular expressions

ADD REPLYlink modified 5 months ago • written 5 months ago by curious460
3
gravatar for Alex Nesmelov
5 months ago by
Alex Nesmelov170
Alex Nesmelov170 wrote:

If gene name is like [locus_tag=gene_name] and coordinates like [location=complement(3162..3611)]

library(tidyverse)

names1 <- "lcl|NC_005336.1_cds_NP_957781.1_1 [locus_tag=ORFVgORF001] [db_xref=GeneID:2947687][protein=ORF001 hypothetical protein] [protein_id=NP_957781.1] [location=complement(3162..3611)] [gbkey=CDS]"

(res <-  
 str_replace_all(names1,
              "^.*?locus_tag=(.*?)\\].*?\\[location.*?(\\d+)\\.\\.(\\d+).*?$",
              "\\1___\\2___\\3") %>%
 str_split("___")
)

If names will be a "gene_name" column in a data.frame called df, a clean final table can be easily produced:

df %>% 
mutate(gene_name =  str_replace_all(gene_name,
                                   "^.*?locus_tag=(.*?)\\].*?\\[location.*?(\\d+)\\.\\.(\\d+).*?$",
                                   "\\1___\\2___\\3")) %>% 
separate(gene_name,
         sep="___",
         into = c("gene", "start", "end"))
ADD COMMENTlink modified 5 months ago • written 5 months ago by Alex Nesmelov170

Awesome. It works. Actually I have my gene names in the column of a data frame, so this is perfect. Would you mind telling me briefly what these regular expressions are doing? Thanks again for taking your time!

ADD REPLYlink written 5 months ago by lokraj200390

We are replacing whole string by three values of interest which are matched via parentheses and referred in replacement as \\1, \\2 \\3. The trick is to match somehow a whole string to get rid of it.

  • ^.*?locus_tag= ------ anything from the start ^ up to locus_tag=, including it. This part is matched for replacement and then will be deleted.
  • (.*?)\\] --------- anything after locus_tag= up to the next square bracket. It is a gene name and its extracted using paranteses.
  • .?\\[location.?(\\d+) -------- anything up to "[location" and after it up to the number consisting of more than one digits (\\d+). Number is extracted as gene start via parentheses, other matched parts will be removed.
  • \\.\\. ------ two points separating gene coordinates
  • .*?$ -------- anything up to the end of string $.
ADD REPLYlink modified 5 months ago • written 5 months ago by Alex Nesmelov170
2
gravatar for zx8754
5 months ago by
zx87549.7k
London
zx87549.7k wrote:

Here is the start:

# example data
x <- c("lcl|NC_005336.1_cds_NP_957781.1_1 [locus_tag=ORFVgORF001] [db_xref=GeneID:2947687] [protein=ORF001 hypothetical protein] [protein_id=NP_957781.1] [location=complement(3162..3611)] [gbkey=CDS]",
       "lcl|NC_001111_NP_999_1 [locus_tag=Test001] [db_xref=GeneID:2947687] [protein=ORF001 hypothetical protein] [protein_id=NP_957781.1] [gbkey=CDS]")

f1 <- function(x, pattern){
  lapply(strsplit(x, " "), function(i){
    grep(pattern, i, value = TRUE)
  })
  }

f1(x, "locus_tag")
# [[1]]
# [1] "[locus_tag=ORFVgORF001]"
# 
# [[2]]
# [1] "[locus_tag=Test001]"
f1(x, "location")
# [[1]]
# [1] "[location=complement(3162..3611)]"
# 
# [[2]]
# character(0)
ADD COMMENTlink written 5 months ago by zx87549.7k

This works. Since, I have all the names in a data frame, the solution provided by @Alex Nemelov suits my need. Thank you!

ADD REPLYlink written 5 months ago by lokraj200390
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 926 users visited in the last hour