Question: Extracting strings from the fasta header
gravatar for lokraj2003
22 days ago by
lokraj200390 wrote:

I want to extract gene name , gene start position and gene stop position from the fasta header of the fasta file. I have tried to extract based on the position but those locations are not consistent. Is there any other way to extract them ?

This is what I have tried so far.

    #I have a vector of these file names. Here I have just one element

   names1 =>"lcl|NC_005336.1_cds_NP_957781.1_1 [locus_tag=ORFVgORF001] [db_xref=GeneID:2947687] 
[protein=ORF001 hypothetical protein] [protein_id=NP_957781.1] [location=complement(3162..3611)] [gbkey=CDS]"

     #Then I extracted words from the string list 

     string_list1 <-  str_extract_all(names1, boundary("word"))


             [1] "lcl"                           "NC_005336.1_cds_NP_957781.1_1"
             [3] "locus_tag"                     "ORFVgORF001"                  
             [5] "db_xref"                       "GeneID"                       
             [7] "2947687"                       "protein"                      
             [9] "ORF001"                        "hypothetical"                 
            [11] "protein"                       "protein_id"                   
            [13] "NP_957781.1"                   "location"                     
            [15] "complement"                    "3162"                         
            [17] "3611"                          "gbkey"                        
            [19] "CDS"

So, I was trying to extract 4th ,16th and 17th element from this list. It works for this particular example. This does not work for other headers where these positions are different. Usually, gene name is consistently present at the 4th position. But, the start and stop location differ among the fasta headers. So, this strategy is not working and I can't think of any other strategy.

string R fasta • 146 views
ADD COMMENTlink modified 21 days ago by Alex Nesmelov100 • written 22 days ago by lokraj200390

Split/focus on the actual keys like locus-tag or location=compliment if those are consistent. This might require regular expressions

ADD REPLYlink modified 21 days ago • written 21 days ago by curious320
gravatar for Alex Nesmelov
21 days ago by
Alex Nesmelov100
Alex Nesmelov100 wrote:

If gene name is like [locus_tag=gene_name] and coordinates like [location=complement(3162..3611)]


names1 <- "lcl|NC_005336.1_cds_NP_957781.1_1 [locus_tag=ORFVgORF001] [db_xref=GeneID:2947687][protein=ORF001 hypothetical protein] [protein_id=NP_957781.1] [location=complement(3162..3611)] [gbkey=CDS]"

(res <-  
              "\\1___\\2___\\3") %>%

If names will be a "gene_name" column in a data.frame called df, a clean final table can be easily produced:

df %>% 
mutate(gene_name =  str_replace_all(gene_name,
                                   "\\1___\\2___\\3")) %>% 
         into = c("gene", "start", "end"))
ADD COMMENTlink modified 20 days ago • written 21 days ago by Alex Nesmelov100

Awesome. It works. Actually I have my gene names in the column of a data frame, so this is perfect. Would you mind telling me briefly what these regular expressions are doing? Thanks again for taking your time!

ADD REPLYlink written 20 days ago by lokraj200390

We are replacing whole string by three values of interest which are matched via parentheses and referred in replacement as \\1, \\2 \\3. The trick is to match somehow a whole string to get rid of it.

  • ^.*?locus_tag= ------ anything from the start ^ up to locus_tag=, including it. This part is matched for replacement and then will be deleted.
  • (.*?)\\] --------- anything after locus_tag= up to the next square bracket. It is a gene name and its extracted using paranteses.
  • .?\\[location.?(\\d+) -------- anything up to "[location" and after it up to the number consisting of more than one digits (\\d+). Number is extracted as gene start via parentheses, other matched parts will be removed.
  • \\.\\. ------ two points separating gene coordinates
  • .*?$ -------- anything up to the end of string $.
ADD REPLYlink modified 19 days ago • written 19 days ago by Alex Nesmelov100
gravatar for zx8754
21 days ago by
zx87549.3k wrote:

Here is the start:

# example data
x <- c("lcl|NC_005336.1_cds_NP_957781.1_1 [locus_tag=ORFVgORF001] [db_xref=GeneID:2947687] [protein=ORF001 hypothetical protein] [protein_id=NP_957781.1] [location=complement(3162..3611)] [gbkey=CDS]",
       "lcl|NC_001111_NP_999_1 [locus_tag=Test001] [db_xref=GeneID:2947687] [protein=ORF001 hypothetical protein] [protein_id=NP_957781.1] [gbkey=CDS]")

f1 <- function(x, pattern){
  lapply(strsplit(x, " "), function(i){
    grep(pattern, i, value = TRUE)

f1(x, "locus_tag")
# [[1]]
# [1] "[locus_tag=ORFVgORF001]"
# [[2]]
# [1] "[locus_tag=Test001]"
f1(x, "location")
# [[1]]
# [1] "[location=complement(3162..3611)]"
# [[2]]
# character(0)
ADD COMMENTlink written 21 days ago by zx87549.3k

This works. Since, I have all the names in a data frame, the solution provided by @Alex Nemelov suits my need. Thank you!

ADD REPLYlink written 20 days ago by lokraj200390
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1699 users visited in the last hour