Question: Get cDNA sequences from NCBI for given protein IDs
0
gravatar for Chirag Parsania
4 weeks ago by
Chirag Parsania1.5k
University of Macau
Chirag Parsania1.5k wrote:

Hi,

I have following NCBI protein ids, which I got from blast result. I want cDNA sequences for all of them. How can I get them. I tried batch entrez but it didn't work

WP_114529907
WP_086161776
WP_023545800
WP_055635421
WP_046726840
WP_019985024
WP_030816474
WP_015656893
WP_066997879
WP_076090756
WP_046591128
WP_037737828
WP_019061909
WP_031182504
WP_029184528
WP_107430231
WP_075738394
WP_018547093
WP_050371941
WP_058922273
WP_014675804
WP_010039119
WP_059042595
WP_031045624
WP_003993952
WP_067124593
WP_069770111
WP_085209034
proteinid cdna ncbi • 184 views
ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by Chirag Parsania1.5k

Since these are WP_* records they are special non-redundant protein ID's. They refer to more than one genome so that is a reason you are not able to get cDNA sequence directly.

ADD REPLYlink written 4 weeks ago by genomax71k

Thanks!

Somehow I got the nucleotide (cDNA) co-ordinates for my query protein id (Wp_*). Is there a way to get sequences from NCBI using nucletide co-ordinates ?

Protein        `Nucleotide Accession`   Start    Stop Strand
  <chr>          <chr>                    <dbl>   <dbl> <chr> 
1 WP_115909503.1 NZ_QTTG01000001.1      5250192 5251037 +     
2 WP_115909964.1 NZ_QTTG01000001.1      5851735 5852526 -     
3 WP_116145412.1 NZ_QREB01000001.1      1915696 1916487 +     
4 WP_117489828.1 NZ_QVIG01000001.1      7216177 7217025 +     
5 WP_119102858.1 NZ_QXMJ01000158.1        70757   71548 -     
6 WP_119295284.1 NZ_KZ984569.1            27367   28866 -
ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by Chirag Parsania1.5k
2
gravatar for Sej Modha
4 weeks ago by
Sej Modha4.4k
Glasgow, UK
Sej Modha4.4k wrote:

You could get all CDS associated with the nucleotide record and filter the fasta file for sequences of interest, based on the presence of the protein ID in the header.

elink -target nuccore -db protein -name protein_nuccore -id WP_114529907|efetch -format fasta_cds_na > all_cds.fasta
ADD COMMENTlink written 4 weeks ago by Sej Modha4.4k

Bingo... !! How to give file containing several Ids instead of just one id

ADD REPLYlink written 4 weeks ago by Chirag Parsania1.5k

I'd suggest writing a for loop to process each ID separately.

ADD REPLYlink written 4 weeks ago by Sej Modha4.4k
0
gravatar for Chirag Parsania
4 weeks ago by
Chirag Parsania1.5k
University of Macau
Chirag Parsania1.5k wrote:

Once all_cds.fasta downloaded as mentioned above by @SejModha, below is the R script to get the query sequences.

library(tidyverse)

## ncbi query protein ids  for which  we need cDNA fasta file 
query_ids <- c("WP_030938735.1", "WP_073767452.1", "WP_075030623.1", "WP_073737703.1", "WP_084907526.1", "WP_058922273.1", "WP_069743312.1", "WP_063483499.1", "WP_086727825.1", "WP_100828975.1") %>% tibble(V1= .)


## get sequences in R
fasta_cdna <- Biostrings::readDNAStringSet("./all_cds.fasta")
fasta_cdna_ids <- base::names(fasta_cdna) 
fasta_cdna_ids %>% head()
#> [1] "lcl|NZ_QLLY01000003.1_cds_WP_146614694.1_1 [locus_tag=K377_RS06515] [protein=DUF4331 domain-containing protein] [protein_id=WP_146614694.1] [location=<1..501] [gbkey=CDS]"                            
#> [2] "lcl|NZ_QLLY01000003.1_cds_WP_111662221.1_2 [locus_tag=K377_RS06520] [protein=hypothetical protein] [protein_id=WP_111662221.1] [location=583..1914] [gbkey=CDS]"                                       
#> [3] "lcl|NZ_QLLY01000003.1_cds_WP_111662644.1_3 [locus_tag=K377_RS06525] [protein=nickel transporter] [protein_id=WP_111662644.1] [location=2067..3539] [gbkey=CDS]"                                        
#> [4] "lcl|NZ_QLLY01000003.1_cds_WP_146614695.1_4 [locus_tag=K377_RS06530] [protein=hypothetical protein] [protein_id=WP_146614695.1] [location=complement(3622..3861)] [gbkey=CDS]"                          
#> [5] "lcl|NZ_QLLY01000003.1_cds_WP_111662645.1_5 [locus_tag=K377_RS06535] [protein=ABC transporter permease] [protein_id=WP_111662645.1] [location=complement(3995..4783)] [gbkey=CDS]"                      
#> [6] "lcl|NZ_QLLY01000003.1_cds_WP_111662223.1_6 [locus_tag=K377_RS06540] [protein=ATP-binding cassette domain-containing protein] [protein_id=WP_111662223.1] [location=complement(4822..5922)] [gbkey=CDS]"

## from original header get protein id for each seq and assign it as sequence id 
get_prot_id <- as_mapper(~ gsub(pattern = ".*\\[protein_id=(.*?)\\].*" , replacement = "\\1" ,x = .))

pp_id <-map_chr(fasta_cdna_ids ,get_prot_id)

length(pp_id)
#> [1] 1131303

pp_id %>% sample(10)
#>  [1] "WP_062925461.1"                                                                                                                                                     
#>  [2] "WP_143613249.1"                                                                                                                                                     
#>  [3] "WP_100109264.1"                                                                                                                                                     
#>  [4] "WP_095873712.1"                                                                                                                                                     
#>  [5] "WP_078631405.1"                                                                                                                                                     
#>  [6] "WP_099926778.1"                                                                                                                                                     
#>  [7] "lcl|NZ_CP041609.1_cds_8438 [locus_tag=FNV67_RS44310] [protein=hypothetical protein] [pseudo=true] [partial=3'] [location=complement(<9384513..9384704)] [gbkey=CDS]"
#>  [8] "WP_143607681.1"                                                                                                                                                     
#>  [9] "WP_067427611.1"                                                                                                                                                     
#> [10] "WP_097242740.1"

names(fasta_cdna) <- pp_id

## keep only unique sequences  
fasta_cdna_uniq <- fasta_cdna[ ! (names(fasta_cdna) %>% duplicated())] 

## get your sequences for your query id 
query_fasta_cdna <- fasta_cdna_uniq [names(fasta_cdna_uniq) %in% query_ids$V1] 
length(query_fasta_cdna)
#> [1] 9

## ids for which sequences not found 
not_found <- query_ids$V1[! query_ids$V1 %in% names(query_fasta_cdna)]

not_found %>% length()
#> [1] 1

## write fasta 
Biostrings::writeXStringSet(query_fasta_cdna , filepath = "./query_id_cdna.fasta")

## write not found ids
write_delim(path =  "./query_id_not_found.txt" , x = not_found %>% as.data.frame() , delim = "\t")
ADD COMMENTlink written 4 weeks ago by Chirag Parsania1.5k
1

You can get the protein sequence easily using Entrezdirect:

$ efetch -db protein -id "WP_030938735.1" -format fasta
>WP_030938735.1 serine protease [Streptomyces violaceoruber]
MFGFNRAKKTAAAVAATAAAAATVLIGAPAAVAAPQPIVGGTTTTTTAYPFMMQITDASQNQFCGGTLVS
ATKVVTAAHCMVGESTSSVRVVGGRTYLNGTNGTVSRVSKIWINPSYTDATNGDDVAVLTLSTSMPYTKA
SYVSSSQTSVYAAGTTARIIGWGTTSSNGSSSNQLRTATVPIVSDSSCASSYGSEYVKSDMVCAGYTSGG
TDTCQGDSGGPLLIGGVLAGITSWGEGCAQAGYPGVYTRLTTFSSLVTAQVNS
ADD REPLYlink written 4 weeks ago by genomax71k

I need cDNA (nt) sequences not protein.

ADD REPLYlink written 4 weeks ago by Chirag Parsania1.5k

Did you not get those directly from @Sej's answer? Perhaps I am missing something simple. These are bacterial sequences so there should be no introns. Are you just making a non-redundant sequence file to remove dups?

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by genomax71k

yes, I got from @Sej's answer. However, its partial solution as we further need to filter all_cds.fasta. And that's what I mentioned in the Rscript.

ADD REPLYlink written 4 weeks ago by Chirag Parsania1.5k

I wrote complete solution here in step by step manner

ADD REPLYlink written 12 days ago by Chirag Parsania1.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 751 users visited in the last hour