How to extract genomic upstream region of a protein identified by its NCBI accession number?
1
0
Entering edit mode
11 weeks ago
mrj ▴ 70

I have a list of NCBI protein accession numbers. I would like to extract out the upstream genomic region of the corresponding gene's nucleotide sequence. I will be thankful to you if you can show me how to get this done.

For example, here are some of the protein accession numbers. I would like to extract out the upstream genomic region of their corresponding nucleotide sequence.

EET74829.1

VEI24834.1

AYW77996.1

EJD65589.1

EFM49534.1

bedtools extract_upstream_region genomic_sequence NCBI • 384 views
ADD COMMENT
1
Entering edit mode

Always provide example accession numbers when asking questions about them.

ADD REPLY
0
Entering edit mode

GenoMax, Thanks for the advice. I have added the list of accession numbers in my original post. I will paste them below as well.

EET74829.1

VEI24834.1

AYW77996.1

EJD65589.1

EFM49534.1

ADD REPLY
3
Entering edit mode
11 weeks ago
GenoMax 110k

These appear to be protein accession numbers that are pointing to various assemblies so there is no direct gene associations. So it may be best to do this as a three step process.

Using Entrezdirect:

Get the accession number of nucleotide assembly/genome

$ esearch -db protein -query AYW77996 | elink -target nuccore | efetch -format acc
CP033719.1

Get the nucleotide start/stops for CDS

$ efetch -db nuccore -id CP033719.1 -format fasta_cds_na | grep AYW77996
>lcl|CP033719.1_cds_AYW77996.1_1542 [locus_tag=EGX94_07890] [protein=copper oxidase] [protein_id=AYW77996.1] [location=1885267..1887939] [gbkey=CDS]

Use the location coordinates to get the sequence you want (e.g. 200 bp upstream). Pay attention to the strand locations.

$ efetch -db nuccore -id CP033719.1 -format fasta -seq_start 1885067 -seq_stop 1885267
>CP033719.1:1885067-1885267 Propionibacterium acidifaciens strain FDAARGOS_576 chromosome, complete genome
GGCTCCGAGCACTGGCGCCAGGTGGGCGGCCTGGGCAACATCGCAGCCCTGCTCGGTCTCGTCGCCGTGG
CCGTCTGGTCGTCCGTGGTCCGGGACGCCGCCGAGGCCGAGCGGCCCCCGTCCGCGCGGGGCGGCCCCGG
CCCGGTCGGCGGGGGAGCCCCCGACAACCCGCCCGCCATGACGATCCCGAGGACCGACGCA
ADD COMMENT
0
Entering edit mode

GenoMax, This is great. This works for me just great. Thank you very much.

ADD REPLY

Login before adding your answer.

Traffic: 1785 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6