How to extract genomic upstream region of a protein identified by its NCBI accession number?
1
0
Entering edit mode
28 days ago
mrj ▴ 60

I have a list of NCBI protein accession numbers. I would like to extract out the upstream genomic region of the corresponding gene's nucleotide sequence. I will be thankful to you if you can show me how to get this done.

For example, here are some of the protein accession numbers. I would like to extract out the upstream genomic region of their corresponding nucleotide sequence.

EET74829.1

VEI24834.1

AYW77996.1

EJD65589.1

EFM49534.1

1
Entering edit mode

0
Entering edit mode

GenoMax, Thanks for the advice. I have added the list of accession numbers in my original post. I will paste them below as well.

EET74829.1

VEI24834.1

AYW77996.1

EJD65589.1

EFM49534.1

3
Entering edit mode
28 days ago
GenoMax 107k

These appear to be protein accession numbers that are pointing to various assemblies so there is no direct gene associations. So it may be best to do this as a three step process.

Using Entrezdirect:

Get the accession number of nucleotide assembly/genome

$esearch -db protein -query AYW77996 | elink -target nuccore | efetch -format acc CP033719.1  Get the nucleotide start/stops for CDS $ efetch -db nuccore -id CP033719.1 -format fasta_cds_na | grep AYW77996
>lcl|CP033719.1_cds_AYW77996.1_1542 [locus_tag=EGX94_07890] [protein=copper oxidase] [protein_id=AYW77996.1] [location=1885267..1887939] [gbkey=CDS]


Use the location coordinates to get the sequence you want (e.g. 200 bp upstream). Pay attention to the strand locations.

\$ efetch -db nuccore -id CP033719.1 -format fasta -seq_start 1885067 -seq_stop 1885267
>CP033719.1:1885067-1885267 Propionibacterium acidifaciens strain FDAARGOS_576 chromosome, complete genome
GGCTCCGAGCACTGGCGCCAGGTGGGCGGCCTGGGCAACATCGCAGCCCTGCTCGGTCTCGTCGCCGTGG
CCGTCTGGTCGTCCGTGGTCCGGGACGCCGCCGAGGCCGAGCGGCCCCCGTCCGCGCGGGGCGGCCCCGG
CCCGGTCGGCGGGGGAGCCCCCGACAACCCGCCCGCCATGACGATCCCGAGGACCGACGCA

0
Entering edit mode

GenoMax, This is great. This works for me just great. Thank you very much.