Question

Downloading multiple gene sequence exon-wise using commandline

0

Entering edit mode

2.5 years ago

Saumya Gupta • 0

hello all,

I want to download exon-wise gene sequence of around 200 genes from NCBI. Is there anyway to do this from terminal? I have the transcript id for each sequence and I am using efetch command to download their CDS.

eg: efetch -db nuccore -format fasta_cds_na -id NM_000125.4

However, I need their exon-wise sequence as well. Is there any way in which we can give the NCBI transcript id as an input and get exon-wise sequence as an output?

exon-wise Gene sequence commandline • 866 views

ADD COMMENT • link updated 2.5 years ago by GenoMax 141k • written 2.5 years ago by Saumya Gupta • 0

score 1 · Answer 1 · 2021-10-26

You can get the feature table and then parse that to retrieve the data you need using start and stop co-ordinate.

$  efetch -db nuccore -format ft -id NM_000125.4 | grep exon
1   683 exon
684 874 exon
875 991 exon
992 1327    exon
1328    1466    exon
1467    1600    exon
1601    1784    exon
1785    6327    exon

Practically this can be used as following to retrieve the sequence. You will get the same header for all exons (sequence truncated for space).

$  efetch -db nuccore -format ft -id NM_000125.4 | grep exon | awk '{print $1,$2}' | xargs -n 2 sh -c 'efetch -db nuccore -id NM_000125.4 -seq_start "$0" -seq_stop "$1" -format fasta'
>NM_000125.4:1-683 Homo sapiens estrogen receptor 1 (ESR1), transcript variant 1, mRNA
AGCTGGCGGAGGGCGTTCGTCCTGGGACTGCACTTGCTCCCGTCGGGTCGCCCGGCTTCACCGGACCCGC
AGGCTCCCGGGGCAGGGCCGGGGCCAGAGCTCGCGTGTCGGCGGGACATGCGCTGCGTCGCCTCTAACCT
>NM_000125.4:684-874 Homo sapiens estrogen receptor 1 (ESR1), transcript variant 1, mRNA
GCCAAATTCAGATAATCGACGCCAGGGTGGCAGAGAAAGATTGGCCAGTACCAATGACAAGGGAAGTATG
GCTATGGAATCTGCCAAGGAGACTCGCTACTGTGCAGTGTGCAATGACTATGCTTCAGGCTACCATTATG
GAGTCTGGTCCTGTGAGGGCTGCAAGGCCTTCTTCAAGAGAAGTATTCAAG
>NM_000125.4:875-991 Homo sapiens estrogen receptor 1 (ESR1), transcript variant 1, mRNA
GACATAACGACTATATGTGTCCAGCCACCAACCAGTGCACCATTGATAAAAACAGGAGGAAGAGCTGCCA
GGCCTGCCGGCTCCGTAAATGCTACGAAGTGGGAATGATGAAAGGTG

Or you could look in gene database. This will retrieve all transcript variants. You will need to parse the one you want.

$ esearch -db gene -query NM_000125.4 | efetch -format gene_table

Exon table for  mRNA  NM_000125.4 and protein NP_000116.2
Genomic Interval Exon       Genomic Interval Coding     Gene Interval Exon      Gene Interval Coding        Exon Length Coding Length   Intron Length
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
151807682-151808364     151807913-151808364     153535-154217       153766-154217       683     452     34232
151842597-151842787     151842597-151842787     188450-188640       188450-188640       191     191     37867
151880655-151880771     151880655-151880771     226508-226624       226508-226624       117     117     63401
151944173-151944508     151944173-151944508     290026-290361       290026-290361       336     336     67147
152011656-152011794     152011656-152011794     357509-357647       357509-357647       139     139     49196
152060991-152061124     152060991-152061124     406844-406977       406844-406977       134     134     33260
152094385-152094568     152094385-152094568     440238-440421       440238-440421       184     184     4163
152098732-152103274     152098732-152098966     444585-449127       444585-444819       4543        235