Downloading multiple gene sequence exon-wise using commandline
1
hello all,
I want to download exon-wise gene sequence of around 200 genes from NCBI. Is there anyway to do this from terminal? I have the transcript id for each sequence and I am using efetch command to download their CDS.
eg: efetch -db nuccore -format fasta_cds_na -id NM_000125.4
However, I need their exon-wise sequence as well. Is there any way in which we can give the NCBI transcript id as an input and get exon-wise sequence as an output?
exon-wise
Gene
sequence
commandline
• 866 views
You can get the feature table and then parse that to retrieve the data you need using start and stop co-ordinate.
$ efetch -db nuccore -format ft -id NM_000125.4 | grep exon
1 683 exon
684 874 exon
875 991 exon
992 1327 exon
1328 1466 exon
1467 1600 exon
1601 1784 exon
1785 6327 exon
Practically this can be used as following to retrieve the sequence. You will get the same header for all exons (sequence truncated for space).
$ efetch -db nuccore -format ft -id NM_000125.4 | grep exon | awk '{print $1,$2}' | xargs -n 2 sh -c 'efetch -db nuccore -id NM_000125.4 -seq_start "$0" -seq_stop "$1" -format fasta'
>NM_000125.4:1-683 Homo sapiens estrogen receptor 1 (ESR1), transcript variant 1, mRNA
AGCTGGCGGAGGGCGTTCGTCCTGGGACTGCACTTGCTCCCGTCGGGTCGCCCGGCTTCACCGGACCCGC
AGGCTCCCGGGGCAGGGCCGGGGCCAGAGCTCGCGTGTCGGCGGGACATGCGCTGCGTCGCCTCTAACCT
>NM_000125.4:684-874 Homo sapiens estrogen receptor 1 (ESR1), transcript variant 1, mRNA
GCCAAATTCAGATAATCGACGCCAGGGTGGCAGAGAAAGATTGGCCAGTACCAATGACAAGGGAAGTATG
GCTATGGAATCTGCCAAGGAGACTCGCTACTGTGCAGTGTGCAATGACTATGCTTCAGGCTACCATTATG
GAGTCTGGTCCTGTGAGGGCTGCAAGGCCTTCTTCAAGAGAAGTATTCAAG
>NM_000125.4:875-991 Homo sapiens estrogen receptor 1 (ESR1), transcript variant 1, mRNA
GACATAACGACTATATGTGTCCAGCCACCAACCAGTGCACCATTGATAAAAACAGGAGGAAGAGCTGCCA
GGCCTGCCGGCTCCGTAAATGCTACGAAGTGGGAATGATGAAAGGTG
Or you could look in gene database. This will retrieve all transcript variants. You will need to parse the one you want.
$ esearch -db gene -query NM_000125.4 | efetch -format gene_table
Exon table for mRNA NM_000125.4 and protein NP_000116.2
Genomic Interval Exon Genomic Interval Coding Gene Interval Exon Gene Interval Coding Exon Length Coding Length Intron Length
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
151807682-151808364 151807913-151808364 153535-154217 153766-154217 683 452 34232
151842597-151842787 151842597-151842787 188450-188640 188450-188640 191 191 37867
151880655-151880771 151880655-151880771 226508-226624 226508-226624 117 117 63401
151944173-151944508 151944173-151944508 290026-290361 290026-290361 336 336 67147
152011656-152011794 152011656-152011794 357509-357647 357509-357647 139 139 49196
152060991-152061124 152060991-152061124 406844-406977 406844-406977 134 134 33260
152094385-152094568 152094385-152094568 440238-440421 440238-440421 184 184 4163
152098732-152103274 152098732-152098966 444585-449127 444585-444819 4543 235
Login before adding your answer.
Traffic: 1855 users visited in the last hour
Thank you so much for your answer. The script that you have shared works perfectly if the id starts with "NM".
However most of my ids start with "XM" and in that case the exon start and stop coordinate is not present in the feature table.
eg:
I am unable to figure out how to use gene-table to extract the exons. Should I use bedtools getFasta and provide genomic intervals from the gene table as an input? or is there any easier way in efetch?
These are computationally predicted ID's. If they don't have exon entries your only option would be to select
gene
feature instead.