Downloading multiple gene sequence exon-wise using commandline
1
0
Entering edit mode
2.5 years ago

hello all,

I want to download exon-wise gene sequence of around 200 genes from NCBI. Is there anyway to do this from terminal? I have the transcript id for each sequence and I am using efetch command to download their CDS.

eg: efetch -db nuccore -format fasta_cds_na -id NM_000125.4

However, I need their exon-wise sequence as well. Is there any way in which we can give the NCBI transcript id as an input and get exon-wise sequence as an output?

exon-wise Gene sequence commandline • 866 views
ADD COMMENT
1
Entering edit mode
2.5 years ago
GenoMax 141k

You can get the feature table and then parse that to retrieve the data you need using start and stop co-ordinate.

$  efetch -db nuccore -format ft -id NM_000125.4 | grep exon
1   683 exon
684 874 exon
875 991 exon
992 1327    exon
1328    1466    exon
1467    1600    exon
1601    1784    exon
1785    6327    exon

Practically this can be used as following to retrieve the sequence. You will get the same header for all exons (sequence truncated for space).

$  efetch -db nuccore -format ft -id NM_000125.4 | grep exon | awk '{print $1,$2}' | xargs -n 2 sh -c 'efetch -db nuccore -id NM_000125.4 -seq_start "$0" -seq_stop "$1" -format fasta'
>NM_000125.4:1-683 Homo sapiens estrogen receptor 1 (ESR1), transcript variant 1, mRNA
AGCTGGCGGAGGGCGTTCGTCCTGGGACTGCACTTGCTCCCGTCGGGTCGCCCGGCTTCACCGGACCCGC
AGGCTCCCGGGGCAGGGCCGGGGCCAGAGCTCGCGTGTCGGCGGGACATGCGCTGCGTCGCCTCTAACCT
>NM_000125.4:684-874 Homo sapiens estrogen receptor 1 (ESR1), transcript variant 1, mRNA
GCCAAATTCAGATAATCGACGCCAGGGTGGCAGAGAAAGATTGGCCAGTACCAATGACAAGGGAAGTATG
GCTATGGAATCTGCCAAGGAGACTCGCTACTGTGCAGTGTGCAATGACTATGCTTCAGGCTACCATTATG
GAGTCTGGTCCTGTGAGGGCTGCAAGGCCTTCTTCAAGAGAAGTATTCAAG
>NM_000125.4:875-991 Homo sapiens estrogen receptor 1 (ESR1), transcript variant 1, mRNA
GACATAACGACTATATGTGTCCAGCCACCAACCAGTGCACCATTGATAAAAACAGGAGGAAGAGCTGCCA
GGCCTGCCGGCTCCGTAAATGCTACGAAGTGGGAATGATGAAAGGTG

Or you could look in gene database. This will retrieve all transcript variants. You will need to parse the one you want.

$ esearch -db gene -query NM_000125.4 | efetch -format gene_table

Exon table for  mRNA  NM_000125.4 and protein NP_000116.2
Genomic Interval Exon       Genomic Interval Coding     Gene Interval Exon      Gene Interval Coding        Exon Length Coding Length   Intron Length
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
151807682-151808364     151807913-151808364     153535-154217       153766-154217       683     452     34232
151842597-151842787     151842597-151842787     188450-188640       188450-188640       191     191     37867
151880655-151880771     151880655-151880771     226508-226624       226508-226624       117     117     63401
151944173-151944508     151944173-151944508     290026-290361       290026-290361       336     336     67147
152011656-152011794     152011656-152011794     357509-357647       357509-357647       139     139     49196
152060991-152061124     152060991-152061124     406844-406977       406844-406977       134     134     33260
152094385-152094568     152094385-152094568     440238-440421       440238-440421       184     184     4163
152098732-152103274     152098732-152098966     444585-449127       444585-444819       4543        235
ADD COMMENT
0
Entering edit mode

Thank you so much for your answer. The script that you have shared works perfectly if the id starts with "NM".

 efetch -db nuccore -format ft -id NM_000125.4 | grep exon | awk '{print $1,$2}' | xargs -n 2 sh -c 'efetch -db nuccore -id NM_000125.4 -seq_start "$0" -seq_stop "$1" -format fasta'

However most of my ids start with "XM" and in that case the exon start and stop coordinate is not present in the feature table.

eg:

efetch -db nuccore -format ft -id XM_006883548

>Feature ref|XM_006883548.1|
1   738 gene
            gene    C1QA
            db_xref GeneID:102876553
1   738 CDS
            product complement C1q subcomponent subunit A
            protein_id  ref|XP_006883610.1|
            db_xref GeneID:102876553

I am unable to figure out how to use gene-table to extract the exons. Should I use bedtools getFasta and provide genomic intervals from the gene table as an input? or is there any easier way in efetch?

ADD REPLY
1
Entering edit mode

These are computationally predicted ID's. If they don't have exon entries your only option would be to select gene feature instead.

ADD REPLY

Login before adding your answer.

Traffic: 1855 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6