Question: Getting exon sequences for all human genes
gravatar for valerie
3.3 years ago by
valerie80 wrote:

Hi everyone,

I would be very grateful if you could help me.

I want to download the sequences of all the exons for each human gene. I went to ensembl biomart and tried to do it for BRCA2 first (this was a random choice). First I selected the following attributes: Unspliced(gene), Exon start, Exon end, strand, Gene start. I thought that this information will be enough to 'cut out' all the exons. The first thing I noticed is that the number of exons is almost twice larger then the number I see on the Wiki. Then I tried to download directly Exon sequences, but the number of sequences was again larger the the number of exons should be and moreover several exon ids correspond to the same sequence. I also tried to download coding sequence and cDNA, but the length of both sequences is not consistent with 'official' BRCA2 length!

These all drives me crazy and I have absolutely no idea on what to do. All I want is to get for each gene the sequences of its exons. Help me please!


exon sequence gene genome • 1.6k views
ADD COMMENTlink modified 3.3 years ago by sacha1.9k • written 3.3 years ago by valerie80
gravatar for sacha
3.3 years ago by
sacha1.9k wrote:

Download refGene.txt.gz from UCSC :


Then keep only uniq Gene name ( column 13 ) and extract coordinate ( chrom - exonstart - exonend) to a bed file. Column 10 and 11 contains exonsStart and exonEnd position separated by a comma.

zcat refGene.txt.gz|sort -u -k13,13|cut -f3,10,11|awk 'BEGIN{OFS="\t"}{split($2,start,",");split($3,end,","); for(i=1;i<length(start);++i){print $1,start[i],end[i]}}' > exons.bed

Finally, you can get exons sequence using bedtools getFasta :

bedtools getfasta -fi hg19.fa -bed exons.bed -fo exon.fa

paid attention to work only with chromsome name in range 1-22,X,Y.

ADD COMMENTlink written 3.3 years ago by sacha1.9k

I just count the size of exome using the following commands :

zcat refGene.txt.gz|sort -u -k 13,13|awk '{SUM=0;split($10,s,","); split($11,e,",");for(i=1;i<length(s);i++){SUM+=e[i] - s[i]};print SUM}'|paste -sd "+"|bc

Divided by the human genom size from hg19, I get : 2.32 %

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by sacha1.9k

Dear Sacha, thank you very much, this is exactly what I need! May I also ask you one more question? I have headers in exons.fa file like "chr19:58346805-58347029". Is there any possibility to have a heading in a format "BRCA exon1"? Thank you in advance!

ADD REPLYlink written 3.3 years ago by valerie80

Solved this issue myself :) In case anyone needs this, you should run:

zcat refGene.txt.gz|sort -u -k13,13|cut -f3,10,11,13|awk 'BEGIN{OFS="\t"}{split($2,start,",");split($3,end,","); for(i=1;i<length(start);++i){print $1,start[i],end[i],$4="" "_"="" i}}'="" &gt;="" exons.bed<="" p="">

to add geneName_exodId to the bed file and then use -name option when running bedtools

ADD REPLYlink written 3.3 years ago by valerie80
gravatar for WouterDeCoster
3.3 years ago by
WouterDeCoster43k wrote:

The issue you have with more sequences and exons is most likely due to alternative transcripts. I think bedtools getfasta can do what you want, if you supply a reference fasta and corresponding gtf or bed file of all exons (

ADD COMMENTlink written 3.3 years ago by WouterDeCoster43k

Thank you very much!!

ADD REPLYlink written 3.3 years ago by valerie80
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 774 users visited in the last hour