Getting exon sequences for all human genes
2
1
Entering edit mode
6.1 years ago
valerie ▴ 100

Hi everyone,

I would be very grateful if you could help me.

I want to download the sequences of all the exons for each human gene. I went to ensembl biomart and tried to do it for BRCA2 first (this was a random choice). First I selected the following attributes: Unspliced(gene), Exon start, Exon end, strand, Gene start. I thought that this information will be enough to 'cut out' all the exons. The first thing I noticed is that the number of exons is almost twice larger then the number I see on the Wiki. Then I tried to download directly Exon sequences, but the number of sequences was again larger the the number of exons should be and moreover several exon ids correspond to the same sequence. I also tried to download coding sequence and cDNA, but the length of both sequences is not consistent with 'official' BRCA2 length!

These all drives me crazy and I have absolutely no idea on what to do. All I want is to get for each gene the sequences of its exons. Help me please!

Thanks!

genome gene sequence exon • 3.2k views
4
Entering edit mode
6.1 years ago
sacha ★ 2.4k

 wget http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/refGene.txt.gz


Then keep only uniq Gene name ( column 13 ) and extract coordinate ( chrom - exonstart - exonend) to a bed file. Column 10 and 11 contains exonsStart and exonEnd position separated by a comma.

zcat refGene.txt.gz|sort -u -k13,13|cut -f3,10,11|awk 'BEGIN{OFS="\t"}{split($2,start,",");split($3,end,","); for(i=1;i<length(start);++i){print $1,start[i],end[i]}}' > exons.bed  Finally, you can get exons sequence using bedtools getFasta : bedtools getfasta -fi hg19.fa -bed exons.bed -fo exon.fa  paid attention to work only with chromsome name in range 1-22,X,Y. ADD COMMENT 0 Entering edit mode I just count the size of exome using the following commands : zcat refGene.txt.gz|sort -u -k 13,13|awk '{SUM=0;split($10,s,","); split($11,e,",");for(i=1;i<length(s);i++){SUM+=e[i] - s[i]};print SUM}'|paste -sd "+"|bc  Divided by the human genom size from hg19, I get : 2.32 % ADD REPLY 0 Entering edit mode Dear Sacha, thank you very much, this is exactly what I need! May I also ask you one more question? I have headers in exons.fa file like "chr19:58346805-58347029". Is there any possibility to have a heading in a format "BRCA exon1"? Thank you in advance! ADD REPLY 0 Entering edit mode Solved this issue myself :) In case anyone needs this, you should run: zcat refGene.txt.gz|sort -u -k13,13|cut -f3,10,11,13|awk 'BEGIN{OFS="\t"}{split($2,start,",");split($3,end,","); for(i=1;i<length(start);++i){print$1,start[i],end[i],\$4="" "_"="" i}}'="" &gt;="" exons.bed<="" p="">

to add geneName_exodId to the bed file and then use -name option when running bedtools

1
Entering edit mode
6.1 years ago

The issue you have with more sequences and exons is most likely due to alternative transcripts. I think bedtools getfasta can do what you want, if you supply a reference fasta and corresponding gtf or bed file of all exons (http://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html).

0
Entering edit mode

Thank you very much!!

0
Entering edit mode

in bedtools getfasta it seems that the only option to have the exons concatenated is using -split which is available for bed12 format. my bed file is not in bed12 and each exon is in a separate row (though the name column of the bed file is the gene ID). how can I get the concatenated sequence?