Question: How to get number of exons for each transcript in biomart
0
gravatar for lauren.fehrman
19 months ago by
lauren.fehrman0 wrote:

I would like to get a table from ensembl that includes the number of exons for each transcript. I can't find any options that correspond to this on biomart, though.

biomart • 1.2k views
ADD COMMENTlink modified 11 months ago by asrmpr0 • written 19 months ago by lauren.fehrman0

I don't think biomart would store such aggregate data. You should be able to use UCSC MySQL tables to write a custom query, if you're not specific about using EnsEMBL. If you need EnsEMBL, you might need to get the CDs information and count exons yourself.

ADD REPLYlink written 19 months ago by RamRS30k

In addition to RamRS suggestion, if you could covert GTF to exons, like for example using this, you can groupBy the transcript and count the number of exons per transcript.

ADD REPLYlink written 19 months ago by geek_y11k

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.
Upvote|Bookmark|Accept

ADD REPLYlink modified 11 months ago by genomax89k • written 19 months ago by RamRS30k
1
gravatar for Benn
19 months ago by
Benn8.0k
Netherlands
Benn8.0k wrote:

I wouldn't use biomaRt, but try to use AWK instead. If you download the annotation gtf file from ensemble, you can try something like this with AWK:

awk '$3=="exon" {print $0}' Homo_sapiens.GRCh38.78.gtf | awk '{ count[$10]++ } END { for (word in count) print word, count[word]}' > numberOfExonsPerGene.txt
ADD COMMENTlink written 19 months ago by Benn8.0k
0
gravatar for Pierre Lindenbaum
19 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum130k wrote:

ucsc (not biomart)

$ mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -P 3306 -D hg38 -e 'select chrom,name,exonCount from wgEncodeGencodeBasicV28;'
+-------+-------------------+-----------+
| chrom | name              | exonCount |
+-------+-------------------+-----------+
| chr1  | ENST00000619216.1 |         1 |
| chr1  | ENST00000473358.1 |         3 |
| chr1  | ENST00000469289.1 |         2 |
| chr1  | ENST00000607096.1 |         1 |
| chr1  | ENST00000417324.1 |         3 |
| chr1  | ENST00000641515.2 |         3 |
| chr1  | ENST00000335137.4 |         1 |
| chr1  | ENST00000466430.5 |         4 |
| chr1  | ENST00000495576.1 |         2 |
| chr1  | ENST00000610542.1 |         4 |
| chr1  | ENST00000493797.1 |         2 |
| chr1  | ENST00000484859.1 |         2 |
| chr1  | ENST00000466557.6 |         8 |
| chr1  | ENST00000410691.1 |         1 |
| chr1  | ENST00000496488.1 |         2 |
| chr1  | ENST00000612080.1 |         1 |
| chr1  | ENST00000635159.1 |         2 |
| chr1  | ENST00000426406.3 |         1 |
(...)
+-------+-------------------+-----------+
ADD COMMENTlink written 19 months ago by Pierre Lindenbaum130k
0
gravatar for benformatics
19 months ago by
benformatics1.9k
ETH Zurich
benformatics1.9k wrote:

Or use R with a transcript database (you can make your own from any GTF using the makeTxDbFromGFF command from the GenomicFeatures library):

## setup transcriptDb
txdb <- TxDb.Mmusculus.UCSC.mm10.ensGene
## get exon locations for each gene
exons <- exonsBy(txdb,'gene')

## print number of exons for each gene - just look at the top 6 with 'head'
head(sapply(exons,length))
ENSMUSG00000000001 ENSMUSG00000000003 ENSMUSG00000000028 ENSMUSG00000000031 
                 9                  9                 24                 15 
ENSMUSG00000000037 ENSMUSG00000000049 
                41                 16
ADD COMMENTlink modified 19 months ago • written 19 months ago by benformatics1.9k
0
gravatar for finswimmer
19 months ago by
finswimmer13k
Germany
finswimmer13k wrote:

I cannot see a a way either to directly get the number of exons in each transcript via biomart.

But you can choose Exon Stable ID as an attribute for an output. You can then use a simple awk script to count the exons per transcript. Let's assume you have choosen the attributes Gene stable ID, Transcript stable ID and Exon stable ID for the output. Make sure you tick the point "Unique results only" and download as a TSV.

$ awk -v FS="\t" -v OFS="\t" 'NR>1 {transcript[$2]++;} END { for(t in transcript) print t, transcript[t] }' mart_export.txt

This will create a list with the transcript name given in column 2 as the key, and count each line with this transcript number. At the end we iterate over the list and print each count. If the transcript id is in a different column in your output, change the $2 to whatever it is.

fin swimmer

ADD COMMENTlink written 19 months ago by finswimmer13k

Thank you. This worked perfectly for what I needed to do.

ADD REPLYlink written 19 months ago by lauren.fehrman0
0
gravatar for asrmpr
11 months ago by
asrmpr0
asrmpr0 wrote:

Exon Number Finder v.1

A tool to find genes of user-specific exon number

https://github.com/CyPH3R-ASR/exonNum

ADD COMMENTlink written 11 months ago by asrmpr0

Thanks for contributing, but please note:

1) this does not answer the toplevel question as OP asked about Biomart and

2) it is sufficient if you add your tool once, not as an answer and a comment. Removed the comment.

ADD REPLYlink modified 11 months ago • written 11 months ago by ATpoint38k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 799 users visited in the last hour