Hello,
i was wondering if there is any way to download genes according to the number of introns they have. Example:
download all the genes with 2 introns, all genes with 3 introns etc.. in mouse for example.
Any idea is really appreciated!
Cheers
Hello,
i was wondering if there is any way to download genes according to the number of introns they have. Example:
download all the genes with 2 introns, all genes with 3 introns etc.. in mouse for example.
Any idea is really appreciated!
Cheers
download all the genes with 2 introns
using ucsc data (column 9 is the number of exons , 13 is the gene name):
curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/mm10/database/refGene.txt.gz" | gunzip -c | awk -F '\t' '($9==3)' | cut -f 13 | sort | uniq
(...)
Zg16
Zic1
Zic2
Zic3
Zkscan14
Zkscan4
Znrd1as
Znrf1
Zscan2
Zscan22
Probably the easiest would be to download the entire annotation database and process that for what you need.
What do you mean with 'download all the genes'? You need just the names, the sequence, the location,...?
Something like this
for all the genes of mouse, how many of them have 2 intron? and then take the name of those genes the same for 2,3,4,5,n... and take the name of those genes..
i was thinking to make a script from the annotation file but i was wondering if there is already a platform, tool, r package etc.. that is doing this.
I'd really appreciate it if someone proof-checks this:
Download the mouse gtf. For protein coding genes only:
grep "protein_coding" mousefile.gtf > protein_coding.gtf
grep -e $'\texon\t' protein_coding.gtf | tr ' ' '\t' | tr -d '"' | tr -d ';' > exons.gtf
cut -f 1-5,16 exons.gtf | sort | uniq | cut -f 6 | sort | uniq -c > Maximum_number_of_exons_for_each_gene.txt
Doesn't really get the number of introns - it gets exons - but maybe it'll help. If you use it, you should implement each step separately to see if anything weird happens.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
It should be pointed out that this is using introns per transcript, not introns per gene, which is rather poorly defined.