extracting exome from genome
1
0
Entering edit mode
6.4 years ago
zeelo ▴ 10

Hello Biostars community,

Time to end my lurker-existence. I have a quick question about extracting particular exonic regions from annotated genomes and I would appreciate any help or comments!

I want to extract all exons from the honeybee genome that fulfill certain categories. The stringency of these categories will probably change at some point, as I want to explore the data under different parameters. For example, I would like to get all exons of amel 4.5 that fulfill the following: (1) The respective LOC has at least 5 exons, (2) 2 exons of at least 500 bp length. Ideally as a .fasta file.

To be honest, I am a little overwhelmed by all the tools/databases that are out there. Just a tip what tool could accomplish this task would help me a lot.

Thanks!

Z

genome Assembly alignment gene • 2.3k views
ADD COMMENT
0
Entering edit mode

what is "LOC" ?

ADD REPLY
0
Entering edit mode

Pierre,

Thanks for your fast reply. LOC describes the gene locus of the honeybee genes. E.g., LOC726504 corresponds to GB41140. I am looking for gene loci that have at least a certain number of exons. I also just saw your solution down on the page. I am excited to try it out. Thanks!

ADD REPLY
2
Entering edit mode
6.4 years ago

get all exons,from gtf, extract name-of-the-transcript+length

    curl -s "ftp://ftp.ensemblgenomes.org/pub/metazoa/release-37/gff3/apis_mellifera/Apis_mellifera.Amel_4.5.37.gff3.gz" | gunzip -c | awk '($3=="exon") {n=split($9,a,/[;\:=]/);for(i=1;i<n;i++) if(a[i]="="transcript")" {printf("%s\t%d\n",a[i+1],1+int($5)-int($4));}}'="" |="" sort="" |="" uniq=""> tmp1.tsv

count the number of transcripts having 5 exons, extract those transcripts, extract, the exons (length>500), sort+uniq, get those having a count>2

    join -t $'\t' -1 1 -2 1 tmp1.tsv <(cut -f 1 tmp1.tsv  | sort | uniq -c | awk '($1==5) {print($2);}'  | sort | uniq) | awk '(int($2)>500)' | cut -f 1 | sort | uniq -c | awk '(int($1)>2) {print $2;}'
GB40269-RA
GB41718-RA
GB41966-RA
GB42025-RA
GB42112-RA
GB42527-RA
GB42613-RA
GB43158-RA
GB44220-RA
GB44690-RA
GB44885-RA
GB44888-RA
GB44987-RA
GB45023-RA
GB45437-RA
GB47779-RA
GB50302-RA
GB50387-RA
GB50561-RA
GB50769-RA
GB51160-RA
GB51985-RA
GB52153-RA
GB52307-RA
GB52435-RA
GB53160-RA
GB54131-RA
GB54351-RA
GB54395-RA
GB54551-RA
GB54715-RA
GB54937-RA
GB55357-RA
GB55624-RA
GB55826-RA

eg.: GB40269-RA http://metazoa.ensembl.org/Apis_mellifera/Transcript/Exons?db=core;g=GB40269;r=7:18936-22560;t=GB40269-RA

GB41718-RA : http://metazoa.ensembl.org/Apis_mellifera/Transcript/Exons?db=core;g=GB41718;r=14:9600024-9604327;t=GB41718-RA

ADD COMMENT
0
Entering edit mode

Pierre,

Thanks for the suggested solution. That was fast. I see the idea, however, I get a syntax error in the curl-

ADD REPLY
1
Entering edit mode
ADD REPLY
0
Entering edit mode

Hi Pierre,

Thank you for your follow-up. Of course I'd be happy to provide some more detail:

awk: cmd. line:1: ($3=="exon") {n=split($9,a,/[;\:=]/);for(i=1;i<n;i++) if(a[i]="="transcript")" {printf("%s\t%d\n",a[i+1],1+int($5)-int($4));}}= awk: cmd. line:1: ^ syntax error and the ^syntax error matches the { before the printf

awk: cmd. line:1: ($3=="exon") {n=split($9,a,/[;\:=]/);for(i=1;i<n;i++) if(a[i]="="transcript")" {printf("%s\t%d\n",a[i+1],1+int($5)-int($4));}}= awk: cmd. line:1: ^ syntax error and the ^syntax error matches the last =

=: command not found =: command not found

Maybe that helps. Thanks again

zeelo

ADD REPLY
0
Entering edit mode

ah ! I see the biostars engine has messed-up the code ! sorry for this: I'll post a gist in a few minutes.

ADD REPLY
0
Entering edit mode

here is the gist :

ADD REPLY
0
Entering edit mode

Hi Pierre,

Wow that works like a charm. Thanks again for your help. Neat!

ADD REPLY
0
Entering edit mode

Hi Pierre,

It's me again. I'm trying to explore the join command, however, I am having some troubles. Given the case I want to parse out all transcript that have at least 20 exons. Using your code I'd do:

join -t $'\t' -1 1 -2 1 tmp1.tsv <(cut -f 1 tmp1.tsv | sort | uniq -c | awk '($1=>20) {print($2);}' | sort | uniq) >> test.tsv

The output table is a little confusing to me, several lines appear multiple times. Do you have a hint what I'm doing wrong?

Thanks again,

Z

ADD REPLY
0
Entering edit mode

this

>> test.tsv

looks wrong to me

ADD REPLY
0
Entering edit mode

I see. Another thanks, Pierre.

ADD REPLY

Login before adding your answer.

Traffic: 1817 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6