Question: Extract random number of transcript and exons from gencode annotation
0
gravatar for zoegward
3.0 years ago by
zoegward100
zoegward100 wrote:

Hi, I want to extract say 1000 transcripts and related exons from my gencode.gtf. Does anyone know how to do this? I'm a beginner and have used grep to extract 'transcripts' and 'exons' only but as there are not a set number of exons connected to the transcripts I'm not sure how to count to extract 1000 transcripts??

At the moment I have:

 grep 'protein_coding' gencode.v27.gtf | awk '{if($3=="transcript" || $3=="exon")print$0}'

Many thanks!

sequencing genome • 1.0k views
ADD COMMENTlink modified 3.0 years ago by michael.ante3.6k • written 3.0 years ago by zoegward100
1
gravatar for zoegward
3.0 years ago by
zoegward100
zoegward100 wrote:

Okay kind of worked out a dirty way of doing it so any more elegant solutions welcome. I found the id of the 1000th transcript and then used sed to print out everything before this id:

> grep 'protein_coding' gencode.v27.gtf | awk '{if($3=="transcript")print$0}'  > test.gtf
## to get the id of the 1000th transcript
 > cat test.gtf | sed -n -e '1000p'
### print everything from the first line to the 1000th transcript id
> sed -n '1,/<TRANSCRIPT_ID FROM THE PREVIOUS LINE OF CODE>/ p' test.gtf
ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by zoegward100
0
gravatar for michael.ante
3.0 years ago by
michael.ante3.6k
Austria/Vienna
michael.ante3.6k wrote:

Hi Zoegward,

I think you need to extract 1000 transcript IDs from the gtf. Something like

awk '$3=="transcript" && match ($0, /protein_coding/){for(i=1;i<NF;i++){if(match($i,/transcript_id/)){print $(i+1)}}}' gencode.v27.gtf > transcript-list.txt

To extract all Transcript IDs from protein coding genes. In order to get 1000 IDs just uses head -n 1000, tail -n 1000, and if you want to shuffle it shuf it before using tail /head.

The resulting list is used for a grep search:

grep -f 1000-transcript-ids.txt gencode.v27.gtf > out.gtf

Cheers,

Michael

ADD COMMENTlink written 3.0 years ago by michael.ante3.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1020 users visited in the last hour