Question: Extracting entries from GTF file
0
gravatar for EVR
3.2 years ago by
EVR570
Earth
EVR570 wrote:

Hi,

I am quite new to bioinformatics. I have a gtf file and also bed file which includes the trascript_name Start End position. For every transcript's start end position, I would like to extract all the exons present between the start and end coordiantes of transcript. For an examples. imagine a gtf file like following

Scaffold1 cuff transcript 344  540  100 + geneid "cuff_45"
Scaffold1 cuff exon 344  400  100 + geneid "cuff_45"
Scaffold1 cuff exon 484  540  100 + geneid "cuff_45"
Scaffold1 cuff transcript 800  1200  100 + geneid "cuff_46"
Scaffold1 cuff exon 800  928  100 + geneid "cuff_46"
Scaffold1 cuff exon 980  1100  100 + geneid "cuff_46"
Scaffold1 cuff exon 1100  1200  100 + geneid "cuff_46"
Scaffold2 cuff transcript 1 500 1000 - gene_id "cuff_47"
Scaffold2 cuff exon 1 500 1000 - gene_id "cuff_47"

and a bed file like following

Scaffold1 344 540

Then I would like extract entries of Scaffold1 and its exons from gtf file like following

Scaffold1 cuff transcript 344  540  100 + geneid "cuff_45"
Scaffold1 cuff exon 344  400  100 + geneid "cuff_45"
Scaffold1 cuff exon 484  540  100 + geneid "cuff_45"

Can someone suggest any tool to achieve my goal.

Thanks in advance.

bed gtf • 2.1k views
ADD COMMENTlink modified 3.2 years ago by Alex Reynolds29k • written 3.2 years ago by EVR570
1

Use unix utility: grep "Scaffold1" your.gtf > scaf1.gtf

Do you also want a BED file from your.gtf or you already have that?

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by genomax80k

Thanks for your reply. I already tried with unix but it also incldes all the entries have Scaffold1 but I need to extract entries between the transcript start and end. I have updated the gtf file in my question. Kindly take a look and guide me

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by EVR570

There are 7 lines that have Scaffold1 in your example. grep above would get all 7. So instead of all 7, you just want the lines that match the interval in your BED file?

ADD REPLYlink written 3.2 years ago by genomax80k

Exactly. I just need need to extartc all the exons confined within the transcript start and end like mentioned in the above.

ADD REPLYlink written 3.2 years ago by EVR570
3
gravatar for Alex Reynolds
3.2 years ago by
Alex Reynolds29k
Seattle, WA USA
Alex Reynolds29k wrote:

Given a GTF file called annotations.gtf with your annotations, and a sorted BED file called intervals.bed that contains your regions of interest, you could do the following with BEDOPS:

$ gtf2bed < annotations.gtf | bedops -e 1 - intervals.bed > answer.bed

The file answer.bed contains BED-formatted annotations from the GTF file, which overlap the provided intervals by one or more bases.

ADD COMMENTlink modified 3.2 years ago • written 3.2 years ago by Alex Reynolds29k
0
gravatar for cmdcolin
3.2 years ago by
cmdcolin1.3k
United States
cmdcolin1.3k wrote:

You may like to use tabix. Tabix has a setting that works well with GFF, so I would convert the GTF to GFF (probably using gffread from cufflinks package, for example gffread -E merged.gtf -o- > merged.gff3)

Then you can bgzip and tabix the output

bgzip myfile.sorted.gff
tabix -p gff myfile.sorted.gff.gz

Then you can retrieve the coordinates with

tabix myfile.sorted.gff.gz Scaffold1:344-540

Note you may also need to sort the GFF. See some additional tips here http://gmod.org/wiki/JBrowse_FAQ#How_do_I_create_a_Tabix_indexed_GFF

ADD COMMENTlink modified 3.2 years ago • written 3.2 years ago by cmdcolin1.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1860 users visited in the last hour