Question: Extract gene coordinates from GTF
1
gravatar for int11ap1
4.8 years ago by
int11ap1320
Barcelona
int11ap1320 wrote:

Hello,

I have a GTF file with only exon features. There's a way to extract the gene coordinates? Or should I write a script?

- INPUT: GTF file.

- OUTPUT: the gene coordinates, whatever the format is.

Thanks. 

coordinates gtf • 6.2k views
ADD COMMENTlink modified 4.1 years ago by Alex Reynolds27k • written 4.8 years ago by int11ap1320

Please make it more clear by showing your Input file and desired output

ADD REPLYlink written 4.8 years ago by ancient_learner610

Done, I cannot be more clear.

ADD REPLYlink written 4.8 years ago by int11ap1320
5
gravatar for Devon Ryan
4.8 years ago by
Devon Ryan88k
Freiburg, Germany
Devon Ryan88k wrote:

The question becomes exactly what you want in terms of coordinates for a gene. I'm guessing that you just want the 5' most and 3' most position along with the strand an chromosome, but perhaps you have something else in mind.
 

Presuming you do want what I mentioned, you could easily do this in R with GenomicFeatures.

library(GenomicFeatures)
txdb <- makeTranscriptDbFromGFF("some_file.gtf", format="gtf")
genes <- genes(txdb)
write.table(as.data.frame(genes)[,-4], file="Just_genes.txt", colnames=F, sep="\t")

The -4 just removes the width column.

ADD COMMENTlink modified 4.8 years ago • written 4.8 years ago by Devon Ryan88k
1

Years later, I would like to make a little update, to maybe save someone 2 minutes: Since some of the last updates the function makeTranscriptDbFromGFF of the package GenomicFeatures is now called makeTxDbFromGFF.

ADD REPLYlink written 5 months ago by caggtaagtat470

I'm getting this error

Error in write.table(as.data.frame(genes)[, -4], file = "Just_genes.txt",  :

unused argument (colnames = F)

ADD REPLYlink written 23 months ago by krushnach80470

Try again with

write.table(as.data.frame(genes)[,-4], file="Just_genes.txt", col.names=F, sep="\t")
ADD REPLYlink written 22 months ago by thomas musielak0
3
gravatar for Pierre Lindenbaum
4.8 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum118k wrote:

using awk and sqlite:

curl -sL "https://rseqflow.googlecode.com/files/mouse_refseq_anno.gtf"   |\
awk -F '    ' 'BEGIN {printf("create temp table T(chrom,start,end,gene); begin transaction;\n");} $3=="exon" {n=split($9,a,/[ ;]+/);for(i=1;i+1< n;i++) if(a[i]=="gene_id") printf("insert into T(chrom,start,end,gene) values (\"%s\",%s,%s,%s);\n",$1,$4,$5,a[i+1]);} END {printf("commit; select chrom,gene,min(start),max(end) from T group by chrom,gene;\n");}' |\
sqlite3 tmp.db
(...)
chrY|Rbm31y|12688110|17402718
chrY|Rbmy1a1|2830680|3783271
chrY|Sly|55213720|75222053
ADD COMMENTlink modified 4.8 years ago • written 4.8 years ago by Pierre Lindenbaum118k
2
gravatar for Alex Reynolds
4.1 years ago by
Alex Reynolds27k
Seattle, WA USA
Alex Reynolds27k wrote:

Using gtf2bed:

$ gtf2bed < foo.gtf | cut -f1-3 > foo_coords.bed3

If you want strand information:

$ gtf2bed < foo.gtf | cut -f1-6 > foo_coords.bed6
ADD COMMENTlink modified 4.1 years ago • written 4.1 years ago by Alex Reynolds27k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1200 users visited in the last hour