Question

Extract information from gtf file

0

Entering edit mode

2.2 years ago

Human • 0

Hi fellow humans,

I'm rather new in this field and I am wondering how I can get the desired information from my gtf file in an elegant way.

I want only the columns 1,3,4,5 and 9(the gene_ID)! 1,3,4,5 is easy, but I'm lacking an efficient coding approach to add also only the gene_id"xy" for each row...

The aim is to write it into a simple .txt file with 5 columns, you know. I'd appreciate any help. The gtf Input looks the following

NC_004593.1     RefSeq  exon    1       68      .       +       .       gene_id ""; transcript_id "unknown_transcript_1"; anticodon "(pos:31..33)"; gbkey "tRNA"; product "tRNA-Phe"; exon_number "1";

GTF • 2.9k views

ADD COMMENT • link updated 13 months ago by Ram 43k • written 2.2 years ago by Human • 0

0

Entering edit mode

May want to try this tool: Extracting genomic feature sequences from GTF/GFF files with AGAT

ADD REPLY • link 2.2 years ago by GenoMax 141k

score 1 · Answer 1 · 2022-02-09

1

Entering edit mode

2.2 years ago

Juke34 8.5k

As the 9th column can be uneven and not sorted in the same way depending the feature (line) it might be easier to first use agat_convert_sp_gff2tsv.pl from AGAT, then you can use an awk command to print only the column you want.

ADD COMMENT • link 2.2 years ago by Juke34 8.5k

0

Entering edit mode

HI,

I only use Linux on a cluster, and R on my computer. I intsalled Anaconda and tried to run your code but it doesnt work. Probably because I basically don't really understand what I am doing here. Isn`t there any more straight forward approach for this maybe :D ?

ADD REPLY • link 2.2 years ago by Human • 0

0

Entering edit mode

did the installation went well with conda? Then it is fairly straightforward to use:

agat_convert_sp_gff2tsv.pl -h
agat_convert_sp_gff2tsv.pl --gff file.gtf -o file.tsv

Then you open to file.tsv and you check which column is the gene_id oone (let's say it is column 11). Then you use awk to print the column you are interested in 1,3,4,5 and 11:
awk '{print $1"\t"$3"\t"$4"\t"$5"\t"$11}' file.tsv

ADD REPLY • link 2.2 years ago by Juke34 8.5k

0

Entering edit mode

well, I have now this Anaconda navigator open and I don't really know where to write this code to... :D It opens the windows terminal when I klick on base(root) and I tried it there, but didn't work.

In where should I write this?

agat_convert_sp_gff2tsv.pl -h agat_convert_sp_gff2tsv.pl --gff file.gtf -o file.tsv

ADD REPLY • link 2.2 years ago by Human • 0

0

Entering edit mode

download miniconda for linux: https://docs.conda.io/en/latest/miniconda.html#linux-installers in a terminal (copy the link of the one you want) In a terminal run wget <link> where link is the link you copied just before. Then run bash Miniconda3-latest-Linux-x86_64.sh
Then Follow instructions, agree with all questions... then source ~/.bashrc

conda create -n agat agat
conda activate agat agat_convert_sp_gff2tsv.pl -h

ADD REPLY • link 2.2 years ago by Juke34 8.5k

0

Entering edit mode

Thanks a lot! It will be useful in future for sure:)

ADD REPLY • link 2.2 years ago by Human • 0

score 0 · Answer 2 · 2022-02-08

0

Entering edit mode

2.2 years ago

supertech ▴ 180

It would be easy with regular expressions. However you can do with minimum coding or no coding. If your file size allows it, print the columns to a CSV file (put commas between columns). Open it in Excel, in the last column that contains attributes, split the text at ";" by converting with "text to column" feature. First field what you want. If you want you split that field even further at space. I hope it helps.

ADD COMMENT • link 2.2 years ago by supertech ▴ 180

0

Entering edit mode

Hi and thank you very much! But the file is way to big for excel unfortunately. Do you maybe know any straight forward method to do the same thing in Linux or R. Everything I can find looks rather elaborate... Thank you in advance

ADD REPLY • link 2.2 years ago by Human • 0

score 0 · Answer 3 · 2022-02-09

0

Entering edit mode

2.2 years ago

cpad0112 21k

Assuming that all exon entries have gene_ids as first column, try this:

$ cat test.gtf 

NC_004593.1 RefSeq  exon    1   68  .   +   .   gene_id "xy"; transcript_id "unknown_transcript_1"; anticodon "(pos:31..33)"; gbkey "tRNA"; product "tRNA-Phe"; exon_number "1";


$ awk -F '\t| |"' -v OFS="\t" '$3 ~ /exon/ {print $1, $3, $4,$5,$11}' test.gtf

NC_004593.1 exon    1   68  xy