Extract information from gtf file
3
0
Entering edit mode
2.2 years ago
Human • 0

Hi fellow humans,

I'm rather new in this field and I am wondering how I can get the desired information from my gtf file in an elegant way.

I want only the columns 1,3,4,5 and 9(the gene_ID)! 1,3,4,5 is easy, but I'm lacking an efficient coding approach to add also only the gene_id"xy" for each row...

The aim is to write it into a simple .txt file with 5 columns, you know. I'd appreciate any help. The gtf Input looks the following

NC_004593.1     RefSeq  exon    1       68      .       +       .       gene_id ""; transcript_id "unknown_transcript_1"; anticodon "(pos:31..33)"; gbkey "tRNA"; product "tRNA-Phe"; exon_number "1";
GTF • 2.9k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
1
Entering edit mode
2.2 years ago
Juke34 8.5k

As the 9th column can be uneven and not sorted in the same way depending the feature (line) it might be easier to first use agat_convert_sp_gff2tsv.pl from AGAT, then you can use an awk command to print only the column you want.

ADD COMMENT
0
Entering edit mode

HI,

I only use Linux on a cluster, and R on my computer. I intsalled Anaconda and tried to run your code but it doesnt work. Probably because I basically don't really understand what I am doing here. Isn`t there any more straight forward approach for this maybe :D ?

ADD REPLY
0
Entering edit mode

did the installation went well with conda? Then it is fairly straightforward to use:

agat_convert_sp_gff2tsv.pl -h
agat_convert_sp_gff2tsv.pl --gff file.gtf -o file.tsv

Then you open to file.tsv and you check which column is the gene_id oone (let's say it is column 11). Then you use awk to print the column you are interested in 1,3,4,5 and 11:
awk '{print $1"\t"$3"\t"$4"\t"$5"\t"$11}' file.tsv

ADD REPLY
0
Entering edit mode

well, I have now this Anaconda navigator open and I don't really know where to write this code to... :D It opens the windows terminal when I klick on base(root) and I tried it there, but didn't work.

In where should I write this?

agat_convert_sp_gff2tsv.pl -h agat_convert_sp_gff2tsv.pl --gff file.gtf -o file.tsv

ADD REPLY
0
Entering edit mode

download miniconda for linux: https://docs.conda.io/en/latest/miniconda.html#linux-installers in a terminal (copy the link of the one you want) In a terminal run wget <link> where link is the link you copied just before. Then run bash Miniconda3-latest-Linux-x86_64.sh
Then Follow instructions, agree with all questions... then source ~/.bashrc

conda create -n agat agat
conda activate agat agat_convert_sp_gff2tsv.pl -h

ADD REPLY
0
Entering edit mode

Thanks a lot! It will be useful in future for sure:)

ADD REPLY
0
Entering edit mode
2.2 years ago
supertech ▴ 180

It would be easy with regular expressions. However you can do with minimum coding or no coding. If your file size allows it, print the columns to a CSV file (put commas between columns). Open it in Excel, in the last column that contains attributes, split the text at ";" by converting with "text to column" feature. First field what you want. If you want you split that field even further at space. I hope it helps.

ADD COMMENT
0
Entering edit mode

Hi and thank you very much! But the file is way to big for excel unfortunately. Do you maybe know any straight forward method to do the same thing in Linux or R. Everything I can find looks rather elaborate... Thank you in advance

ADD REPLY
0
Entering edit mode
2.2 years ago

Assuming that all exon entries have gene_ids as first column, try this:

$ cat test.gtf 

NC_004593.1 RefSeq  exon    1   68  .   +   .   gene_id "xy"; transcript_id "unknown_transcript_1"; anticodon "(pos:31..33)"; gbkey "tRNA"; product "tRNA-Phe"; exon_number "1";


$ awk -F '\t| |"' -v OFS="\t" '$3 ~ /exon/ {print $1, $3, $4,$5,$11}' test.gtf

NC_004593.1 exon    1   68  xy
ADD COMMENT
0
Entering edit mode

exactly what I needed ! thanks a lot !:)

ADD REPLY

Login before adding your answer.

Traffic: 2966 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6