Question

Split GTF file by gene id

0

Entering edit mode

9.5 years ago

kma • 0

Hi,

I want to split an Ensembl GTF file into individual files by gene id. I tried using awk to little avail:

awk '{gsub(/"|;/, "", $10); print >> ($10".gtf")}' Homo_sapiens.GRCh38.82.gtf

This splits the file as expected but also strips the quotes and semicolon from the gene_id entry in each GTF file:

gene_id ENSG00000142733

instead of the correct

gene_id "ENSG00000142733";

This was fixed easily:

find . -type f -name "*.gtf" | xargs sed -i "" 's/$ENSG[0-9]\{11\}$/"\1";/'

However, the remaining problem is that awk also replaces all tabs separating the nine columns/fields with spaces rendering the resulting GTF files invalid.

If anyone can hint me at a possible solution, that would be fantastic!

Thanks, Kemal

text processing • 2.8k views

ADD COMMENT • link 9.5 years ago by kma • 0

score 2 · Answer 1 · 2016-06-10

2

Entering edit mode

9.5 years ago

Devon Ryan 105k

The tab issue can be fixed with BEGIN{OFS="\t"}. Having said that, I'd encourage you to use the shlex module from python, which can split the last column of GTF files properly. You can then more easily do this in a few lines.

ADD COMMENT • link 9.5 years ago by Devon Ryan 105k

0

Entering edit mode

Thanks, Devon. Did not see your comment until now. I tried to change the file separator to tabs explicitly (with FS/OFS) but that also separated the attributes in column 9 by tabs - at least this is what I remember. But I will give it another try and check out the shlex module. Thanks for the suggestion!

ADD REPLY • link 9.5 years ago by kma • 0