Split GTF file by gene id
1
0
Entering edit mode
9.5 years ago
kma • 0

Hi,

I want to split an Ensembl GTF file into individual files by gene id. I tried using awk to little avail:

awk '{gsub(/"|;/, "", $10); print >> ($10".gtf")}' Homo_sapiens.GRCh38.82.gtf

This splits the file as expected but also strips the quotes and semicolon from the gene_id entry in each GTF file:

gene_id ENSG00000142733

instead of the correct

gene_id "ENSG00000142733";

This was fixed easily:

find . -type f -name "*.gtf" | xargs sed -i "" 's/\(ENSG[0-9]\{11\}\)/"\1";/'

However, the remaining problem is that awk also replaces all tabs separating the nine columns/fields with spaces rendering the resulting GTF files invalid.

If anyone can hint me at a possible solution, that would be fantastic!

Thanks, Kemal

text processing • 2.8k views
ADD COMMENT
2
Entering edit mode
9.5 years ago

The tab issue can be fixed with BEGIN{OFS="\t"}. Having said that, I'd encourage you to use the shlex module from python, which can split the last column of GTF files properly. You can then more easily do this in a few lines.

ADD COMMENT
0
Entering edit mode

Thanks, Devon. Did not see your comment until now. I tried to change the file separator to tabs explicitly (with FS/OFS) but that also separated the attributes in column 9 by tabs - at least this is what I remember. But I will give it another try and check out the shlex module. Thanks for the suggestion!

ADD REPLY

Login before adding your answer.

Traffic: 4115 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6