Hi,
I want to split an Ensembl GTF file into individual files by gene id. I tried using awk to little avail:
awk '{gsub(/"|;/, "", $10); print >> ($10".gtf")}' Homo_sapiens.GRCh38.82.gtf
This splits the file as expected but also strips the quotes and semicolon from the gene_id entry in each GTF file:
gene_id ENSG00000142733
instead of the correct
gene_id "ENSG00000142733";
This was fixed easily:
find . -type f -name "*.gtf" | xargs sed -i "" 's/\(ENSG[0-9]\{11\}\)/"\1";/'
However, the remaining problem is that awk
also replaces all tabs separating the nine columns/fields with spaces rendering the resulting GTF files invalid.
If anyone can hint me at a possible solution, that would be fantastic!
Thanks, Kemal
Thanks, Devon. Did not see your comment until now. I tried to change the file separator to tabs explicitly (with FS/OFS) but that also separated the attributes in column 9 by tabs - at least this is what I remember. But I will give it another try and check out the shlex module. Thanks for the suggestion!