I have a gtex variant file- the head of which looks as follows:
phenotype_id variant_id chr1:15947:16607:clu_36198:ENSG00000227232.5 chr1_13550_G_A_b38 ... chr1:15947:16607:clu_36198:ENSG00000227232.5 chr1_14671_G_C_b38 ... chr1:15947:16607:clu_36198:ENSG00000227232.5 chr1_14677_G_A_b38 ... chr1:15947:16607:clu_36198:ENSG00000227232.5 chr1_16841_G_T_b38 ...
I would like to tabix index it to make lookups faster. It is gzip compressed and I will firstly have to do:
zcat gtex.txt.gz | bgzip > gtex.txt.bgz
However I am not quite sure how to proceed from there given that the data is not tab-delimited.
As a trial I tried the first 1000 lines:
zcat gtex.txt.gz | head -n 1000 | bgzip > gtex_1000.txt.bgz ./tabix -p bed gtex_1000.gz #index as a bed file [get_intv] the following line cannot be parsed and skipped: chr1:15947:16607:clu_36198:ENSG00000227232.5 chr1_13550_G_A_b38 ...... ./tabix -p vcf gtex_1000.gz #index as a vcf file
Indexing as a bed file results in a warning while indexing as a vcf gives no warning yet either way when I try to retrieve a sequence:
./tabix test.gz chr1:15000:17000
It returns nothing.
I am starting to think that I will just have to write a script that splits on the ':' and writes the data to a new file.... and then index that file... Does anyone know of a trick to index the files with unconventional delimiting?