Question: tabix indexing a GTEx variant file- ':' delimited
0
gravatar for chrisclarkson100
6 months ago by
European Union
chrisclarkson10090 wrote:

I have a gtex variant file- the head of which looks as follows:

phenotype_id                                    variant_id     
chr1:15947:16607:clu_36198:ENSG00000227232.5    chr1_13550_G_A_b38  ...
chr1:15947:16607:clu_36198:ENSG00000227232.5    chr1_14671_G_C_b38  ...
chr1:15947:16607:clu_36198:ENSG00000227232.5    chr1_14677_G_A_b38  ...
chr1:15947:16607:clu_36198:ENSG00000227232.5    chr1_16841_G_T_b38  ...

I would like to tabix index it to make lookups faster. It is gzip compressed and I will firstly have to do:

zcat gtex.txt.gz | bgzip > gtex.txt.bgz

However I am not quite sure how to proceed from there given that the data is not tab-delimited.

As a trial I tried the first 1000 lines:

zcat gtex.txt.gz | head -n 1000 | bgzip > gtex_1000.txt.bgz

./tabix -p bed gtex_1000.gz #index as a bed file
[get_intv] the following line cannot be parsed and skipped: chr1:15947:16607:clu_36198:ENSG00000227232.5        chr1_13550_G_A_b38     ......

./tabix -p vcf gtex_1000.gz #index as a vcf file

Indexing as a bed file results in a warning while indexing as a vcf gives no warning yet either way when I try to retrieve a sequence:

./tabix test.gz chr1:15000:17000

It returns nothing.

I am starting to think that I will just have to write a script that splits on the ':' and writes the data to a new file.... and then index that file... Does anyone know of a trick to index the files with unconventional delimiting?

snp next-gen assembly • 209 views
ADD COMMENTlink modified 6 months ago • written 6 months ago by chrisclarkson10090

./tabix -p bed gtex_1000.gz #index as a vcf file

Is that a typo? You're still indexing as bed format there.

ADD REPLYlink written 6 months ago by Ram32k

thanks yes that was a typo- corrected now

ADD REPLYlink written 6 months ago by chrisclarkson10090
1
gravatar for chrisclarkson100
6 months ago by
European Union
chrisclarkson10090 wrote:

I proceeded as follows:

import gzip
import sys

qtls=['sqtls','eqtls']
tissues=['Adipose_Subcutaneous', 'Adipose_Visceral_Omentum', 'Adrenal_Gland', 'Artery_Aorta', 'Artery_Coronary', 'Artery_Tibial', 'Brain_Amygdala', 'Brain_Anterior_cingulate_cortex_BA24', 'Brain_Caudate_basal_ganglia', 'Brain_Cerebellar_Hemisphere', 'Brain_Cerebellum', 'Brain_Cortex', 'Brain_Frontal_Cortex_BA9', 'Brain_Hippocampus', 'Brain_Hypothalamus', 'Brain_Nucleus_accumbens_basal_ganglia', 'Brain_Putamen_basal_ganglia', 'Brain_Spinal_cord_cervical_c-1', 'Brain_Substantia_nigra', 'Breast_Mammary_Tissue', 'Cells_Cultured_fibroblasts', 'Cells_EBV-transformed_lymphocytes', 'Colon_Sigmoid', 'Colon_Transverse', 'Esophagus_Gastroesophageal_Junction', 'Esophagus_Mucosa', 'Esophagus_Muscularis', 'Heart_Atrial_Appendage', 'Heart_Left_Ventricle', 'Kidney_Cortex', 'Liver', 'Lung', 'Minor_Salivary_Gland', 'Muscle_Skeletal', 'Nerve_Tibial', 'Ovary', 'Pancreas', 'Pituitary', 'Prostate', 'Skin_Not_Sun_Exposed_Suprapubic', 'Skin_Sun_Exposed_Lower_leg', 'Small_Intestine_Terminal_Ileum', 'Spleen', 'Stomach', 'Testis', 'Thyroid', 'Uterus', 'Vagina', 'Whole_Blood']

gene_old=''


def index(tissue,qtl,gene_old):
        counter_old,counter=1,1
        if qtl=='sqtls':
                path='/path/to/file/'+tissue+'.v8.sqtl_allpairs.txt.gz'
        else:
                path='/path/to/file/'+tissue+'.allpairs.txt.gz'
        f = gzip.open(path, 'r')
        next(f)
        for line in f:
                try:
                        line=line.decode()
                        line=line[:line.find('\t')]
                        gene=line[line.rfind('ENS'):].split('.')[0]
                        #print(gene,gene_old,counter_old,counter)
                        with open(qtl+'_'+tissue+'.idx','a') as indexed:          
                                indexed.write(str(gene_old)+'\t'+str(counter_old)+'\t'+str(counter)+'\n')
                        gene_old=gene
                        counter +=1
                        counter_old=counter
                except IndexError:
                        print('tissue,qtl,gene_old,counter')
                        counter+=1
        f.close()


for qtl in qtls:
        for tissue in tissues:
               qtl=row.qtls
               tissue=row.tissues
               print(qtl, tissue)
               index(tissue,qtl,gene_old)

The above script writes an index for each gene which should look as follows (took 2 days to finish all files):

ensid    row_start    row_end
......
ENSG00000230337     1526391 1534329
ENSG00000171819 1534330 1542321
ENSG00000198793    1542322 1550178
ENSG00000120942  1550179 1558010
.....

And to get a gene of interest you will have to look up the coordinates for an ensemblid/gene and for example:

zcat Brain_Nucleus_accumbens_basal_ganglia.v8.sqtl_allpairs.txt.gz | sed -n '1550179,1558010p;1558011q' > ENSG00000120942.txt
ADD COMMENTlink modified 6 months ago • written 6 months ago by chrisclarkson10090

If this answers your question, accept it as an answer so your question is marked as solved.

ADD REPLYlink written 6 months ago by Ram32k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1820 users visited in the last hour
_