I am trying to do quality control of gene expression data. I am running a script to get sample info,tissues and sequencing method used from SampleattributesDS.txt file. The script:
#!/bin/bash/python
from collections import defaultdict
import gzip
def get_ids_from_vcf(vcf_file):
with gzip.open(vcf_file) as fh:
for line in fh:
line = line.decode('utf-8')
if not line.startswith("##"):
break
sample_ids = line.strip().split()[9:]
return set(sample_ids)
vcf_file = '/new_rna-seq/quality_Control/chr22_annotate.vcf.gz'
vcf_sample_ids = get_ids_from_vcf(vcf_file)
tissue_donors = defaultdict(list)
samples_file = '/new_rna-seq/quality_Control/GTEx_Analysis_v8_Annotations_SampleAttributesDS.txt'
with open(samples_file) as fh:
hdr = fh.readline() # Drop header
for line in fh:
if line == '' or line == '\n':
continue
fields = line.strip().split('\t')
sample_id = fields[0]
tissue = fields[13]
frz = fields[26]
donor_id = '-'.join(sample_id.split('-')[:2])
if donor_id in vcf_sample_ids and frz == 'RNASEQ':
tissue_donors[tissue].append(sample_id + '\n')
for tissue in tissue_donors:
with open(tissue.replace(' - ', '_').replace(' ', '_') + '_donors.txt', 'w') as fh:
fh.writelines(tissue_donors[tissue])
While running this script: I am getting an error The error is:
python get_donors_by_tissue.py Traceback (most recent call last):
File "/home/TWAS/data_download/new_rna-seq/quality_Control/get_donors_by_tissue.py", line 28, in <module> frz = fields[30] IndexError: list index out of range
The error is pertaining to the file: GTEx_Analysis_v8_Annotations_SampleAttributesDS.txt I have downloaded this attribute file from this link: https://www.gtexportal.org/home/datasets wget https://storage.googleapis.com/gtex_analysis_v8/annotations/GTEx_Analysis_v8_Annotations_SampleAttributesDS.txt Could you anyone help me out with this? I am not able to understand how to get these info from this file