Hello! I haven't seen any similar question, so:

I have a list of genes and several vcf files. What I would like to do is to check from the list of the genes in all vcf files from a dir, and if I get a match, return me in one table (e.g excel) with all the info line, the first columm should havethe name of the match file.

At the momment what I get is a filter script for each file, but I don't know how to check in a dir tree and return it all in a single table.

import sys
from glob import glob
from subprocess import call
from pandas import DataFrame

> gene_list = open("./genes_rp.txt",'r')
> gene_list = gene_list.readlines()[1:]
> final_list = list() for gene in gene_list:    
>     gene = gene.strip('\n').split('\t')   
>     final_list.append(gene[0].strip())
> sample_folder = glob(sys.argv[1] + '*prefiltered.txt')
> for sample_path in sample_folder[1:]:     
>     sample = open(sample_path, 'r')
>      sample = sample.readlines()
>   header = sample[0].strip('\n').split('\t')  
>  output = list()
>   output.append(header)
>   for variant in sample:      
>       variant = variant.strip('\n').split('\t')
>        variant_gene = variant[0]      
>       if variant_gene in final_list:
>         output.append(variant)
>   df = DataFrame(output)
>   df.to_excel(sample_path + '_rp.xlsx', sheet_name='sheet1', header = False,index=False)

The script above it will be usefull if you have a a vcf with a lot of genes and you wanna see only a few of them

use the standard linux tools. Something like:

find /path/to/dir/ -type -name "*.vcf" | while read F ; do grep  -H -w -o -f  genes.txt $F | uniq ; done

and please, don't use Excel. Excel is bad


