I have a vcf file with chromosome and positions contained therein and a bigwig file of expression across a genome. I want to filter those SNPs which fall within a gene with RNA expression i.e. above 200. The file is in the format of: chromosome, start, stop, RNA exp.
I am currently using a perl script to read the big wig line by line (I'm thinking would be quicker searching the file using grep) and every line searching for a match using every line of my vcf to match chromosome and position within 5 because there is a skipping of positions by 1 or two (realised that this can be larger so need to match within start and stop positions). The end file I can remove resulting duplicates by a perl one liner. The perl script looks like it may take between 7-14 days, although the bigwig file is 1GB so expected.
Does anyone know of a simplier way to search big wig file using VCF (all ordered in sequence found in genome) and match but only once, then if exp above 200 print to new file?
vcf file (usual, first two columns only those want to match RNA expression of line that contains this position, then print the entire line):
S1ch00 245 G GA
bigwig file (realised my offset position of -5 and +5 is not enough if do it this way:
chromosome start position Stop position Expression
S1ch00 238 240 1
S1ch00 240 275 2
S1ch00 275 278 1
S1ch00 299 303 1
S1ch00 303 349 2
S1ch00 349 353 250
It will be useful for others if you can paste first five lines from both the files so that it would be easy for them to understand. Also, do you know how to code? You can use hash in perl or directory in python to store the expression data such that you have chromosomes as keys. This way your search space would be reduced. If you dont know coding, you can split your files by chromosomes and run the jobs in parallel.
See: What is the quickest algorithm for range overlap? I think this shouldn't take more than a few minutes.
I agree, this should take a few minutes with Perl, not weeks. It would be helpful to see some of the code so we can better understand what is being tried. That way we could identify the issues and maybe find a solution.
updated. thanks
Can you try what Alex explained below. It should be pretty straight forward and fast to do. I don't think there is any other way until you want to write your own code.