Question

File Searching Efficiency Using Perl Or Grep

0

Entering edit mode

10.4 years ago

rob234king ▴ 610

I have a vcf file with chromosome and positions contained therein and a bigwig file of expression across a genome. I want to filter those SNPs which fall within a gene with RNA expression i.e. above 200. The file is in the format of: chromosome, start, stop, RNA exp.

I am currently using a perl script to read the big wig line by line (I'm thinking would be quicker searching the file using grep) and every line searching for a match using every line of my vcf to match chromosome and position within 5 because there is a skipping of positions by 1 or two (realised that this can be larger so need to match within start and stop positions). The end file I can remove resulting duplicates by a perl one liner. The perl script looks like it may take between 7-14 days, although the bigwig file is 1GB so expected.

Does anyone know of a simplier way to search big wig file using VCF (all ordered in sequence found in genome) and match but only once, then if exp above 200 print to new file?

vcf file (usual, first two columns only those want to match RNA expression of line that contains this position, then print the entire line):

S1ch00 245 G GA

bigwig file (realised my offset position of -5 and +5 is not enough if do it this way:

chromosome start position Stop position Expression

S1ch00 238 240 1

S1ch00 240 275 2

S1ch00 275 278 1

S1ch00 299 303 1

S1ch00 303 349 2

S1ch00 349 353 250

perl • 3.6k views

ADD COMMENT • link updated 10.4 years ago by Pierre Lindenbaum 161k • written 10.4 years ago by rob234king ▴ 610

1

Entering edit mode

It will be useful for others if you can paste first five lines from both the files so that it would be easy for them to understand. Also, do you know how to code? You can use hash in perl or directory in python to store the expression data such that you have chromosomes as keys. This way your search space would be reduced. If you dont know coding, you can split your files by chromosomes and run the jobs in parallel.

ADD REPLY • link 10.4 years ago by Ashutosh Pandey 12k

0

Entering edit mode

See: What is the quickest algorithm for range overlap? I think this shouldn't take more than a few minutes.

ADD REPLY • link 10.4 years ago by Michael 54k

0

Entering edit mode

I agree, this should take a few minutes with Perl, not weeks. It would be helpful to see some of the code so we can better understand what is being tried. That way we could identify the issues and maybe find a solution.

ADD REPLY • link 10.4 years ago by SES 8.6k

0

Entering edit mode

updated. thanks

ADD REPLY • link 10.4 years ago by rob234king ▴ 610

0

Entering edit mode

Can you try what Alex explained below. It should be pretty straight forward and fast to do. I don't think there is any other way until you want to write your own code.

ADD REPLY • link 10.4 years ago by Ashutosh Pandey 12k

score 1 · Answer 1 · 2013-12-02

It is not clear to me what your rate limiting step is with the Perl script. But if you convert both your reference set (SNPs) and query set (genes with RNA expression data) to sorted BED data, you can use BEDOPS tools which are written to take advantage of sorted inputs, in order to calculate very fast answers to overlap and other set operation-based questions.

First, prepare an RNA expression file to query against, which is in sorted BED format.

Start with the Kent tool bigWigToBedGraph. Threshold the output with awk and sort it with BEDOPS sort-bed:

$ bigWigToBedGraph rnaExp.bw rnaExp.bed
$ awk '{if ($4 > 200) print $0}' rnaExp.bed > rnaExp.thresholded.bed
$ sort-bed rnaExp.thresholded.bed > rnaExp.thresholded.sorted.bed

Then convert the VCF data to sorted BED:

$ vcf2bed < mySNPs.vcf > mySNPs.bed

Then apply set operations with bedops:

$ bedops -e -1 mySNPs.bed rnaExp.thresholded.sorted.bed > mySNPsThatMeetRNAExpressionThreshold.bed

The file mySNPsThatMeetRNAExpressionThreshold.bed is your answer: positions of SNPs that are contained within genes with expression values greater than 200.

If you instead ever need to associate specific expression values with the SNP, the bedmap tool can help here:

$ sort-bed rnaExp.bed > rnaExp.sorted.bed
$ bedmap --echo --echo-map-score mySNPs.bed rnaExp.sorted.bed > mySNPsWithAllRNAExpressionValues.bed

As opposed to bedops, which just does set operations, the bedmap operation will also tell you the distribution of expression values that associate with your SNPs. For instance, if you have SNPs with multiple genes spanning each of them, and there is a lot of variance in the expression levels in those overlapping genes, then you might think about how you are thresholding your expression data.

Links to information for UCSC and BEDOPS tools:

score 1 · Answer 2 · 2013-12-02

I wrote a tool named vcfbigwig : https://github.com/lindenb/jvarkit/wiki/VCFBigWig

I takes as input a VCF and a Bigwig file, annotate the VCF with the overlapping data in the bigwig.

 java -jar dist/vcfbigwig.jar \
     IN=input.vcf.gz \
     INFOTAG=GERP \
     BW=gerp.bw

##INFO=<ID=GERP,Number=1,Type=Float,Description="Values from bigwig file: com.github.lindenb.jvarkit.tools.vcfbigwig.VCFBigWig BIGWIG=gerp.bw TAG=GERP IN=input.vcf.gz    VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false">
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO(...)
A    33926    .    G    A    182    .    GERP=-6.35(...)
A    45365    .    A    G    222    .    GERP=-3.55(...)

you can then filter the resulting VCF with GATK http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_filters_VariantFiltration.html or with my tool https://github.com/lindenb/jvarkit/wiki/VCFFilterJS