Question: Quick Search For Overlaps With A Mask File
0
gravatar for diviya.smith
5.5 years ago by
diviya.smith50
United States
diviya.smith50 wrote:

Hi there,

I am interested in computing some statistics for a subset of the genome. I have compiled a list of 28M sites and I am now checking if this site passed the 1000 genomes filters. In order to do this, I am using samtools to access the mask file (in fasta format) for the particular position using -

samtools faidx maskfile chr_pos

However, for 28M sites, this is very slow. Does anyone have any suggestions to speed up the search? I am fairly good at Perl programming but I dont think using the standard search methods that I am familiar with will significantly speed things up. Any suggestions would be very helpful?

Also as step2, I need to lookup the results from a table for which I am currently using perl hashes but again too slow for such large amount of data. I would be most grateful if you have any suggestions to more the speed.

-Diviya

search • 1.3k views
ADD COMMENTlink modified 5.5 years ago • written 5.5 years ago by diviya.smith50
1

What are the attributes for these 28M sites ? How you are matching with genomes ? Are you trying to match based on some interval range etc?

ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by always_learning960
0
gravatar for Jorge Amigo
5.5 years ago by
Jorge Amigo11k
Santiago de Compostela, Spain
Jorge Amigo11k wrote:

you could always try using ANNOVAR filtering option, which would give you 1000 genomes' population frequencies, if by "checking if this site passed the 1000 genomes filters" you mean that you want to want to know if the 1000 genomes project detected those sites as variants:

annotate_variation.pl -filter -dbtype 1000g2012apr_eur -buildver hg19 YOUR_FILE humandb/
ADD COMMENTlink written 5.5 years ago by Jorge Amigo11k

Pardon my ignorance, But Can you please explain that how this is going help above posed question ? This is purely for my knowledge only !

ADD REPLYlink written 5.5 years ago by always_learning960

I think this is an attempt to be more specific about "this site passed the 1000 genomes filters".

ADD REPLYlink written 5.5 years ago by Michael Dondrup46k

thanks Michael, you got it. I was just suggesting to filter (or to highlight) 1000 genomes' variants, if the original 28M sites are to be queried for known variation.

ADD REPLYlink written 5.5 years ago by Jorge Amigo11k
0
gravatar for Michael Dondrup
5.5 years ago by
Bergen, Norway
Michael Dondrup46k wrote:

You should calculate range overlaps between query regions and 1kg regions to approach this in the most efficient way. Therefore, convert both files to ranges either in GFF or BED format and use bedtools or Bioconductor/IRanges to calculate overlaps. See What is the quickest algorithm for range overlap?

To do this you need to face the following challenges:

  • Is the mask file the correct and optimal way of getting all the "site passed the 1000 genomes filters" and what does that mean?
  • How is the fact encoded in the mask file and how can this be converted to BED or GFF? If you show us where you got that file from would certainly help.
ADD COMMENTlink modified 5.5 years ago • written 5.5 years ago by Michael Dondrup46k
0
gravatar for diviya.smith
5.5 years ago by
diviya.smith50
United States
diviya.smith50 wrote:

Sorry, perhaps I was not very clear. I am interested in extracting all the CG sites across the genome (~28M) and checking if these sites passed the 1000 genomes filters (information for which is included in a mask file in fasta format). I am currently using samtools to pull out the sites from the fasta file but the search is very slow. Any suggestions on how I can improve the search time?

ADD COMMENTlink written 5.5 years ago by diviya.smith50

this should be a comment, not an answer. anyway, I would go for Michael's suggestion: getting 1000genomes' masked regions in bed format and intersecting them with your sites.

ADD REPLYlink written 5.5 years ago by Jorge Amigo11k

Thanks all for the suggestions. This is very helpful.

ADD REPLYlink written 5.5 years ago by diviya.smith50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1818 users visited in the last hour