Question: How to extract specific reads from mapped BED
0
gravatar for cfarmeri
3.8 years ago by
cfarmeri150
Japan
cfarmeri150 wrote:

Hi.

I have mapped BED file(about 500MB) by bowtie2 paired mapping. Following is head of this BED.

- - - - - - - - - - 

chr14   8835890 8835954 HWUSI-EAS476:3:2:4670:1030#0/1  42      -

chr14   8835750 8835806 HWUSI-EAS476:3:2:4670:1030#0/2  42      +

chr3    135330299       135330319       HWUSI-EAS476:3:2:6478:1025#0/1  42      +

chr3    135330468       135330508       HWUSI-EAS476:3:2:6478:1025#0/2  42      -

・・

- - - - - - - - - -

And I have also correponding read ID list(5973053 IDs). Following is head of this BED.

- - - - - - - - - -

HWUSI-EAS476:3:100:10000:10332

HWUSI-EAS476:3:100:10000:1189

HWUSI-EAS476:3:100:10001:12090

- - - - - - - - - - 

I would like to extract BED lines from the original BED according to the list of read IDs.

I have used grep in shell script and re.search() in python, but it didn't move correctly because of the file size.

Anyone can solve this problem? Thank you for your helping!!

 

 

myposts • 1.0k views
ADD COMMENTlink modified 3.7 years ago by Alex Reynolds28k • written 3.8 years ago by cfarmeri150
5
gravatar for Alex Reynolds
3.8 years ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

See grep -f <filename> to apply a grep search from a file <filename> containing the patterns of interest.

ADD COMMENTlink written 3.8 years ago by Alex Reynolds28k
0
gravatar for cfarmeri
3.7 years ago by
cfarmeri150
Japan
cfarmeri150 wrote:

Thanks Alex, I tried your advice.

But the process has aborted because of a shortage of the memory.

I assigned 20G memory to this process but eventually it aborted.

Someone knows how to attain this purpose without a lot of memory.

Please help me!

ADD COMMENTlink written 3.7 years ago by cfarmeri150

Split up your file containing ~6M patterns into smaller subsets using split -l <lines> <pattern_filename> <split_result_prefix> (see man split for more information). Perhaps start with 100k lines per subset, which will give you 60 smaller files containing a subset of patterns.

Run separate grep -f <filename> processes, where each <filename> is one of the pattern's subsets.

If you have access to a computational cluster, you can run each of the separate grep calls on each node. Otherwise, run them in series on your computer, one after the other.

When all the searches are done, union all the search results with cat.

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by Alex Reynolds28k
0
gravatar for Alex Reynolds
3.7 years ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

An alternative to using grep is to use a hash table in a scripting language like Perl or Python. In Python, hash tables are called dictionaries. I'll show an example script below that uses Perl. 

Hash tables use a lot of memory, but they are fast — often faster than grep — because looking up a value in a hash table takes constant time.

You say your BED data looks like this:

chr14   8835890     8835954     HWUSI-EAS476:3:2:4670:1030#0/1  42      -
chr14   8835750     8835806     HWUSI-EAS476:3:2:4670:1030#0/2  42      +
chr3    135330299   135330319   HWUSI-EAS476:3:2:6478:1025#0/1  42      +
chr3    135330468   135330508   HWUSI-EAS476:3:2:6478:1025#0/2  42      -
・・・

The patterns you are searching for look like everything up to the hash character (#) in the ID field of this BED file. So if we parse the BED file line by line and turn its ID field into a key that matches your pattern of interest, then we can do a quick lookup of the hash table.

Let's assume that your patterns are in a text file called patterns.txt and your BED elements are in a file called data.bed.

Here is an example Perl script called filter.pl that you could use to filter your BED file against your patterns:

#!/usr/bin/env perl

use strict;
use warnings;

my $patternFn = $ARGV[0];
if (! defined $patternFn) { die "Error: Specify pattern filename\n"; }

# build hash table of keys
print STDERR "Building hash table from patterns...\n";
my $keyring;
open KEYS, "<", $patternFn;
while (<KEYS>) {
    chomp $_;
    $keyring->{$_} = 1;
}
close KEYS;

# look up key from each line of standard input
print STDERR "Filtering standard input against patterns...\n";
while (<STDIN>) {
    chomp $_;
    my ($chr, $start, $stop, $id, $score, $strand) = split("\t", $_);
    if (! defined $id) { die "Error: BED input is missing ID field\n"; }
    (my $key = $id) =~ s/#.*+//g; # copy id to key and modify key
    if (defined $keyring->{$key}) { print "$_\n"; }
}

You could use this like so:

$ filter.pl patterns.txt < data.bed > data.filtered.bed

Again, if your list of patterns is too large to make a hash table that fits into system memory, you can use split on patterns.txt and then run filter.pl on each smaller patterns file. Then run cat on each filtered BED file to bring them all together into one result.

ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by Alex Reynolds28k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1308 users visited in the last hour