Question

How to extract specific reads from mapped BED

0

Entering edit mode

9.6 years ago

cfarmeri ▴ 210

Hi.

I have mapped BED file(about 500MB) by bowtie2 paired mapping. Following is head of this BED.

chr14   8835890 8835954 HWUSI-EAS476:3:2:4670:1030#0/1  42      -
chr14   8835750 8835806 HWUSI-EAS476:3:2:4670:1030#0/2  42      +
chr3    135330299       135330319       HWUSI-EAS476:3:2:6478:1025#0/1  42      +
chr3    135330468       135330508       HWUSI-EAS476:3:2:6478:1025#0/2  42      -
..
..

And I have also corresponding read ID list(5973053 IDs). Following is head of this BED.

HWUSI-EAS476:3:100:10000:10332
HWUSI-EAS476:3:100:10000:1189
HWUSI-EAS476:3:100:10001:12090
..
..

I would like to extract BED lines from the original BED according to the list of read IDs.

I have used grep in shell script and re.search() in python, but it didn't move correctly because of the file size.

Anyone can solve this problem? Thank you for your help!

bed • 2.3k views

ADD COMMENT • link updated 3.0 years ago by Ram 45k • written 9.6 years ago by cfarmeri ▴ 210

Ram · Answer 1 · 2015-12-11

5

Entering edit mode

9.6 years ago

Alex Reynolds 36k

See grep -f <filename> to apply a grep search from a file <filename> containing the patterns of interest.

ADD COMMENT • link 9.6 years ago by Alex Reynolds 36k

0

Entering edit mode

Thanks Alex, I tried your advice.

But the process has aborted because of a shortage of the memory.

I assigned 20G memory to this process but eventually it aborted.

Someone knows how to attain this purpose without a lot of memory.

Please help me!

ADD REPLY • link 9.6 years ago by cfarmeri ▴ 210

0

Entering edit mode

Split up your file containing ~6M patterns into smaller subsets using

split -l <lines> <pattern_filename> <split_result_prefix>

(see man split for more information). Perhaps start with 100k lines per subset, which will give you 60 smaller files containing a subset of patterns.

Run separate grep -f <filename> processes, where each <filename> is one of the pattern's subsets.

If you have access to a computational cluster, you can run each of the separate grep calls on each node. Otherwise, run them in series on your computer, one after the other.

When all the searches are done, union all the search results with cat.

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 9.6 years ago by Alex Reynolds 36k

Ram · Answer 2 · 2015-12-22

An alternative to using grep is to use a hash table in a scripting language like Perl or Python. In Python, hash tables are called dictionaries. I'll show an example script below that uses Perl.

Hash tables use a lot of memory, but they are fast - often faster than grep - because looking up a value in a hash table takes constant time.

You say your BED data looks like this:

chr14   8835890     8835954     HWUSI-EAS476:3:2:4670:1030#0/1  42      -
chr14   8835750     8835806     HWUSI-EAS476:3:2:4670:1030#0/2  42      +
chr3    135330299   135330319   HWUSI-EAS476:3:2:6478:1025#0/1  42      +
chr3    135330468   135330508   HWUSI-EAS476:3:2:6478:1025#0/2  42      -
...

The patterns you are searching for look like everything up to the hash character (#) in the ID field of this BED file. So if we parse the BED file line by line and turn its ID field into a key that matches your pattern of interest, then we can do a quick lookup of the hash table.

Let's assume that your patterns are in a text file called patterns.txt and your BED elements are in a file called data.bed.

Here is an example Perl script called filter.pl that you could use to filter your BED file against your patterns:

#!/usr/bin/env perl

use strict;
use warnings;

my $patternFn = $ARGV[0];
if (! defined $patternFn) { die "Error: Specify pattern filename\n"; }

# build hash table of keys
print STDERR "Building hash table from patterns...\n";
my $keyring;
open KEYS, "<", $patternFn;
while (<KEYS>) {
    chomp $_;
    $keyring->{$_} = 1;
}
close KEYS;

# look up key from each line of standard input
print STDERR "Filtering standard input against patterns...\n";
while (<STDIN>) {
    chomp $_;
    my ($chr, $start, $stop, $id, $score, $strand) = split("\t", $_);
    if (! defined $id) { die "Error: BED input is missing ID field\n"; }
    (my $key = $id) =~ s/#.*+//g; # copy id to key and modify key
    if (defined $keyring->{$key}) { print "$_\n"; }
}

You could use this like so:

$ filter.pl patterns.txt < data.bed > data.filtered.bed

Again, if your list of patterns is too large to make a hash table that fits into system memory, you can use split on patterns.txt and then run filter.pl on each smaller patterns file. Then run cat on each filtered BED file to bring them all together into one result.