Hadoop Streaming for .bam file giving errors
1
0
Entering edit mode
9.2 years ago

Hello Everyone,

I am running bsmap. methratio.py on hadoop using hadoop streaming. But I am having error when I run the command:

Hadoop command :

hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.5.1.jar -libjars /usr/local/hadoop-bam/hadoop-bam-7.0.0-jar-with-dependencies.jar -D org.apache.hadoop.mapreduce.lib.input.FileInputFormat=org.seqdoop.hadoop_bam.BAMInputFormat -file './mad.cmd' -file '../fadata/test.fa' -mapper './mad.cmd' -input ./wgEncodeSydhRnaSeqK562Ifna6hPolyaAln.bam -output ./outfile

Tha mad.cmd has:

python methratio.py --ref=../fadata/test.fa -r -g --out=bsmap_out_sample1.txt ./wgEncodeSydhRnaSeqK562Ifna6hPolyaAln.bam

The error I am having:

15/01/22 15:52:17 INFO mapreduce.Job: Job job_1418762215449_0033 running in uber mode : false
15/01/22 15:52:17 INFO mapreduce.Job:  map 0% reduce 0%
15/01/22 15:52:23 INFO mapreduce.Job: Task Id : attempt_1418762215449_0033_m_000016_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2
        at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
        at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
        at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
        at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

15/01/22 15:52:23 INFO mapreduce.Job: Task Id : attempt_1418762215449_0033_m_000011_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2
        at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
        at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
        at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
        at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

Can someone tell me what I am doing wrong here

python genome hadoop-streaming bam bsmap • 3.1k views
ADD COMMENT
1
Entering edit mode
9.2 years ago

Isn't it waiting for methratio.py to output something for each of the input alignments (or at least a group of them)? That'll never happen. While you could use a map-reduce framework with that python script, it'd have to be a completely custom one that understands the output. Also, it's likely that you want --out /dev/stdout or something like that.

Consider the fact that you've been asking about how to wrap this stuff in hadoop for much longer than it would have taken to simply run it on a single cluster node...

ADD COMMENT
0
Entering edit mode

Hi Devon,

Can you please tell me what are my options to achieve this. I am trying to execute the methratio.py on 32 node hadoop cluster. Is it possible to wrap it with hadoop ? Or is there any there alternatives are there? Please advice me on this.

ADD REPLY
0
Entering edit mode

The simple options are:

  1. Just execute this without hadoop on a single node. Use whatever scheduler your cluster uses. Your cluster is probably already using hadoop via hdfs for file storage.
  2. Split the BAM file into parts and run each one on a different node. Again, this does not need you to do some overly complicated wrapping of things with a hadoop interface to BAM files.

BTW, I recall that methratio.py is somewhat limited in what it can do. If this is actually important data, then use something that can deal with methylation bias, like PileOMeth (before you ask, no, we never wrote that to interact with hadoop) or Bis-SNP (it also doesn't use hadoop and it's pretty weak in handling methylation bias).

ADD REPLY
0
Entering edit mode

So it is No way to use with Hadoop in this? When you say split the BAM file (2) did you mean load in to hdfs?

Is there any other way to do this with map reduce?

ADD REPLY
0
Entering edit mode

Well, there are ways to do this with map reduce, but you're going to be going through more trouble than it's worth since you're going to have to write the code to make it happen.

Loading something into hdfs just means, "copy it to the files system using hdfs". There's nothing special there. Splitting a BAM file literally means splitting it into multiple files. Of course in the time taken to do the splitting in (2), method (1) would have mostly completed.

Out of curiosity, why do you want to use hadoop for this? Your cluster will be perfectly happy without you doing that and you're unlikely to get the results any faster if you're using a local cluster.

ADD REPLY
0
Entering edit mode

The use of hadoop is for learning purpose how bio and computer science can co relate with this project.

So you are saying just to load the .bam file to hdfs and run the methratio.py separately on each machine with some scheduler like oozie (in Hadoop ecosystem)? Am I correct?

If I do so then also I need to reduce the output it produces, am I correct?

ADD REPLY
0
Entering edit mode

If you split the BAM file first then yes, you'll need to reduce the split output to produce a consolidated output. If you run individual samples/files on different (or even the same) node then there's nothing to really reduce, given that the target is per-file output. Granted, you then have to process those output files, but you probably want to have a look at them first before proceeding.

ADD REPLY
0
Entering edit mode
Is it okay to split the bam files with hdfs? Will it create any problem because chromosome number ? ( while splitting hdfs will create duplicates of data chunks ) Will it create problem while reducing for a final output after processing ?
ADD REPLY
0
Entering edit mode

It depends on how they're split. You can't just arbitrarily chop up a BAM file and have it work.

ADD REPLY
0
Entering edit mode
Thanks Devon. Can you suggest me some ways how do I split the bam file ? It would be really helpful
ADD REPLY
0
Entering edit mode

Presumably Hadoop-Bam provides an API for that. Just so you know, anything you do like this with hadoop is going to involve at least some amount of programming on your part.

ADD REPLY
0
Entering edit mode
So do I need to use hadoop bam api before split( before load the data into hdfs) ? Or after processing with methratio.py and while reducing (combining) the results ? Can you please explain me the steps ? Im new to this . It would be really helpful to proceed in a correct path
ADD REPLY
0
Entering edit mode

You're going to have to figure this out for yourself, I'm not familiar with the inner-working of hadoop-bam.

ADD REPLY
0
Entering edit mode
Thanks Devon. Thanks for the info.
ADD REPLY
0
Entering edit mode

Currently we are executing with out hadoop in a single node. But we are trying to use it with hadoop cluster to save processing time and also for learning purpose

ADD REPLY

Login before adding your answer.

Traffic: 2771 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6