Question: Hadoop Streaming for .bam file giving errors
0
gravatar for shalini.ravishankar
4.9 years ago by
United States
shalini.ravishankar10 wrote:

Hello Everyone,

I am running bsmap(https://code.google.com/p/bsmap/) . methratio.py on hadoop using hadoop streaming. But I am having error when I run the command:

Hadoop command :

 

hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.5.1.jar -libjars /usr/local/hadoop-bam/hadoop-bam-7.0.0-jar-with-dependencies.jar  -D org.apache.hadoop.mapreduce.lib.input.FileInputFormat=org.seqdoop.hadoop_bam.BAMInputFormat -file './mad.cmd' -file '../fadata/test.fa' -mapper './mad.cmd' -input ./wgEncodeSydhRnaSeqK562Ifna6hPolyaAln.bam -output ./outfile

 

Tha mad.cmd has :

python methratio.py --ref=../fadata/test.fa -r -g --out=bsmap_out_sample1.txt ./wgEncodeSydhRnaSeqK562Ifna6hPolyaAln.bam

 

The error I am having:

 

15/01/22 15:52:17 INFO mapreduce.Job: Job job_1418762215449_0033 running in uber mode : false
15/01/22 15:52:17 INFO mapreduce.Job:  map 0% reduce 0%
15/01/22 15:52:23 INFO mapreduce.Job: Task Id : attempt_1418762215449_0033_m_000016_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2
        at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
        at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
        at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
        at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

15/01/22 15:52:23 INFO mapreduce.Job: Task Id : attempt_1418762215449_0033_m_000011_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2
        at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
        at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
        at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
        at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

 

Can someone tell me what I am doing wrong here

 

 

 

ADD COMMENTlink modified 4.9 years ago by Devon Ryan93k • written 4.9 years ago by shalini.ravishankar10
1
gravatar for Devon Ryan
4.9 years ago by
Devon Ryan93k
Freiburg, Germany
Devon Ryan93k wrote:

Isn't it waiting for methratio.py to output something for each of the input alignments (or at least a group of them)? That'll never happen. While you could use a map-reduce framework with that python script, it'd have to be a completely custom one that understands the output. Also, it's likely that you want --out /dev/stdout or something like that.

Consider the fact that you've been asking about how to wrap this stuff in hadoop for much longer than it would have taken to simply run it on a single cluster node...

ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by Devon Ryan93k

Hi Devon,

Can you please tell me what are my options to achieve this. I am trying to execute the methratio.py on 32 node hadoop cluster. Is it possible to wrap it with hadoop ? Or is there any there alternatives are there? Please advice me on this.

ADD REPLYlink written 4.9 years ago by shalini.ravishankar10

The simple options are:

  1. Just execute this without hadoop on a single node. Use whatever scheduler your cluster uses. Your cluster is probably already using hadoop via hdfs for file storage.
  2. Split the BAM file into parts and run each one on a different node. Again, this does not need you to do some overly complicated wrapping of things with a hadoop interface to BAM files.

BTW, I recall that methratio.py is somewhat limited in what it can do. If this is actually important data, then use something that can deal with methylation bias, like PileOMeth (before you ask, no, we never wrote that to interact with hadoop) or Bis-SNP (it also doesn't use hadoop and it's pretty weak in handling methylation bias).

ADD REPLYlink written 4.9 years ago by Devon Ryan93k

So it is No way to use with Hadoop in this ?? When you say split the BAM file (2) did you mean load in to hdfs ??

Is there any other way to do this with map reduce ?? 

ADD REPLYlink written 4.9 years ago by shalini.ravishankar10

Well, there are ways to do this with map reduce, but you're going to be going through more trouble than it's worth since you're going to have to write the code to make it happen.

Loading something into hdfs just means, "copy it to the files system using hdfs". There's nothing special there. Splitting a BAM file literally means splitting it into multiple files. Of course in the time taken to do the splitting in (2), method (1) would have mostly completed.

Out of curiosity, why do you want to use hadoop for this? Your cluster will be perfectly happy without you doing that and you're unlikely to get the results any faster if you're using a local cluster.

ADD REPLYlink written 4.9 years ago by Devon Ryan93k

The use of hadoop is for learning purpose how bio and computer science can co relate with this project. 

So you are saying just to load the .bam file to hdfs and run the methratio.py separately on each machine with some scheduler like oozie (in Hadoop ecosystem) ? I am correct ? 

If I do so then also I need to reduce the output it produces, am I correct ?

ADD REPLYlink modified 4.9 years ago • written 4.9 years ago by shalini.ravishankar10

If you split the BAM file first then yes, you'll need to reduce the split output to produce a consolidated output. If you run individual samples/files on different (or even the same) node then there's nothing to really reduce, given that the target is per-file output. Granted, you then have to process those output files, but you probably want to have a look at them first before proceeding.

ADD REPLYlink written 4.9 years ago by Devon Ryan93k
Is it okay to split the bam files with hdfs? Will it create any problem because chromosome number ? ( while splitting hdfs will create duplicates of data chunks ) Will it create problem while reducing for a final output after processing ?
ADD REPLYlink written 4.9 years ago by shalini.ravishankar0

It depends on how they're split. You can't just arbitrarily chop up a BAM file and have it work.

ADD REPLYlink written 4.9 years ago by Devon Ryan93k
Thanks Devon. Can you suggest me some ways how do I split the bam file ? It would be really helpful
ADD REPLYlink written 4.9 years ago by shalini.ravishankar0

Presumably Hadoop-Bam provides an API for that. Just so you know, anything you do like this with hadoop is going to involve at least some amount of programming on your part.

ADD REPLYlink written 4.9 years ago by Devon Ryan93k
So do I need to use hadoop bam api before split( before load the data into hdfs) ? Or after processing with methratio.py and while reducing (combining) the results ? Can you please explain me the steps ? Im new to this . It would be really helpful to proceed in a correct path
ADD REPLYlink written 4.9 years ago by shalini.ravishankar0

You're going to have to figure this out for yourself, I'm not familiar with the inner-working of hadoop-bam.

ADD REPLYlink written 4.9 years ago by Devon Ryan93k
Thanks Devon. Thanks for the info.
ADD REPLYlink written 4.9 years ago by shalini.ravishankar0

Currently we are executing with out hadoop in a single node. But we are trying to use it with hadoop cluster to save processing time and also for learning purpose

ADD REPLYlink written 4.9 years ago by shalini.ravishankar10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 759 users visited in the last hour