[DRMAA-WG] normal exit status causes drmaa_wifaborted

Tim Harsch harsch1 at llnl.gov
Thu Mar 29 12:13:34 CDT 2007


Thanks Rayson, as always, you're a great help!

Well, I've narrowed down the problem.  I was worried that Schedule::DRMAAc 
may not be working correctly, but now I'm not so sure...  I think it may be 
specific to SGE.  I noticed that on page 137 of the User's guide ( 
http://192.18.109.11/817-6117/817-6117.pdf ), it lists exit code 99 as 
having specific meaning w.r.t. rescheduling.  It got me wondering if other 
exit codes have specific meaning, or are getting interpreted in some way I 
don't understand.  So I wrote the two attached scripts, output below.  As 
you can see: exit codes below 100 work as expected, exit code 100 returns 
wifaborted, and exit codes above 100 get mangled.  (NOTE: I was having 
difficulty getting my previous method of using /bin/csh -c 'exit 100' to 
work as expected and so switched to a simple perl wrapper script [ also 
attached ] )

I think a valid next step would be to write this script in the Java binding 
and see what happens.

[harsch1 at xber1 DRMAA_JavaTest]$ Test.pl
Test.pl
Sent script '/home/harsch1/tmp/DRMAA_JavaTest/exit_script.pl 1' to grid with 
Job ID: '85064'
Exited: 1
  Exit value: 1
Aborted: 0
Signaled: 0
Sent script '/home/harsch1/tmp/DRMAA_JavaTest/exit_script.pl 60' to grid 
with Job ID: '85065'
Exited: 1
  Exit value: 60
Aborted: 0
Signaled: 0
Sent script '/home/harsch1/tmp/DRMAA_JavaTest/exit_script.pl 80' to grid 
with Job ID: '85066'
Exited: 1
  Exit value: 80
Aborted: 0
Signaled: 0
Sent script '/home/harsch1/tmp/DRMAA_JavaTest/exit_script.pl 100' to grid 
with Job ID: '85067'
Exited: 0
Aborted: 1
Signaled: 0
Sent script '/home/harsch1/tmp/DRMAA_JavaTest/exit_script.pl 1000' to grid 
with Job ID: '85068'
Exited: 1
  Exit value: 232
Aborted: 0
Signaled: 0
Sent script '/home/harsch1/tmp/DRMAA_JavaTest/exit_script.pl 10000' to grid 
with Job ID: '85069'
Exited: 1
  Exit value: 16
Aborted: 0
Signaled: 0


Thanks,
Tim Harsch

----- Original Message ----- 
From: "Rayson Ho" <rayrayson at gmail.com>
To: "DRMAA-WG" <drmaa-wg at gridforum.org>
Sent: Wednesday, March 28, 2007 6:28 PM
Subject: Re: [DRMAA-WG] normal exit status causes drmaa_wifaborted


> It uses JNI (Java Native Interface):
>
> http://blogs.sun.com/templedf/entry/porting_the_drmaa_java_language
>
> Rayson
>
>
>
> On 3/28/07, Tim Harsch <harsch1 at llnl.gov> wrote:
>> Daniel,
>>    By what method does the Java binding, bind to the C binding ( e.g. the
>> perl binding uses SWIG... )
>>
>> I'm diving into the Perl binding now, but its been about 4 years since I
>> wrote it.... so it's gonna take me some time I think.
>>
>> PS It's really odd, the problem showed up in code I've put in regular use
>> for a long time, I know bugs just don't introduce themselves, but this 
>> part
>> of the binding worked fine, I haven't upraded SGE or Perl or recompiled
>> Schedule::DRMAAc but the problem just appeared.  I'm thinking the 
>> sysadmins
>> ran up2date on my RH4 box and a dependency library to the C binding 
>> changed.
>> But, if your Java binding is actively using it, then it would rule that
>> out...
>>
>>
>> ----- Original Message -----
>> From: "Daniel Templeton" <Dan.Templeton at Sun.COM>
>> To: "Tim Harsch" <harsch1 at llnl.gov>
>> Cc: "DRMAA-WG" <drmaa-wg at gridforum.org>
>> Sent: Wednesday, March 28, 2007 10:48 AM
>> Subject: Re: [DRMAA-WG] normal exit status causes drmaa_wifaborted
>>
>>
>> > Tim,
>> >
>> > Looks like something localized to the Perl binding or your
>> > configuration.  I did the same test on the Java language binding, which
>> > is also based on the C binding, and it worked fine for me.  Output
>> > below, program attached.
>> >
>> > Could the problem be that you're sending the full command line as the
>> > remote command and "1" as the args, instead of "csh" as the remote
>> > command and "-c", "'exit 1'" as the args?  What is the meaning of
>> > setting the args to "1"?
>> >
>> > ---
>> >
>> > % java -cp /sge/lib/drmaa.jar:. -d64 Test
>> > Exited: true
>> > Aborted: false
>> > Signaled: false
>> >
>> > ---
>> >
>> > Daniel
>> >
>> > Tim Harsch wrote:
>> >> I don't understand why causing a simple non-zero exit status is
>> >> causing drmaa_wifaborted to be set.
>> >>
>> >> The easiest way for me to demo this is to change line 38 of
>> >> t/08_posix_tests.t of the Schedule::DRMAAc CPAN module to be
>> >> my $remote_cmd = "csh -c 'exit 1'";
>> >>
>> >> And then running "make test TEST_VERBOSE=1", which would produce:
>> >> <SNIP>
>> >> ok 12 - drmaa_wait says jobid did not change?
>> >> #     Failed test (t/08_posix_tests.t at line 83)
>> >> not ok 13 - drmaa_wait should say there is more info available in
>> >> POSIX funcs
>> >> ok 15 - drmaa_wifaborted error?
>> >> #     Failed test (t/08_posix_tests.t at line 90)
>> >> not ok 16 - normal job should not abort.
>> >> ok 17 - drmaa_wifexited returned 3 of 3 args
>> >> ok 18 - drmaa_wifexited error?
>> >> #     Failed test (t/08_posix_tests.t at line 97)
>> >> not ok 19 - normal job should exit.
>> >> <SNIP>
>> >>
>> >> I've attached test 8 to this email, in case you want to see how the
>> >> calls are made in Perl.
>> >>
>> >> Any ideas?
>> >>
>> >> Thanks,
>> >> Tim Harsch
>> >> ------------------------------------------------------------------------
>> >>
>> >> --
>> >>   drmaa-wg mailing list
>> >>   drmaa-wg at ogf.org
>> >>   http://www.ogf.org/mailman/listinfo/drmaa-wg
>> >>
>> >
>> >
>>
>>
>> --------------------------------------------------------------------------------
>>
>>
>> > import org.ggf.drmaa.*;
>> >
>> > public class Test {
>> > public static void main(String[] args) throws Exception {
>> > Session s = SessionFactory.getFactory().getSession();
>> > s.init("");
>> > JobTemplate jt = s.createJobTemplate();
>> > jt.setRemoteCommand("/usr/bin/csh");
>> > jt.setArgs(new String[] {"-c", "'exit 1'"});
>> > String job = s.runJob(jt);
>> > JobInfo ji = s.wait(job, s.TIMEOUT_WAIT_FOREVER);
>> > System.out.println("Exited: " + ji.hasExited());
>> > System.out.println("Aborted: " + ji.wasAborted());
>> > System.out.println("Signaled: " + ji.hasSignaled());
>> > s.deleteJobTemplate(jt);
>> > s.exit();
>> > }
>> > }
>> >
>> --
>>  drmaa-wg mailing list
>>  drmaa-wg at ogf.org
>>  http://www.ogf.org/mailman/listinfo/drmaa-wg
>>
> --
>  drmaa-wg mailing list
>  drmaa-wg at ogf.org
>  http://www.ogf.org/mailman/listinfo/drmaa-wg 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Test.pl
Type: application/octet-stream
Size: 2051 bytes
Desc: not available
Url : http://www.ogf.org/pipermail/drmaa-wg/attachments/20070329/1777f344/attachment.obj 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: exit_script.pl
Type: application/octet-stream
Size: 32 bytes
Desc: not available
Url : http://www.ogf.org/pipermail/drmaa-wg/attachments/20070329/1777f344/attachment-0001.obj 


More information about the drmaa-wg mailing list