[DRMAA-WG] normal exit status causes drmaa_wifaborted

Rayson Ho rayrayson at gmail.com
Thu Mar 29 12:32:23 CDT 2007


In sge_shepherd(8):

 100    Job script, prolog and epilog:  When  FORBID_APPERROR
            is  not  set  in the configuration (see sge_conf(5)),
            the job gets requeued.  Otherwise see "Other".


On the other hand, on Unix (including Linux), there is a limit on how
large the exit value can be (and exit code 1000 is invalid because it
is too large):

http://tldp.org/LDP/abs/html/exitcodes.html

Rayson






On 3/29/07, Tim Harsch <harsch1 at llnl.gov> wrote:
> Thanks Rayson, as always, you're a great help!
>
> Well, I've narrowed down the problem.  I was worried that Schedule::DRMAAc
> may not be working correctly, but now I'm not so sure...  I think it may be
> specific to SGE.  I noticed that on page 137 of the User's guide (
> http://192.18.109.11/817-6117/817-6117.pdf ), it lists exit code 99 as
> having specific meaning w.r.t. rescheduling.  It got me wondering if other
> exit codes have specific meaning, or are getting interpreted in some way I
> don't understand.  So I wrote the two attached scripts, output below.  As
> you can see: exit codes below 100 work as expected, exit code 100 returns
> wifaborted, and exit codes above 100 get mangled.  (NOTE: I was having
> difficulty getting my previous method of using /bin/csh -c 'exit 100' to
> work as expected and so switched to a simple perl wrapper script [ also
> attached ] )
>
> I think a valid next step would be to write this script in the Java binding
> and see what happens.
>
> [harsch1 at xber1 DRMAA_JavaTest]$ Test.pl
> Test.pl
> Sent script '/home/harsch1/tmp/DRMAA_JavaTest/exit_script.pl 1' to grid with
> Job ID: '85064'
> Exited: 1
>  Exit value: 1
> Aborted: 0
> Signaled: 0
> Sent script '/home/harsch1/tmp/DRMAA_JavaTest/exit_script.pl 60' to grid
> with Job ID: '85065'
> Exited: 1
>  Exit value: 60
> Aborted: 0
> Signaled: 0
> Sent script '/home/harsch1/tmp/DRMAA_JavaTest/exit_script.pl 80' to grid
> with Job ID: '85066'
> Exited: 1
>  Exit value: 80
> Aborted: 0
> Signaled: 0
> Sent script '/home/harsch1/tmp/DRMAA_JavaTest/exit_script.pl 100' to grid
> with Job ID: '85067'
> Exited: 0
> Aborted: 1
> Signaled: 0
> Sent script '/home/harsch1/tmp/DRMAA_JavaTest/exit_script.pl 1000' to grid
> with Job ID: '85068'
> Exited: 1
>  Exit value: 232
> Aborted: 0
> Signaled: 0
> Sent script '/home/harsch1/tmp/DRMAA_JavaTest/exit_script.pl 10000' to grid
> with Job ID: '85069'
> Exited: 1
>  Exit value: 16
> Aborted: 0
> Signaled: 0
>
>
> Thanks,
> Tim Harsch
>
> ----- Original Message -----
> From: "Rayson Ho" <rayrayson at gmail.com>
> To: "DRMAA-WG" <drmaa-wg at gridforum.org>
> Sent: Wednesday, March 28, 2007 6:28 PM
> Subject: Re: [DRMAA-WG] normal exit status causes drmaa_wifaborted
>
>
> > It uses JNI (Java Native Interface):
> >
> > http://blogs.sun.com/templedf/entry/porting_the_drmaa_java_language
> >
> > Rayson
> >
> >
> >
> > On 3/28/07, Tim Harsch <harsch1 at llnl.gov> wrote:
> >> Daniel,
> >>    By what method does the Java binding, bind to the C binding ( e.g. the
> >> perl binding uses SWIG... )
> >>
> >> I'm diving into the Perl binding now, but its been about 4 years since I
> >> wrote it.... so it's gonna take me some time I think.
> >>
> >> PS It's really odd, the problem showed up in code I've put in regular use
> >> for a long time, I know bugs just don't introduce themselves, but this
> >> part
> >> of the binding worked fine, I haven't upraded SGE or Perl or recompiled
> >> Schedule::DRMAAc but the problem just appeared.  I'm thinking the
> >> sysadmins
> >> ran up2date on my RH4 box and a dependency library to the C binding
> >> changed.
> >> But, if your Java binding is actively using it, then it would rule that
> >> out...
> >>
> >>
> >> ----- Original Message -----
> >> From: "Daniel Templeton" <Dan.Templeton at Sun.COM>
> >> To: "Tim Harsch" <harsch1 at llnl.gov>
> >> Cc: "DRMAA-WG" <drmaa-wg at gridforum.org>
> >> Sent: Wednesday, March 28, 2007 10:48 AM
> >> Subject: Re: [DRMAA-WG] normal exit status causes drmaa_wifaborted
> >>
> >>
> >> > Tim,
> >> >
> >> > Looks like something localized to the Perl binding or your
> >> > configuration.  I did the same test on the Java language binding, which
> >> > is also based on the C binding, and it worked fine for me.  Output
> >> > below, program attached.
> >> >
> >> > Could the problem be that you're sending the full command line as the
> >> > remote command and "1" as the args, instead of "csh" as the remote
> >> > command and "-c", "'exit 1'" as the args?  What is the meaning of
> >> > setting the args to "1"?
> >> >
> >> > ---
> >> >
> >> > % java -cp /sge/lib/drmaa.jar:. -d64 Test
> >> > Exited: true
> >> > Aborted: false
> >> > Signaled: false
> >> >
> >> > ---
> >> >
> >> > Daniel
> >> >
> >> > Tim Harsch wrote:
> >> >> I don't understand why causing a simple non-zero exit status is
> >> >> causing drmaa_wifaborted to be set.
> >> >>
> >> >> The easiest way for me to demo this is to change line 38 of
> >> >> t/08_posix_tests.t of the Schedule::DRMAAc CPAN module to be
> >> >> my $remote_cmd = "csh -c 'exit 1'";
> >> >>
> >> >> And then running "make test TEST_VERBOSE=1", which would produce:
> >> >> <SNIP>
> >> >> ok 12 - drmaa_wait says jobid did not change?
> >> >> #     Failed test (t/08_posix_tests.t at line 83)
> >> >> not ok 13 - drmaa_wait should say there is more info available in
> >> >> POSIX funcs
> >> >> ok 15 - drmaa_wifaborted error?
> >> >> #     Failed test (t/08_posix_tests.t at line 90)
> >> >> not ok 16 - normal job should not abort.
> >> >> ok 17 - drmaa_wifexited returned 3 of 3 args
> >> >> ok 18 - drmaa_wifexited error?
> >> >> #     Failed test (t/08_posix_tests.t at line 97)
> >> >> not ok 19 - normal job should exit.
> >> >> <SNIP>
> >> >>
> >> >> I've attached test 8 to this email, in case you want to see how the
> >> >> calls are made in Perl.
> >> >>
> >> >> Any ideas?
> >> >>
> >> >> Thanks,
> >> >> Tim Harsch
> >> >> ------------------------------------------------------------------------
> >> >>
> >> >> --
> >> >>   drmaa-wg mailing list
> >> >>   drmaa-wg at ogf.org
> >> >>   http://www.ogf.org/mailman/listinfo/drmaa-wg
> >> >>
> >> >
> >> >
> >>
> >>
> >> --------------------------------------------------------------------------------
> >>
> >>
> >> > import org.ggf.drmaa.*;
> >> >
> >> > public class Test {
> >> > public static void main(String[] args) throws Exception {
> >> > Session s = SessionFactory.getFactory().getSession();
> >> > s.init("");
> >> > JobTemplate jt = s.createJobTemplate();
> >> > jt.setRemoteCommand("/usr/bin/csh");
> >> > jt.setArgs(new String[] {"-c", "'exit 1'"});
> >> > String job = s.runJob(jt);
> >> > JobInfo ji = s.wait(job, s.TIMEOUT_WAIT_FOREVER);
> >> > System.out.println("Exited: " + ji.hasExited());
> >> > System.out.println("Aborted: " + ji.wasAborted());
> >> > System.out.println("Signaled: " + ji.hasSignaled());
> >> > s.deleteJobTemplate(jt);
> >> > s.exit();
> >> > }
> >> > }
> >> >
> >> --
> >>  drmaa-wg mailing list
> >>  drmaa-wg at ogf.org
> >>  http://www.ogf.org/mailman/listinfo/drmaa-wg
> >>
> > --
> >  drmaa-wg mailing list
> >  drmaa-wg at ogf.org
> >  http://www.ogf.org/mailman/listinfo/drmaa-wg
>
> --
>  drmaa-wg mailing list
>  drmaa-wg at ogf.org
>  http://www.ogf.org/mailman/listinfo/drmaa-wg
>
>


More information about the drmaa-wg mailing list