[drmaa-wg] DRMAA TEST SUITE

Roger Brobst rogerb at cadence.com
Mon Mar 27 10:46:47 CST 2006



The "terminated normally" terminology was borrowed
from (and attributed to) the POSIX spec for wait3().
Although I'm not enamored with the terminology,
I would be opposed to changing the semantics based upon:
 >   [...] "normal" termination might have a completely
 >   different meaning in different DRM's

As I understand the proposed text for drmaa_wifexited,
 > "Evaluates into 'exited' a non-zero value if stat was 
 > returned for a ended job that either failed (DRMAA_PS_FAILED)
 > or finished (DRMAA_PS_DONE).
if a job went directly from the "Queued" state to the
"Failed"  state (without entering the "Running" state), 
drmaa_wifexited would output non-zero ?

I'd be opposed to ~that~ !

It occurred to met that the "ended job" terminology
might have been intended to disallow this situation ...
but I discarded that thought since "ended job" is not
in the job state transition diagram.

-Roger



In a previous e-mail, Peter Troeger wrote:
> I think both Ruben and me didn't like the statement about "normal
> termination":
> 
> --- snip
> 
> "Evaluates into 'exited', a non-zero value if stat was returned for a
> job that terminated normally. A zero value can also indicate that
> although the job has terminated normally an exit status is not available
> or that it is not known whether the job terminated normally. In both
> cases drmaa_wexitstatus() SHALL NOT provide exit status information.
> A non-zero 'exited' value indicates more detailed diagnosis can be
> provided by means of drmaa_wifsignaled(),
> drmaa_wtermsig(),drmaa_wexitstatus(), and drmaa_wcoredump()."
> 
> -- snip
> 
> We discussed that "normal" termination might have a completely different
> meaning in different DRM's. Therefore, DRMAA should only rely on it's
> own job state transition concept, instead of using new words such as
> "termination". A first rough proposal for a different text:
> 
> -- snip
> 
> "Evaluates into 'exited' a non-zero value if stat was returned for a
> ended job that either failed (DRMAA_PS_FAILED) or finished
> (DRMAA_PS_DONE). More detailed diagnosis can be provided by means of
> drmaa_wifsignaled(), drmaa_wtermsig(),drmaa_wexitstatus(), and
> drmaa_wcoredump().
> A zero result for the 'exited' parameter either indicates that 1.)
> although the job is known to be ended more information is not available
> or 2.) that it is not known whether the job ended. In both cases
> drmaa_wexitstatus() SHALL NOT provide exit status information."
> 
> -- snip
> 
> Just a proposal, the other fixes are fine.
> 
> Peter.
> 
> Hrabri Rajic schrieb:
> > Hi Ruben, Peter,
> > 
> > It might be a good idea for two of you to check drama_wif* functions for
> > correctness from your standpoint.  Tracker 1125,
> > https://forge.gridforum.org/tracker/?aid=1125 could explain the reasons for
> > many changes those routine went thru.
> > 
> > Attached is the up to date DRMAA spec.
> > 
> > Thx
> > 
> > 	Hrabri
> > 
> > 
> > 
> >>-----Original Message-----
> >>From: owner-drmaa-wg at ggf.org [mailto:owner-drmaa-wg at ggf.org] On Behalf Of
> >>Ruben Santiago Montero
> >>Sent: Thursday, March 23, 2006 4:55 AM
> >>To: Peter Tröger
> >>Cc: DRMAA Working Group
> >>Subject: Re: [drmaa-wg] DRMAA TEST SUITE
> >>
> >>Hi Peter,
> >>
> >>On Tuesday 21 March 2006 21:43, you wrote:
> >>
> >>>>Sorry, I do not agree. In the DRMS context, job life cycle comprises
> >>
> >>all
> >>
> >>>>the job execution stages since the job enters the DRM system. In this
> >>>>sense, whenever a job is submitted there should be a termination
> >>
> >>(either
> >>
> >>>>it actually ran or not). I can give you an example, if you submit a
> >>
> >>job
> >>
> >>>>(qsub) and then you kill it (qdel), it is obvious that the job
> >>
> >>terminated
> >>
> >>>>abnormally (it has been killed), although the job never entered the
> >>>>running state.
> >>>
> >>>This is one possible interpretation, I agree. The DRMAA spec is aligned
> >>>to POSIX semantics here - it is only possible to have something
> >>>terminated which was running (== executed) before.
> >>
> >>OK!!
> >>
> >>>>There is no relation between if the job terminated normally and if
> >>
> >>there
> >>
> >>>>is no further information from the DRM. In the previous example (a job
> >>>>that has been killed) could or could not be more information from the
> >>>>DRMS.  But in any case, it is clear that the job terminated
> >>
> >>abnormally.
> >>
> >>>>drmaa_wifexited description should concentrate in one aspect since
> >>
> >>there
> >>
> >>>>is no obvious (or general) relation between job termination and
> >>
> >>getting
> >>
> >>>>further information from DRM.
> >>>
> >>>You are right. The main intention of drmaa_wifexited() is to tell you if
> >>>additional information about the job execution ending is available. The
> >>>final status of the job is provided by drmaa_job_ps(), and nothing else.
> >>
> >>OK, We will fix the drmaa_wifexited() in GridWay DRMAA according to this.
> >>
> >>
> >>>The confusion might eventually be solvable by a slight reformulation of
> >>>the first sentences in the drmaa_wif...() descriptions, in order to
> >>>avoid the word "termination". This would not lead to a change of
> >>
> >>semantics.
> >>
> >>>I have no good proposal - DRMAA group ?
> >>>
> >>>
> >>>>>( Note: The testsuite assumes here that unusable input files are
> >>>>>detected by the DRM before the job starts. This  seems to be
> >>
> >>realistic,
> >>
> >>>>>since file staging operations are usually not part of the job
> >>>>>execution.)
> >>>>
> >>>>I do not think so. Usually job preparation stages are part of the job
> >>>>execution, for example:
> >>>
> >>>...
> >>>
> >>>
> >>>>Therefore I suggest removing the ST_ERROR_INPUT_FAIURE,
> >>>>ST_ERROR_FILE_FAILURE and  ST_ERROR_FILE_FAILURE from the official
> >>
> >>test
> >>
> >>>>suite. In the previous DRMs at least, you can submit a job with output
> >>>>file /etc/passwd or an unusable input file , the job is queued, runs
> >>
> >>and
> >>
> >>>>fails.
> >>>
> >>>During the last phone call, the group went through the code. We agree to
> >>>your impression that the 3 tests are currently not sufficient. The
> >>>descriptions for "input / output / error stream" job template parameters
> >>>says that an invalid value should result in the job state
> >>>DRMAA_PS_FAILED - and nothing more. There is no description of what that
> >>>means for drmaa_wif...() calls, but the testsuite expects a particular
> >>>behavior. If you look at DRMAA section 2.6, it is clearly shown that
> >>>DRMAA_PS_FAILED is possible both for queued and running jobs.
> >>>
> >>>Our proposal is to remove the call of drmaa_wifaborted() for
> >>>ST_INPUT_FILE_FAILURE / ST_ERROR_FILE_FAILURE / ST_OUTPUT_FILE_FAILURE.
> >>>The drmaa_wait() call does not hurt (since all submitted jobs must be
> >>>waitable), but the crucial part is the testing for the result of
> >>>drmaa_synchronize(). After this change, I would expect the test cases to
> >>>be successful also on your system. In case of malicious input / output /
> >>>error files, the DRMAA implementation would only be expected to state a
> >>>job failure. This should work for all GridWay-supported systems, right ?
> >>>Could you accept this proposal ?
> >>>
> >>
> >>Sure. It make sense for me also.
> >>
> >>There is also a validator in the state diagram (Section 2.6). I am just
> >>wondering if a DRMAA implementation could just reject the jobs in these
> >>tests
> >>at submission with a DRMAA_ERRNO_DENIED_BY_DRM.
> >>
> >>
> >>>BTW: Condor is one example for a system where the existence of input
> >>>files is checked before the job is started. But at least your GRAM
> >>>example convinced me that the opposite is also true ;-) ...
> >>>
> >>>
> >>>>Sure. The problem is that the code is not clear either. From DRMAA 1.0
> >>
> >>C
> >>
> >>>>bindings example:
> >>>
> >>>...
> >>>
> >>>
> >>>>From this code it seems that a signaled job should end with a zero
> >>
> >>exited
> >>
> >>>>value from wifexited (as if it did not terminate normally), as opposed
> >>
> >>to
> >>
> >>>>your comments in the previous mails and the code in the DRMAA test
> >>
> >>suite.
> >>
> >>>You are right, as already said above. drmaa_wifexited() mainly indicates
> >>>the availability of additional information.
> >>
> >>OK
> >>
> >>>Regards,
> >>>Peter.
> >>
> >>Best Regards,
> >>Rubén
> >>--
> >>+-----------------------------------------------------------+
> >> Dr. Ruben Santiago Montero
> >> Assistant Professor
> >> Dpto. Arquitectura de Computadores y Automatica
> >> Facultad de Informatica
> >> Universidad Complutense      phone  : +34 91 394 75 38
> >> 28040 Madrid                 fax    : +34 91 394 75 27
> >> Spain                        email  : rubensm at dacya.ucm.es
> >> http://asds.dacya.ucm.es/
> >>+-----------------------------------------------------------+
> >>
> >>GridWay, The Way to Grid! http://www.gridway.org





More information about the drmaa-wg mailing list