[drmaa-wg] DRMAA TEST SUITE

Peter Troeger peter.troeger at hpi.uni-potsdam.de
Mon Mar 27 15:43:59 CST 2006


>
> The "terminated normally" terminology was borrowed
> from (and attributed to) the POSIX spec for wait3().
> Although I'm not enamored with the terminology,
> I would be opposed to changing the semantics based upon:
>>   [...] "normal" termination might have a completely
>>   different meaning in different DRM's
>
> As I understand the proposed text for drmaa_wifexited,
>> "Evaluates into 'exited' a non-zero value if stat was
>> returned for a ended job that either failed (DRMAA_PS_FAILED)
>> or finished (DRMAA_PS_DONE).
> if a job went directly from the "Queued" state to the
> "Failed"  state (without entering the "Running" state),
> drmaa_wifexited would output non-zero ?
>
> I'd be opposed to ~that~ !
>
> It occurred to met that the "ended job" terminology
> might have been intended to disallow this situation ...

Arrggh - correct. PS_FAILED could mean both things. What about this  
text, does it still reflect the original POSIX idea (and good english):

"Evaluates into 'exited' a non-zero value if stat was returned for a
ended job that either failed after running or finished after running  
(see section 2.6).
More detailed diagnosis can be provided by means of
drmaa_wifsignaled(), drmaa_wtermsig(),drmaa_wexitstatus(), and
drmaa_wcoredump().
A zero result for the 'exited' parameter either indicates that 1.)
although it is known that the job was running, more information is  
not available
or 2.) that it is not known whether the job was running. In both cases
drmaa_wexitstatus() SHALL NOT provide exit status information."


Regards,
Peter.



>
>
> In a previous e-mail, Peter Troeger wrote:
>> I think both Ruben and me didn't like the statement about "normal
>> termination":
>>
>> --- snip
>>
>> "Evaluates into 'exited', a non-zero value if stat was returned for a
>> job that terminated normally. A zero value can also indicate that
>> although the job has terminated normally an exit status is not  
>> available
>> or that it is not known whether the job terminated normally. In both
>> cases drmaa_wexitstatus() SHALL NOT provide exit status information.
>> A non-zero 'exited' value indicates more detailed diagnosis can be
>> provided by means of drmaa_wifsignaled(),
>> drmaa_wtermsig(),drmaa_wexitstatus(), and drmaa_wcoredump()."
>>
>> -- snip
>>
>> We discussed that "normal" termination might have a completely  
>> different
>> meaning in different DRM's. Therefore, DRMAA should only rely on it's
>> own job state transition concept, instead of using new words such as
>> "termination". A first rough proposal for a different text:
>>
>> -- snip
>>
>> "Evaluates into 'exited' a non-zero value if stat was returned for a
>> ended job that either failed (DRMAA_PS_FAILED) or finished
>> (DRMAA_PS_DONE). More detailed diagnosis can be provided by means of
>> drmaa_wifsignaled(), drmaa_wtermsig(),drmaa_wexitstatus(), and
>> drmaa_wcoredump().
>> A zero result for the 'exited' parameter either indicates that 1.)
>> although the job is known to be ended more information is not  
>> available
>> or 2.) that it is not known whether the job ended. In both cases
>> drmaa_wexitstatus() SHALL NOT provide exit status information."
>>
>> -- snip
>>
>> Just a proposal, the other fixes are fine.
>>
>> Peter.
>>
>> Hrabri Rajic schrieb:
>>> Hi Ruben, Peter,
>>>
>>> It might be a good idea for two of you to check drama_wif*  
>>> functions for
>>> correctness from your standpoint.  Tracker 1125,
>>> https://forge.gridforum.org/tracker/?aid=1125 could explain the  
>>> reasons for
>>> many changes those routine went thru.
>>>
>>> Attached is the up to date DRMAA spec.
>>>
>>> Thx
>>>
>>> 	Hrabri
>>>
>>>
>>>
>>>> -----Original Message-----
>>>> From: owner-drmaa-wg at ggf.org [mailto:owner-drmaa-wg at ggf.org] On  
>>>> Behalf Of
>>>> Ruben Santiago Montero
>>>> Sent: Thursday, March 23, 2006 4:55 AM
>>>> To: Peter Tröger
>>>> Cc: DRMAA Working Group
>>>> Subject: Re: [drmaa-wg] DRMAA TEST SUITE
>>>>
>>>> Hi Peter,
>>>>
>>>> On Tuesday 21 March 2006 21:43, you wrote:
>>>>
>>>>>> Sorry, I do not agree. In the DRMS context, job life cycle  
>>>>>> comprises
>>>>
>>>> all
>>>>
>>>>>> the job execution stages since the job enters the DRM system.  
>>>>>> In this
>>>>>> sense, whenever a job is submitted there should be a termination
>>>>
>>>> (either
>>>>
>>>>>> it actually ran or not). I can give you an example, if you  
>>>>>> submit a
>>>>
>>>> job
>>>>
>>>>>> (qsub) and then you kill it (qdel), it is obvious that the job
>>>>
>>>> terminated
>>>>
>>>>>> abnormally (it has been killed), although the job never  
>>>>>> entered the
>>>>>> running state.
>>>>>
>>>>> This is one possible interpretation, I agree. The DRMAA spec is  
>>>>> aligned
>>>>> to POSIX semantics here - it is only possible to have something
>>>>> terminated which was running (== executed) before.
>>>>
>>>> OK!!
>>>>
>>>>>> There is no relation between if the job terminated normally  
>>>>>> and if
>>>>
>>>> there
>>>>
>>>>>> is no further information from the DRM. In the previous  
>>>>>> example (a job
>>>>>> that has been killed) could or could not be more information  
>>>>>> from the
>>>>>> DRMS.  But in any case, it is clear that the job terminated
>>>>
>>>> abnormally.
>>>>
>>>>>> drmaa_wifexited description should concentrate in one aspect  
>>>>>> since
>>>>
>>>> there
>>>>
>>>>>> is no obvious (or general) relation between job termination and
>>>>
>>>> getting
>>>>
>>>>>> further information from DRM.
>>>>>
>>>>> You are right. The main intention of drmaa_wifexited() is to  
>>>>> tell you if
>>>>> additional information about the job execution ending is  
>>>>> available. The
>>>>> final status of the job is provided by drmaa_job_ps(), and  
>>>>> nothing else.
>>>>
>>>> OK, We will fix the drmaa_wifexited() in GridWay DRMAA according  
>>>> to this.
>>>>
>>>>
>>>>> The confusion might eventually be solvable by a slight  
>>>>> reformulation of
>>>>> the first sentences in the drmaa_wif...() descriptions, in  
>>>>> order to
>>>>> avoid the word "termination". This would not lead to a change of
>>>>
>>>> semantics.
>>>>
>>>>> I have no good proposal - DRMAA group ?
>>>>>
>>>>>
>>>>>>> ( Note: The testsuite assumes here that unusable input files are
>>>>>>> detected by the DRM before the job starts. This  seems to be
>>>>
>>>> realistic,
>>>>
>>>>>>> since file staging operations are usually not part of the job
>>>>>>> execution.)
>>>>>>
>>>>>> I do not think so. Usually job preparation stages are part of  
>>>>>> the job
>>>>>> execution, for example:
>>>>>
>>>>> ...
>>>>>
>>>>>
>>>>>> Therefore I suggest removing the ST_ERROR_INPUT_FAIURE,
>>>>>> ST_ERROR_FILE_FAILURE and  ST_ERROR_FILE_FAILURE from the  
>>>>>> official
>>>>
>>>> test
>>>>
>>>>>> suite. In the previous DRMs at least, you can submit a job  
>>>>>> with output
>>>>>> file /etc/passwd or an unusable input file , the job is  
>>>>>> queued, runs
>>>>
>>>> and
>>>>
>>>>>> fails.
>>>>>
>>>>> During the last phone call, the group went through the code. We  
>>>>> agree to
>>>>> your impression that the 3 tests are currently not sufficient. The
>>>>> descriptions for "input / output / error stream" job template  
>>>>> parameters
>>>>> says that an invalid value should result in the job state
>>>>> DRMAA_PS_FAILED - and nothing more. There is no description of  
>>>>> what that
>>>>> means for drmaa_wif...() calls, but the testsuite expects a  
>>>>> particular
>>>>> behavior. If you look at DRMAA section 2.6, it is clearly shown  
>>>>> that
>>>>> DRMAA_PS_FAILED is possible both for queued and running jobs.
>>>>>
>>>>> Our proposal is to remove the call of drmaa_wifaborted() for
>>>>> ST_INPUT_FILE_FAILURE / ST_ERROR_FILE_FAILURE /  
>>>>> ST_OUTPUT_FILE_FAILURE.
>>>>> The drmaa_wait() call does not hurt (since all submitted jobs  
>>>>> must be
>>>>> waitable), but the crucial part is the testing for the result of
>>>>> drmaa_synchronize(). After this change, I would expect the test  
>>>>> cases to
>>>>> be successful also on your system. In case of malicious input /  
>>>>> output /
>>>>> error files, the DRMAA implementation would only be expected to  
>>>>> state a
>>>>> job failure. This should work for all GridWay-supported  
>>>>> systems, right ?
>>>>> Could you accept this proposal ?
>>>>>
>>>>
>>>> Sure. It make sense for me also.
>>>>
>>>> There is also a validator in the state diagram (Section 2.6). I  
>>>> am just
>>>> wondering if a DRMAA implementation could just reject the jobs  
>>>> in these
>>>> tests
>>>> at submission with a DRMAA_ERRNO_DENIED_BY_DRM.
>>>>
>>>>
>>>>> BTW: Condor is one example for a system where the existence of  
>>>>> input
>>>>> files is checked before the job is started. But at least your GRAM
>>>>> example convinced me that the opposite is also true ;-) ...
>>>>>
>>>>>
>>>>>> Sure. The problem is that the code is not clear either. From  
>>>>>> DRMAA 1.0
>>>>
>>>> C
>>>>
>>>>>> bindings example:
>>>>>
>>>>> ...
>>>>>
>>>>>
>>>>>> From this code it seems that a signaled job should end with a  
>>>>>> zero
>>>>
>>>> exited
>>>>
>>>>>> value from wifexited (as if it did not terminate normally), as  
>>>>>> opposed
>>>>
>>>> to
>>>>
>>>>>> your comments in the previous mails and the code in the DRMAA  
>>>>>> test
>>>>
>>>> suite.
>>>>
>>>>> You are right, as already said above. drmaa_wifexited() mainly  
>>>>> indicates
>>>>> the availability of additional information.
>>>>
>>>> OK
>>>>
>>>>> Regards,
>>>>> Peter.
>>>>
>>>> Best Regards,
>>>> Rubén
>>>> --
>>>> +-----------------------------------------------------------+
>>>> Dr. Ruben Santiago Montero
>>>> Assistant Professor
>>>> Dpto. Arquitectura de Computadores y Automatica
>>>> Facultad de Informatica
>>>> Universidad Complutense      phone  : +34 91 394 75 38
>>>> 28040 Madrid                 fax    : +34 91 394 75 27
>>>> Spain                        email  : rubensm at dacya.ucm.es
>>>> http://asds.dacya.ucm.es/
>>>> +-----------------------------------------------------------+
>>>>
>>>> GridWay, The Way to Grid! http://www.gridway.org





More information about the drmaa-wg mailing list