[drmaa-wg] DRMAA TEST SUITE

Daniel Templeton Dan.Templeton at Sun.COM
Mon Mar 27 08:39:19 CST 2006


Peter,

Works for me.

Daniel

Peter Troeger wrote On 03/27/06 14:17,:

>I think both Ruben and me didn't like the statement about "normal
>termination":
>
>--- snip
>
>"Evaluates into 'exited', a non-zero value if stat was returned for a
>job that terminated normally. A zero value can also indicate that
>although the job has terminated normally an exit status is not available
>or that it is not known whether the job terminated normally. In both
>cases drmaa_wexitstatus() SHALL NOT provide exit status information.
>A non-zero 'exited' value indicates more detailed diagnosis can be
>provided by means of drmaa_wifsignaled(),
>drmaa_wtermsig(),drmaa_wexitstatus(), and drmaa_wcoredump()."
>
>-- snip
>
>We discussed that "normal" termination might have a completely different
>meaning in different DRM's. Therefore, DRMAA should only rely on it's
>own job state transition concept, instead of using new words such as
>"termination". A first rough proposal for a different text:
>
>-- snip
>
>"Evaluates into 'exited' a non-zero value if stat was returned for a
>ended job that either failed (DRMAA_PS_FAILED) or finished
>(DRMAA_PS_DONE). More detailed diagnosis can be provided by means of
>drmaa_wifsignaled(), drmaa_wtermsig(),drmaa_wexitstatus(), and
>drmaa_wcoredump().
>A zero result for the 'exited' parameter either indicates that 1.)
>although the job is known to be ended more information is not available
>or 2.) that it is not known whether the job ended. In both cases
>drmaa_wexitstatus() SHALL NOT provide exit status information."
>
>-- snip
>
>Just a proposal, the other fixes are fine.
>
>Peter.
>
>Hrabri Rajic schrieb:
>  
>
>>Hi Ruben, Peter,
>>
>>It might be a good idea for two of you to check drama_wif* functions for
>>correctness from your standpoint.  Tracker 1125,
>>https://forge.gridforum.org/tracker/?aid=1125 could explain the reasons for
>>many changes those routine went thru.
>>
>>Attached is the up to date DRMAA spec.
>>
>>Thx
>>
>>	Hrabri
>>
>>
>>
>>    
>>
>>>-----Original Message-----
>>>From: owner-drmaa-wg at ggf.org [mailto:owner-drmaa-wg at ggf.org] On Behalf Of
>>>Ruben Santiago Montero
>>>Sent: Thursday, March 23, 2006 4:55 AM
>>>To: Peter Tröger
>>>Cc: DRMAA Working Group
>>>Subject: Re: [drmaa-wg] DRMAA TEST SUITE
>>>
>>>Hi Peter,
>>>
>>>On Tuesday 21 March 2006 21:43, you wrote:
>>>
>>>      
>>>
>>>>>Sorry, I do not agree. In the DRMS context, job life cycle comprises
>>>>>          
>>>>>
>>>all
>>>
>>>      
>>>
>>>>>the job execution stages since the job enters the DRM system. In this
>>>>>sense, whenever a job is submitted there should be a termination
>>>>>          
>>>>>
>>>(either
>>>
>>>      
>>>
>>>>>it actually ran or not). I can give you an example, if you submit a
>>>>>          
>>>>>
>>>job
>>>
>>>      
>>>
>>>>>(qsub) and then you kill it (qdel), it is obvious that the job
>>>>>          
>>>>>
>>>terminated
>>>
>>>      
>>>
>>>>>abnormally (it has been killed), although the job never entered the
>>>>>running state.
>>>>>          
>>>>>
>>>>This is one possible interpretation, I agree. The DRMAA spec is aligned
>>>>to POSIX semantics here - it is only possible to have something
>>>>terminated which was running (== executed) before.
>>>>        
>>>>
>>>OK!!
>>>
>>>      
>>>
>>>>>There is no relation between if the job terminated normally and if
>>>>>          
>>>>>
>>>there
>>>
>>>      
>>>
>>>>>is no further information from the DRM. In the previous example (a job
>>>>>that has been killed) could or could not be more information from the
>>>>>DRMS.  But in any case, it is clear that the job terminated
>>>>>          
>>>>>
>>>abnormally.
>>>
>>>      
>>>
>>>>>drmaa_wifexited description should concentrate in one aspect since
>>>>>          
>>>>>
>>>there
>>>
>>>      
>>>
>>>>>is no obvious (or general) relation between job termination and
>>>>>          
>>>>>
>>>getting
>>>
>>>      
>>>
>>>>>further information from DRM.
>>>>>          
>>>>>
>>>>You are right. The main intention of drmaa_wifexited() is to tell you if
>>>>additional information about the job execution ending is available. The
>>>>final status of the job is provided by drmaa_job_ps(), and nothing else.
>>>>        
>>>>
>>>OK, We will fix the drmaa_wifexited() in GridWay DRMAA according to this.
>>>
>>>
>>>      
>>>
>>>>The confusion might eventually be solvable by a slight reformulation of
>>>>the first sentences in the drmaa_wif...() descriptions, in order to
>>>>avoid the word "termination". This would not lead to a change of
>>>>        
>>>>
>>>semantics.
>>>
>>>      
>>>
>>>>I have no good proposal - DRMAA group ?
>>>>
>>>>
>>>>        
>>>>
>>>>>>( Note: The testsuite assumes here that unusable input files are
>>>>>>detected by the DRM before the job starts. This  seems to be
>>>>>>            
>>>>>>
>>>realistic,
>>>
>>>      
>>>
>>>>>>since file staging operations are usually not part of the job
>>>>>>execution.)
>>>>>>            
>>>>>>
>>>>>I do not think so. Usually job preparation stages are part of the job
>>>>>execution, for example:
>>>>>          
>>>>>
>>>>...
>>>>
>>>>
>>>>        
>>>>
>>>>>Therefore I suggest removing the ST_ERROR_INPUT_FAIURE,
>>>>>ST_ERROR_FILE_FAILURE and  ST_ERROR_FILE_FAILURE from the official
>>>>>          
>>>>>
>>>test
>>>
>>>      
>>>
>>>>>suite. In the previous DRMs at least, you can submit a job with output
>>>>>file /etc/passwd or an unusable input file , the job is queued, runs
>>>>>          
>>>>>
>>>and
>>>
>>>      
>>>
>>>>>fails.
>>>>>          
>>>>>
>>>>During the last phone call, the group went through the code. We agree to
>>>>your impression that the 3 tests are currently not sufficient. The
>>>>descriptions for "input / output / error stream" job template parameters
>>>>says that an invalid value should result in the job state
>>>>DRMAA_PS_FAILED - and nothing more. There is no description of what that
>>>>means for drmaa_wif...() calls, but the testsuite expects a particular
>>>>behavior. If you look at DRMAA section 2.6, it is clearly shown that
>>>>DRMAA_PS_FAILED is possible both for queued and running jobs.
>>>>
>>>>Our proposal is to remove the call of drmaa_wifaborted() for
>>>>ST_INPUT_FILE_FAILURE / ST_ERROR_FILE_FAILURE / ST_OUTPUT_FILE_FAILURE.
>>>>The drmaa_wait() call does not hurt (since all submitted jobs must be
>>>>waitable), but the crucial part is the testing for the result of
>>>>drmaa_synchronize(). After this change, I would expect the test cases to
>>>>be successful also on your system. In case of malicious input / output /
>>>>error files, the DRMAA implementation would only be expected to state a
>>>>job failure. This should work for all GridWay-supported systems, right ?
>>>>Could you accept this proposal ?
>>>>
>>>>        
>>>>
>>>Sure. It make sense for me also.
>>>
>>>There is also a validator in the state diagram (Section 2.6). I am just
>>>wondering if a DRMAA implementation could just reject the jobs in these
>>>tests
>>>at submission with a DRMAA_ERRNO_DENIED_BY_DRM.
>>>
>>>
>>>      
>>>
>>>>BTW: Condor is one example for a system where the existence of input
>>>>files is checked before the job is started. But at least your GRAM
>>>>example convinced me that the opposite is also true ;-) ...
>>>>
>>>>
>>>>        
>>>>
>>>>>Sure. The problem is that the code is not clear either. From DRMAA 1.0
>>>>>          
>>>>>
>>>C
>>>
>>>      
>>>
>>>>>bindings example:
>>>>>          
>>>>>
>>>>...
>>>>
>>>>
>>>>>From this code it seems that a signaled job should end with a zero
>>>>        
>>>>
>>>exited
>>>
>>>      
>>>
>>>>>value from wifexited (as if it did not terminate normally), as opposed
>>>>>          
>>>>>
>>>to
>>>
>>>      
>>>
>>>>>your comments in the previous mails and the code in the DRMAA test
>>>>>          
>>>>>
>>>suite.
>>>
>>>      
>>>
>>>>You are right, as already said above. drmaa_wifexited() mainly indicates
>>>>the availability of additional information.
>>>>        
>>>>
>>>OK
>>>
>>>      
>>>
>>>>Regards,
>>>>Peter.
>>>>        
>>>>
>>>Best Regards,
>>>Rubén
>>>--
>>>+-----------------------------------------------------------+
>>>Dr. Ruben Santiago Montero
>>>Assistant Professor
>>>Dpto. Arquitectura de Computadores y Automatica
>>>Facultad de Informatica
>>>Universidad Complutense      phone  : +34 91 394 75 38
>>>28040 Madrid                 fax    : +34 91 394 75 27
>>>Spain                        email  : rubensm at dacya.ucm.es
>>>http://asds.dacya.ucm.es/
>>>+-----------------------------------------------------------+
>>>
>>>GridWay, The Way to Grid! http://www.gridway.org
>>>      
>>>
>>    
>>
>
>  
>

-- 
******************************************************
*         Daniel Templeton   UMPK18 x83749           *
*        Staff Engineer, Sun N1 Grid Engine          *
******************************************************
* "What's the sense in never thinkin' 'bout the tomb *
*  When you're much too busy returning to the womb?" *
*                  -They Might Be Giants             *
******************************************************






More information about the drmaa-wg mailing list