[drmaa-wg] DRMAA TEST SUITE

Peter Tröger peter.troeger at hpi.uni-potsdam.de
Tue Mar 21 14:43:26 CST 2006


> Sorry, I do not agree. In the DRMS context, job life cycle comprises all the 
> job execution stages since the job enters the DRM system. In this sense, 
> whenever a job is submitted there should be a termination (either it actually 
> ran or not). I can give you an example, if you submit a job (qsub) and then 
> you kill it (qdel), it is obvious that the job terminated abnormally (it has 
> been killed), although the job never entered the running state.

This is one possible interpretation, I agree. The DRMAA spec is aligned 
to POSIX semantics here - it is only possible to have something 
terminated which was running (== executed) before.

> There is no relation between if the job terminated normally and if there is no 
> further information from the DRM. In the previous example (a job that has 
> been killed) could or could not be more information from the DRMS.  But in any 
> case, it is clear that the job terminated abnormally.
> 
> drmaa_wifexited description should concentrate in one aspect since there is no 
> obvious (or general) relation between job termination and getting further 
> information from DRM.

You are right. The main intention of drmaa_wifexited() is to tell you if 
additional information about the job execution ending is available. The 
final status of the job is provided by drmaa_job_ps(), and nothing else.

The confusion might eventually be solvable by a slight reformulation of 
the first sentences in the drmaa_wif...() descriptions, in order to 
avoid the word "termination". This would not lead to a change of semantics.

I have no good proposal - DRMAA group ?

>> ( Note: The testsuite assumes here that unusable input files are
>> detected by the DRM before the job starts. This  seems to be realistic,
>> since file staging operations are usually not part of the job execution.)
>>
> 
> I do not think so. Usually job preparation stages are part of the job 
> execution, for example:
...
> Therefore I suggest removing the ST_ERROR_INPUT_FAIURE, ST_ERROR_FILE_FAILURE 
> and  ST_ERROR_FILE_FAILURE from the official test suite. In the previous DRMs 
> at least, you can submit a job with output file /etc/passwd or an unusable 
> input file , the job is queued, runs and fails.

During the last phone call, the group went through the code. We agree to 
your impression that the 3 tests are currently not sufficient. The 
descriptions for "input / output / error stream" job template parameters 
says that an invalid value should result in the job state 
DRMAA_PS_FAILED - and nothing more. There is no description of what that 
means for drmaa_wif...() calls, but the testsuite expects a particular 
behavior. If you look at DRMAA section 2.6, it is clearly shown that 
DRMAA_PS_FAILED is possible both for queued and running jobs.

Our proposal is to remove the call of drmaa_wifaborted() for 
ST_INPUT_FILE_FAILURE / ST_ERROR_FILE_FAILURE / ST_OUTPUT_FILE_FAILURE. 
The drmaa_wait() call does not hurt (since all submitted jobs must be 
waitable), but the crucial part is the testing for the result of 
drmaa_synchronize(). After this change, I would expect the test cases to 
be successful also on your system. In case of malicious input / output / 
error files, the DRMAA implementation would only be expected to state a 
job failure. This should work for all GridWay-supported systems, right ? 
Could you accept this proposal ?

BTW: Condor is one example for a system where the existence of input 
files is checked before the job is started. But at least your GRAM 
example convinced me that the opposite is also true ;-) ...

> Sure. The problem is that the code is not clear either. From DRMAA 1.0 C 
> bindings example:
...
> From this code it seems that a signaled job should end with a zero exited 
> value from wifexited (as if it did not terminate normally), as opposed to 
> your comments in the previous mails and the code in the DRMAA test suite.

You are right, as already said above. drmaa_wifexited() mainly indicates 
the availability of additional information.

Regards,
Peter.





More information about the drmaa-wg mailing list