[drmaa-wg] DRMAA TEST SUITE

Ruben Santiago Montero rubensm at dacya.ucm.es
Thu Mar 23 04:54:56 CST 2006


Hi Peter,

On Tuesday 21 March 2006 21:43, you wrote:
> > Sorry, I do not agree. In the DRMS context, job life cycle comprises all
> > the job execution stages since the job enters the DRM system. In this
> > sense, whenever a job is submitted there should be a termination (either
> > it actually ran or not). I can give you an example, if you submit a job
> > (qsub) and then you kill it (qdel), it is obvious that the job terminated
> > abnormally (it has been killed), although the job never entered the
> > running state.
>
> This is one possible interpretation, I agree. The DRMAA spec is aligned
> to POSIX semantics here - it is only possible to have something
> terminated which was running (== executed) before.

OK!!
>
> > There is no relation between if the job terminated normally and if there
> > is no further information from the DRM. In the previous example (a job
> > that has been killed) could or could not be more information from the
> > DRMS.  But in any case, it is clear that the job terminated abnormally.
> >
> > drmaa_wifexited description should concentrate in one aspect since there
> > is no obvious (or general) relation between job termination and getting
> > further information from DRM.
>
> You are right. The main intention of drmaa_wifexited() is to tell you if
> additional information about the job execution ending is available. The
> final status of the job is provided by drmaa_job_ps(), and nothing else.

OK, We will fix the drmaa_wifexited() in GridWay DRMAA according to this.

>
> The confusion might eventually be solvable by a slight reformulation of
> the first sentences in the drmaa_wif...() descriptions, in order to
> avoid the word "termination". This would not lead to a change of semantics.
>
> I have no good proposal - DRMAA group ?
>
> >> ( Note: The testsuite assumes here that unusable input files are
> >> detected by the DRM before the job starts. This  seems to be realistic,
> >> since file staging operations are usually not part of the job
> >> execution.)
> >
> > I do not think so. Usually job preparation stages are part of the job
> > execution, for example:
>
> ...
>
> > Therefore I suggest removing the ST_ERROR_INPUT_FAIURE,
> > ST_ERROR_FILE_FAILURE and  ST_ERROR_FILE_FAILURE from the official test
> > suite. In the previous DRMs at least, you can submit a job with output
> > file /etc/passwd or an unusable input file , the job is queued, runs and
> > fails.
>
> During the last phone call, the group went through the code. We agree to
> your impression that the 3 tests are currently not sufficient. The
> descriptions for "input / output / error stream" job template parameters
> says that an invalid value should result in the job state
> DRMAA_PS_FAILED - and nothing more. There is no description of what that
> means for drmaa_wif...() calls, but the testsuite expects a particular
> behavior. If you look at DRMAA section 2.6, it is clearly shown that
> DRMAA_PS_FAILED is possible both for queued and running jobs.
>
> Our proposal is to remove the call of drmaa_wifaborted() for
> ST_INPUT_FILE_FAILURE / ST_ERROR_FILE_FAILURE / ST_OUTPUT_FILE_FAILURE.
> The drmaa_wait() call does not hurt (since all submitted jobs must be
> waitable), but the crucial part is the testing for the result of
> drmaa_synchronize(). After this change, I would expect the test cases to
> be successful also on your system. In case of malicious input / output /
> error files, the DRMAA implementation would only be expected to state a
> job failure. This should work for all GridWay-supported systems, right ?
> Could you accept this proposal ?
>
Sure. It make sense for me also.

There is also a validator in the state diagram (Section 2.6). I am just 
wondering if a DRMAA implementation could just reject the jobs in these tests 
at submission with a DRMAA_ERRNO_DENIED_BY_DRM.

> BTW: Condor is one example for a system where the existence of input
> files is checked before the job is started. But at least your GRAM
> example convinced me that the opposite is also true ;-) ...
>
> > Sure. The problem is that the code is not clear either. From DRMAA 1.0 C
> > bindings example:
>
> ...
>
> > From this code it seems that a signaled job should end with a zero exited
> > value from wifexited (as if it did not terminate normally), as opposed to
> > your comments in the previous mails and the code in the DRMAA test suite.
>
> You are right, as already said above. drmaa_wifexited() mainly indicates
> the availability of additional information.

OK
>
> Regards,
> Peter.

Best Regards,
Rubén
-- 
+-----------------------------------------------------------+
 Dr. Ruben Santiago Montero
 Assistant Professor
 Dpto. Arquitectura de Computadores y Automatica
 Facultad de Informatica
 Universidad Complutense      phone  : +34 91 394 75 38
 28040 Madrid                 fax    : +34 91 394 75 27
 Spain                        email  : rubensm at dacya.ucm.es
 http://asds.dacya.ucm.es/
+-----------------------------------------------------------+

GridWay, The Way to Grid! http://www.gridway.org





More information about the drmaa-wg mailing list