[drmaa-wg] DRMAA-WG April 4, 2006 call

Peter Troeger peter.troeger at hpi.uni-potsdam.de
Wed Apr 5 03:36:45 CDT 2006


Meeting minutes for April 4 phone conference:

- March 21 meeting minutes accepted without changes

- Upcoming SGE 6.0U8 release will still be DRMAA 0.95 compliant
   - DRMAA 1.0 compliance with SGE 6.0U9 release (3-6 months from now)
     or with the CVS main trunk

- Small problem with strtok_r() in the test suite under Solaris,
  Dan will commit patched version to Sourceforge CVS

- Latest text proposal for drmaa_wifexited() discussed (tracker #1125),
  accepted on condition that the term "ended" is removed
   - Peter adds updated text to the tracker

- Job state after resuming from suspend state
   - Rough agreement that Condor and GridWay approach of restarting
     the job is something different then suspend ("rescheduling")
   - Added as post 1.0 DRMAA feature (tracker #1787)
   - Suspend feature and according state transition back to PS_RUNNING
     remains mandatory for DRMAA 1.0 (no test suite changes)
   - Peter informs Ruben

- Discussion about job rejection in case of invalid job template
   - Would ease up Condor implementation, since invalid input files
     are detected on job submission by this system
   - Agreement that early rejection of invalid jobs should
     always be possible (e.g. compute centre checks)
   - Proposal for text change in new tracker #1786

- Document submission to GGF on Friday
   - Pending SGE experience report (Dan)
   - Pending updated Condor experience report (Peter)
   - Pending final DRMAA spec (Hrabri)

Regards,
Peter.

> *** new phone numbers ***
> *** new phone numbers ***
> 
> 
> The bi-weekly DRMAA call is scheduled for 16:00 UTC (8:00PDT - Pacific
> Daylight Time /10:00CDT/ 17:00 Central Europe). All Participants should use
> the following information to reach the conference call:
> 
> ------------------------------------
> * Toll Free Dial In Number for North America:   1 800 867-8609
> * Toll Free Dial In Number for Germany:         0 800 101-4546
> * Int'l Access/Caller Paid Dial In Number:      +49 069509594678
> * ACCESS CODE: 7223898
> ------------------------------------
> 
> Attachments to this email:
> 
>       - March 21 meeting minutes
> 
> 
> Meeting Agenda:
> 
> A. Meeting secretary for this meeting?
> 
> B. Acceptance of the March 21, 2006 meeting minutes
> 
> C. Admin
> 	- third chair update
> 
> F. Open/general issues discussion
>   - experience documents
>   - #1125 Tracker - see the included text at the end of the agenda
>   - Job suspension is different from triggering job rescheduling in Condor 
>            (see attached  " "Re: GridWay Experience Report" mail)
>   - Status of the test suite
>       - post ver 1.0 issues
>       - handling exit status for bad input / ouput / error streams 
>            (see attached "Re: [drama-wg] DRMAA TEST SUITE" mail)
>   - misc
> 	
> 
> Cheers,
> 	Hrabri
> 
> 
> ------------------------- Tracker #1125 proposed change
> ----------------------------
> 
> Currently we have:
> "Evaluates into 'exited', a non-zero value if stat was returned for a job
> that terminated normally. A zero value can also indicate that although the
> job has terminated normally an exit status is not available or that it is
> not known whether the job terminated normally. In both cases
> drmaa_wexitstatus() SHALL NOT provide exit status information.
> A non-zero 'exited' value indicates more detailed diagnosis can be provided
> by means of drmaa_wifsignaled(), drmaa_wtermsig(),drmaa_wexitstatus(), and
> drmaa_wcoredump()."
> 
> It was proposed (Hrabri's adaptation of Peter's latest proposal) to change
> it to 
> 
> "Evaluates into 'exited' a non-zero value if stat was returned for a ended
> job 
> that either failed after running or finished after running (see section
> 2.6).
> A non-zero 'exited' value indicates more detailed diagnosis can be provided
> by
> means of drmaa_wifsignaled(), drmaa_wtermsig(),drmaa_wexitstatus(), and
> drmaa_wcoredump() functions.
> A zero result for the 'exited' parameter either indicates that 
>    1) although it is known that the job was running, more information is not
> available 
>    2) it is not known whether the job was running 
> 
> In both cases drmaa_wexitstatus() SHALL NOT provide exit status
> information."
> 
> 
> 
> 
> ------------------------------------------------------------------------
> 
> Betreff:
> Re: [drmaa-wg] DRMAA TEST SUITE
> Von:
> Peter Tröger <peter.troeger at hpi.uni-potsdam.de>
> Datum:
> Thu, 23 Mar 2006 16:00:06 -0500
> An:
> "Ruben Santiago Montero" <rubensm at dacya.ucm.es>
> 
> An:
> "Ruben Santiago Montero" <rubensm at dacya.ucm.es>
> CC:
> "DRMAA Working Group" <drmaa-wg at gridforum.org>
> 
> Absender:
> <owner-drmaa-wg at ggf.org>
> Referenzen:
> <200603181416.03350.rubensm at dacya.ucm.es>
> <200603211155.00824.rubensm at dacya.ucm.es>
> <4420656E.1020806 at hpi.uni-potsdam.de>
> <200603231154.56859.rubensm at dacya.ucm.es>
> Nachricht-ID:
> <44230C56.5020000 at hpi.uni-potsdam.de>
> MIME-Version:
> 1.0
> Content-Type:
> multipart/alternative; boundary="----=_NextPart_000_00AB_01C6564A.984CB390"
> X-Mailer:
> Microsoft Office Outlook, Build 11.0.5510
> Thread-Index:
> AcZOvM76P8wyV9lqRV6jpspnNA+aow==
> In-Reply-To:
> <200603231154.56859.rubensm at dacya.ucm.es>
> X-MimeOLE:
> Produced By Microsoft MimeOLE V6.00.2900.2180
> X-Apparently-To:
> hrabri at sbcglobal.net via 68.142.199.165; Thu, 23 Mar 2006 13:00:26 -0800
> X-Originating-IP:
> [140.221.10.4]
> X-Original-To:
> grdfm-drmaa-wg at mailbouncer.mcs.anl.gov
> x-fsavag4mse-ts:
> dbb6c6d4fbd7d8b3
> X-OriginalArrivalTime:
> 23 Mar 2006 21:00:01.0725 (UTC) FILETIME=[C082BED0:01C64EBC]
> 
> 
>>> Our proposal is to remove the call of drmaa_wifaborted() for
>>> ST_INPUT_FILE_FAILURE / ST_ERROR_FILE_FAILURE / ST_OUTPUT_FILE_FAILURE.
>>> The drmaa_wait() call does not hurt (since all submitted jobs must be
>>> waitable), but the crucial part is the testing for the result of
>>> drmaa_synchronize(). After this change, I would expect the test cases to
>>> be successful also on your system. In case of malicious input / output /
>>> error files, the DRMAA implementation would only be expected to state a
>>> job failure. This should work for all GridWay-supported systems, right ?
>>> Could you accept this proposal ?
>>>
>> Sure. It make sense for me also.
>>
>> There is also a validator in the state diagram (Section 2.6). I am just
>> wondering if a DRMAA implementation could just reject the jobs in
> these tests
>> at submission with a DRMAA_ERRNO_DENIED_BY_DRM.
> 
> The spec is unclear here, since the description of the input / ouput /
> error parameters demands a particular job state - DRMAA_PS_FAILED. You
> can only have a job state when you have a job id. YOu can only have a
> job id when drmaa_run() was successfull. I really would like to have the
> opportunity of DRMAA_ERRNO_DENIED_BY_DRM also in this case, but then we
> have to relax the description of the according job template attributes.
> 
> Sounds like another issue for the next phone call. Hrabri ?
> 
> Regards,
> Peter.
> 
> 
> ------------------------------------------------------------------------
> 
> Betreff:
> [drmaa-wg] Minutes for DRMAA WG con-call 03/21/2006
> Von:
> "Andreas Haas" <Andreas.Haas at Sun.COM>
> Datum:
> Tue, 21 Mar 2006 12:58:11 -0500
> An:
> "DRMAA Working Group" <drmaa-wg at gridforum.org>
> 
> An:
> "DRMAA Working Group" <drmaa-wg at gridforum.org>
> 
> Absender:
> <owner-drmaa-wg at ggf.org>
> Nachricht-ID:
> <Pine.GSO.4.53.0603211807160.41800 at sr-ergb01-01>
> MIME-Version:
> 1.0
> Content-Type:
> multipart/alternative; boundary="----=_NextPart_000_00AF_01C6564A.98516E80"
> X-Mailer:
> Microsoft Office Outlook, Build 11.0.5510
> Thread-Index:
> AcZNEQuJoof2neN9S9actDErgu5YCA==
> X-MimeOLE:
> Produced By Microsoft MimeOLE V6.00.2900.2180
> X-Apparently-To:
> hrabri at sbcglobal.net via 68.142.199.167; Tue, 21 Mar 2006 09:58:23 -0800
> X-Originating-IP:
> [140.221.10.4]
> X-Original-To:
> grdfm-drmaa-wg at mailbouncer.mcs.anl.gov
> X-X-Sender:
> ah114088 at sr-ergb01-01
> 
> 
> Attendees: Roger, Peter, Daniel, Hrabri and Andreas
> 
> Last meeting minutes accepted without corrections.
> 
> * Harbri proposes to add Peter as 3rd chair for DRMAA WG.
>   Peter says he would be willing to do it. Result of the
>   election is 5 votes pro and 0 votes against!
> 
> * Discussion about ST_INPUT_FILE_FAILURE test case
>   brought up by Ruben Santiago Montero. There is agreement
>   the testing procedure needs to be to comply with the
>   specification as proposed by Ruben.
> 
> * Andreas to review change in spec for tracker item 1125
> 
> 
> ------------------------------------------------------------------------
> 
> Betreff:
> Re: GridWay Experience Report
> Von:
> "Peter Troeger" <peter.troeger at hpi.uni-potsdam.de>
> Datum:
> Thu, 23 Mar 2006 10:33:19 -0500
> An:
> "Andreas Haas" <Andreas.Haas at Sun.COM>
> 
> An:
> "Andreas Haas" <Andreas.Haas at Sun.COM>
> CC:
> "Ruben Santiago Montero" <rubensm at dacya.ucm.es>, "Hrabri Rajic"
> <hrabri at sbcglobal.net>, Ignacio Martín Llorente <llorente at dacya.ucm.es>,
> "Roger Brobst" <rbrobst at cadence.com>, "Daniel Templeton"
> <Dan.Templeton at Sun.COM>
> 
> Referenzen:
> <200603211212.41381.rubensm at dacya.ucm.es>
> <44207279.4090500 at hpi.uni-potsdam.de>
> <200603231153.39610.rubensm at dacya.ucm.es>
> <Pine.GSO.4.53.0603231428390.41800 at sr-ergb01-01>
> Nachricht-ID:
> <4422BFBF.1000800 at hpi.uni-potsdam.de>
> MIME-Version:
> 1.0
> Content-Type:
> multipart/alternative; boundary="----=_NextPart_000_00B3_01C6564A.98565080"
> X-Mailer:
> Microsoft Office Outlook, Build 11.0.5510
> Thread-Index:
> AcZOjxz4c6sJabBCTiSfFifkWcbX0w==
> In-Reply-To:
> <Pine.GSO.4.53.0603231428390.41800 at sr-ergb01-01>
> X-MimeOLE:
> Produced By Microsoft MimeOLE V6.00.2900.2180
> X-Apparently-To:
> hrabri at sbcglobal.net via 68.142.199.172; Thu, 23 Mar 2006 07:33:20 -0800
> X-Originating-IP:
> [141.89.225.123]
> X-Header-Overseas:
> Mail.from.Overseas.source.mail3.hpi.uni-potsdam.de
> x-fsavag4mse-ts:
> ce3a50e13d5a79e
> X-OriginalArrivalTime:
> 23 Mar 2006 15:33:19.0057 (UTC) FILETIME=[1C68FC10:01C64E8F]
> X-Accept-Language:
> de-DE, de, en-us, en
> X-Enigmail-Version:
> 0.93.0.0
> 
> 
> 
>>>>- State of jobs after suspension: I loved to read this, since I had
>>>>exactly the same problem in the Condor DRMAA implementation. I ended up
>>>>with marking such jobs as "was suspended before", in order to give the
>>>>right active state afterwards. If we want to change the spec according
>>>>to this, we have a post 1.0 issue.
>>>
>>>Great!. I think I can just make the same thing in GridWay DRMAA.
>>
>>
>> Hm ... I doubt this is a good idea. Job suspension is different
>> from triggering job rescheduling. If implementing job suspension
>> is a severe problem for DRM vendors, I believe that should be rather
>> an argument for not making it mandatory rather than deviating
>> from the standard.
> 
> Even though we are running out of time for spec changes, this should be
> a topic for the next DRMAA phone conference. Hrabri, could you put this
> on the agenda ?
> 
> Regards,
> Peter.
> 





More information about the drmaa-wg mailing list