[saga-rg] Use Case Mapping...

Andre Merzky andre at merzky.net
Thu Sep 8 03:09:22 CDT 2005


Hi all, 

below are some notes for mapping the current SAGA API spec
to one of the use cases, i.e. the GridLab use case for
application migration.  I attach the use case for reference
as well.

Cheers, Andre.

+-----------------------------------------------------------------+


The SAGA API allows to migrate any job it can handle with the
job class, using the migrate method.  That provides an easy
solution for the GridLab migration use case if supported by the
implementation/middleware/backend:

--------------------------------------------------------------
  #include <saga.hpp>
  #include <vector>
  #include <string>

  using namespace std;
  
  int main ()
  {
    saga::job_server js;
    saga::job j = js.run_job ("remote.host.net", "my_app");
  
    job_definition jd = j.get_job_definition ();
    
    vector <string> hosts;
    vector <string> files;
    hosts.push_back (string ("near.host.net"));
    files.push_back (string 
         ("http://remote.host.net/file > http://near.host.net/file"));

    jd.set_vector_attribute ("SAGA_HostList",     hosts);
    jd.set_vector_attribute ("SAGA_FileTransfer", files);
  
    j.migrate (jd);
  
    cout << "Heureka!" << endl;
  
    return (0);
  } 
--------------------------------------------------------------

(Question: does the SAGA migrate call move checkpoint files
 automatically, or do they need to be specified in the new 
 job description as above?)
  
However, for the complete use case to be implemented on
application level, a number of steps cannot be implemented in
SAGA.  The call sequence would be:

  In the application instance which performs the migration on
  the other job:
      - trigger migration for the remote job
      - discover new resource
      + move checkpoint data to new resource
      + schedule application on new resource
      + continue computation (and discontinue old job)
  
  In the application instance which gets migrated
      - get tirggered from checkpointing
      = perform application level checkpointing
      - report checkpoint file location(s)

Items marked with + are possible to implement in SAGA, items
marked with - aren't.  The item marked with '=' is (currently)
not related to SAGA.

For the complete implementation of the use case, SAGA misses:

  1) means to communicate with the remote application instance
  2) means to discover new resources

Notes:
  1) means of communcation are actually given, but not per se
     usable for this use case.  E.g. streams are a definite
     overkill for signalling checkpointing requests.  Signals
     (as in job.signal (int signal)) would work, but only if the
     remote job uses signal handling as a checkpoint trigger.
     That also might be difficultato use if the job is running 
     in a wrapper script, or in a virtual machine etc - that
     might not be transparent to SAGA, and would require direct
     communications.  Also, the signalling method misses
     feedback about success of the operation, and cannot return
     information such as the location of checkpoint files.

  2) the current SAGA API covers job submission to specific
     hosts, or lets the middleware choose a suitable host for
     submission.  However, the brokering result is not exposed
     on API level, as would be neccessary for this specific use
     case, and possibly for other dynamically active Grid
     applications.  

     One way to implement that is to provide a direct interface
     to Grid information systems, and on that way expose
     information about available resources.  That would actually
     be more flexible, as is e.g. also allows the discovery of
     specific services, but would also require additional
     semantic knowledge on application lelvel.


+-----------------------------------------------------------------+

-- 
+-----------------------------------------------------------------+
| Andre Merzky                      | phon: +31 - 20 - 598 - 7759 |
| Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 |
| Dept. of Computer Science         | mail: merzky at cs.vu.nl       |
| De Boelelaan 1083a                | www:  http://www.merzky.net |
| 1081 HV Amsterdam, Netherlands    |                             |
+-----------------------------------------------------------------+
-------------- next part --------------

SAGA Use Case Template:
=======================

  Name of use case: Application Migration

  Contact:          Andre Merzky <merzky at cs.vu.nl>


1. General Information:
-----------------------

  This section consists of check-boxes to provide some context in
  which to evaluate the use case.
  
  1.1 Which best describes your organisation:
  
    Industry                     [ ]
    Academic                     [x]
    Other                        [ ]
       Please specify:           ................................
                       
                       
  1.2 Application area:    
                       
    Astronomy                    [ ]
    Particle physics             [ ]
    Bio-informatics              [ ]
    Environmental Sc.            [ ]
    Image analysis               [ ]
    Other                        [ ]
       Please specify: astrophisics, but the use case is generic
  
  
  1.3 Which of the following apply to or best describe this use
      case Multiple selections are possible, please prioritize
      with numbers from 1 (low) to 5 (high):
  
    Database                     [ ]
    Remote steering              [3]
    Visualization                [1]
    Security                     [1]
    Resource discovery           [5]
    Resource scheduling          [5]
    Workflow                     [3]
    Data movement                [5]
    High Throughput Computing    [ ]
    High Performance Computing   [1]
    Other                        [ ]
        Please specify:          ................................
  
  
  1.4 Are you an:
  
    Application user             [ ]
    Application developer        [ ]
    System administrator         [ ]
    Service developer            [ ]
    Computer science researcher  [ ]
    Other                        [ ]
        Please specify:   Middleware Developer (higher levels)
  

        
2. Introduction:
----------------
  
  2.1 Provide a paragraph introduction to your use case.
      Background to the project is another alternative.  
      (E.g.  100 words).

    One of the major scenarios targeted by the GridLab project is
    the ability to migrate a running application in a VO.  The
    migration process may get triggered by various means:

      - running out of time on the original resource
      - a more powerful resource comes available
      - a resource with more memory or local disk space is needed
      - user prefers a  different resource and triggers migration
      - migration as part of a larger work flow scenario

    The migrations includes following well defined steps:

      - trigger migration
      - discover new resource
      - perform application level checkpointing
      - move checkpoint data to new resource
      - schedule application on new resource
      - continue computation (and discontinue old job)

    Several of these operations need to be done on application
    level - the use cases specifically describes those operations
    in respect to an Grid API.


  2.2 Is there a URL with more information about the project ?
    
    http://www.gridlab.org/
  
  
3. Use Case to Motivate Functionality Within a Simple API:
----------------------------------------------------------  
  
  Provide a scenario description to explain customers' needs.
  E.g. "move a file from A to B," "start a job."
  
  Please include figures if possible.

  If your use case requires multiple components of functionality,
  please  provide separate descriptions for each component,
  bullet points of 50 words per functionality are acceptable.
  
    Following the list from 2.1:

    - trigger migration
        
      If the application triggers the migration process itself,
      it needs means to communicate with the resource management
      system it got started with, or with any other one which
      knows about its execution environment requirements (exe,
      input files, output files... -> job description).  The
      request basically is:
      
      rms  = Grid.getResourceManagementSystem ();
      self = rms.getMyJobDescription ();
      // perform checkpoint
      // save state
        
      
      If the application migration gets triggered from outside
      the application, the application needs to have means to
      getified about this - it needs to know when to perform
      checkpointing and to shut down.  There are many ways to do
      that - application steering like mechanisms seem the most
      convenient ones:
      
      sub mycallback (userdata) {
        // perform checkpoint
        // save state
      }
      result = Grid.announceCheckpointCallback (mycallback, 
                                                userdata);
      
      
      - discover new resource
      
        If that operation is not performed by the resource
        management system itself, the application needs to
        discover new resources where itself can run on.  It 
        needs to provide its own job description.
        
        host = GriResourceManager.discoverNewHost (self);

        
      - perform application level checkpointing
      
        The checkpointing process itslef does not need Grid
        support per se, but the application needs to be able to
        announce the location of it's checkpoint files.  These
        could be put into a replica catalog, or onto a global
        file system - but the resource manager needs to know
        about them, in order to make them available on the new
        resource:
      
        app.checkpoint (filename);
        grid.replicaCatalog.addFile (replicaname, filename);
        rms.announceCheckpointFile  (replicaname);

        
      - move checkpoint data to new resource

        If that operation is not performed by the resource
        management system itself, the application needs to be
        able to mograte its checkpoint files to the new resource:

        grid.copyFiles (filename,    host);
          or
        grid.replicate (replicaname, host);
        

      - schedule application on new resource

        If that operation is not performed by the resource
        management system itself, the application needs to be
        able to start a copy of itself on the remore resource:
        
        copy = GriResourceManager.runJobOnHost (self, host);

     
      - continue computation (and discontinue old job)

        Both are straight forward.
        
  
4. Customers:
-------------

  Describe customers of this use case and their needs. In
  particular, where and how the use case occurs "in nature" and
  for whom it occurs.  E.g. max 40 words

    The cusomers of the use case are scientific communities 
    with jobs
    a) running for a very long time (~weeks)
    b) with varying comuting demands (peeks requiring more 
       powerful resource, or more disk space)
    c) which are part of larger dynamic systems

    Grand Challenge Simulations are specific target applications
    for that use case.

  
5. Involved Resources:
----------------------

  5.1 List all the resources needed: e.g. what hardware, data,
      software might be involved.

        - compute resources
        - data storage systems  
        - resoure management systems
        - data replication/movement systems
        - remote steering or monitoring systems

  
  5.2 Are these resources geographically distributed?
    
        potentially yes.

  
  5.3 How many resources are involved in the use case?  E.g. how
      many remote tasks are executing at the same time?

        minimum: 2, maximum: unlimited, only one compute resource
        at the same time.

  
  5.4 Describe your codes and tools: what sort of license is
      available, e.g. open or closed source license; what sort of
      third party tools and libraries do you use, and what is
      their availablility;  do you regularly work from source
      code, or use pre-compiled applications; what languages are
      your applications developed in (if relevant), e.g.
      Fortran, C, C++, Java, Perl, or Python.

    Application: C/Fortran code, open source 
                 http://www.cactuscode.org
    API:         C api binding to Grid Services, open source 
                 http://www.gridlab.org/gat/
    Services:    C and Java Services, open source, mostly basing 
                 on globus 
                 http://www.gridlab.org/

  
  5.5 What information sources do you require, e.g. certificate
      authorities, or registries.

    Resource Discovery and state preservation (repolica systems
    or similar) are the main requirements to information
    management.
      
  
  5.6 Do you use any resources other than traditional compute or
      data resources, e.g. telescopes, microscopes, medical
      imaging instruments.

    No.

  
  5.7 Please link all the above back to the functionalities
      described in the use case section where possible.

    ...


  5.8 How often is your application used on the grid or grid-like
      systems?

      [ ] Exclusively
      [ ] Often (say 50-50)
      [x] Ocassionally on the grid, but mostly stand-alone
      [ ] Not at all yet, but the plan is to.

    The application is actually used in Grids, but does not make
    full use of Grid capabilities (as the one described here).
      

  
6. Environment:
---------------

  Provide a description of the environment your scenario runs in,
  for example the languages used, the tool-sets used, and the
  user environments (e.g. shell, scripting language, or portal).

    Users work mostly on shells, portals are uder development.
    Programmers work on open source solutions, unix only, C, C++,
    Fortran.


  
7. How the resources are selected:
----------------------------------

  7.1 Which resources are selected by users, which are inherent
      in the application, and which are chosen by system
      administrators, or by other means?  E.g. who is specifying
      the architecture and memory to run the remote tasks?  

    Compute Resources are selected manually or automatically (job
    description by users).
  
  
  7.2 How are the resources selected? E.g. by OS, by CPU power,
      by memory, don't care, by cost, frequency of availability
      of information, size of datasets?

    OS, Architecture, Memory, disk space, runtime (when, how
    long)
  
  
  7.3 Are the resource requirements dynamic or static?
    
    Vary from run to run, but mostly static, sometimes dynamic.
    In the future more dynamic.
  
  
8. Security Considerations:
---------------------------

  8.1 What things are sensitive in this scenario: executable
      code, data, computer hardware?  I.e. at what level are
      security measures used to determine access, if any?

    Data should get only accessed by owner or group.  Resources
    are not to be compromised of course.  --> standard academic
    security requirements.

  
  8.2 Do you have any existing security framework, e.g. Kerberos
      5, Unicore, GSI, SSH, smartcards?
  
    GSI for all communication and resource access.
  
  
  8.3 What are your security needs: authentication,
      authorisation, message protection, data protection,
      anonymisation, audit trail, or others?

    authentication, authorisation, basic data protection

  
  8.4 What are the most important issues which would simplify
      your security solution?  Simple API, simple deployment,
      integration with commodity technologies.  

    simple deployment
  

  
9. Scalability:
---------------

  What are the things which are important to scalability and to
  what scale - compute resources, data, networks ?

    The scenario is not bound by scalability (the application of
    course is).


  
10. Performance Considerations:
-------------------------------

  Explain any relevant performance considerations of the use
  case.

    Full time to migrate to a better must result in a benefit if
    compared to having the computation simply continue on the old
    resource.  However, on ocasions where simply continuation is
    not possible, performance penalties are acceptable.

    In general: performance requirements depend on specific
    application/simulation.

  
11. Grid Technologies currently used:
-------------------------------------

  If you are currently using or developing this scenario, which
  grid technologies are you using or considering?

    - globus based services from the GridLab project
    - Grid Application Toolkit from the GridLab project

  
12. What Would You Like an API to Look Like?
--------------------------------------------  
  
  Suggest some functions and their prototypes which you would
  like in an API which would support your scenario.

    An example of a migtration in GAT is included in the GAT
    release.
  
  
13. References:
---------------

  List references for further reading.

    http://www.gridlab.org/gat/




More information about the saga-rg mailing list