[Pgi-wg] OGF PGI Scope

Etienne URBAH urbah at lal.in2p3.fr
Fri May 14 14:09:49 CDT 2010


Aleksandr,

Concerning OGF PGI Uses Cases and Scope :


Complexity
----------
The 'requirement for pre-installed complex software packages to add more 
to complexity' is ALREADY there :  This is AR1 (34).


Scope
-----
In fact, execution of Activities or Jobs is only a little part of the 
'Distributed Data Processing' functionalities that a Production Grid is 
required to provide.

In particular, we MUST clearly understand the difference of 'Quality of 
Service' required by :
-  Transient  entities (such as Activities or Jobs), which MAY fail at 
any time for any reason,
-  Persistent entities (such as Grid resource descriptors, Security 
descriptors, Data sets, Accounting Records, Log records, ...), which 
SHOULD be securely kept.

I just published at http://forge.gridforum.org/sf/go/doc15977?nav=1 the 
source file of UML Class and Collaboration Diagrams (designed with the 
ArgoUML tool) :

-  showing INTERFACES of Distributed Data Processing which are necessary 
or useful to standardize,

-  permitting to asses the impact of ARCHITECTURE on the list of 
INTERFACES which absolutely require to be standardized to permit minimum 
interoperability.

I will publish the corresponding pictures very soon.


Best regards.

-----------------------------------------------------
Etienne URBAH         LAL, Univ Paris-Sud, IN2P3/CNRS
                       Bat 200   91898 ORSAY    France
Tel: +33 1 64 46 84 87      Skype: etienne.urbah
Mob: +33 6 22 30 53 27      mailto:urbah at lal.in2p3.fr
-----------------------------------------------------


On Fri, 14/05/2010 20:50, Aleksandr Konstantinov wrote:
> Well done, Oxana. Very thorough description. I coudln't say better.
> Just add requirement for pre-installed complex software packages
> to add more to complexity.
>
> A.K.
>
>
>
> Oxana Smirnova at Thursday 13 May 2010 22:00 wrote:
>> Hi Andrew, all,
>>
>> allow me to start from very starters, to explain "typical" workload
>> Aleksandr referred to.
>>
>> Both ARC and gLite "grew" from the requirements of the High Energy Physics
>> community, more specifically - those of LHC experiments. I'll come back
>> later to what it means in practice.
>>
>> The basic difference between gLite and ARC starting requirements is that
>> gLite is designed for resources owned or controlled to a large extent by
>> their users, while ARC is designed to support resources shared between
>> different users and controlled by fairly independent resource providers.
>>
>> The immediate difference is policies: while gLite community is largely
>> expected to comply with policies devised by the LHC Grid (WLCG) Joint
>> Security and Policy Group, ARC community has no unique set of policies. ARC
>> sites that contribute to LHC computing in general tend to respect the WLCG
>> policies, but not too close, giving priority to local policies of resource
>> owners. Needless to say, this introduces extra complexity into
>> requirements, and reduces the rate of simple use cases.
>>
>> Now, LHC experiments are huge communities, between 500 and 3000 members in
>> each. All well-educated computer-savvy people who never hesitate coming
>> with their own brilliant solutions. Even within one experiment divergence
>> is huge, and there are sites that support 4 experiments. This adds to the
>> complexity: not only we have diverging or contradictory requirements from
>> resource owners, we also have all sorts of diverging requirements from
>> users. The least common denominator is 1: meaning "ping", because even
>> "hello world" is ambiguous - where do you send the output, to a file or to
>> a standard output? If it is standard output, do you write it to a log? Who
>> said "hello world", the individual user or the whole experiment? Do we have
>> a right to log individual activities at all?  And so on.
>>
>> There are several basic kinds of typical jobs by LHC experiments, in
>> general they can be separated in 5 categories:
>>
>> 1. Monte Carlo generation (no input, small resource consumption, small
>> output) 2. Detector simulation (one input file, moderate resource
>> consumption, moderate output) 3. Signal reconstruction (multiple input
>> files, moderate resource consumption, large output) 4. Data merging (very
>> many input files - like, 400, large resource consumption, large output) 5.
>> Data analysis (huge amount of input files not necessarily known in advance,
>> small resource consumption, small output)
>>
>> The job can have a number of states, e.g.:
>>
>> 1. Job is defined  (may require authorisation)
>> 2. Job matched to a site (requires authorisation, detailed site
>> information, maybe data availability information) 3. Data are made
>> available to a job (authorisation, probably delegation of data staging
>> rights to a staging service) 4. Job processes data (CPU, I/O, network
>> access to external databases requiring authorisation) 5. Job places data at
>> [multiple] pre-defined destinations (authorisation, contacting external
>> databases, probably delegation to a staging service) 6. Job is finished
>> 7. Job has failed
>> 8. Job is terminated  (requires authorisation)
>> 9. Job is resubmitted (authorisation, information)
>>
>> Each state may have a number of sub-states, depending on the
>> experiment-specific framework.
>>
>> Authorisation may be per Virtual Organisation (file reading), per Role
>> and/or Group within a VO (job types 1-4), per person (job type 5), or even
>> per service (some frameworks accept services as VO members).
>>
>> Delegation of rights in general is very much needed, because of the large
>> number of auxiliary services, distributed environment and general
>> complexity of the workflow. No-one really knows how to achieve the goals
>> without delegation.
>>
>> Each experiment has their own framework. Most such frameworks circumvent
>> Grid services because they are too generic. This means that jobs are highly
>> non-trivial as they attempt to re-implement Grid services such as
>> information gathering and publishing, data movement and registration,
>> resubmission etc.; they also tend to tweak authorisation by executing
>> payload of users not authorised by Grid means. This complicates even
>> further  the job state picture.
>>
>> If PGI's outcome will make any of the above mentioned jobs impossible, most
>> key ARC and gLite customers will not use PGI specs, and they will have only
>> academic value. This was not the PGI goal, as "P" stands for "Production".
>>
>> Cheers,
>> Oxana
>>
>> 13.05.2010 15:20, Andrew Grimshaw пишет:
>>> Aleksandr,
>>> Referring to your sentence/paragraph
>>>
>>> "Such "simple" job is very far from being "typical". At least in
>>> NorduGrid world AFAIK."
>>>
>>> Could you elaborate. I see in my work basically two "types" of jobs that
>>> dominate - sets of HTC "parameter space" jobs, and true parallel MPI
>>> jobs. In both cases the "job" is basically a command line - either an
>>> mpiexec/run, and application with parameters, or a script with
>>> parameters. The job has known inputs and outputs, or a directory tree
>>> location where it needs to run. The jobs runs to completion, or it fails,
>>> in either case there are output files and result codes. Sometimes the job
>>> is a workflow, but when you pick that apart it turns into jobs that have
>>> inputs and outputs along with a workflow engine orchestrating it all.
>>>
>>> What is a typical job that you see? When I say "typical" I mean covers
>>> 80% of the jobs.
>>>
>>> A
>>>
>>> -----Original Message-----
>>> From: Aleksandr Konstantinov [mailto:aleksandr.konstantinov at fys.uio.no]
>>> Sent: Sunday, May 02, 2010 3:36 PM
>>> To: pgi-wg at ogf.org
>>> Cc: Andrew Grimshaw; 'Oxana Smirnova'; 'Etienne URBAH'; 'David SNELLING';
>>> lodygens at lal.in2p3.fr
>>> Subject: Re: [Pgi-wg] OGF PGI Requirements - Flexibility and clarity
>>> versus Rigidity and confusion
>>>
>>> Hello,
>>>
>>> I agree that problem is to difficult to solve. One should take into
>>> account that initially task was different. Originally AFAIR it was
>>> an attempt of few grid project to make a common interface suitable
>>> for them. Later those were forced to OGF and problem upgraded
>>> to almost unsolvable.
>>>
>>> Andrew Grimshaw at Saturday 01 May 2010 15:36 wrote:
>>>> Oxana,
>>>> Well said.
>>>>
>>>> I would add that I fear we may be trying too much to solve all problems
>>>
>>> the
>>>
>>>> first time around - to "boil the ocean". To completely solve the whole
>>>> problem is a daunting task indeed as there are so many issues.
>>>>
>>>> I personally believe we will make more progress if we solve the minimum
>>>> problem first, e.g., securely run a simple job from
>>>
>>> infrastructure/sw-stack
>>>
>>>> A on infrastructure/sw-stack B.
>>>
>>> This problem is already solved. And it was done in few ways.
>>> 1. Client stacks supporting multiple service stacks
>>> 2. BES + GSI
>>> 3. Other combinations currently in use
>>> And none is fully suitable for real production. So unless task of PGI
>>> is considered to be purely theoretical this approach would become
>>> equal to one more delay.
>>>
>>>> "Infrastructure/sw-stack A" means a set of resources (e.g., true
>>>> parallel-Jugene, clusters, sets of desktops) running a middleware stack
>>>> (e.g., Unicore 6 or Arc) configured a particular way. In the European
>>>> context this might mean an NGI such as D-Grid with Unicore 6 running a
>>>> job on NorduGrid running Arc. (Please forgive me if I have the
>>>> particulars of the NGIs wrong.)
>>>>
>>>> "Simple job" means a job that is typical, not special. This is not to
>>>> say that its resource requirements are simple, it may have very
>>>> particular requirements (cores per socket, interconnect, memory), rather
>>>> I mean that the job processing required is simple: run w/o staging,
>>>> simple staging,
>>>
>>> Such "simple" job is very far from being "typical". At least in NorduGrid
>>> world AFAIK.
>>>
>>>> perhaps client interaction with the session directory pre, post, and
>>>
>>> during
>>>
>>>> execution.
>>>> Try to avoid complex job state models that will be hard to agree
>>>> on, and difficult to implement in some environments.
>>>>
>>>> "Securely" means sufficient authentication information required at B is
>>>> provided to B in a form it will accept from a policy perspective.
>>>> Further, that we try as much as possible to avoid a delegation
>>>> definition that extends inwards beyond the outer boundary of a
>>>> particular
>>>
>>> I'm lost. Is is delegation or definition which extends?
>>>
>>>> infrastructure/sw-stack. (The last sentence is a bit awkward, I
>>>> personally think that we will need to have two models of authentication
>>>> and
>>>
>>> delegation
>>>
>>>> - a legacy transport layer mechanism, and a message layer mechanism
>>>> based on SAML, and that inside of a software stack we cannot expect
>>>> sw-stacks to change their internal delegation mechanism.)
>>>>
>>>> I believe authentication/delegation is the most critical item: if we
>>>
>>> cannot
>>>
>>>> get the authentication/delegation issues solved, the rest is moot with
>>>> respect to a PRODUCTION environment. We may be able to do demo's and
>>>
>>> stunts
>>>
>>>> while punting on authentication/delegation, but we will not integrate
>>>> production systems.)
>>>
>>> Wasn't delegation voted no during last review?
>>>
>>>
>>> A.K.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5073 bytes
Desc: S/MIME Cryptographic Signature
Url : http://www.ogf.org/pipermail/pgi-wg/attachments/20100514/18d98676/attachment.bin 


More information about the Pgi-wg mailing list