[glue-wg] ComputingService and Endpoints, a point of view

Mon Aug 27 06:08:57 EDT 2012

Hi Stephen,

On 2012-08-25 12:12, stephen.burke at stfc.ac.uk wrote:
> Florido Paganelli [mailto:florido.paganelli at hep.lu.se] said:
>>> What would you propose to do with Share, Resource and Manager?
>>
>> Same approach. As I said, this depends if we want to override the
>> associations or not. This cannot be represented in UML, but makes
>> sense in realizations.
>
> And what about the relations between them? And the same for the
> storage classes? I think this would be quite a big change which would
> need a significant advantage to be worthwhile, and so far I don't
> think you've given one.
>

There is no changes. As I said, UML cannot express inheritance so well 
as implementation is straightforward.

But we have the opportunity to fix it in the realization documents that 
are not final yet.

I did not spend time reasoning about the other associations, but if we 
agree on a composition-driven approach (every specification adds, does 
not overload) rather than a bare inheritance-driven approach (every 
specification overloads associations) I see no problem whatsoever. We're 
still fully consistent with the model, everything works as expected.

>> In LDAP, I would scope the search for endpoints starting from the
>> ComputingService,
>
> You can do that if you have chosen one specific ComputingService, but
> in your own example of a delegation endpoint which could serve
> computing or others kind of service, the current definition lets you
> search for all the endpoints which serve computing services but not
> the others.
>

Yes I understand what you mean. But what if I have a delegation endpoint 
that can be used both for computing and for storage? should I replicate 
such an endpoint in a ComputingService and in a StorageService? an in 
that case the same delegation endpoint would be a ComputingEndpoint and 
a StorageEndpoint, two different IDs. but in the end is the same endpoint!
How to express is the same endpoint? same ID? but then the record would 
have different objectclasses and associations... It's kinda bad to have 
differen records with the same ID.

I would rather call it Endpoint, add associations pointing to both the 
StorageService and ComputingService it serves, give it the same ID and 
place it in both Computing and Storage services.

>> but then give me something to relate a local information service
>> and its endpoints (some OpenLDAP service), or an independent
>> delegation Service to the box where the ComputingService is,
>> otherwise I run the
>
> As I already said, I think an information endpoint should be a
> separate Service. For a delegation service I can't say, it would
> depend on how closely it's bound to the computing service and what
> the use cases are.
>
>> risk of quering twice the information system(s) for no reason, and
>> submit jobs twice to the same endpoint because I cannot
>> distinguish between them.
>
> Queries are normally very lightweight compared with real service
> interactions like job submission, unless you're doing a very large
> number of them - querying twice is not a problem. Being able to
> recognise that you have the same Endpoint multiple times obviously is
> important, but I don't see why it would be difficult to recognise
> duplicates.
>

querying twice is a problem on big numbers. say I have 20 information 
endpoints and 40 submission endpoints in an index, such as EMIR, in 
which every Endpoint record has also the Service.ID of the Service the 
endpoint belongs to.

A client retrieves all the 60 of them. Then, it might want to query 
information endpoints to scan for submission endpoints.

Scenario 1)
I have Endpoints and ComputingEndpoints in a ComputingService.

I'll make it easy here. A single box might have more than one 
information/submission endpoint, that means  deciding which 
information/submission endpoints belonging to the same box one doesn't 
want to query. So, let's simplify the scenario and suppose submission 
endpoints belong to different boxes and information endopoints belong to 
different boxes.

BUT there might be information endpoints on the same box of at least one 
submission endpoint.

Then, since Endpoints and ComputingEndpoints are in the same 
ComputingService, IF the information endpoint has the same Service.ID of 
a submission endpoint, the client might decide not to query it.

Operation cost: one comparison for each information endpoint and 
submission endpoint at most, 20*40 = 800 ops

Scenario 2) Different services,
Endpoints in a Information Service and ComputingEndpoints in a 
ComputingService.

We then have different Service.IDs for each endpoint, because 
information endpoints belong to different services than submission 
endpoints.

The client cannot know which relationship exists between services, and 
then it must query information endpoints.

Suppose every information endpoint outputs 10 submission endpoints, some 
registered to the index (i.e. belonging to the set of 40 taken from the 
index) and some not (i.e. not in those 40 present in the index), ~200 
endpoints.

As said, since there is no information on how information and submission 
endpoints are coupled, I need to scan the information endpoints as I can 
gather more submission endpoints there. A client cannot just suppose 
that all the useful submission endpoints are in the index.

Hence I must check all the 40 submission endpoints in the index against 
the 200 retrieved from the  information endpoints , in order not to 
submit twice to the same endpoint.

In the worst case is 20 queries to information endpoints + 40*200 = 8000 
comparison operations, 8020 operations in total, and we're gone to the 
next order.

The numbers are arbitrary, but I can tell you that ARC will have at 
least 3 submission endpoints per box and you know what happens if you 
take a site-bdii as an information endpoint (one might easily reach 10 
there on big sites)

It is easy to see that as the number of job requests increases we might 
occur in an incredible amount of work just to submit a single job. Of 
course clients can use fancy ranking algorithms and or dynamic 
programming to solve the problem better.

>> In my initial implementation I wanted to use the
>> service-to-service association described in GFD1.47 (page 7, page
>> 13); however I was told that this was not the purpose for it to be
>> there, but it was more to reflect some hierarchy between Services.
>
> I don't see how it could represent a hierarchy unless you had some
> other way to express it - Service-Service is a peer relation, there
> is no directionality (unlike e.g. Domain-Domain). In any case, as
> I've said repeatedly, the question is not what the purpose was when
> the schema was defined (none in particular as far as a I remember)
> but whether it can be used to satisfy whatever requirements you have
> now in a specific case. For the things you're describing this may
> well be sufficient.
>

It might be worth then pushing these associations records into an index. 
Many developers are underestimating these associations in 
implementations and I tend not to consider them reliable.
I can see that they were meant as an approach to database integrity with 
a relational DB in mind.

These things nowadays are better realized via graph databases. Maybe the 
IDs in the associations might be used as a foundation to query and build 
a graph database of relationships between services, but this is dreaming 
of the future :)

>> I think the flaw in such an association based approach would be
>> that the unique ID might be wrong at a certain point in time (for
>> example because of ID renewal) and not refer anymore to the record
>> it points to.
>
> Persistency of IDs is a separate question, and a general one - IDs
> must be persistent for as long as necessary for all the possible
> uses. ServiceIDs in particular should probably change only when
> services are reconfigured in a major way. If references to IDs can't
> be followed the whole schema will be unusable!
>

I agree on both these two comments! we must push for those IDs to be 
crucial for implementations. Their value and importance for distributed 
deployments to work has been underestimated, especially regarding the 
rules regulating their persistence. I guess it is already part of you 
EGI profile, Stephen.

Cheers,
-- 
Florido Paganelli
Lund University - Particle Physics
ARC Middleware
EMI Project