[glue-wg] When is data stale?

Tue Apr 21 07:00:38 EDT 2015

Paul Millar [mailto:paul.millar at desy.de] said:
>  From your replies, you appear to have an internal definition of a
> CreationTime that is to yourself clear, self-obvious and almost
> axiomatic.  Unfortunately, you cannot seem to express that idea in the
> terms defined within GLUE-2.

OK, let's have one more try. The concept which you seem to think is missing is "entity instance". That may not be explicitly defined but it's a general computing concept, and I find it hard to see that you could make much sense of the schema without it. The schema defines entities as collections of attributes with types and definitions; an instance of that entity has specific values for the attributes. One of those attributes is CreationTime.  Instances are created in a way completely unspecified by the schema document, but whatever the method the CreationTime is the time at which that creation occurs (necessarily approximate since creation will take a finite time). If a new instance is created it gets a new CreationTime even if all the other attributes happen to be the same. However, if an instance is copied the copy preserves *all* the attribute values including CreationTime - if you change that it's a new instance and not a copy.

> The validator should expose bugs, not hide them.  How else are sites
> going to fix these bugs.

The point is that sites can't fix middleware bugs, and hence shouldn't get tickets for them. If tickets were raised for errors which would always occur and can't be fixed until a new middleware release is available the validator would have been rejected - sites must be able to clear alarms in a reasonably short time. That's also why only ERRORs generate alarms - ERRORs are always wrong, WARNINGs may be correct so a site may be unable to remove them. Of course, the validator can still be run outside the Nagios framework without the known issues mask.

> It would be good if we could check this: I think there's a bug in BDII
> where stale data is not being flushed.

Maria has been on maternity leave for several months, so all this has been on hold. I think she should be back fairly soon, but no doubt it will take a while to catch up. A couple of years ago there was a bug where old data wasn't being deleted, but it should be out of the system by now. Also bear in mind that top BDIIs can cache data for up to four days.

> If the validator is hiding bugs, and the policy is to do so whenever
> bugs are found, then it is useless.

The policy is to submit a ticket to the middleware developers and keep track of it. There's no point in repeatedly finding the same bug.

> AFAIK, there's no intrinsic reason why there should be anything beyond a
> 2--3 minute delay: the time taken to fetch the updated information from
> a site-level BDII.

The top BDII has to fetch information from several hundred site BDIIs and the total data volume is large. It takes several minutes to do that. And site BDIIs themselves have to collect information from the resource BDIIs at the site. Back in 2012 Laurence did some tests to see if the top BDII could scale to read from the resource BDIIs directly, but the answer was no, it can cope with O(1000) sources but not O(10000). Also the resource BDII runs on the service and loads it to some extent so it can't update too often - a particular issue for the CE, which is the service with the fastest-changing data.

> Yeah, typical grid middleware response: rewrite the software rather than
> fix a bug.

I could say that your response is typical: criticism without understanding.

As far as I'm concerned this correspondence is closed. I've said what I have to say, if you don't understand it I don't propose to make any further attempts to explain, especially since you seem to be resorting to abuse rather than argument. 

Stephen