[glue-wg] When is data stale?

Tue Apr 21 05:38:20 EDT 2015

Hi Stephen,

On 20/04/15 21:14, stephen.burke at stfc.ac.uk wrote:
> I don't know if you're being deliberately obtuse - *all* the
> attribute names are supposed to be descriptive of what they mean,

*sigh* I'm really not trying being deliberately obtuse.  I'm trying to 
illustrate some (apparently) difficult concept and failing to do so.

> if they aren't the name is poorly chosen. Would you be happier if
> the attribute were called RabbitFood and I defined it as a creation
> time?

To some extent yes --- When writing a description I consider the name as 
if written in a foreign language.  This forces me not to be "lazy" and 
write a description that stands on its own.  I also try hard to write 
the description without using the words contained in the name; this 
helps me avoid the trap of making assumptions on the reader's 
understanding of those words.

 From your replies, you appear to have an internal definition of a 
CreationTime that is to yourself clear, self-obvious and almost 
axiomatic.  Unfortunately, you cannot seem to express that idea in the 
terms defined within GLUE-2.

My point is that the language of GLUE-2 seems to prevent such 
descriptions: it assumes some kind of steady-state, without describing 
how information is updated.  If so, this makes defining CreationTime 
impossible without introducing a new concept, such as the info-provider.

[...]
>> Do you know if the glue validator is being run against production
>> top-level BDII instances?
>
> Yes, it's run as part of the site Nagios tests, and sites get
> tickets for things marked as ERROR,

Excellent.

 > except that known middleware bugs are masked.

I actually also discovered that bugs are being hidden, this morning:

	https://its.cern.ch/jira/browse/GRIDINFO-58

This feel this is very wrong!

The validator should expose bugs, not hide them.  How else are sites 
going to fix these bugs.

 > I'm not sure offhand if that includes this issue - as
> Florido says, the ARC values were short enough that they always
> failed the test so it may still be masked.

It would be good if we could check this: I think there's a bug in BDII
where stale data is not being flushed.

If the validator is hiding bugs, and the policy is to do so whenever 
bugs are found, then it is useless.

>> One hour!  Why doesn't someone fix this?
>
> It's actually more like 30 minutes, and it's pretty much intrinsic
> to the BDII architecture.

AFAIK, there's no intrinsic reason why there should be anything beyond a
2--3 minute delay: the time taken to fetch the updated information from 
a site-level BDII.

Where's the bug-report for this?

> There have been various attempts to design a new information system
> but none have come to fruition.

Yeah, typical grid middleware response: rewrite the software rather than 
fix a bug.

>> OK, but why 1 hour and not 1 minute or 1 day?
>
> As I said, there's no point in having it much shorter than an hour
> because the system can't update that fast.

OK, but again, this is bad.

Rather than fixing a bug, a work-around is introduced.

> For most dynamic information 1 day would be unrealistically long
> because the dynamic state of most services can change quite a bit
> faster, e.g. services often go from up to down to up within a day.

Personally, I'm still not convinced there's some intrinsic period 
describing how long an object should stay valid.

For example, if an Endpoint is no longer available, that information 
should propagate quickly.  It doesn't matter that the endpoint has been 
available for the past 6 months, or that endpoints are generally stable 
for many days.

> Ideally we'd have a more responsive information system

Absolutely!  30 minutes delay is ridiculous.

> and one which treated different kinds of information differently -
> i.e. fast-changing information like running job counts would be
> updated every few minutes or less, while slowly changing objects
> would update infrequently.

That's merely an optimisation, which might prove useful if we can 
reasonably label such data.

I'm still not convinced about labelling objects as rapidly or slowly 
updating, so I'm not convinced with this optimisation.

> In that case the Validity could be set according to the realistic
> lifetime of the information and the information system could use it
> as a guide to when it should refresh. However that isn't what we have
> at the moment.

True.

I think the immediate focus should be fixing top-level BDIIs so they
provide reasonably up-to-date information.

Cheers,

Paul.