[glue-wg] On BDII performance was Re: When is data stale?

Wed Apr 22 05:45:23 EDT 2015

Hi Paul

Without replying point to point the benchmarking you've
done, which was nice work, I kindly suggest you not to benchmark a
technology you maybe don't understand completely. As usual,
theoretically everything is fine, but not in practice :( .
Your claims are all true, LDAP is very fast in answering queries; this
is why we use it... and this is not why BDII is "slow".

Most of the time spent by BDII is done on restructuring the LDAP tree.
LDAP indexing is tree structured index backed by a key-value berkeley db.
That means that when aggregating data, all the data must be reindexed
("rewriting the dn" in LDAP slang) to fit into the tree. It is also such
three structure plus the simplicity of a key-value pair db that allows
LDAP to perform queries in such a fast way.

Unfortunately all of this comes at a cost. Updating the db requires the
following steps (I didn't look into the code recently, but I roughly
remember this)

1) ldap query the sources (negligible time as you discovered)
2) rebuild the new tree(s) generating new LDIF document(s) (very time
consuming, includes rekeying of ALL objects.)
3) check differences between the rebuilt tree(s) and the existing
database entries
3) modify existing entries that have changed (one ldap-modify for each
object)
4) Remove objects that are not there anymore (ldap-delete)
5) ldap-add new objects -- which boils down to ldap-adding a whole new
LDIF document (that is,  the entire DB) in most of the cases due to --
guess what -- CrationTime and Validity which are always changing!!! :D

As you can see above, you just bechmarked the top of the iceberg.

Laurence or Maria can correct me if the above is not true. I don't know
the code that well but I had to look into it during EMI times.

Over the years Laurence managed to shorten down this update time with
several smart ideas, that include also enterprise-level techniques like
replication, and probably partial LDIF documents where applicable. I
think in this way he avoided having two LDAP servers.

You have to understand that the LDAP technology is intended for data
that changes rarely, and we're using it for an almost real-time system.
One more hint of the fact that is a bad monitoring tool...

Trust me, 30 mins is a great achievement for a technology that never was
meant to do what we use it for. I might have several arguments against
BDII code but not about its performance.

The problems we're facing with ARC while trying to move to other
technologies is that query times for LDAP are faster than other
investigated technologies (e.g. REST web services)
Update times are horrible, but for people is more important to have the
queries fast than the information fresh it seems...
And I can say this because in ARC we put also jobs in the LDAP database,
which is EXTREME for today's numbers (i.e. O(10000) jobs).
It's nice(?) that these numbers match those that Stephen mentioned.

Cheers,
Florido

On 2015-04-21 20:07, Paul Millar wrote:
> Hi Stephen,
> 
> First, I must apologise if you felt my emails were in any way abusive
> --- they were certainly not intended that way; rather, I would like the
> effort we have all invested in GLUE and the grid infrastructure be used
> properly.
> 
> Currently, I see different groups developing their own information
> systems, running in parallel with GLUE+BDII, because of problems (both
> perceived and actual) with BDII.  I would like these problems addressed
> and find the very slow progress frustrating.
> 
> Onto the specific points...
> 
> On 21/04/15 13:00, stephen.burke at stfc.ac.uk wrote:
>> Paul Millar [mailto:paul.millar at desy.de] said:
>>> From your replies, you appear to have an internal definition of a
>>> CreationTime that is to yourself clear, self-obvious and almost
>>> axiomatic. Unfortunately, you cannot seem to express that idea in
>>> the>> terms defined within GLUE-2.
>>
>> OK, let's have one more try. The concept which you seem to think is
>> missing is "entity instance". That may not be explicitly defined but
>> it's a general computing concept, and I find it hard to see that you
>> could make much sense of the schema without it. The schema defines
>> entities as collections of attributes with types and definitions; an
>> instance of that entity has specific values for the attributes. One
>> of those attributes is CreationTime.  Instances are created in a way
>> completely unspecified by the schema document, but whatever the
>> method the CreationTime is the time at which that creation occurs
>> (necessarily approximate since creation will take a finite time). If
>> a new instance is created it gets a new CreationTime even if all the
>> other attributes happen to be the same. However, if an instance is
>> copied the copy preserves *all* the attribute values including
>> CreationTime - if you change that it's a new instance and not a
>> copy.
> 
> Thanks, that makes sense.
> 
> Just to confirm: you define two general mechanisms through which data is
> acquired: creating an entity instance and copying an entity instance.
> 
> In concrete terms, resource-level BDII+info-provider creates entity
> instances while site- and top- level BDIIs copy entity instances.  This
> breaks the symmetry, allowing CreationTime to operate only on
> resource-level BDIIs.
> 
> Perhaps such a description is trivial or "well known", but it seems to
> me that GLUE-2 when used in a hierarchy (like the WLCG info system)
> would benefit from such a description.  This could go in GLUE-2 itself,
> or perhaps in a hierarchy profile document.
> 
>>> The validator should expose bugs, not hide them.  How else are
>>> sites going to fix these bugs.
>>
>> The point is that sites can't fix middleware bugs [..]
> 
> What you say is correct.  I would also say that only sites can deploy
> the bug-fixes.
> 
>> and hence
>> shouldn't get tickets for them. If tickets were raised for errors
>> which would always occur and can't be fixed until a new middleware
>> release is available the validator would have been rejected - sites
>> must be able to clear alarms in a reasonably short time. That's also
>> why only ERRORs generate alarms - ERRORs are always wrong, WARNINGs
>> may be correct so a site may be unable to remove them. Of course, the
>> validator can still be run outside the Nagios framework without the
>> known issues mask.
> 
> Yes, it's always a bit fiddly dealing with a new test where the
> production instance currently fails.
> 
>>> It would be good if we could check this: I think there's a bug in
>>> BDII where stale data is not being flushed.
>>
>> Maria has been on maternity leave for several months, so all this has
>> been on hold. I think she should be back fairly soon, but no doubt it
>> will take a while to catch up. A couple of years ago there was a bug
>> where old data wasn't being deleted, but it should be out of the
>> system by now. Also bear in mind that top BDIIs can cache data for up
>> to four days.
> 
> Sure, I knew Maria was away; but I was hoping there would be someone
> covering for her, and that the process wasn't based on her heroic
> efforts alone.
> 
>>> If the validator is hiding bugs, and the policy is to do so
>>> whenever bugs are found, then it is useless.
>>
>> The policy is to submit a ticket to the middleware developers and
>> keep track of it. There's no point in repeatedly finding the same
>> bug.
> 
> Yes, that is certainly a sound policy.
> 
>>> AFAIK, there's no intrinsic reason why there should be anything
>>> beyond a 2--3 minute delay: the time taken to fetch the updated
>>> information from a site-level BDII.
>>
>> The top BDII has to fetch information from several hundred site BDIIs
>> and the total data volume is large. It takes several minutes to do
>> that. And site BDIIs themselves have to collect information from the
>> resource BDIIs at the site. Back in 2012 Laurence did some tests to
>> see if the top BDII could scale to read from the resource BDIIs
>> directly, but the answer was no, it can cope with O(1000) sources but
>> not O(10000). Also the resource BDII runs on the service and loads it
>> to some extent so it can't update too often - a particular issue for
>> the CE, which is the service with the fastest-changing data.
> 
> I'm not sure I agree here.
> 
> First, the site-level BDII should cache information from resource-level
> BDIIs, as resource-level BDIIs cache information from info-providers.
> This means that load from top-level BDIIs is only experienced by
> site-level BDIIs.
> 
> Taking a complete (top-level) dump only takes a few seconds.
> 
> paul at celebrimbor:~$ /usr/bin/time -f %e ldapsearch -LLL -x -H
> ldap://lcg-bdii.cern.ch:2170 -b o=glue > /dev/null
> 4.49
> 
> paul at celebrimbor:~$ /usr/bin/time -f %e ldapsearch -LLL -x -H
> ldap://lcg-bdii.cern.ch:2170 -b o=grid > /dev/null
> 5.15
> 
> 
> Lets say it takes about 10--15 seconds in total.
> 
> A top-level BDII is updating by this process (invoking the ldapsearch
> command).  Assuming the process is bandwidth limited, this should also
> take ~10--15 seconds as the total amount of information sent over the
> network should be about the same.  (Note that this doesn't take into
> account TCP slow-start, so it may be a slight underestimate, but see
> below for why I don't believe this is a real problem.)
> 
> Lets assume the problem isn't bandwidth limited, that the update
> frequency is limited by latency of the individual requests to site-level
> BDIIs.
> 
> I surveyed the currently registered site-level BDIIs:
> 
> for url in $(ldapsearch -LLL -x -H ldap://lcg-bdii.cern.ch:2170 -b
> o=glue $(ldapsearch -LLL -x -H ldap://lcg-bdii.cern.ch:2170 -b o=glue
> '(GLUE2ServiceType=bdii_site)' GLUE2ServiceID|perl -p00e 's/\n //'|awk
> 'BEGIN{printf "(|"}/^GLUE2ServiceID/{printf
> "(GLUE2EndpointServiceForeignKey="$2")"}END{print ")"}')
> GLUE2EndpointURL|perl -p00e 's/\n //g' | sed -n 's%^GLUE2EndpointURL:
> \(ldap://[^:]*:[0-9]*/\).*%\1%p'); do /usr/bin/time -a -o times.dat -f
> %e ldapsearch -LLL -x -H $url -o nettimeout=30 -b o=glue > /dev/null; done
> 
> This query covered some 318 sites.  The ldapsearch command failed for 5
> endpoints and the query timed out for 3 endpoints.
> 
> Of the remaining 310 sites, the maximum time for ldapsearch to complete
> was about 19.21 seconds and the (median) average was 0.44 seconds.  For
> 82% of sites, ldapsearch completed within a second; for 92% it completed
> within two seconds.
> 
> Repeating this for GLUE-1.3 showed similar statistics.
> 
> This suggests to me that information from responsive sites could be
> maintained with a lag of order 10 seconds to a minute (depending);
> information from sites with badly performing site-level BDIIs would be
> updated less often.
> 
> I haven't investigated injecting this information: BDII now generates a
> LDIF diff which is injected into the slapd.  This is distinct from the
> original approach, which employed a "double-buffer" with two slapd
> instances.
> 
> Still, I currently don't see why a top-level BDIIs must lag by some 30
> minutes.
> 
>>> Yeah, typical grid middleware response: rewrite the software rather
>>> than fix a bug.
>>
>> I could say that your response is typical: criticism without
>> understanding.
> 
> Perhaps, but I have reviewed the BDII code-base in the past and I know
> roughly how it works.
> 
> My simple investigation suggests maintaining a top-level BDII with
> sub-minute latencies is possible with at least 80--90% of site-level BDIIs.
> 
> Of course I may be missing something here, but it certainly seems
> feasible to achieve much better than is currently being done.
> 
> Cheers,
> 
> Paul.

-- 
==================================================
 Florido Paganelli
   ARC Middleware Developer - NorduGrid Collaboration
   System Administrator
 Lund University
 Department of Physics
 Division of Particle Physics
 BOX118
 221 00 Lund
 Office Location: Fysikum, Hus B, Rum B313
 Office Tel: 046-2220272
 Email: florido.paganelli at REMOVE_THIShep.lu.se
 Homepage: http://www.hep.lu.se/staff/paganelli
==================================================