[glue-wg] Some thoughts on storage objects

Thu Apr 3 05:30:37 CDT 2008

Paul Millar wrote:

 > Hi all,

Hi Paul, *,

[..]
 > I don't know how useful this is.  It's just my point-of-view of
 > things as stand now.  I'm sure there are bits that are "wrong"
 > (either I've misunderstood and/or this description breaks a
 > use-case), but if so, helpfully people can point which bits are
 > wrong and (perhaps) it will stimulate some discussion.

IMO it is useful.  We know from past experiences that different
communities interpret different things, er, differently.

Sometimes that is useful - makes schema reusable and adaptable,
but as a service provider publishing the same attributes with
different semantics is a nightmare.

I think it's an excellent attempt and I found it useful (just
took me some time before I had time to parse it :-)

[..]
 >
 > UserDomain:

 > A collection of one or more end-users; a VO is an instance of a
 > UserDomain.  All end-users that interact with the physical
 > storage are a member of a UserDomain and, in general, derive
 > their authorisation from that membership.

If we do it like this then we should use the hierarchical feature in
UserDomain:

UserDomain : WLGC	UserDomain : NGS	UserDomain : Diamond
     |			    |            	    |
UserDomain : LHCb	UserDomain : biomed	UserDomain : Pr234
     |			    |            	    |
UserDomain : prod	UserDomain : NHS	UserDomain : Beamline

In general, I think the high level entity is more useful than a low
level entity such as the VO.

Within WLCG for example, most SE info is available to all VOs in the
sense that they (SEs) publish information that the VOs know how to
make sense of.

Outside WLCG (yes, there are people not in WLCG), the same schema
could be used but e.g. with different attributes published (e.g. some
req'd by WLCG could be left blank for NGS).  The UserDomain could be
consulted by the information consumer to check whether they should
even probe further.

 >
 > StorageCapacity:
 >
 > A StorageCapacity object describes the ability to store data
 > within a homogeneous storage technology.  This storage
 > technology provides a common access latency.

We used to call this a StorageComponent.
http://storage.esc.rl.ac.uk/GLUE-SE-1.3-input-1.03.pdf

It was debated when we discussed 1.3 whether it was even useful to
expose this level of detail to users via the information system, but I
would suggest in the interim we have some users who do care.

 >
 > All StorageCapacity objects are specified within a certain
 > context.  The context is determined by an association between
 > the StorageCapacity object and precisely one other higher-level
 > object.  These associations are not listed here, but are
 > described in later sections.

What if I have an

 >
 > In general, a StorageCapacity object will record some
 > context-specific information.  Examples of such information
 > include the total storage capacity of the underlying technology
 > and how much of that total has been used.
 >
 > The underlying storage technology may affect which of the
 > context-specific attributes are available.  For example, tape
 > storage may be considered semi-infinite, so the total and free
 > attributes have no meaning.  If this is so, then it affects all
 > StorageCapacity objects with the same underlying technology,
 > independent of their context.

That is too prescriptive (OK, it was just an example).  For another
example, some customers who pay directly for media used, they will
want to know how much space is available on the tapes.  Better just to
leave the space thingies as optional attributes.

In any case, we know the concepts of "free" and "used" (etc) are so
difficult to pin down that each User Domain may well have its own
interpretations.  Which is why the UserDomain attr could be handy.

 >
 > Different contexts may also affect what context-specific
 > attributes are recorded.  This is a policy decision when
 > implementing GLUE, as recording all possible information may be
 > costly and provide no great benefit.

Mm hm.  Which I think is why the User Domain is a good thing if we can
use it for that.

 > [Aside: these two reasons are why many of the attributes within
 > StorageCapacity are optional.  Rather than explicitly
 > subclassing the objects and making the values required, it is
 > left deliberately vague which attributes are published.]

Yep, domains will define what they want and what it means.

 > A StorageCapacity may represent a logical aggregation of
 > multiple underlying storage technology instances; for example, a
 > StorageCapacity might represent many disk storage nodes, or many
 > tapes stored within a tape silo.  GLUE makes no effort to record
 > information at this deeper level; but by not doing so, it
 > requires that the underlying storage technology be
 > homogeneous. Homogeneous means that the underlying storage
 > technology is either identical or sufficiently similar that the
 > differences don't matter.

All that is really required is homogeneity in attributes.  For
example, a community that does not care about AccessLatency being
published (if one such exists) would see disk and tape as
homogeneous.

 > In most cases, the homogeneity is fairly obvious (e.g., tape
 > storage vs disk-based storage), but there may be times where
 > this distinction becomes contentious and judgement may be
 > required; for example, the quality of disk-base storage might
 > indicate that one subset is useful for a higher-quality service.
 > If this is so, then it may make sense to represent the different
 > class of disk by different SpaceCapacities.

Yes, but the GLUE schema should not prescribe how to do this - provide
attributes for communities to publish the most common capabilities but
leave it to the communities to define what they mean.

GLUE could include examples but they must not be normative.

 > StorageEnvironment:
 >
 > A StorageEnvironment is a collection of one or more
 > StorageCapacities with a set of associated (enforced) storage
 > management policies.  Examples of these policies are Type
 > (Volatile, Durable, Permanent) and RetentionPolicy (Custodial,
 > Output, Replica).

I completely agree with Maarten here, we need to get away from the old
overloaded names (even as examples! unless we put in in big fat
letters that their use is deprecated).

ExpirationMode : releaseWhenExpired, warnWhenExpired, neverExpire.

RetentionPolicy : Replica, Output, Custodial.

Note that these come from the SRM world (it is a bug in SRM2.2 that
the old volatile etc names were retained).  These names may again be
meaningless to other communities, who may still wish to publish values
for these.

It may be better to provide attributes for ExpirationMode and
RetentionPolicy (etc) and explain what they're for, but not to
prescribe the values.

 > StorageEnvironments act as a logical aggregation of
 > StorageCapacities, so each StorageEnvironment must have at least
 > one associated StorageCapacity.  It is the associated
 > StorageCapacities that allow a StorageEnvironment to store data
 > with its advertised policies; for example, to act as (Permanent,
 > Custodial) storage of data.

OK

 > Since a StorageEnvironment may contain multiple
 > StorageCapacities, it may describe a heterogeneous environment.
 > An example of this is "tape storage", which has both tape
 > back-end and disk front-end into which users can pin files.
 > Such a StorageEnvironment would have two associated
 > StorageCapacities: one describing the disk storage and another
 > describing the tape.

We've had this case before.  In that case, the StorageEnvironment
should publish the minimal capabilities which it can support (or leave
them blank, leaving it to the client to go and query the
StorageCapacities).

For example, if a StorageEnvironment contains both tape and disk, its
AccessLatency should be Nearline - as it is the lowest common
denominator.

This only makes sense if you have a partial order on capabilities.

 > If a StorageCapacity is associated with a StorageEnvironment, it
 > is associated with only one.  A StorageCapacity may not be
 > shared between different StorageEnvironments.

OK

 > StorageCapacities associated with a StorageEnvironment must be
 > non-overlapping with any other such StorageCapacity and the set
 > of all such StorageCapacities must represent the complete
 > storage available to end-users.  Each physical storage device
 > (e.g., individual disk drive or tape) that an end-user can
 > utilise must be represented by (some part of) precisely one
 > StorageCapacity associated with a StorageEnvironment.
 >

OK, except we shouldn't say "it must represent the complete storage
available to end users" - it's up to the information publisher.  We
may have "secret" storage available, via endpoints communicated by
other means.

 > Nevertheless, the StorageCapacities associated with
 > StorageEnvironments may be incomplete as a site may deploy
 > physical storage devices that are not directly under end-user
 > control; for example, disk storage used to cache incoming
 > transfers.  GLUE makes no effort to record information about
 > such storage.

Of course.

 >
 >
 > StorageResource:

 > A StorageResource is an aggregation of one or more
 > StorageEnvironments and describes the hardware that a particular
 > software instance has under its control.

Ummm.  I would avoid using the word "hardware" here, or at least I
find it confusing.  For example, at RAL, CASTOR does not have
exclusive use of the tapestore.

 > A StorageResource must have at least one StorageEnvironment,
 > otherwise there wouldn't be much point publishing information
 > about it. [This isn't a strict requirement, but I think it makes
 > sense to include it.]

I would be less concerned about publishing an empty StorageResource -
if a top level BDII publishes the resource and gathers the
Environments from lower level BDIIs, it may at some point find itself
publishing an empty StorageResource, e.g. during maintenance.

 > All StorageEnvironments must be part of precisely one
 > StorageResource.  SoftwareEnvironments may not be shared between
 > StorageResources.  This means that all physics hardware must be
 > published under precisely one StorageResource.
 >

OK.

 >
 > StorageShare:
 >
 > A StorageShare is a logical partitioning of one or more
 > StorageEnvironments.

OK.

 > Perhaps the simplest example of a StorageShare is one associated
 > with a single StorageEnvironment with a single associated
 > StorageCapacity, and that represents all the available storage
 > of that StorageCapacity.  An example of a storage that could be
 > represented by this trivial StorageShare is the classic-SE.
 >
 > StorageSpaces must have one or more associated
 > StorageCapacities.  These StorageCapacities provide a complete
 > description of the different homogeneous underlying technologies
 > that are available under the space.

OK.

 > In general, the number of StorageCapacities associated with a
 > StorageShare is the sum of the number of StorageCapacities
 > associated with each of the StorageShare's associated
 > StorageEnvironments.

You sort of contradict yourself in this para in some of the paras
below, so I have tried to summarise the conclusion:

Observation [**]

1. StorageEnvironments fully partition the StorageCapacities

    [that is: each StorageCapacity belongs to one and only one
     StorageEnvironment]

2. Each StorageShare contains one or more StorageCapacities

3. A StorageShare is associated with a StorageEnvironment if and only
    if they contain a common StorageCapacity.

 > Following from this, there is an implicit association between
 > the StorageCapacity associated with a StorageShare and the
 > corresponding StorageCapacity associated with a
 > StorageEnvironment.  Intuitively, this association is from the
 > fact that the two StorageCapacities share the same underlying
 > physical storage.  This implicit association is not recorded in
 > GLUE.

Well except the StorageShare has a unique id.  Also I understood there
is a 1..* - 1..* association between the Share and the Environment.

s/StorageSpace/StorageShare/, cf Maarten's email.
(6 occurrences)

 > StorageShares may overlap.  Specifically, given a
 > StorageCapacity (SC_E) that is associated with some
 > StorageEnvironment and which has totalSize TS_E, let the sum of
 > the totalSize attributes for all StorageCapacities that are:
 > 1. associated with a StorageShare, and 2. that are implicitly
 > associated with SC_E be TS_S.  If the StorageShares are covering
 > then TS_S = TS_E.  If the StorageShares overlap, then TS_S >
 > TS_E.  [sorry, I couldn't easily describe this with just words
 > without it sounding awful!]

See [**] above

Note 2 is still a cover - even if the StorageShares overlap they still
_cover_ the set of all StorageCapacities.  If $X$ is a space
(e.g. topological space) then $U:=\{A_i\subseteq X\vert i\in I\}$ is
said to be a cover of $X$ if $\union_{i\in I}A_i=X$.

 > StorageShares may be incomplete.  Following the same definitions
 > as above, this is when TS_S < TS_E.  Intuitively, this happens
 > if the site-admin has not yet assigned all available storage.

See [**] above.

 > End-users within a UserDomain may wish to store or retrieve
 > files.  The StorageShares provides a complete, abstract
 > description of the underlying storage at their disposal.  No
 > member of a UserDomain may interact with the physical hardware
 > except through a StorageShare.

In this case it makes sense to use hierarchical UserDomains.

 > The partitioning is persistent through file creation and
 > deletion.

Which partitioning?  The StorageShares do not partition anything in
general.

 > The totalSize attributes (of a StorageShare's associated
 > StorageCapacties) do not change as a result of file creation or
 > deletion.  [Does GLUE need to stipulate this, or should we leave
 > this vague?]

This is actually not necessarily true: if I start adding files to the
share, it may expand because the storage system chooses to add more
Capacities to it.

 > A single StorageShare may allow multiple UserDomains to access
 > storage; if so, the StorageShare is "shared" between the
 > different UserDomains.  Such a shared StorageShare is typical if
 > a site provides storage described by the trivial StorageShare
 > (one that covers a complete StorageEnvironment) whilst
 > supporting multiple UserDomains.
 >

This is getting complicated if the UserDomains themselves are
hierarchical!

 >
 > StorageMappingPolicy:
 >
 > The StorageMappingPolicy describes how a particular UserDomain
 > is allowed to access a particular StorageShare.  No member of a
 > UserDomain may interact with a StorageShare except as described
 > by a StorageMappingPolicy.

This is also too prescriptive.  Surely it is not up to GLUE to mandate
rules for how storage systems are used.

For example, I may publish a general read-only access rule for the VO,
but a subset of the VO may have write access.  I should not have to
publish that explicitly.

 > The StorageMappingPolicies may contain information that is
 > specific to that UserDomain, such as one or more associated
 > StorageCapacities.  If provided, these provide a
 > UserDomain-specific view of their usage of the underlying
 > physical storage technology as a result of their usage within
 > the StorageShare.

I agree, this is consistent with how I think UserDomains should be
used.

 > If StorageCapacities are associated with a StorageMappingPolicy,
 > there will be the same number as are associated with the
 > corresponding StorageShare.
 >

This probably needs more careful checking.  For example, a
StorageShare can be contained in more than one StorageShare and those
StorageShares can themselves be associated with different
UserDomains.

 >
 > StorageEndpoint:
 >
 > A StorageEndpoint specifies that storage may be controlled
 > through a particular interface.  The SRM protocol is an example
 > of such an interface and a StorageEndpoint would be advertised
 > for each instance of SRM.

Yep.

 > The access policies describing which users of a UserDomain may
 > use the StorageEndpoint are not published.  On observing that a
 > site publishes a StorageEndpoint, one may deduce only that it is
 > valid for at least one user of one supported UserDomain.
 >

That should be OK - as long as the endpoints themselves are
interpreted the same way by all users - which seems reasonable.

 >
 > StorageAccessProtocol:
 >
 > A StorageAccessProtocol describes how data may be sent or
 > received.  The presence of a StorageAccessProtocol indicates
 > that data may be fetched or stored using this interface.

Yep.

 > Access to the interface may be localised; that is, only
 > available from certain computers.  It may also be restricted to
 > specified UserDomains.  However, neither policy restrictions are
 > published in GLUE.  On observing a StorageAccessProcol, one may
 > deduce only that it is valid for at least one user of one
 > supported UserDomain.

Where did the network description go?  We used to have one.

The idea is that certain protocols can be used only locally, or on
certain networks.

For example, a single StorageElement can have a range of GridFTP data
movers on the WAN, a LAN protocol internally, and an OPN link which
accepts UDP-based high speed data transfer like the astronomers use.

If you are a local job you can ask it "do you support gridftp" and it
would say yes, but you cannot necessarily access the GridFTP data
movers from the worker nodes - and it would be less efficient than the
LAN protocol.

I think we need to put it back, and StorageAccessProtocol seems to me
the more obvious location.

 > StorageService:
 > A StorageService is an aggregation of StorageEndpoints,
 > StorageAccessProtocols and StorageResources.  It is the
 > top-level description of the ability to transfer files to and
 > from a site, and manipulate the files once stored.

OK.

Cheers
--jens