[UR-WG] Draft/suggestion for Storage Accounting Record (2010/10/07)

Fri Oct 22 06:21:20 CDT 2010

Hi Henrik,

Following are some of my personal comment that we may discuss, agree, 
disagree, etc...

> 1. Discrete vs continuous
>
> First we discussed how storage differs from jobs in the sense that a job is a
> discrete unit whereas storage has a more continuous nature, where the usage more
> or less constantly varies (bytes used), but typically in relatively small
> scale.
>
> A storage record can however only describe something discrete, so the
> continuous nature of storage will have to be split into something discrete
> (similar to integration). Of course the granularity of this process should not
> be dictated be the standard. The first suggestions in then to have a start and
> an end time for the storage measurement where a measurement is taken. This
> allows to create per-day, per-hour resolution or something third, depending on
> the needs.
>
> Result:
> "StartTime" and "EndTime" element for describing the time interval for when the
> record is valid. Both of them are DateTime values.
>    

In the discussion that we had in Munich at OGF we discussed something 
like that and, I think is the right thing. Even if we account for a 
single file it should be possible to create UR that specify a start and 
end date so that you can calculate the integral you mentioned. This even 
if the file are still present. The site could, for example create those 
records regularly.

> 2. What to measure
>
> The most obvious metric here is the amount of used space. Another metric is the
> amount of reserved space, i.e., space not taken by actual files/objects, but
> which cannot be used by other parties. A third metric, is the amount of space
> which is not allocated, and can be used. Initially we called this "free space",
> but this term is not exactly accurate as it refers more to a file system
> metric, which can be very different from what can actually be used due to
> quotas. Instead we choose to have an element describing how much unallocated
> space there is, i.e., how space one can be expected to be able to use. The
> actual free space is a metric, which can be very tricky to measure, and is
> typically only of interest to the storage system administrator.
>    

Here I'm not so sure that we really need to define attributes for a 
record that describes free space. In fact what we want to know is only 
the used space over time. System administrators are then free to analyse 
the UR and compare them with the available space.
I think that UR should be focused on the used space not in the available 
space.

> A secondary issue is how to report the numbers. We quickly decided on using
> bytes, as it is the fundamental unit, and saves us from deciding on weather to
> use 1000 or 1024 bytes as a base. Having multiple options (say KB, MB) for
> reporting would complicate the standard unnecessarily without any real benefits,
> so we suggest to keep it to bytes, and bytes only. This would also ensure that
> the number is always an integer (when using KB and up, floats could be used to
> be exact) and saves silly conversion routines when parsing the record.
>    

I agree.

> We briefly considered if reporting should be per file, but this was quite
> quickly shoot down, as it would make the records unreasonably large, without
> providing any real value. We did end with an element describing the number of
> files, which are using the space reported.
>
> Result:
> "UsedSpace", "ReservedSpace", "UnAllocatedSpace" metrics for describing how
> space is used, reserved, and available for use. Reserved and unallocated are
> probably not overlapping (as reserved space is technically used). The
> measurement is in bytes.
> "FileCount" for describing the number of files using the space.
>    

I wouldn't discard the accounting per file as, I think, it's the only 
way to have a precise way to account for the used space (we also had 
some request regarding this point where people are interested in knowing 
how often a specific file is accessed). I agree that it can make the 
number of records huge but this problem might be solved using aggregate 
of records where you lose the detail regarding the file, etc. This 
anyway is not something that must be done but giving the choice to 
account also per file I think is important.
I agree with the reserved space might be accounted because even if it 
isn't really used it might be "locked".

> 3. Site and Storage Concepts
>
> The issue of how to describe the site and what storage the accounted data is
> stored is perhaps the most complex issue in defining this format. The
> discussion to achieve this was very non-linear, so I will just the describe the
> result:
>
> A site is considered a top level container for storage. The site name should
> globally unique (which probably means an FQDN).
> A storage system is an independent system on a site.
> A storage system partition is a part of storage system (similar to dCache pools)
> A storage type describes the storage type where the data is stored (disk, tape)
>
> None of these elements are mandatory (though we couldn't find a valid use case
> for a record without a site). E.g., is perfectly valid to just use site, site
> and storage type, or site and storage system. If multiple of the elements
> exist, they are considered to have hierarchical tree-like structure, i.e.,
> site ->  storage system ->  storage system partition ->  storage type. How to
> structure this is described in the "Record Structure" section.
>
> This allows a site with a simple setup to reporting just for the site, where as
> more complex installations can report per storage system, and how much is
> placed on tape and disk respectively for different storage systems.
>
> A single institution can easily report multiple sites, we do not interfere in
> this, but it is very likely that a site would have several storage element /
> systems and would like to able to aggregate it under a single site.
>
> Finally we also add the possibility for a storage class, which describes the
> class of storage accounted for. This could "precious", "deletable", "pinned",
> etc. It is not considered a part of the hierarchy described above.
>
>
> 4. Identity
>
> To describe who is using the space, an identity block or group, similar to the
> one in usage record must be supported. As with usage record it must be possible
> to specify both a local user name and global user identity. However the common
> case is likely to be a VO or a VO group, so being able to describe this is
> extremely important. Another use case is a local group at the site who owns the
> data (not everyone uses grid). How exactly to describe the virtual organization
> is not quite clear, but the following elements are needed as a base: VO name,
> VO issuer (DN of the VO issuer, somewhat VOMS specific), VO group, and VO role.
> There might be use cases for being able to have multiple VO blocks (though I
> suspect that will be messy).
>
> Resulting elements:
> VO Name
> VO Issuer
> VO group
> VO role
> Global User Identity
> Local User Name
> Local User Group
>
> All element are strings. The Global User Identity is to be considered global
> unique. Furthermore the combination of VO issuer and VO name should be globally
> unique.
>    

I think that I agree in general with those things. Maybe just the name 
might be more general without including the reference to VO that are 
maybe too much GRID specific. To identify a user I would use the same 
system used in the current UR so to keep everything uniform.

>
> 5. Record Structure
>
> Having identified the identity block and the site/storage structure, we tried
> to find a good way to structure the individual records. We considered the
> possibility of having a tree structure in the storage accounting record, with
> the site, storage system, storage system partition, and storage type as
> possible levels in the tree. The leaves would then each have an identity and
> space usage block. This however would be quite complicated to construct and
> parsed, so a simpler - more flat - structure was quickly preferred. Here each
> record would maximum have a one site, storage system, storage system partition,
> and storage type (but all still optional). If a system would need to describe
> for multiple storage system / storage system partitions / storage type
> several records would have to be generated. This format should still be easy
> to aggregate together into site or storage system to get complete numbers. The
> flat format is also much easier to explain which probably means that it should
> be preferred. The two structure can describe exactly the same, so there is no
> limitation by choosing the flat format.
>    

I agree also with this.

>
> 6. Record Overview
>
> Given the previous section, we can now present an overview of the elements in a
> storage accounting record. We reuse the recordId and createTime elements from
> the usage record standard (just the element names, not the namespaces).
>
>
> StorageAccountingRecord
>
>    recordId (considered globally unique)
>    createTime (timestamp when the record was created)
>    StartTime (datetime value)
>    EndTime (datetime value)
>    Site
>    StorageSystem
>    StorageSystemPartition
>    StorageType
>    StorageClass
>    IdentityBlock
>     VO Block
>      VO Name
>      VO Issuer
>      VO Group
>      VO Role
>     GlobalUserIdentity
>     LocalUserName
>     LocalGroupName
>    UsedSpace
>    ReservedSpace
>    UnallocatedSpace
>    FileCount
>
> The only enforced element is the recordId.
>
>
> 7. Name Issues
>
> Someone brought up that SAR is not the most fortunate name. It sounds a bit
> like SARS, and the phonetic sound is apparently close to a non-to-fortunate
> Hungarian word.
>
> We could consider SR (storage record) or SUR (storage usage record)
> This really doesn't matter.
>
>
> 8. Sharing Elements with Usage Record Standard
>
> Some of the elements are identical in both name and semantics to the ones in
> usage record. We do not suggest to share the elements as such (same namespace),
> as it would make the standard rely on the UR standard, and hence make it less
> self-contained. The UR standard is only used in a few systems, and is likely to
> be replaced with a new standard sometime. Furthermore the implementation gains
> of sharing the names are very small, if they even exist.
>    

Even if it's decided to have two separated UR for resources used and 
storage I think that, where possible, we should reuse the same 
convention for names. Like, for example the "Global User Identity" or 
the "Local User", etc.

Andrea

>
> We have definitely missed something in this, but we hope this can be a start
> for the discussion in Brussels. If you see problems or issues with this record
> please let us know.
>
>
>       Best regards, Henrik Thostrup Jensen&  Jon Kerr Nilsen
>
>
> --
>    ur-wg mailing list
>    ur-wg at ogf.org
>    http://www.ogf.org/mailman/listinfo/ur-wg
>    

-- 
Andrea Cristofori
INFN-CNAF
Viale Berti Pichat 6/2
40127 Bologna
Italy
Tel. : +39-051-6092920
Skype: andrea-cnaf