[Pgi-wg] Execution Service Strawman - Data Staging
Etienne URBAH
urbah at lal.in2p3.fr
Fri Apr 10 18:34:48 CDT 2009
Moreno,
Concerning the work about DATA STAGING inside OGF PGI :
I have carefully read :
- The GROMACS use case at http://forge.gridforum.org/sf/go/doc15580?nav=1
From my point of view, it contains 2 different use cases of Input
Data Staging.
- The section about 'Data Staging' of the 'Execution Service Strawman'
document at http://forge.gridforum.org/sf/go/doc15590?nav=1
I am afraid that it contains too restrictive implicit assumptions
about :
- the permanent location of 1 file,
- the destination of Input Data Staging.
So, I would like to provide precisions about following concepts :
Permanent location of 1 file
----------------------------
As far as I know, from the point of view of the grid Computing Resource,
1 permanent file can be stored on 5 different types of location :
CLIENT: A location accessible by the Client (grid User or Workflow
Engine), but NOT by Grid Services.
------ This requires Data Staging managed by the Client.
WEB: A location accessible by anybody, provided it has adequate
credentials.
--- In this case, Data Staging SHOULD be managed by the Client.
If not, then this requires the Client to transmit (beware of
security issues) or delegate the credentials to a Grid Service.
TAPE: A tape on a grid Storage Resource.
---- This requires Data Staging (which can be managed by the Client
or by a Grid Service).
DISK: A disk on a grid Storage Resource.
---- If the network bandwidth is high enough, and if the file is
accessed only sequentially, this does NOT require Data Staging. But the
Client can prefer to perform Data Staging anyway.
LIST: A list of grid Storage Resources (storing replicas having
identical content). The Client does NOT know in advance on which media
(tape or disk) each Storage Resource stores its replica.
----
According to best practices, a disk on the Computing Resource should NOT
be a permanent location for a file.
Destination of INPUT Data Staging
---------------------------------
As far as I know, Input Data Staging can be performed in direction of 3
different types of Storage Resources :
- Disk on a (possibly remote) grid Storage Resource (from tape).
This permits remote sequential reads.
- Disk on a close grid Storage Resource (from tape or remote Storage
Resource).
This permits close sequential reads.
- Local disk of the Computing Resource.
This permits quick sequential or random reads and writes.
Destination of OUTPUT Data Staging
----------------------------------
As far as I know, Output Data Staging can be performed from the local
disk of the Computing Resource to 4 different types of Storage Resources :
- 1 grid Storage Resource
(The decision to choose disk or tape for permanent storage of the
file is OUTSIDE the scope of PGI).
- LIST (see definition above).
- WEB (see definition above).
- CLIENT (see definition above).
USE CASES
---------
The possibilities listed above lead to numerous different use cases. I
describe below 4 canonic use cases (of course, any combination is
possible) :
Pre and Post Staging by the Client (EGEE standard use case)
----------------------------------
1) Only if the input files are NOT already in the same Storage Resource,
the Client transfers the input files to disks in the same grid Storage
Resource (pre stage in),
2) The Client submits a Job (with the grid locations of input and output
files, all in the same grid Storage Resource) to a Computing Resource
close to the Storage Resource,
3) The Job sequentially reads from and writes to files inside the close
Storage Resource, and uses the local disk only for temporary files,
4) The Client receives notification that the Job is finished,
5) Only if necessary, the Client pulls the desired output files from the
Storage Resource (post stage out).
I appreciate this use case, because it is very simple, and does NOT
REQUIRE staging at all inside a well designed global workflow (like Unix
pipe).
Pre and Post Staging in a STORAGE Resource by the Execution Service
-------------------------------------------------------------------
1) The Client submits a Job (with the grid locations of input and output
files) to the Execution Service,
2) Only if the input files are NOT already in the same Storage Resource,
the Execution Service transfers the input files to disks in the same
grid Storage Resource (pre stage in),
3) The Execution Service sends the Job for execution to a Computing
Resource close to the Storage Resource.
4) The Job sequentially reads from and writes to files inside the close
Storage Resource, and uses the local disk only for temporary files,
5) The Execution Service receives (from the Computing Resource) the
notification that the Job is finished inside the Computing Resource,
6) Only if necessary, the Execution Service transfers the desired output
files from the close Storage Resource to their final grid locations
(post stage out).
7) The Client receives notification that the Job is finished,
If the Client is a grid User, he would appreciate this use case, because
it is very simple for him, but it requires complex processing by the
Execution Service.
Just In Time Staging in the COMPUTING Resource by the Execution Service
-----------------------------------------------------------------------
1) The Client submits a Job (with the grid locations of input and output
files) to the Execution Service,
2) The Execution Service sends the Job with a 'do not start' attribute
to a Computing Resource.
3) The Execution Service receives (from the Computing Resource) the
locations of input and output files inside the Computing Resource,
4) The Execution Service transfers the input files to their locations
inside the Computing Resource (just in time stage in),
5) The Execution Service starts the Job,
6) The Job sequentially or randomly reads from and writes to files only
inside its local disk,
7) The Execution Service receives (from the Computing Resource) the
notification that the Job is finished inside the Computing Resource,
8) The Execution Service transfers the output files from the Computing
Resource to their final grid locations (just in time stage out).
9) The Client receives notification that the Job is finished,
This use case requires some processing by the Execution Service, but is
the best one when the job really needs random access to files.
Just In Time Staging in the COMPUTING Resource by the Client
------------------------------------------------------------
The Client :
1) submits a Job with a 'do not start' attribute,
2) receives the locations of input and output files inside the Computing
Resource,
3) pushes the input files to their locations inside the Computing
Resource (just in time stage in),
4) starts the Job,
5) receives notification that the Job is finished,
6) pulls the output files from the Computing Resource (just in time
stage out),
7) purges the Job (from the Computing Resource).
I personally do NOT appreciate this use case, because it is complicated,
and the Computing Resource must keep the (possibly huge) output files
until they are pulled by the Client.
Can you please :
- Check if the above concepts are relevant, and if the associated lists
are complete,
- Check if the above use cases are relevant, and if there could be
completely different use cases.
- Propose, from the above concepts and use cases, which possibilities
the PGI Working Group should take into account, and which should be
considered out of scope ?
Thank you very much in advance.
Best regards.
----------------------------------
Etienne URBAH IN2P3 - LAL
Bat 200 91898 ORSAY France
Tel: +33 1 64 46 84 87
Mob: +33 6 22 30 53 27
Skype: etienne.urbah
mailto:urbah at lal.in2p3.fr
----------------------------------
On Wed, 01 Apr 2009, Moreno Marzolla wrote:
> Dear all,
>
> I just uploaded a new document into the "Input Documents/Execution
> Service" folder:
>
> http://forge.ogf.org/sf/go/doc15590?nav=1
>
> This document is a somehow polished version of the notes taken during
> the pre-PGI Geneva meeting. It is basically a "wish list" for what we
> called "Geneva Execution Service (GES)". The name means that in its
> current status, the document covers the requirements emerged during the
> Geneva meeting. Please note that the document has some funny artifacts
> emerged during OpenOffice->Word conversion. Sorry for that, I hope we
> will be able to fix that in the next revisions.
>
> We now ask people to have a look at the document, comment it and also
> propose additional requirements arising from their middlewares and use
> cases. Thus, the document will eventually evolve into a requirements
> document for the "PGI Execution Service" (and not GES).
> Note that at this point we would like to concentrate on requirements
> only. How these requirements can be mapped onto existing or brand new
> specifications/profiles will be the next step, after the requirements
> have been finalized.
>
> We propose to have a teleconference on wednesday, april 8th at 16:00CET
> (note that now also Europe entered daylight saving time, so we are back
> in sync). Agenda and call in details will be circulated next week.
> Meanwhile, feel free to discuss the document using the mailing list.
>
> Moreno.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 5060 bytes
Desc: S/MIME Cryptographic Signature
Url : http://www.ogf.org/pipermail/pgi-wg/attachments/20090411/f0de40b5/attachment-0001.bin
More information about the Pgi-wg
mailing list