[jsdl-wg] Process Topology

Christopher Smith csmith at platform.com
Tue Apr 19 11:53:05 CDT 2005


Just to followup with one issue I've identified.

The current specification has a default value of ResourceCount to be 1. In
order for the TotalCPUCount thing to work properly, the default should be
undefined (much like CPUCount).

By the way ... setting a default TotalCPUCount to 1 implies a ResourceCount
of 1 when allocation is done.

-- Chris


On 19/4/05 09:40, "Christopher Smith" <csmith at platform.com> wrote:

> Ok ... here are my thoughts on process topology and what's currently
> expressible in JSDL.
> 
> First, I'll list some use cases (they're all parallel jobs):
> 
> 1. Simple MPI job. Wants 32 processors with 1 processor per resource (in
> JSDL, a host is a "resource").
> 
> 2. OpenMPI job. Wants 32 processors with 8 processors per resource.
> 
> 3. An OpenMP job. Wants 32 processors. Shared mem of course, so one
> resource.
> 
> 4. A "homegrown" master/slave parallel job (say a ligand docking job). Wants
> 32 processors. No tiling constraints at all.
> 
> * Note that I'm specifically leaving out the Naregi "coupled simulation" use
> case (sorry guys), since we determined at the last GGF that it was a case
> which could be decomposed into multiple JSDL documents.
> 
> Second ... what is process topology? It provides the user a way to express
> how resources should be _allocated_ given the characteristics of the job
> (usually in terms of IO patterns ... e.g. network communication, disk IO
> channel contention, etc). Thus, it's used when the resource manager is
> _allocating_ the resources, not when the job is being started/launched.
> Therefore, none of the elements used to express process topology should be
> in the POSIXApplication section
> 
> What we have in JSDL now:
> 
> ResourceCount (how many "resources" i.e. hosts I want)
> CPUCount (how many processors _per resource_)
> TileSize (how many processors to allocate per resource as a unit)
> ProcessCount (total number of _processes_ that the job will use to execute
> the job)
> 
> I will argue that ProcessCount is useless for the purposes of process
> topology, since a) it isn't about allocation, and b) there isn't enough
> information to tell me how to start/launch a parallel job. It isn't about
> allocation since it is irrelevant to the scheduler whether I'll be computing
> using threads or processes. It isn't useful for launching because it doesn't
> tell me how to spread the ProcessCount processes given a particular
> allocated topology.
> 
> So that leaves the rest of them.
> 
> TileSize and CPUCount are pretty much the same thing. At least for 80% (or
> more) of the uses I've seen. The only thing that might cause them to differ
> is that I could possibly allocate more than one tile on a host. Given that
> CPUCount is a range and that we could express step values in the range (we
> can express step values in the range, right?), we don't need TileSize any
> more. 
> 
> Note: I'm making an assumption here that CPUCount is the number of cpus that
> I want from the resource, rather than an expression of how many cpus the
> host needs to have configured. If it is the latter, then we do need
> TileSize, and replace CPUCount in my examples below with TileSize.
> 
> So let's see how these map to the use cases.
> 
> 1. ResourceCount == 32, CPUCount == 1
>   -> LSF : "-n 32 -R span[ptile=1]"
>   -> PBS : "-l nodes=32:ppn=1"     (ppn=1 might be the default)
> 
> 2. ResourceCount == 4,  CPUCount == 8
>   -> LSF : "-n 32 -R span[ptile=8]"
>   -> PBS : "-l nodes=4:ppn=8"
> 
> 3. ResourceCount == 1, CPUCount == 32
>   -> LSF : "-n 32 -R span[hosts=1]"  (hosts=1 equivalent to ptile=<-n val>)
>   -> PBS : "-l nodes=1:ppn=32"
> 
> 4. ResourceCount == 32, CPUCount == 1
>   -> oops ... it doesn't care about tiling
>    ResourceCount == 1, CPUCount == 32
>   -> hmm ... artificial constraint ... would suck on a blade cluster
>    ResourceCount == 1-32, CPUCount == 1,32
>   -> oops again ... I might get a total allocation of 32*32 cpus
> 
>   * there seems to be a gap!
> 
> If we had a term called "TotalCPUCount" for the entire job, I could do:
> 
> 4. TotalCPUCount == 32
>   -> LSF : "-n 32"
>   -> PBS : "not sure how to express"
> 
> It basically means to grab 32 cpus, regardless of how they are spread.
> Basically I just need cpus. This is used a whole hell of a lot within our
> customer base. 
> 
> So ... in summary ... I propose:
> 
> CPUCount (as is if it's allocated cpus per resource)
> TileSize (iff CPUCount is an expression of configured cpus in a host)
> ResourceCount (as is ... hmmm ... maybe the default value needs to change)
> TotalCPUCount (how many cpus this jobs needs to run in total)
> 
> -- Chris
> 





More information about the jsdl-wg mailing list