[DRMAA-WG] About obtaining the machines names in a parallel job

Yves Caniou yves.caniou at ens-lyon.fr
Wed Mar 24 23:35:54 CDT 2010


Hi,

The fact that the master task starts the slaves relying on the DRM may not be 
the most frequent case. Furthermore, even in the paradigm master/slave, the 
master has to know the name of the slaves, that' where Daniel's line "tells 
it where all the slaves are" is really important for me: at least one node 
should have the possibility to know the name of resources involved in the 
reservation. As we discuss during the OGF session, generally the identity of 
the nodes is stored in a file whose filename depends on the deployed DRM.
What I suggest is at least one of the two things:
- the possibility for at least one node to know the identity of the other, by 
using a meta-DRM-DRMAA name for example.
- the possibility to copy the file to all nodes as a user request in the 
prologue (should be possible since the master knows the file anyway).

My preference goes naturally to the second, since the user doesn't have to 
care to distribute the information if needed (which could force him to pack 
his application in a false MPI program only to dispatch the information, or 
fork a "scp"-not-better-thing...)

Peter, I've also seen (at least!) something that was really interesting in 
your report, concerning the two classes of parallel job support. Does this 
mean that people involved in DRMAA consider the possibility to submit not 
only command line programs but script as well?

Cheers.

.Yves.

Le Wednesday 24 March 2010 14:01:54 Daniel Templeton, vous avez écrit :
> The way SGE (and I think LSF) handles parallel jobs is that there is
> always a master/slave concept.  The DRM system allocates the nodes,
> starts the master task, and tells it where all the slaves are.  The
> master task is then responsible for starting the slave tasks, usually
> via the DRM.
>
> Maybe I'm missing some context, but this conversation sounds *way*
> outside of the context of DRMAA to me.  DRMAA has nothing to do with how
> a job is launched.  DRMAA is purely on the job management side:
> submission, monitoring, and control.
>
> Daniel
>
> On 03/24/10 05:54, Peter Tröger wrote:
> > Hi Yves,
> >
> > thanks for a good discussion in Munich, I hope we can rely on your
> > user perspective also in the future.
> >
> >> I understand why you don't want to put a mean to get the
> >> hostnamesfile for an
> >> MPI code, since it's should be transparently done in the configName
> >> (correct
> >> name if my rememberings are well).
> >>
> >> But I thought of a different use case: a code is just launched on all
> >> machines. This code is a socket based one, thus it needs to know the
> >> other
> >> machine names to be able to run correctly.
> >> Of course, this could be bypassed with the use of an external
> >> machine where a
> >> daemon runs, and where running codes can register -- I think of it
> >> like an
> >> omniNames running for example. Another solution is to encapsulate
> >> applications in an MPI code just to, maybe, have that information.
> >
> > For me, it sounds like getting the information about allocated
> > machines (for a job) on each of the execution hosts. I wonder if this
> > information is provided by the different DRM systems. Does that depend
> > on the parallelization technology, such as the chosen MPI library ?
> >
> > Best,
> > Peter.
> >
> >> But don't you think that the cost is very big (if possible: a lot of
> >> policy is
> >> to not let run user code on the frontal, and a machine only knows
> >> that itself
> >> is taking part to the parallel run) compared to the possibility to
> >> at least
> >> having the possibility to copy the file containing the hostnames to
> >> all
> >> reserved nodes?
> >>
> >> Bon courage for the discussions today!
> >> Cheers.
> >>
> >> .Yves.
> >>
> >> --
> >> Yves Caniou
> >> Associate Professor at Université Lyon 1,
> >> Member of the team project INRIA GRAAL in the LIP ENS-Lyon,
> >> Délégation CNRS in Japan French Laboratory of Informatics (JFLI),
> >>   * in Information Technology Center, The University of Tokyo,
> >>     2-11-16 Yayoi, Bunkyo-ku, Tokyo 113-8658, Japan
> >>     tel: +81-3-5841-0540
> >>   * in National Institute of Informatics
> >>     2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, Japan
> >>     tel: +81-3-4212-2412
> >> http://graal.ens-lyon.fr/~ycaniou/
> >> --
> >>   drmaa-wg mailing list
> >>   drmaa-wg at ogf.org
> >>   http://www.ogf.org/mailman/listinfo/drmaa-wg
> >
> > --
> >    drmaa-wg mailing list
> >    drmaa-wg at ogf.org
> >    http://www.ogf.org/mailman/listinfo/drmaa-wg
>
> --
>   drmaa-wg mailing list
>   drmaa-wg at ogf.org
>   http://www.ogf.org/mailman/listinfo/drmaa-wg



-- 
Yves Caniou
Associate Professor at Université Lyon 1,
Member of the team project INRIA GRAAL in the LIP ENS-Lyon,
Délégation CNRS in Japan French Laboratory of Informatics (JFLI),
  * in Information Technology Center, The University of Tokyo,
    2-11-16 Yayoi, Bunkyo-ku, Tokyo 113-8658, Japan
    tel: +81-3-5841-0540
  * in National Institute of Informatics
    2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, Japan
    tel: +81-3-4212-2412 
http://graal.ens-lyon.fr/~ycaniou/


More information about the drmaa-wg mailing list