[drmaa-wg] perl DRMAA, SGE and working directory

Tue Dec 14 08:50:39 CST 2004

Hi,

If we look at the string /home/msarachu/.showdb.04.12.09:17.37.42
and consider the following expression (we need to add this syntax into the spec for the working directory as well)

[hostname]:file_path

then it is not surprising for the runtime to look for directory 17.37.42.
Unfortunately, the second error, not being able to find host /home/msarachu/.showdb.04.12.09 is not displayed.

Hope this helps from the standard point of view.

Regards,
    -Hrabri

-----Original Message-----
From: owner-drmaa-wg at ggf.org [mailto:owner-drmaa-wg at ggf.org] On Behalf Of Martín Sarachu
Sent: Tuesday, December 14, 2004 8:28 AM
To: drmaa-wg at ggf.org
Subject: [drmaa-wg] perl DRMAA, SGE and working directory

Dear list,

I'm using Schedule-DRMAAc-0.81 and SGE to be able to queue jobs from a web
interface.

Here's my problem: When launching a job with something like  /home/msarachu  as
the $DRMAA_WD it runs ok, but when using a directory like
/home/msarachu/.showdb.04.12.09:17.37.42  as $DRMAA_WD the script does not run
and the error reported by SGE is "28  : changing into working directory".
I also passed the directory "escaped" (\.showdb\.04\.12\.09\:17\.37\.42) and got
the same error, although passing the string
"/home/msarachu/.showdb.04.12.09:17.37.42/job.sh" to the $DRMAA_REMOTE_COMMAND
argument works fine because the job is sent to the queue.
Is there any way to mask this directory so it changes ok to the working directory?

Below is an email from a failed job I tried to run with
DRMAA_WD = /home/msarachu/wProjects/tope/.showdb.04.12.14:16.30.47
Look at the sheperd error, apparently is truncating the dir just before the :

If I submit the job from /home/msarachu/wProjects/tope/.showdb.04.12.14:16.30.47
with command 'qsub -cwd job.sh' it works ok.

-----
Job 123 caused action: Job 123 set to ERROR
 User        = msarachu
 Queue       = all.q at pentiumIV.embnet-ar.org
 Host        = pentiumIV.embnet-ar.org
 Start Time  = <unknown>
 End Time    = <unknown>
failed changing into working directory:can't read usage file for job 123.1

Shepherd trace:
12/13/2004 16:31:04 [502:24622]: shepherd called with uid = 0, euid = 502
12/13/2004 16:31:04 [502:24622]: starting up 6.0u1
12/13/2004 16:31:04 [502:24622]: setpgid(24622, 24622) returned 0
12/13/2004 16:31:04 [502:24622]: no prolog script to start
12/13/2004 16:31:04 [502:24623]: pid=24623 pgrp=24623 sid=24623 old pgrp=24622
getlogin()=<no login set>
12/13/2004 16:31:04 [502:24623]: setosjobid: uid = 0, euid = 502
12/13/2004 16:31:04 [502:24623]: RLIMIT_CPU setting: (soft 4294967295 hard
4294967295) resulting: (soft 4294967295 hard 4294967295)
12/13/2004 16:31:04 [502:24623]: RLIMIT_FSIZE setting: (soft 4294967295 hard
4294967295) resulting: (soft 4294967295 hard 4294967295)
12/13/2004 16:31:04 [502:24623]: RLIMIT_DATA setting: (soft 4294967295 hard
4294967295) resulting: (soft 4294967295 hard 4294967295)
12/13/2004 16:31:04 [502:24623]: RLIMIT_STACK setting: (soft 4294967295 hard
4294967295) resulting: (soft 4294967295 hard 4294967295)
12/13/2004 16:31:04 [502:24623]: RLIMIT_CORE setting: (soft 4294967295 hard
4294967295) resulting: (soft 4294967295 hard 4294967295)
12/13/2004 16:31:04 [502:24623]: RLIMIT_VMEM/RLIMIT_AS setting: (soft 4294967295
hard 4294967295) resulting: (soft 4294967295 hard 4294967295)
12/13/2004 16:31:04 [502:24623]: RLIMIT_RSS setting: (soft 4294967295 hard
4294967295) resulting: (soft 4294967295 hard 4294967295)
12/13/2004 16:31:04 [500:24623]: closing all filedescriptors
12/13/2004 16:31:04 [500:24623]: further messages are in "error" and "trace"
12/13/2004 16:31:04 [502:24622]: forked "job" with pid 24623
12/13/2004 16:31:04 [502:24622]: child: job - pid: 24623
12/13/2004 16:31:04 [502:24622]: wait3 returned 24623 (status: 7168;
WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 28)
12/13/2004 16:31:04 [502:24622]: job exited with exit status 28
12/13/2004 16:31:04 [502:24622]: reaped "job" with pid 24623
12/13/2004 16:31:04 [502:24622]: job exited not due to signal
12/13/2004 16:31:04 [502:24622]: job exited with status 28
12/13/2004 16:31:04 [502:24622]: now sending signal KILL to pid -24623
12/13/2004 16:31:04 [502:24622]: no tasker to notify
12/13/2004 16:31:04 [502:24622]: failed starting job
12/13/2004 16:31:04 [502:24622]: no epilog script to start

Shepherd error:
12/13/2004 16:31:04 [500:24623]: error: can't chdir to :16.30.47: No such file
or directory

Shepherd pe_hostfile:
pentiumIV.embnet-ar.org 1 all.q at pentiumIV.embnet-ar.org UNDEFINED
-----

I sent this same email to Tim and also SGE users list. Tim also suggested to
send it to this list.

Thanks in advance.

Best regards,

Martin 

-- 
Martín Sarachu
msarachu at biol.unlp.edu.ar
EMBnet Argentina
http://www.ar.embnet.org