[Carpet] MPI_WAITALL : Error code is in status

Jonathan Thornburg jthorn at aei.mpg.de
Fri May 26 19:45:41 CEST 2006


Hi, Erik,

I have a Cactus par file which runs fine on 1 processor, but dies as
follows on 2 processors:

% cactus_test-moving-excision -np 2 try-mpi-error.par
[[many lines of output schnipped]]
INFO (AHFinderDirect): setting initial guess for horizon 1/1
INFO (AHFinderDirect):    setting ellipsoid: center=(0,0,0)
INFO (AHFinderDirect):                       radius=(2,2,2)
INFO (AHFinderDirect): proc 0: searching for horizon 1/1
INFO (AHFinderDirect):    proc 0/horizon 1:it 1 r_grid=2.00 ||Theta||=6.6e-02
INFO (AHFinderDirect):    proc 0/horizon 1:it 2 r_grid=1.80 ||Theta||=3.8e-02
INFO (AHFinderDirect):    proc 0/horizon 1:it 3 r_grid=1.80 ||Theta||=1.8e-04
INFO (AHFinderDirect):    proc 0/horizon 1:it 4 r_grid=1.80 ||Theta||=7.4e-10
INFO (AHFinderDirect): AH 1/1: r=1.79939 at (-0.000000,0.000000,0.000000)
INFO (AHFinderDirect): AH 1/1: area=45.21198259 irreducible_mass=0.9484006614
INFO (AHFinderDirect): writing h to "try-mpi-error/Kerr.h.t0.ah1.gp"
INFO (AHFinderDirect): setting old-style (CCTK_REAL) mask grid function SpaceMask::emask
INFO (MovingExcision): stage 1 (phase 1): 2 operators ==> 37, 12 point(s)
INFO (MovingExcision): stage 2 (phase 1): 2 operators ==> 30, 15 point(s)
0 - MPI_WAITALL : Error code is in status
[0]  Aborting program !
[0] Aborting program!
% 

This is using 1.2.6, the ch_shmem device, Intel 8.0 compilers,
configured DEBUG=yes OPTIMISE=no an an AEI xeon.

The error message is clearly telling me that MPI_Waitall() died,
and put some information about what went wrong in its status structure.
Alas, a brief grep through Carpet/CarpetLib/src/* shows that each and
every call on MPI_Waitall() passes MPI_STATUSES_IGNORE as the 3rd
argument, so there's no status structure to look at.

Is there a deep reason you didn't get status back from MPI_Waitall(),
or was it just for convenience?  And are there any reasonably easy ways
(short of hacking CarpetLib -- and every other MPI-using thorn) to find
what's going on here?  As it is, I sort of suspect that my thorn
MovingExcision might be doing something wrong... but the most exotic
thing it does is call CCTK_SyncGroup().


ciao,

-- 
-- Jonathan Thornburg <jthorn at aei.mpg.de>      
   Max-Planck-Institut fuer Gravitationsphysik (Albert-Einstein-Institut),
   Golm, Germany, "Old Europe"     http://www.aei.mpg.de/~jthorn/home.html      
   "Washing one's hands of the conflict between the powerful and the
    powerless means to side with the powerful, not to be neutral."
                                      -- quote by Freire / poster by Oxfam




More information about the developers mailing list