[Carpet] MPI_WAITALL : Error code is in status
Erik Schnetter
schnetter at cct.lsu.edu
Fri May 26 20:31:16 CEST 2006
On May 26, 2006, at 12:45:41, Jonathan Thornburg wrote:
> Hi, Erik,
>
> I have a Cactus par file which runs fine on 1 processor, but dies as
> follows on 2 processors:
>
> % cactus_test-moving-excision -np 2 try-mpi-error.par
> [[many lines of output schnipped]]
> INFO (AHFinderDirect): setting initial guess for horizon 1/1
> INFO (AHFinderDirect): setting ellipsoid: center=(0,0,0)
> INFO (AHFinderDirect): radius=(2,2,2)
> INFO (AHFinderDirect): proc 0: searching for horizon 1/1
> INFO (AHFinderDirect): proc 0/horizon 1:it 1 r_grid=2.00 ||
> Theta||=6.6e-02
> INFO (AHFinderDirect): proc 0/horizon 1:it 2 r_grid=1.80 ||
> Theta||=3.8e-02
> INFO (AHFinderDirect): proc 0/horizon 1:it 3 r_grid=1.80 ||
> Theta||=1.8e-04
> INFO (AHFinderDirect): proc 0/horizon 1:it 4 r_grid=1.80 ||
> Theta||=7.4e-10
> INFO (AHFinderDirect): AH 1/1: r=1.79939 at
> (-0.000000,0.000000,0.000000)
> INFO (AHFinderDirect): AH 1/1: area=45.21198259
> irreducible_mass=0.9484006614
> INFO (AHFinderDirect): writing h to "try-mpi-error/Kerr.h.t0.ah1.gp"
> INFO (AHFinderDirect): setting old-style (CCTK_REAL) mask grid
> function SpaceMask::emask
> INFO (MovingExcision): stage 1 (phase 1): 2 operators ==> 37, 12
> point(s)
> INFO (MovingExcision): stage 2 (phase 1): 2 operators ==> 30, 15
> point(s)
> 0 - MPI_WAITALL : Error code is in status
> [0] Aborting program !
> [0] Aborting program!
> %
Note that this is the output from only one processor. Use the "-r"
option to Cactus to get the output from the other processor as well.
> This is using 1.2.6, the ch_shmem device, Intel 8.0 compilers,
> configured DEBUG=yes OPTIMISE=no an an AEI xeon.
>
> The error message is clearly telling me that MPI_Waitall() died,
> and put some information about what went wrong in its status
> structure.
> Alas, a brief grep through Carpet/CarpetLib/src/* shows that each and
> every call on MPI_Waitall() passes MPI_STATUSES_IGNORE as the 3rd
> argument, so there's no status structure to look at.
These status arguments are usually not useful for error checking. As
your error message shows, the MPI routine aborted before it returned
to Cactus, so there would be nothing to look at anyway. This is the
default behaviour of MPI.
> Is there a deep reason you didn't get status back from MPI_Waitall(),
> or was it just for convenience? And are there any reasonably easy
> ways
> (short of hacking CarpetLib -- and every other MPI-using thorn) to
> find
> what's going on here? As it is, I sort of suspect that my thorn
> MovingExcision might be doing something wrong... but the most exotic
> thing it does is call CCTK_SyncGroup().
The deep reason is that the programme cannot continue anyway, so
adding our own MPI error handler that makes MPI return to Carpet,
then checking the status everywhere, then aborting the code anyway
would add no functionality. For the record, the MPI standard says
that keeping track of requests and statuses may slow down things.
If you want, then there is an elegant way to get these statuses --
you can extend CarpetLib instead of hacking it. This is the
preferred way. "Hacking" implies that the added code is unclean,
which would be up to you if you implement it.
The really easiest way, however, would be to use a debugger. I
suggest TotalView. It allows you to examine the state of MPI on each
processor, as well as all the MPI messages in the send and receive
queues. If you don't have TotalView installed, go ask for a licence
-- it is really not appropriate to expect people to develop parallel
programmes without a suitable debugger. Alternatively, maybe your
MPI implementation has a way to start each process in a debugger in a
separate xterms. While not as comfortable as TotalView, it still
allows you to see what is going on.
Short of doing that, try using the various "verbose" and
"veryverbose" settings in Carpet and CarpetLib. You can also try
CarpetLib's "barriers" parameter, which may make your programme abort
closer to the real error.
Quite likely, you call CCTK_SyncGroup() not on all processors at the
same time, or you enable or disable storage not on all processors.
Look for statements "if (CCTK_MyProc(cctkGH)==X)" in your code; this
may give you a pointer to why this may be so. Consider also that the
error may have occurred later, after MovingExcision was finished.
-erik
--
Erik Schnetter <schnetter at cct.lsu.edu>
My email is as private as my paper mail. I therefore support encrypting
and signing email messages. Get my PGP key from www.keyserver.net.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 186 bytes
Desc: This is a digitally signed message part
Url : /archives/developers/attachments/20060526/69d7079c/attachment.pgp
More information about the developers
mailing list