[Carpet] inefficient recovery from a checkpoint

Erik Schnetter schnetter at cct.lsu.edu
Fri Oct 19 18:28:08 CEST 2007


On Oct 19, 2007, at 03:12:18, Peter Diener wrote:

> Hi,
>
> Some more detailed info (using IO::verbose = "full") from a job (on  
> abe) that fails to restart on the same number of processors as the  
> checkpoint files where produced with. The run producing the  
> checkpoint files used around 800-900 Mb per process and abe has 1Gb  
> per core.
>
> With the attached parfile it seems that when recovery is done for  
> levels 0-5 all information on processor 0 is read from checkpoint  
> file 0. Then for level 6, it somehow thinks it needs to read  
> information from additional checkpoint files. At some point while  
> reading those files the run finally dies with the following error:
>
> terminate called after throwing an instance of 'std::bad_alloc'
>   what():  St9bad_alloc
>
> The question then is: why does it need information from multiple  
> checkpoint files, when it is restarted on exactly the same number  
> of processors?

The question could be: why does it think it needs information from  
multiple checkpoint files?  Can you add some debug output, so that we  
know (a) which processors think that, (b) which grid functions are  
involved, and (c) which sets of grid points?

Or is there something inconsistent, and the run would abort anyway,  
after having looked for something non-existent in the other  
checkpoint files?  You can test this be recovering using only 4 or 2  
cores per node (but still the same number of MPI processes).

-erik

-- 
Erik Schnetter <schnetter at cct.lsu.edu>

My email is as private as my paper mail.  I therefore support encrypting
and signing email messages.  Get my PGP key from www.keyserver.net.



-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 186 bytes
Desc: This is a digitally signed message part
Url : /archives/developers/attachments/20071019/dbb0985c/attachment.pgp 


More information about the developers mailing list