[Carpet] recovery from checkpoint on a different number of nodes
Christian D. Ott
cott at as.arizona.edu
Mon Jun 4 18:44:55 CEST 2007
Hi Bruno,
I ran into a similar problem a while ago. In principle, there
is no problem with Carpet and restarting a run on a different
number of cpus. However, in practice, and in particular when one is
using many cores/hdf5 files, hdf5 itself becomes a memory hog.
The reason for this is that each cpu has to look at multiple / many
checkpoint files to find the data it is supposed to read in.
Each time hdf5 looks at a dataset (and there are many)
in a hdf5 file, it allocates a small buffer that is not freed by
the system (we recently saw something similar in CarpetIOASCII and
fixed it by forcing the system to free the memory). This is a known
problem with hdf5 when you have many small datasets to iterate over
/ read in.
Which version of hdf5 are you using? If you are using 1.6.x, could
you try if you still see the problem with 1.8.x?
- Christian
On Mon, Jun 04, 2007 at 06:18:13PM +0200, Bruno Giacomazzo wrote:
> Hi,
> I'm using the 3rd stable version of Carpet (and the development
> version of cactus) and I had a problem recovering from the checkpoint
> files using a different number of nodes.
>
> I had a job that was running on 24 nodes (96 cores), I stopped it
> saving a checkpoint file and then restarted it on 32 nodes (128 cores). I
> wanted to see if I was able to run it faster increasing the number of
> nodes. This didn't happen since, as I found later, the job was swapping. I
> stopped this new job creating a new checkpoint file and restarted it on 24
> nodes using this new checkpoint file. This new run was swapping heavily
> and it was automatically killed since it was used all the memory
> (ram+swap).
>
> I then decided to restart it on 24 nodes using the first
> checkpoint file (the one produced by the job running on 24 nodes) and now
> it runs without any problem.
>
> Has anybody seen things like this? I thought that was not a
> problem to restart a carpet run on a different number of nodes.
>
>
> Cheers,
> Bruno
>
> --
> Dr. Bruno Giacomazzo
> Max Planck Institute for Gravitational Physics
> Albert Einstein Institute
> Am Muehlenberg 1
> D-14476 Potsdam
> Germany
>
> Tel. : +49 331 567 7183
> Fax : +49 331 567 7252
> cell. : +49 173 826 4488
> email : bgiacoma at aei.mpg.de
>
> -------------------------------------------------
> There are only 10 types of people in the world:
> Those who understand binary, and those who don't
> -------------------------------------------------
>
> _______________________________________________
> developers mailing list
> developers at lists.carpetcode.org
> http://lists.carpetcode.org/listinfo/developers
More information about the developers
mailing list