[Carpet] recovery from checkpoint on a different number of nodes
Bruno Giacomazzo
bgiacoma at aei.mpg.de
Mon Jun 4 19:08:53 CEST 2007
Christian,
On Mon, 4 Jun 2007, Christian D. Ott wrote:
> Each time hdf5 looks at a dataset (and there are many)
> in a hdf5 file, it allocates a small buffer that is not freed by
> the system (we recently saw something similar in CarpetIOASCII and
> fixed it by forcing the system to free the memory). This is a known
> problem with hdf5 when you have many small datasets to iterate over
> / read in.
thank you for the answer. I didn't know this.
> Which version of hdf5 are you using? If you are using 1.6.x, could
> you try if you still see the problem with 1.8.x?
I will try to see what happens using 1.8.x instead of 1.6.5.
Thank you,
Bruno
> On Mon, Jun 04, 2007 at 06:18:13PM +0200, Bruno Giacomazzo wrote:
>> Hi,
>> I'm using the 3rd stable version of Carpet (and the development
>> version of cactus) and I had a problem recovering from the checkpoint
>> files using a different number of nodes.
>>
>> I had a job that was running on 24 nodes (96 cores), I stopped it
>> saving a checkpoint file and then restarted it on 32 nodes (128 cores). I
>> wanted to see if I was able to run it faster increasing the number of
>> nodes. This didn't happen since, as I found later, the job was swapping. I
>> stopped this new job creating a new checkpoint file and restarted it on 24
>> nodes using this new checkpoint file. This new run was swapping heavily
>> and it was automatically killed since it was used all the memory
>> (ram+swap).
>>
>> I then decided to restart it on 24 nodes using the first
>> checkpoint file (the one produced by the job running on 24 nodes) and now
>> it runs without any problem.
>>
>> Has anybody seen things like this? I thought that was not a
>> problem to restart a carpet run on a different number of nodes.
>>
>>
>> Cheers,
>> Bruno
>>
>> --
>> Dr. Bruno Giacomazzo
>> Max Planck Institute for Gravitational Physics
>> Albert Einstein Institute
>> Am Muehlenberg 1
>> D-14476 Potsdam
>> Germany
>>
>> Tel. : +49 331 567 7183
>> Fax : +49 331 567 7252
>> cell. : +49 173 826 4488
>> email : bgiacoma at aei.mpg.de
>>
>> -------------------------------------------------
>> There are only 10 types of people in the world:
>> Those who understand binary, and those who don't
>> -------------------------------------------------
>>
>> _______________________________________________
>> developers mailing list
>> developers at lists.carpetcode.org
>> http://lists.carpetcode.org/listinfo/developers
> _______________________________________________
> developers mailing list
> developers at lists.carpetcode.org
> http://lists.carpetcode.org/listinfo/developers
>
>
--
Dr. Bruno Giacomazzo
Max Planck Institute for Gravitational Physics
Albert Einstein Institute
Am Muehlenberg 1
D-14476 Potsdam
Germany
Tel. : +49 331 567 7183
Fax : +49 331 567 7252
cell. : +49 173 826 4488
email : bgiacoma at aei.mpg.de
-------------------------------------------------
There are only 10 types of people in the world:
Those who understand binary, and those who don't
-------------------------------------------------
More information about the developers
mailing list