[Carpet] recovery from checkpoint on a different number of nodes
Bruno Giacomazzo
bgiacoma at aei.mpg.de
Mon Jun 4 18:18:13 CEST 2007
Hi,
I'm using the 3rd stable version of Carpet (and the development
version of cactus) and I had a problem recovering from the checkpoint
files using a different number of nodes.
I had a job that was running on 24 nodes (96 cores), I stopped it
saving a checkpoint file and then restarted it on 32 nodes (128 cores). I
wanted to see if I was able to run it faster increasing the number of
nodes. This didn't happen since, as I found later, the job was swapping. I
stopped this new job creating a new checkpoint file and restarted it on 24
nodes using this new checkpoint file. This new run was swapping heavily
and it was automatically killed since it was used all the memory
(ram+swap).
I then decided to restart it on 24 nodes using the first
checkpoint file (the one produced by the job running on 24 nodes) and now
it runs without any problem.
Has anybody seen things like this? I thought that was not a
problem to restart a carpet run on a different number of nodes.
Cheers,
Bruno
--
Dr. Bruno Giacomazzo
Max Planck Institute for Gravitational Physics
Albert Einstein Institute
Am Muehlenberg 1
D-14476 Potsdam
Germany
Tel. : +49 331 567 7183
Fax : +49 331 567 7252
cell. : +49 173 826 4488
email : bgiacoma at aei.mpg.de
-------------------------------------------------
There are only 10 types of people in the world:
Those who understand binary, and those who don't
-------------------------------------------------
More information about the developers
mailing list