[Carpet] recovery from checkpoint on a different number of nodes

Bruno Giacomazzo bgiacoma at aei.mpg.de
Mon Jun 4 18:18:13 CEST 2007


Hi,
 	I'm using the 3rd stable version of Carpet (and the development 
version of cactus) and I had a problem recovering from the checkpoint 
files using a different number of nodes.

 	I had a job that was running on 24 nodes (96 cores), I stopped it 
saving a checkpoint file and then restarted it on 32 nodes (128 cores). I 
wanted to see if I was able to run it faster increasing the number of 
nodes. This didn't happen since, as I found later, the job was swapping. I 
stopped this new job creating a new checkpoint file and restarted it on 24 
nodes using this new checkpoint file. This new run was swapping heavily 
and it was automatically killed since it was used all the memory 
(ram+swap).

 	I then decided to restart it on 24 nodes using the first 
checkpoint file (the one produced by the job running on 24 nodes) and now 
it runs without any problem.

 	Has anybody seen things like this? I thought that was not a 
problem to restart a carpet run on a different number of nodes.


Cheers,
Bruno

-- 
Dr. Bruno Giacomazzo
Max Planck Institute for Gravitational Physics
Albert Einstein Institute
Am Muehlenberg 1
D-14476 Potsdam
Germany

Tel.  : +49 331 567 7183
Fax   : +49 331 567 7252
cell. : +49 173 826 4488
email : bgiacoma at aei.mpg.de

-------------------------------------------------
There are only 10 types of people in the world:
Those who understand binary, and those who don't
-------------------------------------------------



More information about the developers mailing list