[Carpet] Bugzilla Bug 118
Erik Schnetter
schnetter at cct.lsu.edu
Sat May 3 04:29:56 CEST 2008
On May 2, 2008, at 17:18:51, Ian Hinder wrote:
> Erik Schnetter wrote:
>> On Feb 22, 2008, at 09:00:20, Luca Baiotti wrote:
>>
>>> Hallo Erik, hallo Thomas, hallo everyone.
>>>
>>> Is there any news about Carpet Bugzilla Bug 118 filed by Bela long
>>> ago?
>>>
>>> I might have encountered it again, in a simulation that ran for a
>>> few
>>> weeks and had performed > 10^5 iterations. The run had recently
>>> recovered, ~300 iterations before this fatal error:
>>>
>>> INFO (CarpetRegrid2): Centre 0 is at position [0,0,0]
>>> INFO (CarpetRegrid2): Centre 1 is at position [0,0,0]
>>> INFO (CarpetRegrid2): Centre 2 is at position [0,0,0]
>>> INFO (CarpetRegrid2): Regridding
>>> INFO (Carpet): Grid structure statistics:
>>> INFO (Carpet): GF: rhs: 371k active, 485k owned (+31%), 843k total
>>> (+74%), 46.9 steps/time
>>> INFO (Carpet): GF: vars: 85, pts: 263M active, 330M owned (+25%),
>>> 568M
>>> total (+72%), 1.0 comp/proc
>>> INFO (Carpet): GA: vars: 581, pts: 3M active, 3M total (+0%)
>>> INFO (Whisky): Setting up the atmosphere mask: all points are
>>> not_atmosphere
>>> INFO (Whisky): Setting up the atmosphere mask: all points are
>>> not_atmosphere
>>> INFO (Whisky): Setting up the atmosphere mask: all points are
>>> not_atmosphere
>>> WARNING level 0 in thorn CarpetLib processor 0 host node0052.admin
>>> (line 257 of
>>> /data1/baiotti/Cactus/configs/belladonna_NoDebug/build/CarpetLib/
>>> gdata.cc):
>>>
>>> -> Internal error: extrapolation in time. time=3153
>>> times=[3153,3152.94,3152.87]
>>
>> This looks as if there was an accumulation of floating point error.
>> This should not happen -- the times of the individual levels should
>> be
>> re-synchronised when the levels are aligned in time. This may not
>> happen correctly.
>>
>>> Actually this might be a different problem. Of course I am producing
>>> output with verbose=yes, but debug information will require more
>>> time,
>>> since the machines I am using have problems running with DEBUG =
>>> yes.
>>> So I first wanted to know whether there is any new info about this.
>>>
>>>
>>> While I was writing this email, the run with increased verbosity
>>> finished. You can find the 49Mb output file on belladonna in
>>> /home/baiotti/tmp
>>
>>
>> The output file says that there is a core file. Can you load the
>> core
>> file into a debugger and output both "time" and "times" with very
>> high
>> accuracy?
>
> I am resurrecting this thread from Feb 2008.
>
> After recovering from a checkpoint file, and regridding several
> times (I
> don't know if either of these are important), I get the error
>
> INFO (Carpet): Grid structure statistics:
> INFO (Carpet): GF: rhs: 582k active, 1400k owned (+141%), 4524k total
> (+223%), 163 steps/time
> INFO (Carpet): GF: vars: 60, pts: 114M active, 212M owned (+86%), 624M
> total (+194%), 1.0 comp/proc
> INFO (Carpet): GA: vars: 389, pts: 3M active, 3M total (+0%)
> WARNING level 0 in thorn CarpetLib processor 0 host
> i105-204.ranger.tacc.utexas.edu
> (line 263 of
> /share/home/00915/hinder/Cactus/configs/ian_c4_mvd/build/CarpetLib/
> gdata.cc):
>
> -> Internal error: extrapolation in time. time=2737.0000000000009
> times=[2736.9999999999995,2735.9999999999995,2734.9999999999995]
> TACC: MPI job exited with code: 1
> TACC: Shutting down parallel environment.
> TACC: Shutdown complete. Exiting.
>
> The eps with which the times are being compared in CarpetLib is 1e-12,
> but the two times here are different by 1.36e-12, so the error is
> triggered. I am rerunning with a larger eps value to make sure
> nothing
> else is wrong. This happens at t = 586.500 - I don't know why
> Carpet is
> saying 2736.
This counts coarse grid time steps, not physical time. This time
comes from the internals which doesn't know about physical coordinates.
> I'm guessing the time are not being synchronized correctly, as
> mentioned
> in the email quoted above?
Yes, I assume that the global time is not correctly transferred into
the individual level's times, so that the differing time stepping
there leads to accumulated floating point error. Actually, if you
take 2735 * 1e-16 you arrive at the right order of magnitude.
I expect that increasing eps will help. 1e-10 is probably harmless,
1e-6 is probably too large an error to be acceptable.
I made some corrections to this recently, but this was more than one
month ago. I checked, and I did not push a patch after the version
you are using, although I didn't actually locate the patch.
-erik
--
Erik Schnetter <schnetter at cct.lsu.edu> http://www.cct.lsu.edu/~eschnett/
My email is as private as my paper mail. I therefore support encrypting
and signing email messages. Get my PGP key from www.keyserver.net.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 194 bytes
Desc: This is a digitally signed message part
Url : /archives/developers/attachments/20080502/7046ddd2/attachment.pgp
More information about the developers
mailing list