Coral acting funny.
shott at stanford.edu
Fri Nov 14 07:00:44 PST 2008
Yes, for several days .... I forget when it started .... the event
manager periodically gets hung up. It is the event manager that sends
out all of the notifications to Coral clients every time there is an
enable, disable, shutdown, problem, reservation, etc.
What is perplexing is that there have been to changes to our code or the
version of Java for a long time .... so what is happening?
Note: we've got several things that try to detect this and to ultimately
restart the servers .... for example, each of the servers sends an "I'm
still alive" heartbeat to the Admin Manager every 5 minutes and, if it
doesn't hear from one of the servers, it initiates a restart. However,
in this case, the Event Manager is still sufficiently alive to be
sending heartbeats ....
At the operating system level, we look for things like a deadlocked
database or for serious communications errors .... but this particular
failure doesn't trigger those either.
Finally, we monitor the "thread count" of each of the servers because we
have noticed that when this happens, the thread count begins to increase
.... basically one new thread for every enable/diable etc (we think).
That, in fact, ultimately triggered a restart .... but because there was
limited activity late at night, it didn't restart them until early this
At the moment, we are collecting data to see if there is a correlation
between the runaway mozilla processes on the Sunrays and the times when
this happens. We are also exploring whether University changes to
firewalls have somehow created a problem that is impacting us.
In any event, while we continue to track this down, sending email to
coral at snf.stanford.edu as soon as your or anyone else notices this is
usually the best way to make us aware of the problem .... and folks in
the lab often are aware before any of our automatic detection schemes
More information about the coral