Coral acting funny.

John Shott shott at
Fri Nov 14 07:00:44 PST 2008


Yes, for several days .... I forget when it started .... the event 
manager periodically gets hung up.  It is the event manager that sends 
out all of the notifications to Coral clients every time there is an 
enable, disable, shutdown, problem, reservation, etc.

What is perplexing is that there have been to changes to our code or the 
version of Java for a long time .... so what is happening?

Note: we've got several things that try to detect this and to ultimately 
restart the servers .... for example, each of the servers sends an "I'm 
still alive" heartbeat to the Admin Manager every 5 minutes and, if it 
doesn't hear from one of the servers, it initiates a restart.  However, 
in this case, the Event Manager is still sufficiently alive to be 
sending heartbeats ....

At the operating system level, we look for things like a deadlocked 
database or for serious communications errors .... but this particular 
failure doesn't trigger those either.

Finally, we monitor the "thread count" of each of the servers because we 
have noticed that when this happens, the thread count begins to increase 
.... basically one new thread for every enable/diable etc (we think).   
That, in fact, ultimately triggered a restart .... but because there was 
limited activity late at night, it didn't restart them until early this 

At the moment, we are collecting data to see if there is a correlation 
between the runaway mozilla processes on the Sunrays and the times when 
this happens.  We are also exploring whether University changes to 
firewalls have somehow created a problem that is impacting us.

In any event, while we continue to track this down, sending email to 
coral at as soon as your or anyone else notices this is 
usually the best way to make us aware of the problem .... and folks in 
the lab often are aware before any of our automatic detection schemes 
kick in.



