Coral "issues" .....
shott at stanford.edu
Wed Nov 19 17:49:46 PST 2008
SNF Lab Members:
Sigh .... we've had some Coral outages this week .... and for that I
apologize. Let me explain the set of thing that has happened and their
Recently (well, for the past few days) a number of you have noticed that
Coral clients in the lab will occasionally fail to indicate that a tool
has been enabled or disabled. In other words, the clients occasionally
fail to update. For about a week, the part of Coral that sends out
these notifications to all of the local clients has been getting hung up
during times of high activity. Once this has gotten hung up, we need to
restart the Coral servers. As a result, the Coral servers have been
restarting 3 to 4 times per day. Note: if we aren't around to restart
the servers, there is an automatic means of determining that the servers
are hung so that even if Bill or I are not around, the servers will
restart themselves after a while.
Interestingly, the part of Coral that has been getting hung up is code
that has not been changed in probably 5 years. Also, the version of
Java that we are running has not been updated in a number of months. It
seems as if this must be related to some underlying networking problem
at the socket level .... but the cause has certainly been perplexing.
Earlier this afternoon, however, we heard reports that people couldn't
get in on Remote Coral. This turned out to be a hardware failure of our
4-channel network card. To replace the card we had to take the Coral
servers down which, of course, prevented local Coral sessions from running.
But at this point (5:35 p.m. on Wednesday) the Coral servers are back
up, the network card has been replaced, and both local and remote Coral
sessions seem to be functioning properly.
Were our problems with the event server that caused the server restarts
and the client update problems earlier this week due to some early
problems with the network card that completely failed this afternoon?
We don't yet know .... but we are monitoring closely.
If the Event Manager continues to hang this week, however, we are
prepared to move the Coral servers onto a new platform. To do this will
require a planned outage of Coral early Saturday morning with a planned
outage extending from about 4 a.m. to 7 a.m. We will send out another
announcement, however, if we need to follow through on that this weekend.
We apologize for these inconveniences .... Team Coral tries hard to
deliver good availability and we know that access has been less reliable
than normal this week.
Thank you for your continued support,
More information about the labmembers