Coral "issues" .....

John Shott shott at stanford.edu
Wed Nov 19 17:49:46 PST 2008


SNF Lab Members:

Sigh .... we've had some Coral outages this week .... and for that I 
apologize.  Let me explain the set of thing that has happened and their 
status:

Recently (well, for the past few days) a number of you have noticed that 
Coral clients in the lab will occasionally fail to indicate that a tool 
has been enabled or disabled.  In other words, the clients occasionally 
fail to update.  For about a week, the part of Coral that sends out 
these notifications to all of the local clients has been getting hung up 
during times of high activity.  Once this has gotten hung up, we need to 
restart the Coral servers.  As a result, the Coral servers have been 
restarting 3 to 4 times per day.  Note: if we aren't around to restart 
the servers, there is an automatic means of determining that the servers 
are hung so that even if Bill or I are not around, the servers will 
restart themselves after a while.

Interestingly, the part of Coral that has been getting hung up is code 
that has not been changed in probably 5 years.  Also, the version of 
Java that we are running has not been updated in a number of months.  It 
seems as if this must be related to some underlying networking problem 
at the socket level .... but the cause has certainly been perplexing.

Earlier this afternoon, however, we heard reports that people couldn't 
get in on Remote Coral.  This turned out to be a hardware failure of our 
4-channel network card.  To replace the card we had to take the Coral 
servers down which, of course, prevented local Coral sessions from running.

But at this point (5:35 p.m. on Wednesday) the Coral servers are back 
up, the network card has been replaced, and both local and remote Coral 
sessions seem to be functioning properly.

Were our problems with the event server that caused the server restarts 
and the client update problems earlier this week due to some early 
problems with the network card that completely failed this afternoon?  
We don't yet know .... but we are monitoring closely.

If the Event Manager continues to hang this week, however, we are 
prepared to move the Coral servers onto a new platform.  To do this will 
require a planned outage of Coral early Saturday morning with a planned 
outage extending from about 4 a.m. to 7 a.m.  We will send out another 
announcement, however, if we need to follow through on that this weekend.

We apologize for these inconveniences .... Team Coral tries hard to 
deliver good availability and we know that access has been less reliable 
than normal this week.

Thank you for your continued support,

John






More information about the labmembers mailing list