Coral finally returns!!!

John Shott shott at snf.stanford.edu
Sat Feb 8 10:04:37 PST 2003


SNF Lab Members:

After a painfully long outage, Coral and the file systems are back on-line and
normal lab operations can resume.  Because we had to completely re-load and
re-build things from the ground up, you may notice some other applications
that are either missing or misbehaving.  If you notice any problems, please
report them to coral at snf.stanford.edu.

Following is a bit of a discussion of what went wrong, what we learned, and
how we hope to avert similar outages in the future.  If you don't really care
about that and are simply happy that Coral is operational, you shouldn't feel
compelled to read the remaining material.

So, if you are eager to get back to work, Team Coral apologizes profusely for
the lengthy outage and will strive to avoid similar problems in the future.

Thank you for your patience and your continued support ...

John and Team Coral

Following is a somewhat more detailed discussion of the Coral Outage of '03
...

Last weekend, we suffered what turned out to be a failure of the root
partition and boot disk of the Sunray server. As we had recent and complete
tape backups of everything on this disk we initially felt that we would be
able to put in a replacement disk, reload it with data from tape, and be back
in business.  As you well know ... it didn't turn out to be a quick process.

It actually took us and Sun field service a while to figure out that it was a
disk failure. Why? Sunray has "Fiber Channel" disks that connect the disks to
the rest of the system through a high-bandwidth fiber link as opposed to the
more conventional electrical connection onto the main bus.  Error messages
were all related to fiber channel failures ... and time was wasted changing
fiber channel components before it was finally determined that the real
problem was the boot disk connected to the end of the fiber channel.

Once we had the new disk in place ... it still took a long time to bring back
the system.  Why? This is largely due to some additional details and
idiocyncracies of Fiber Channel disks that not only were we unaware of, but
were also overlooked by Sun support and field service personnel until our
problem got elevated to the "Field Service Superstars".  Basically, more time
was consumed because not everything as properly configured ...  and steps that
we took to recover that we were assured by various hardware/software field
service personnel "should work" simply didn't.  Finally, this outage didn't
drag on because of lack of effort: Bill and Mike were working on this
virtually around the clock much of the week.  

So, how will we reduce the likelihood of similar problems in the future? 
While we are still working out the details and the implementation of a
comprehensive plan to eliminate future lengthy outages and to have greater
fault tolerance, we do have some of the first steps of a plan in place:

1. Sunray has been given a second boot disk that will, hopefully, allow us to
quickly recover and run from the second boot disk while were were resolving
issues on the first disk.  We will plan to add similar capabilities on other
critical systems.

2. Critical file systems (your home directories, the coral database, etc.)
will be moved to a RAID (Redundant Arrays of Inexpensive Disks, I think ...)
disk system.  Not only does this better protect the file systems themselves
from hardware failure, but it will allow us to more quickly replace a failed
server with a replacement server.

3. Get "backup Coral servers" in place where possible.  We actually have some
hardware intended for this purpose ... they are the Coral development
machines.  Unfortuately, we were not able to use them in this case because
these newer machines run on a newer operating sysem (Solaris 8 vs 2.6), a
newer database (Oracle 9i vs. 8), and the development version of Coral is not
quite "ready for prime time".  This combination of factors made it difficult
to move "Production Coral" to the development hardware.  However, once we
complete the transition between the old OS and database to the new one, we
will also be in a position to use the development server hardware as
"emergency backup" Coral servers by connecting it to the "production database"
that will be on the RAID disks.  In other words, we will hope to make our
development environment "look like" our production environment in  terms of OS
and database, so that we can use it as a backup to our production servers.

So, let me once again apologize for the length of this recent Coral outage. 
We are working hard to learn from these experiences and to take steps to both
increase our ability to withstand similar hardware failures and to quickly
switch to backup hardware while we resolve problems with the primary hardware.

Thank you for your continued support,

John and Team Coral



More information about the labmembers mailing list