Coral outage this evening ...
shott at snf.stanford.edu
Sat May 31 23:34:22 PDT 2003
Bill and Mike:
I received a "Coral is Dead" phone call this evening. When I tried to check
it out from home, I had a hard time reaching anything, network-wise.
I came in and found that /var/lob/coral/eqmgr.log reported some failures in
reaching the walker box.
And ping and traceroute couldn't get to walker from atu ....
Also, on the atu screen, there are a bunch of in.mpathd errors ...
Not knowing exactly what to do, I did 2 things:
1. I power cycled our switch.
2. I did an "/etc/initinet stop" and "/etc/inetinit start" on atu.
The combination of these things got things going again ... although I don't
I'm worried that all of those mpathd errors on atu aren't healthy and I'm
worried that the fact that those two actions "fixed" a significant network
outage indicates that the problem is in our side of the world. I suggest that
on monday we look at the frequency of those in /var/adm/messages.* and see if
it corresponds to the frequency where we see network problems elsewhere.
The fact that a network problem causes us to be getting the "Coral is dead"
calls indicates that this is a serious problem and I think that we need to
come up with a strategy to get to the bottom of this ASAP.
More information about the coral