Coral outage this evening ...

John Shott shott at
Sat May 31 23:34:22 PDT 2003

Bill and Mike:

I received a "Coral is Dead" phone call this evening.  When I tried to check
it out from home, I had a hard time reaching anything, network-wise.

I came in and found that /var/lob/coral/eqmgr.log reported some failures in
reaching the walker box.

And ping and traceroute couldn't get to walker from atu ....

Also, on the atu screen,  there are a bunch of in.mpathd errors ...

Not knowing exactly what to do, I did 2 things:

1. I power cycled our switch.

2. I did an "/etc/initinet stop" and "/etc/inetinit start" on atu.

The combination of these things got things going again ... although I don't
know why.

I'm worried that all of those mpathd errors on atu aren't healthy and I'm
worried that the fact that those two actions "fixed" a significant network
outage indicates that the problem is in our side of the world.  I suggest that
on monday we look at the frequency of those in /var/adm/messages.* and see if
it corresponds to the frequency where we see network problems elsewhere.

The fact that a network problem causes us to be getting the "Coral is dead"
calls indicates that this is a serious problem and I think that we need to
come up with a strategy to get to the bottom of this ASAP.



