Event Manager ...

John Shott shott at snf.stanford.edu
Tue Jan 16 09:24:06 PST 2001


Bill and Mike:

I've spent some time looking at the event manager code and also playing around
with some modifications that, I think, will improve things somewhat.  However,
there are also a couple of things that are less easy to do but, would probably
improve performance.

Virtually all of the stuff I've found of note lives in the file
labnet/event/server/EventManagerImpl.java.  The versions that I've been
playing with are either in ~shott/labnet or /home/labnet/.prod-install/labnet.

Most of the things that I've found and been looking at are in the postEvent
method ... with a couple of minor things in the unsubscirbeAll method.

I've also tried to add some performance monitoring logging that I'll describe
briefly.

Basically, the core of the event server is a single vector called subscribers
that maintains a list of each subscriber and the event to which they are
subscribing.  The method postEvent basically looks through this vector for a
match between the eventID of the posted event and that of each
subscriber/event pair.  If the eventIDs match, it tries to post the event to
that subscriber. If it fails, is assumes that that subscriber no longer exists
and tries to do an "unsubscribeAll".  At this level, this all seems logical
and clean.  However, there are a couple of deficiencies .... 

1. The current code isn't doing a great job of cleaning up after itself when
clients die ... I think that I've found part of that reason, but maybe not
everything.  The end result of this is that the vector subscribers grows
nearly monotonically in size.  This is, I believe the  primary reason that the
event server gets slow over time.  With the improved logging that I'll
descirbe shortly, we should be able to quantify this performance and/or
degradation.

2. There is a single subscribers vector that contains subscripritons for all
events for all clients.  That vector has to be searched completely for every
posting ... this is inefficient because each event has only one ID, but it
looks through all of the subscriptions for all of the events any how.

Before I talk about specific changes, let me talk about the performance
logging that I have added ... in fact, I've added this incrementally, and some
of the things that I've added have already helped me to understand
deficiencies in the currrent implementation.

Since the postEvent method is the core of the event server, I've concentrated
on the logging there.

For each postEvent, I now print the current time, the eventID, the size of the
subscribers vector, the number of successful postings, and the number of
failed postings.
I also log the difference in time between the timestamp of the incoming event
(I'm not 100% sure, but I think that this is the time that the relevant server
received a request for action from the client) and when postEvent starts. 
Finally, I log the amount of time that postEvent itself uses.

A typical entry in evtmgr.log now looks like:
2001-01-15 14:23:35 Posted event: 102 (51, 4, 0) Elapsed time: 20 + 6 msec

>From watching these logs, it is clear that the size of the vector subscriber
grows and that we aren't getting complete cleanup ... This is true even if a
user properly uses the "Exit" on the client. This caused me to look at the
unsubscribeAll(obj) and unsubscribe(obj, eventID) methods where obj is an
EventSubscriber. The unscubribe methods basically search through the
subscribers vector and compares the ior of the subscriber to the ior of the
person wishing to unsubscribe.

Here I've found 2 things: One minor and one important:

The minor thing: The ior of the person wishing to unsubscibe (objior) is
determined each time in the looping through the vector when it only needs to
be determined once prior to the entry into the loop.

The major thing:  David use an Enumeration to loop through the elements of
subscribers.  If he finds a match (in other words, a subscription to be
removed), he removes it using subscribers.removeElement.  I beleive that this
causes an error and that we don't remove everything that we should be removing
on a single pass:  For example, if we are currently examining element 15 and
remove it, then original element 16 becomes element 15 ... and the next time
through the Enumeration we examine new element 16 (which is original element
17) ... and failed to examing element 16.  Because subscriptions from a single
IOR are added sequentially, this has the end result of only removing half of
the subscriptions that we should be removing.

I beleive that we should be using an Iterator instead of an Enumerator  ...
the Iterator has the iterator.remove() method that removes the current element
of whatever is being iterated ... but when you then say iterator.next(), you
get the true next element instead of passing over  the next one as happens
with an Enumerator.  Note: Apparently Iterators are not thread safe ... but,
if I understand things correctly, the fact that the unsubscribe(),
unsubscribeAll(), and postEvent() methods are synchronized solves this
problem.

I now have a version of the event server running that uses an Iterator (and
pulls the determination of objior out of the iteration).  This has been
running since Sunday ... I'd like to shut it down at the end of the day and
rerun the origianl version (but with performance logging outlined above) to
see what improvements we have made.

Even with these changes, however, performance degrades over time becuase we
still don't get full cleanup and (I think) because we have one giant
subscribers vector.  Actually, if people do an "Exit" we get full cleanup now
... this is seen in the log file as a sequence of 5 entries that look like:
Client close removing: 6 subscriptions.
Client close removing: 0 subscriptions.
Client close removing: 4 subscriptions.
Client close removing: 0 subscriptions.
Client close removing: 2 subscriptions.
(Note: the order doesn't have to be the same ... but the equipment client
subscribes to 6 events, the maintenance client subscribes to 4, the
reservation client subscribes to 2 and the history and staff clients don't
subscribe to anything).

There are also lots of log entries that only have a single line:
Client close removing: 6 subscriptions.
that is preceded by:
sub.postedEvent: org.omg.CORBA.COMM_FAILURE: minor code: 1 completed: No

This is a case where a postEvent() didn't answer and adds that subscriber to
the badsubs vector and then tries to unsubscribe from all events for that
subscriber.  However, that is only one of the clients belonging to a "client
session".  (Note: usually it is the equipment client because there are more
equipment events than any other kind ...). As a result, the size of
subscribers is larger than it should be for folks who don't exit properly
because they have to fail to receive an event for the equipment client, for
the maintenance client, and for the reservation client before they are fully
cleaned out of the subscribers vector.  (Note: of course folks who still have
valid, but inactive sessions also cause the subscribers vector to be larger
than it should).

What additional things could or should be done:

1. (Fairly easy, I think) ... currently any excpetion in the try/catch block
of postEvent() is assumed to be a dead client.  To be precise, we should try
to catch COMM_FAILURE events with minor code: 1 as a "dead client" ... and
still catch and report other errors which might be trying to tell us something
different.

2. (Not too hard ... I think)  We should try to break subscribers into
multiple lists of subscribers ... for example, if we had a separate
subscribers vector for each event (maybe call them subscribers_101,
subscribers_102, etc. ...)
Then  a call to postEvent(Timestamp aTimestamp, long loid, Any news)
would take the eventID (loid), determine the proper subscriber vector to call
(using maybe a small hashtable with eventID as a key) .. and then call
postEvent(Timestamp, Vector event_subscribers, Any news).  On anverage, each
of the 10 or so lists would only be 1/10 as long as the current subscribers
list.  Plus, I think that it might be possible to handle posting to different
events (101 vs. 108, for example) "simultaneously" if we allowed that to
happen on multiple threads ... although I'm out of my element understanding
threads.

Note: Unsubscribe all would have to go through all 10 or so lists ... but the
total number of elements is no more than it would have been the old way.

3. (Not so easy ...) To do a better job of "cleaning up" a COMM_FAILURE from
one client should not only unsubscribe to all events assoicated with that IOR
but with all events associated with all of the IORs associated with a
particular cleint ... at the moment, however, I don't know that we know that
information .... although I bet that a combination of the auth manager and the
admin manager probably could put that information together.  However ... this
sounds like a lot of work, and I'm not sure that it would be worth it unless
we made the other changes and still determined that the event manager
performance was still degrading.

My current plan is to try to look a the evtmgr.log today ... then probably try
to restart the performance monitoring old version tomorrow to see if the
Iterator changes (in particular) have improved anything significantly.  

I apologize for the length of this message, but thought I'd try to write down
what I've learned and observed. Any thoughts, feedback, and reactions, would
be most welcome.

Thanks,

John


Well, I ap



More information about the coral mailing list