Sometime back, I had answered a question on stackoverflow about Publish/Subscribe Systems . While the direction of my answer was not exactly what the guy who asked the question was looking at (He was not interested in performance aspects/design trade-offs) , I think anyone who stumbles here might find the problem interesting : Here is a gist of my answer on stackoverflow.
The general problem is called Content based Publish Subscribe , and if you search for papers in the same area, you would get a lot of results : For instance- this paper
Here are few things the system would need
1) A data-store for the subscriptions which needs to store: a)Store the list of subscribers b)Store the list of subscriptions
2) A means for authenticating the requests for subscriptions and the nodes themselves a) Server-Subscribers communicate over ssl. In the case of the server handling thousands of SSL connections – It’s a CPU intensive task, especially if lots of connections are set up in bursts.
b) If all the subscriber nodes are in the same trusted network, need not have ssl.
3) Whether we want a Push or Pull based model:
a)Server can maintain a latest timestamp seen per node, per filter matched. When an event matches a filter, send a notification to the subscriber. Let the client then send a request. The server then initiate sending matching events.
b)Server matches and sends filter to clients at one shot.
Difference between (a) and (b) is that, in (a) you have more state maintained on the client side. Easier to extend a subscriber-specific logic later on. In (b) the client is dumb. It does not have any means to say if it does not want to receive events for whatever reason. (say, network clog).
4) How are the events maintained in memory at the server-side?
a)The logical model here is table with columns of strings (C1..CN), and each new row added is a new event.
b)We could have A hash-table per column storing a tupple of (timestamp, pointer to event structure). And each event is given a unique id. With different data-structures,we can come up with different schemes.
c) Events here are considered as infinite stream. If we have a 32-bit eventId, we have chances of integer-overflow.
d) If we have a timer function on the server, matching and dispatching events,what is the actual resolution of the system timer? Does that have any implication?
e) Memory allocation is a very expensive operation. If your filter-matching logic is going to do frequent allocations/ freeing, it will adversely affect performance. How can we manage the memory-pool for this particular operation? Would we different size-buckets of page-aligned memory?
5) What should happen if the subscriber node loses connectivity or goes down? (a)Is it acceptable for the client to lose events during the period, or should the server buffer everything? (b)If the subscriber goes down,till what historical time in the past can it request matching events.
6) More details of the messaging layer between (Server,Subscriber) (a) Is the communication between the server and subscribers synchronous or asynchronous?
(b)Do we need a binary-protocol or text-based protocol between the client/server? (There are trade-off’s in both)
7) Should we need any rate-limiting logic in server side? What should we do if we starve some of the clients while serving data to few others?
8) How would the change of subscriptions be managed? If some client wishes to change it’s subsciption then, should it be updated in-memory first before updating the permanent data-store? Or vice-versa? What would happen if the server goes down, before the data-store is written-to? How would we ensure consistency of the data-store- the subscriptions/server list?
9)This was assuming that we have a single server- What if we need a cluster of servers that the subscribers can connect to? (Whole bunch of issues here: ) a)How can network-partitioning be handled? ( example: of say 5 nodes,3 nodes are reachable from each other, and other 2 nodes can only reach other?) b) How are events/workload distributed among the members of the cluster?
10) Is absolute correctness of information sent to the subscriber a requirement,ie, can the client receive additional information,that what it’s subscription rules indicate? This can determine choice of data-structure- example using a probabilistic data structure like a Bloom filter on the server side, while doing the filtering
11)How is time-ordering of events maintained on the server side? (Time-order sorted linked list? timestamps?)
12)Will the predicate-logic parser for the subscriptions need unicode support?
In conclusion,Content-based pub-sub is a pretty vast area- and it is a distributed system which involves interaction of databases,networking,algorithms,node behavior(systems go down,disk goes bad,system runs out of memory because of a memory leak etc) – We have to look all these aspects. And most importantly, we have to look at the available time for actual implementation, and then determine how we want to go about solving this problem
