Edit Rename Upload Download Back to Top

State Replication Protocol (SRP)

State Replication Protocol (SRP)

SRP serializes the state of a network of objects into a stream of data. Serialization starts with a root object and traverses all objects referenced directly or indirectly down to a user defined level. As object state is serialized, it can be transported to an object space of another OO computer language or dialect and the object network can be loaded (replicated) as it was saved. SRP's streaming of objects is affectionately refered to as "SluRPing"--as in slurping an object.

Why do we need yet another replication mechanism?

There are already a plethora of choices out there--to name a few: XML, CORBA GIOP and IIOP, OTI/ENVY Swapper, Java RMI, VisualWorks BOSS, Java Beans, VisualWorks Parcels, MS Word Documents, etc.. These are just a few of the more common standards or tools available for replicating state of various kinds of objects from one object space to another. SRP is better than all of them in most aspects and is a standard that that could replace any of them. SRP will grow to be used in place of many of the existing replication mechanisms. The growth will originate in Smalltalk circles, spread to Java, and could eventually become a CORBA standard. Why such confidence? You should get an idea of the the potential of SRP by reading the rest of this document.

Why not XML?: Extensible Markup Language (XML) is fine for simple objects and when space and performance isn't an issue. SRP is extremely space efficient; XML is extremely verbose. SRP loads simple data in one pass; XML parses data in a syntax that is likely to change over time. SRP allows metastate to be inlined or separate from object state; XML stores metastate in separate DTD files using another syntax. SRP allows for complex recursive objects; tags could be used in XML to do simple recursion. SRP has rich mapping capabilities that can be used with recursive state; XML has simple mapping capabilities. With SRP, everything is considered an object; with XML, everything converts to and from strings--the developer must define the conversion and therefore has data migration issues.

XML has somewhat of an advantage over SRP in that XML encodes in a human readable form. Having data human readable has advantages for computers too; Internet search engines can identify keywords in the data without needing to load the data. You can also edit XML from simple browsers.

Even if human readability is very important to you, XML still may not be your best choice. SRP isn't human readable, but it could be easily loaded into an object space to be viewed and manipulated as an object--and then resaved. Or, you could convert an SRP based object into XML so you can edit it in a text editor. Think about your need for human readable data. Just how often you think you would edit or view XML data using anything other than the program that generated the data? Do you think you would use a text editor to edit your MS Word document files or Excel spreadsheet files if the files were encoded in XML? XML has value as a common medium of exchange, but SRP would be better in that role.

How does SRP differ from other replication mechanisms?

SRP is as different from other replication mechanisms as Smalltalk is different from static typed programming languages (like Java). SRP is like Smalltalk in that objects are strongly typed--meaning that each object is associated with and knows only one class of object. Other replication mechanisms are like staticaly typed languages in that there are strict rules that define the contents of objects or how to to store the contents of objects. Some simple replication mechanisms would specify mapping rules something like this:

Date>>saveOn: aMarshaler aMarshaler nextPutShortInteger: month; nextPutShortInteger: day; nextPutShortInteger: year.

Date class>>loadFrom: aMarshaler ^self month: aMarshaler nextShortInteger day: aMarshaler nextShortInteger year: aMarshaler nextShortInteger.

The problem with doing it this way is flexibility and maintainability. Whatever object space that wants to load dates must know that Dates are stored as three integers in the order month, day, and year. Unless that matches precisely, the date won't be loaded properly.

Two years and a billion data records later some bright developer discoveres that storing a Date as three short integers (at two bytes each) is really wasteful. The integers will never be negative, so unsigned short integers would have been a better choice. Furthermore, six bytes is way more than is needed to store a month (range 1 to 12 - 4 bits), day (range 1 to 31 - 5 bits), and year (range 0 to 9999 - 14 bits). The data could instead be enumerated and stored as a single four octet unsigned long--with room to spare. Furthermore, it is found that the order month, date, year doesn't lend itself to being sorted as easily as the order year, month, date. By changing the order in the enumerated data, dates could be more easily sorted and indexed.

A great discovery for the developer, but wait a minute, what of the billion or so dates that are already stored? Database access will have to be halted while data goes through a schema migration. New load rules and save rules will also have to be installed in all object spaces that will need to marshal the dates to/from data streams. This could be a major undertaking.

The reasons this happened are that 1) the original choice of storage was flawed (face it, we rarely get it right the first try); 2) persistence rules dictated how dates must be stored and loaded; 3) the date itself was just data expected to be found in a specific position in the data stream--it was not stored as an object with its own sense of identity.

So how would SRP do in this situation. First off, SRP uses metastates so that mapping rules aren't necessary. A metastate is similar to a Smalltalk metaclass, but is only for the state of an object instead of the object itself. SRP would record the metastate of the Date class, at most, once per traversal. The metastate would specify the proper load rules to use for whatever loads the object. In other words, the rule for decoding a date is saved on the data stream along with the dates. This leaves no chance of load problems because load rules will never be out of sync with data. Second is that each part of a date is stored as an object in its own right. The instance variable named "month" happened to contain an Integer when the date was saved, so an integer object was saved for the month. Third is that migration of existing data is optional and can be done incrementaly.

If we later found that it would offer some advantage to represent a date as a single enumerated integer then we could simply create one mapping rule to convert dates into instances of an MyEnumeratedDate class prior to saving. The MyEnumeratedDate mapping rule class would be defined to know how to convert a date to an integer. The integer would be held in an instance variable of an MyEnumeratedDate instance that gets saved in place of the original date. The MyEnumeratedDate class would also know how to convert itself back to a regular date when loaded.

An object space can load the state of a MyEnumeratedDate even though it may not know what to do with instances once they are loaded. As you might expect, the behaviorless state of an EnumeratedDate is of limited use when loaded into an object space. This release of SRP requires that some mapping behavior exist in an object space when loading. A future release of SRP may allow, under constrained conditions, for behavior to be automatically retrieved for unrecognized states.

Mapping Rules: Mapping rules define any special processing that objects will go through when they are to be saved or loaded. They are often used to convert objects to a portable form when saving, and then used to restore those portable objects to a native form when loading. An important responsibility of mapping rules is to control the depth of an object traversal.

An object can refer to many objects which in turn refer to many other objects and even back to objects that have already been referred to. Extremely complex object networks are common. Without mapping rules to control traversal depth, an object traversal could inadvertently reference every object in your object space.

SRP includes a set of Portable Mapping Rules (PMR) that are used by default to save many Smalltalk kernel objects in a form that is portable between Smalltalk dialects. SRP allows you to define, select, and reject the rules that are to be used depending on your needs. You can always persist exactly what you want--no more, and no less. You can set the priority for applying rules. You can also map an object multiple times.

Portable Mapping Rules (PMR): Each computer language and/or dialect of a language can choose its own way to represent instances of the same class. For example, a Dictionary class in VisualAge for Smalltalk uses the named instance variables "elementCount" and "elements" while a Dictionary class in VisualWorks uses the named instance variable "tally" and is indexable. In spite of the implementation differences, Dictionaries fill the same purpose in both Smalltalk dialects. What is needed is a portable way to represent a Smalltalk Dictionary.

Portable Mapping Rules are defined for many Smalltalk kernel classes so that objects are stored in a portable format that can be loaded on other Smalltalk dialects. PMR load mapping rules are defined for each dialect to do the conversion from a portable format back to a native class. The overhead for doing this mapping is well justified by the portability that results. It is possible to remove or override mapping rules to tune performance if necessary.

Mapping Context When mapping rules are applied, the rule has full access to the context in which the object to be mapped is used. You know for instance that an Association being mapped is referred to by a CompiledMethod which is referred to by a MethodDictionary and so on to the root object. Not only do you know the full traversal path, but you also know both the original form of the referring objects and any form that referring objects may have been mapped to. Often knowing the context allows for simpler mapping rules to be used. The mapping context is available both when saving and loading objects.

Single-pass Serialization: This means that you can start transmitting a data stream through the internet before the entire root object has been serialized. It can only be done because SRP never goes back to an earlier part of the data stream to change data. Likewise, object replication can begin before the data stream has made it completely through a communications channel. This can be used to minimize latency in communication between object spaces.

Object Substitution: Most persistence tools will allow you to exchange one object for another as the object is loaded. This is most often used when migrating instances from one class schema to another. Only a few tools will allow you to substitute objects in a recursive network of objects. This is a challenge because the tool discovers it needs to swap an object for another, but usually when it is in the process of loading the contents of that object; but more importantly, any references to the loaded object need to point to the replacement object to maintain the same relationships in the object network.

SRP even allows you to swap and migrate objects *after* loading and still maintain proper relationships between objects. It doesn't do so by using #become: (which doesn't perform well in all Smalltalk dialects or simply isn't available for other languages). Nor does it load into indexable slots and later assemble the network relationships from those slots after loading. Instead, it selectively makes use of a special value holder for some objects in the object network that is intelligently removed as loading progresses.

Suitable for Replicate Proxies: Traversal rules are flexible enough that any object can take the place of another while still maintaining relative object references between traversed objects. Using a proxy object in place of an object is one obvious use. The idea that objects can contain an OID for a remote object space opens the door to the idea that objects with OIDs need not be immediately traversed for replication to a remote object space. A proxy takes the place of an object that isn't encoded in the data stream, but can be retrieved or used on demand by the remote object space.

Traversal rules can declare when a proxy will be used, and the type of proxy that will be used. The proxy object will be replicated in an object space just like any other object in the data stream. Messages sent to the proxy that aren't intended for the proxy itself can be caught by a #doesNotUnderstand: message in Smalltalk which allows the proxy to take whatever action is appropriate to handle the message.

SRP provides the framework to make traversal object replacement possible, but doesn't declare how proxies are implemented or how they will respond to messages sent to them.

Data Encoding: The serialized data is stored in a way that makes it simple, portable, and limitless. Data is stored as a series of whole number integers (values 0 to positive infinity). Each integer is represented by a series of eight bit units known as an octet. The number of octets that comprise one whole number integer is determined by one of the eight bits known as the carry bit. If the carry bit is ON then seven bits of the octet are joined with seven bits of the next octet in the series. The octet bits continue to join until the carry bit is OFF marking the end of the integer being loaded.

Encoding the data as a stream of whole number integers has advantages over the typical technique of writing the byte representation of an object directly to a byte stream. Here is a list of the primary advantages: 1) Cross-platform portability. It doesn't matter if you are on UNIX, Mac, PC, etc. because data is encoded the same way. Big-endian/little-endian byte ordering is no longer a concern. 2) Unlimited data values. Since the data structure can handle integers to infinity, the only real limits are the limits of the integer representation of whatever software language will be used to encode/decode the integer stream. Fortunately, Smalltalk integers are also unlimited. 3) Implementation independence. It doesn't matter to the integer stream if a character occupies one byte, two, or four bytes in memory. A character is just a whole number. The host memory consumption of any object is irrelevant to the data storage representation. 4) Space savings. As you might have already guessed, an integer held in a 64 bit data structure in memory doesn't necessarily get stored in 64 bits with SRP. Only as many bit units as are necessary are used. Due to carry bits, it could be stored in more than 64 bits, but the predominate case is that it is stored in fewer. An object that stores in 10,028 bytes in with ENVY swapper and 4,500 bytes with BOSS only requires 1,713 bytes with SRP. 5) Simplicity. Only a series of whole number integers is written and loaded. It doesn't matter to the data stream what is repsesented by those integers.

Metastates: Some other persistance mechanisms require that each class define its own data encoding and decoding rules. Rules contained in classes in an object space determine how to read and write on the data stream. This technique creates an undesireable dependence between the data and the class--and possibly a particular version of that class. It means that data simply can't be read unless classes in the object space know the rules for reading each type of object.

In contrast, SRP uses metastates to describe how the data is encoded using simple general terms so that it can later be decoded. SRP is so flexible in this regard, that you could load the data of instances of a class even if the class doesn't exist in the object space. This allows you to migrate instances to some other form when loading. It would also allow you to load and later resave the state of objects even though the behavior of those objects is unknown.

Metastate Collections: By default SRP stores metastates along with the state for each type of object. This allows state to be loaded without the loading object space knowing complex load rules. While this is generally the best way to do things, the metastates do take space to store. SRP allows you to refer to a collection of metastates by name instead of encoding them in each traversal. This improves performance and reduces the number of bytes require to store objects. The caveot is that a loading object space must have the ability to fetch the metastate collection if it doesn't recognize the name.

Data Recoverability: Because of the sequence in which objects are traversed, it is possible to load a fragment of a persistent object from corrupt data. The SRP loader doesn't currenty make use of this, but it is possible to do. One technique that helps recoverability is that reference indexes to previously loaded objects are relative to the index position of the object being loaded instead of being relative to the root object. This means corruption early in the data stream has less chance of affecting data later in the stream.

Documented Encoding: Some data formats are proprietary, undocumented, and tightly bound to the implementation it was designed for. SRP was meant to bridge gaps rather than create them. SRP data formats are released into the public domain with the hope they will be widely used by all. It is hoped that SRP's whole number integer stream encoding and sequences will one day be as ubiquitous as TCP/IP.

--Paul Baumann


Edit Rename Upload Download Back to Top