Tuesday, November 3, 2009

Item 5: Remember that identity breeds contention











 < Day Day Up > 





Item 5: Remember that identity breeds contention



Most programmers agree that an object is a unification of state (data) and behavior (methods), but there's a third element to any object-based or object-oriented language, too, and in many respects, this mysterious third element is what makes the object world go 'round. Look at the following code:










public class Person

{

private int age = 18;

private String name;



public Person(String n, int a) { name = n; age = a; }

public void happyBirthday() { ++age; }

}



Person p1 = new Person("Cathi", 29);

Person p2 = new Person("Alan", 35);

p2.happyBirthday();




When finished, how old is Alan? Or, more accurately, what is the age of the object referenced by p2? Alan is 36, of course. So how does Java know how to differentiate the object named "Cathi" from the object named "Alan" and keep the ages straight?



It's not a trick question�even most inexperienced Java programmers can usually get this one right. The trick lies in the implicit this parameter that's a part of every object, that uniquely references the object in question. Former C++ programmers even go so far as to say that this is in fact a pointer to the object in memory, and since no two objects can occupy the same place in memory, the address of the object therefore makes it unique. (Java programmers, of course, know that there's no such thing as a pointer, despite the presence of the NullPointerException class in their language, and wish C++ programmers would just stop pretending otherwise.)



In truth, an object is made up of three things: state, behavior, and identity. Identity allows the Java language and platform to differentiate between the previous code and the following, where p3 and p4 both refer to the same unique object:










Person p3 = new Person("Stephanie", 10);

Person p4 = p3;



p3.happyBirthday();

// How old is the Stephanie object at this point?




It seems like a pretty brain-dead concept, but stick with me.



Identity is fairly easy to spot in objects, but it appears elsewhere in a standard enterprise system, too. Given the following lines, where is the identity?










INSERT INTO person ("Cathi", 29);

INSERT INTO person ("Alan", 35);

INSERT INTO person ("Stephanie", 10);




Even without seeing the schema, we know that the relational database preserves a sense of identity because even those tables that lack a defined primary key can still differentiate between one row and the next�the database preserves the actual row's identity using some implementation-dependent scheme. (In Oracle, this is a silent column called the ROWID, for example.) Regardless of mechanism, it serves much the same purpose as the this pointer in C++ or Java, to provide a sense of identity. It allows us to write SQL expressions like the following and see that Cathi is now 30:










UPDATE person SET (age = age+1) WHERE name="Cathi";

SELECT name, age FROM person;




So why the concern? Objects have identity, rows have identity, what's the big deal?



One basic pillar of traditional object-oriented design has always been that abstract problem domain entities should map to a single object�in other words, that if we're representing Cathi with an object in the heap, then all references to Cathi should be done through references that refer to that same, individual object. Failure to do so can break code if we're not careful:










Person p1 = new Person("Cathi", 29);

Person p2 = new Person("Cathi", 29);

p2.happyBirthday();

// So how old is Cathi?




In many respects, this is why the distinction between the == operator, which performs a test for identity, and the equals method, which performs a test for equivalence, is so crucial to Java programming, as shown here:










if (p1 == p2)

System.out.println("It's the same object!");

if (p1.equals(p2))

System.out.println("It's equivalent, but not identical.");




So thus far, we're still OK.



In fact, this concept of identity is so fundamental to our notions of how systems should be designed that we tend to mimic it in distributed object systems; fortunately, it's trivial to do given the mechanics of RMI (or CORBA, for that matter):










PersonManager pm = (PersonManager)

Naming.lookup("rmi://localhost/PersonManager");

Person p1 = pm.findPerson("Cathi");

Person p2 = pm.findPerson("Cathi");

p1.happyBirthday();

// Thanks to distributed object identity,

// we have logical identity




When the RMI lookup occurs, we pull back some kind of "lookup" manager for Person objects, what in the EJB space is a Home, and subsequent lookups return remote stubs to the solitary object that lives on the server. So calling through p1 to increment Cathi's age also implicitly does the same for p2, since they both reference the same logical object. (Remember that p1 and p2 are technically references to separate proxy objects, but since the proxy objects�the stubs�point to the same object on the server, they're logically the same object. This is why the remote stubs are carefully written to return true if you call p1.equals(p2).)



In other words, identity in a distributed system is not just a function of memory addresses and logical mapping of behavior and state; in a distributed system, identity includes the hardware on which the object itself lives. And, to paraphrase the old Robert Frost poem, that makes all the difference in the world.



Consider the remote PersonManager/Person code again, this time with an eye toward identity as a factor of the machine on which the objects live. We presume that PersonManager can construct objects only on the local machine, so where, then, will each and every Person object reside? On the same machine as the PersonManager, of course, meaning that as the system scales up to include massive numbers of Person instances, we're intrinsically limited to the maximum work supported by that single machine. No matter how many additional machines we throw into the cluster, we can never have more Person objects in use at once than the one machine can support.



Worse, we run into the problem that we'll need to build in some kind of synchronization support to make sure that two accessors don't modify values concurrently and corrupt the data. After all, if an object has identity, it's possible that more than one client will want or need access to it simultaneously with another client. Locks mean that concurrent access by clients is impossible�that's what the locks are intended to prevent�which gets us to the lock window discussion of Item 29.



It's a simple observation: scalable access to a single shared object is impossible. If the contention inside the object's implementation doesn't kill you, and we haven't reached the implicit limit of work the underlying server can handle, the fact that it's a round-trip to the object in the first place (see Item 17) usually will. It begins to become more clear why Martin Fowler says that distributed object systems "suck like an inverted hurricane" [Fowler, 87].



Proponents of EJB are already getting ready to debunk this: "But in EJB, we have all sorts of enhancements to take care of this problem�passivation, for example." That's true, as long as some objects aren't in use by clients at the moment. Passivation works much like virtual memory at the operating system level�so much so that some have suggested that passivation should be tossed away in favor of just letting the operating system do it�and suffers from much the same flaws. If you have 100 clients, each of which is making active use of 100 objects, that's 10,000 objects that logically should remain active and unpassivated. But if the server's heuristics say that only 5,000 objects can remain active and the server passivates the other 5,000, then the server's going to spend a phenomenal amount of time in activation/passivation thrashing, just as your operating system does when you exceed physical memory. Forcing massive numbers of page faults (or activation and passivation of EJB objects) is a really quick way to kill performance.



The EJB proponents aren't finished: "But what about clustering? We can spread those objects around the network rather than host them on a single system." This presumes several things. For one, clients will want roughly evenly spread access to those objects; if any one object receives more than its fair share of attention from clients, then we're constrained once again to the maximum capacity for work of the system on which that singular object lives. (This is the danger of the Remote Singleton, by the way�regardless of how deeply you optimize your synchronization implementation in order to avoid holding locks longer than necessary�see Item 29�you'll always be constrained by the underlying remoting and/or networking plumbing.)



So where, exactly, does identity fit in the context of EJB?



In many ways, attempts to build identity-based systems in EJB will run into a brick wall. EJB's understanding of identity is to rely on external forms of identity�via the client or the external representation of the data in the database, for example�and not to rely on the commonplace identity mechanism we're so comfortable with in Java. This is why, for example, as an EJB programmer you can't assume that a request from one particular client will end up on the same actual object instance in memory. EJB explicitly throws away identity at the object layer, in order to preserve identity at the client or data layer. This is why an EJB container is so free to pool object instances. (This is also why attempts to build a Singleton in EJB are so painfully difficult�Singletons rely on object identity, and EJB takes object identity away from you.)



Stateless session beans are implicitly without identity. In fact, stateless session beans probably should have been called "anonymous session beans" because, while they can in certain circumstances maintain state, you can't be certain that you're invoking a method on the same bean instance you invoked last time�hence, they have no identity. This has led many pundits to call the stateful session bean the best horse to ride of those in the EJB stable because its implicit lack of identity allows the best scalability. (This isn't necessarily true, by the way, as some tests run by Mike Clark, of Bitter EJB fame [Tate/Clark/Lee/Lindskey], prove.)



Stateful session beans are objects whose identity is known (initially) only to the client that created them. Note that a stateful session bean isn't exclusive to that client�if you pass the handle to a different client, that second client can invoke methods on the same stateful session bean instance. As long as only one client accesses the bean, however, its identity remains a nonissue because little to no contention arises out of it. Interestingly, the EJB Specification requires that the EJB server implement a synchronization mechanism that prevents concurrent access of the entire bean instance by more than the client; unless the client makes multithreaded access to the instance, this is unlikely. In essence, the synchronization on a stateful session bean effectively adds to the latency of the bean call itself; because the specification requires the container to throw an exception back to the caller in the event of concurrent access to the stateful session bean (EJB Specification, Section 7.12.10), at a minimum the container needs to check that there's not already a call in progress.



Transactional COM+ [Ewald] points out two interesting properties of the object-per-client model. First, if there is one object for each client accessing the system, then the number of objects in existence reflects the number of clients currently accessing the system; if objects are shared across clients, this information isn't easily available. Second, maintaining per-client state is much simpler because each object itself inherently acts as a cookie identifying the client, rather than relying on some external mechanism. So in general, it's best if stateful session beans aren't shared across clients (which further undermines the need for session beans to be synchronized in a per-instance fashion).



We really run into problems with entity beans because not only does an entity bean have identity, but that identity is well-known across the entire system, and thus each entity bean acts as a Remote Singleton. This is a necessary design�remember, the entity bean instance is trying to represent an entity in the system, and entities implicitly have identity, of the row in the database they represent,[3] if not an actual, physical thing. This means that if multiple clients need access to a single entity, they're all working against the same logical object. It also implies that at any given point, two identities need to be maintained: the identity of the entity bean object and the identity of the row in the database.

[3] I should be more careful with my language here, since technically entity beans aren't required to be persistent to a relational database. (At a conference back in 1998 I heard the various OODBMS vendors raise a royal ruckus over the explicit phrase "an entity bean represents a row in your database" that appeared in the EJB 1.0 Specification.) Nevertheless, 99.5% of the world's entity beans are tied to an RDBMS, so I'll continue to use that assumption.



An entity bean (or other object-first persistence mechanism) can take two basic approaches to avoid the identity problem.





  1. Preserve the entity bean's identity, and try to cache like mad.

    Unfortunately, the EJB Specification works against the entity in this case, since the specification clearly calls for transactional semantics around the bean's access�remember, the specification demands that should the EJB server crash, the entity's state must be preserved, and if the server tries to save on a round-trip or two by caching the data in memory, then it runs the risk that the process could die before the cache can be flushed. (This approach can't work for local entity beans, by the way, since local entities must be accessed within the same JVM; if the local bean in turn is a remote proxy to a remote entity bean, we've sort of lost the whole point of the local bean, haven't we?)

  2. Break the identity of the entity bean itself, choosing to define identity in terms of type, rather than object, thus relying instead on the underlying database to hold identity. This essentially makes the entity bean an anonymous object that passes any state-manipulation logic directly to the database. (This is essentially the approach that Ewald recommends, albeit for COM+.) This allows the server to create as many entity beans as necessary across the cluster, thus spreading the load of even the most highly accessed entity. Unfortunately, this means that caching is very unlikely, since the cache itself will have to have some kind of identity, meaning it's a singular place to which all the otherwise anonymous objects must come in order to exchange updates, and that in turn gets us back to the identity bottleneck. More interestingly, this effectively reduces the entity bean to a stateless session bean that has some SQL generated for it (in the case of Container-Managed Persistence (CMP) entities).



It's fair to ask at this point what, if anything, we gain from moving identity away from the objects themselves and into the database. It's simple: we gain the ability to move logic off the database and into a separate layer that can, if desired or necessary, run on a separate tier from the database or the client. It's not going to remove the possibility (eventuality, more likely) of a bottleneck within the database, but it will buy you a lot more room before you hit that ceiling. The process of removing just enough of the bottlenecks in a system to meet your throughput needs (which you clearly identified, per Item 8, right?) is the essence of scalable system design.



In the end, we can't eliminate identity completely from the system. Even if we could, it would be counterproductive to do so�after all, the data elements we want to work with have to be identifiable, otherwise we would be unable to tell one Customer from another. The trick is to recognize what forms of identity are acceptable, and where identity, and the necessary locking that has to accompany it, can be avoided. Get that right, and you're well on your way to a highly scalable architecture.













     < Day Day Up > 



    No comments: