Friday, October 30, 2009

Item 44: Use in-process or local storage to avoid the network











 < Day Day Up > 





Item 44: Use in-process or local storage to avoid the network



Most often, when J2EE developers begin the design and architectural layout of the project, it's a foregone conclusion that data storage will be done in a relational database running on a machine somewhere in the data center or operations center.



Why?



There's no mystical, magical reason, just simple accessibility. We want the data to be accessible by any of the potential clients that use the system.



In the old days, before n-tier architectures became the norm, clients connected directly to servers, so the data needed to be held directly on the server, in order to allow all the clients access to all the data; networks hadn't yet achieved widespread ubiquity, and wireless access technology, which makes networking even simpler, hadn't arrived. Connecting machines to a network was a major chore, and as a result, the basic concepts of peer-to-peer communication were reserved for discussions at the lowest levels of the networking stack (the IP protocol, for example).



In time, we realized that putting all of the data on a central server had the additional advantage of putting major load on the server, thereby taking it off the clients. Since the server was a single machine, it was much more cost-effective (so we believed) to upgrade or replace that single machine, rather than all of the clients that connected to it. So not until well after we had established the idea of the centralized database did we started putting our databases on the heaviest iron we could find, loading them up with maximal RAM footprints and huge gigabyte (and later, terabyte) drive arrays.



The point of this little digression into history is that the centralized, remote database server exists simply to provide a single gathering point for data, not because databases need to run on servers that cost well into five, six, or seven digits. We put the data on the server because it was (a) a convenient place to put it, and (b) an easy way to put processing in one central place for all clients to use without having to push updates out (zero deployment), and (c) a way to put the data close to the processing (see Item 4).



Transmitting all this data across the wire isn't cheap, however, nor is it without its own inherent share of problems. It's costly both in terms of scalability and performance (since each byte of bandwidth consumed to transfer data around the network is a byte of bandwidth that can't be used for other purposes), and the time it takes to shuffle that data back and forth is not trivial, as Item 17 describes. So, given that we put data on the centralized database in order to make it available to other clients and that it's not cheap to do the transfer, don't put data on the remote database unless you have to; that is, don't put any data on the remote database unless it's actually going to be shared across clients.



In such cases, running a relational database (or another data storage technology�here we can think about using an object database or even an XML database if we choose to) inside the same process as the client application can not only keep network round-trips to a minimum but also keep the data entirely on the local machine. While running Oracle inside of our servlet container is probably not going to happen any time soon, running an all-Java RDBMS implementation is not so far-fetched; for example, Cloudscape offers this functionality, as do PointBase and HSQLDB (an open-source implementation on Sourceforge), essentially becoming a database masquerading as a JDBC driver. Or, if you prefer an object-based approach, Prevayler, another open-source project, stores any Java object in traditional objects-first persistence fashion. If you'd rather see the data in a hierarchical fashion, Xindice is an open-source XML database from the Apache Group.



One simple in-process data storage technique is the RowSet itself. Because the RowSet is entirely disconnected from the database, we can create one around the results of a query and keep that RowSet for the lifetime of the client process without worrying about the scalability effects on the data base. Because the RowSet is Serializable, we can store it as is without modification into any OutputStream, such as a file or a Preferences node; in fact, if the RowSet is wrapped around configuration data, it makes more sense to store it in a Preferences node than in a local file in some ways (see Item 13). It won't scale to thousands of rows, it won't enforce relational integrity, but if you need to store that much data or to put relational constraints around the local data, you probably want a "real" database, like HSQLDB or a commercial product that supports "embedding" in a Java process.



There's another side to this story, however�the blindingly obvious realization that a remote database requires a network to reach it. While this may seem like an unnecessary statement, think about that for a moment, and then consider the ubiquitous sales-order application tied into the inventory database for a large manufacturing company. We release the application onto the salesperson's laptop, and off the salesperson goes to pay a visit to the VP of a client company. After a little wining, dining, and 18 holes on a pretty exclusive golf course, the VP's ready to place that million-dollar order. The salesperson fires up the laptop, goes to place the order, and to his horror, gets an error: "database not found." Sitting in the golf club's posh restaurant, our plucky salesperson suddenly pales, turns to the VP, and says, "Hey, um, can we go back to your office so I can borrow your network?"



Assuming the VP actually lets our intrepid salesperson make use of her network to connect back to the salesperson's home network via VPN, assuming the salesperson knows how to set that up on his laptop, assuming the IT staff at home has opened a VPN in the corporate network, and assuming the application actually works with any speed over the VPN connection, the salesperson's credibility is taking a pretty serious beating here. On top of this, the VP has every reason to refuse the salesperson the network connection entirely�it's something of a security risk, after all, to let foreign machines onto the network behind the firewall. And the IT staff at home has every reason to keep the database as far away from the "outside world" as possible, VPN or no VPN. By the time the sales person returns to the home office to place the order (scribbled on the napkin from the posh restaurant), the VP may have changed her mind, or the salesperson may have forgotten to get some important detail onto the napkin, and so on. (It's for reasons like these that salespeople are trained to place the order as quickly as possible as soon as the customer approves it.)



In some circles, this is called the traveling salesman problem (not to be confused with the problem-solving cheapest-way-to-travel version of the problem commonly discussed in artificial intelligence textbooks). The core of the problem is simple: you can't always assume the network will be available. In many cases, you want to design the application to make sure it can run without access to the network, in a disconnected mode. This isn't the same as offering up a well-formatted, nicely handled SQLException when the router or hub hiccups; this is designing the application to take the salesperson's order on the standalone laptop with no network anywhere in sight anytime soon. When designing for this situation, ask yourself how the application will behave when users are 37,000 feet in the air, trying to sell widgets to somebody they just met on the airplane.



One of the easiest ways to accommodate this scenario is to keep a localized database running, with a complete data dump of the centralized database on the same machine as the client (which, by the way, should probably be a rich-client application since the network won't be there to connect to the HTTP server, either�see Item 51). When the machine is connected via the network to the remote database, it updates itself with the latest-and-greatest data from the centralized database. Not necessarily the complete schema, mind you, but enough to be able to operate without requiring the remote database. For example, in the hypothetical sales application, just the order inventory, detail information, and prices would probably be enough�a complete dump of open sales, their history, and shipping information probably doesn't need to be captured locally. Then, when new orders are placed, the application can update the local tables running on the same machine.



Some of you are undoubtedly cringing at the entire suggestion. "Not connected to the remote database? Inconceivable! How do we avoid data integrity errors? That's why we centralized the database in the first place! After all, if there are only 100 widgets left, and both Susan and Steve sell those last 100 widgets to different clients via their laptop systems, we have an obvious concurrency issue when we sync their data up against the centralized system. Any system architect knows that!"



Take a deep breath. Ask yourself how you're going to handle this scenario anyway, because whether it happens at the time the salesperson places the order or when the order goes into the database, the same problem is still happening. Then go take a look at Item 33 as one way to solve it.



Normally, in a connected system, inventory checking occurs when the salesperson places the order�if the requested number of widgets isn't available on the shelves, we want to key the order with some kind of red flag and either not let the order be placed or somehow force the salesperson to acknowledge that the inventory isn't available at the moment. In the connected version of the application, this red flag often comes via a message box or alert window�how would this be any different from doing it later, when the salesperson is synchronizing the data against the centralized system? In fact, this may be a preferable state of affairs, because that alert window could potentially ruin the sale. If the VP sees the message she may rethink placing the order: "Well, I'll bet your competitor has them in stock, and we need these widgets right now," even if "right now" means "we can use only 50 a week." Instead, when the red flag goes off back at the office, the salesperson can do a little research (difficult, if not impossible, to do sitting in the client VP's office) before calling the customer to tell them the news. ("Well, we had only 50 widgets here in the warehouse, but Dayton has another 50, and I'm having them express-mailed to you, no charge.")



While the remote database has obvious advantages for enterprise systems�after all, part of our working definition of an enterprise system is that it manages access to resources that must be shared, which implies centralization�the remote database doesn't necessarily serve as the best repository for certain kinds of data (like per-user settings, configuration options, or other nonshared data) or for certain kinds of applications, particularly those that desire or need to run in a disconnected fashion.













     < Day Day Up > 



    No comments: