Home > Programming / tutorials > Understanding asymmetric Clustering – Part 3

Understanding asymmetric Clustering – Part 3

In last 2 posts we had been covering asymmetric clustering .

More can be found at

What is asymmetric clustering – Part 1

What is asymmetric clustering – Part 2

What is an appropriate level of granularity for a partition?

That’s very application specific. Customers currently typically use less than 512 partitions. Partitions can be either:

Named partitions for a specific function in the application.

Here the application may need a single thread in the cluster doing database archiving for example or summarizing audit records. This work could be done in a partition and the partitioning facility runtime would make sure it’s always running in one cluster member. These partitions correspond to conventional singletons.

Named partitions for groups of data sets, hashing is an example. All datasets whose key hashes to X run on partition X. An example here is a stock symbol. A partition could correspond to a stock symbol and be responsible for all order matching for that symbol. This partition can cache all symbol related data and then execute the incoming orders/quotes in sequence and write any state changes through to the database to harden them. The goal here is to give each partition exclusive write rights for the subset of the data corresponding to that partition in the database.

Can partitions be used as a conventional singleton?

Yes, a named partition is simply a conventional singleton. HTTP and EJB client requests can be routed to the JVM in the cluster that hosts a specific partition. The application programmer must write an interceptor that is called on the EJB client when a remote method on a specific stateless session bean is invoked. The interceptor then examines the method name and arguments and then returns the name of the partition that the request should be routed to. So, for example, a client can call an EJB method called ‘acceptOrder’ that takes an Order object as a parameter. The interceptor is provided with the method name and argument(s) and the interceptor then returns the String stored in the Order.getCompanyName() method. The client then routes the request to the partition named with the company name. If the JVM with that company crashes or fails then WebSphere automatically fails it over to another cluster member and new requests are now routed there instead.

So the workload management component lives in the client stubs? What about work arriving using messaging?

Applications can also use JMS or non JMS messaging providers such as Tibco Rendevous or Reuters SFC to receive partition related work. When a partition is activated in a JVM then the application can subscribe using JMS or native APIs to queues or Topics for messages for that partition. When the partition is deactivated then the application can unsubscribe from the topic or queue. The async beans feature of WebSphere is critical in providing this kind of flexibility to the application and such an approach is not possible with the J2EE specification. Async beans provides a fully supported threading capability for J2EE application and allows custom threading models for receiving this work. JSR 236 is intended to standardize these APIs and fill this current gap in the J2EE programming model.

How are the partitions made highly available?

WebSphere has a new component called the High Availability Manager or HAManager. This component is included in both WebSphere XD 5.1 and WebSphere 6.0. This provides high availability services in both products. XD uses the HAManager to manage partitions and ensure that a partition only runs on exactly one cluster member. Administrators can use policies to specify preferred server lists and fail back options for partitions. For example, the administrator could say that the partition for trading IBM stock should primarily run on server A and failover to server B with manual fail back to A if A restarts after a failover. We can demonstrate failover times on the order of seconds using this technology. Policies can also be updated in real time to control/move partitions around a cluster without the need for JVM restarts.

How do you detect server failure?

We use heart beating and we also watch the health of connections between peer JVMs. If a connection to a JVM closes or it doesn’t respond to heart beats then we mark it as dead and initiate recovery.

What if the server isn’t really dead?

This is possible when a JVM pauses temporarily and doesn’t send heart beats or because of network flooding or swapping on a machine. The application can be coded to detect and tolerate split clusters using database locking or a file based database on a shared file system using leased file locks. We do this in WebSphere for transaction logging as well as our built-in messaging engine. Or, if a customer desires it, WebSphere can integrate directly with the hardware to make a very cost effective highly available system. When WebSphere detects a failed server, it can be configured to run a script whose purpose is to guarantee the server is stopped. When the script returns then WebSphere starts recovery. We have scripts that communicate with the service processor on xSeries servers as well as the xSeries blade chassis service processor and tell it to power down the blade with the suspicious JVM. Scripts can also be written to control intelligent power strips, there are power outlets with Ethernet ports and individual outlets can be switched on or off using SNMP traps typically and to take advantage of non IBM server hardware with similar capabilities. This guarantees the JVM has failed before recovery starts. If the script cannot verify the server is powered down then WebSphere doesn’t failover and we notify the administrator.

Why is it best for a partition to run on one server vs. across multiple ones?

The key to horizontal scalability is eliminating cross talk and contention between servers. Stock trading, airline/hotel reservations, batch applications are examples of such applications. The ideal situation is that partitions do not need to interact with each other at all. If this is the case then we’ll get linearly scalability. All applications that use a database will experience some cross talk within the database due to index locks or latches within the database (such as those surrounding the transaction log) until the application is using a partitioned database. Application Architects should strive to stay as close as possible to this ideal as possible for best performance.

Won’t the database become a bottleneck?

Of course, but this database can do more work than normal because it’s just processing updates, it should be experiencing very little contention and practically no queries at all. We have seen even small 4 way boxes perform several thousand update transactions per second on internal benchmarking for customers in the financial sector. XD includes new support for a proxy data source, which enables developers to be able to use multiple databases for a single CMP per deployment. This means the data in the database itself can be partitioned. We could use two independent DB2 databases and put the read/write data from partitions A-M in database A and partitions N-Z in database B. The application can tell XD which DB2 database to use for a given transaction. This also improves availability as losing a database doesn’t stop all work, it’s a partial outage. Suppose database A goes down then the data from N-Z can still be processed. This provides a linearly scalable pattern for the database that’s superior from an availability and scalability viewpoint when compared with competitive database products.

You mentioned XD supporting server consolidation, how does that work?

A lot of customers tell us that they want to consolidate multiple independent server farms in to a single smaller server farm. They want to do this because most server farms are underutilized or over provisioned. The boxes are only typically running at 20%-30% capacity, if that. This is quite costly and is not flexible. For example, an application in one server farm experiences a surge in requests and maxes out the CPU capacity in that farm, while the server farm in the next room is still basically idle at 10% CPU.

XD allows administrator to define a single cluster (a node group) and then deploy multiple applications to that node group. The administrator does not tell XD how many servers in the node group should run each application. Instead, the administrator tells XD that application A should have a response time of one second, application B 500ms and application C two seconds. The administrator can also tell us that A is more important than B and C and C is more important than B. XD will then monitor the workload and dynamically decide how many instances of each application to run on the node group in order to meet these goals. If application A currently has a response time of 1.5 seconds then XD will add more application A server instances and potentially remove application B and C server instances to bring more resources to application A so it can meet its goal. XD can also predict that A will likely exceed its response time in 10 minutes based on a trend and react in anticipation of the event. This greatly simplifies the life of an administrator and allows the machines to be more efficiently used than a usual independent server set approach. XD also offers options to generate various email alerts when conditions are exceeded; for example it can restart servers when they appear to have a memory leak or after X requests.

Can you describe how this automatic provisioning works within the context of J2EE applications?

Normally, an administrator creates a cluster and then adds specific machines to that cluster. The application is then deployed to this cluster and runs on all those machines. The number of servers in the cluster is static unless the administrator manually adds/removes servers to/from the cluster. XD is different. The administrator creates a node group with all available servers. The administrator then creates a dynamic cluster for each application to be run on the shared environment represented by the node group. The administrator when specifies the goals and which applications/application URIs have which goals. XD’s job is then t

Measure the performance of the applications in the node group

Analyze it

Predict and compare the response times with the goals for the requests

Execute changes to the placement of applications to the node group servers to better meet the goals

This is also called a MAPE loop from the initials. As an example, let’s say there are four machines in the node group. The administrator then adds two applications to the node group by making a dynamic cluster for each application. Application A has a service level agreement of 1 second and high priority. Application B has a service level agreement of .5 seconds and a medium priority. A is initially just running on one server. If the workload for A is small and B gets a spike and breaks the .5 second SLA then XD will automatically start application B on all 4 servers. If application A then because of the load from B breached it’s SLA then because it’s more important, XD will stop B running on the server with A and XD may over time start A running on all 4 servers and stop B running on 3 of them in order to allow A to meet its goal.

Any parting thoughts?

I think it’s an exciting time now for J2EE and WebSphere. Products like XD, especially with the partitioning facility and async beans (a WBI 5.1 or WebSphere 6.0 feature) and the additional features that we’re planning for XD 6.0, will redefine the types of application that Java application servers can be used for. This technology will apply to the high and low end.

Some customers want 100 transactions per second while others want 100,000 transactions per second. Some of the largest customers want less than ten transactions per second but they are very high value, mission critical transactions. Almost everybody wants very high availability levels, fast recovery times (seconds) and to be able to leverage standards based skill-sets that their programmers already mostly know rather than have to hire a team of rocket scientists and build a complete custom platform from scratch.

A small amount of custom code can make a huge performance difference for an application and WebSpheres support for micro-containers finally allows customers to do this without having to build a complete custom application server from scratch. Use what we provide and enhance it with a micro-container if you need to. Most of your application is vanilla J2EE code and a small fraction of it will likely interact with a micro-container if you needed one.

The trick is making an off the shelf platform that can deliver these kinds of expectations using economical technology both from a development and a deployment point of view. The technology we’re adding to the WebSphere J2EE platform will allow customers to do this and get the best of both worlds and hopefully will redefine the OLTP transaction monitor space.

  1. No comments yet.
  1. No trackbacks yet.