Wednesday, April 3, 2013

Load Balancing T3 InitialContext Retrieval for WebLogic using Oracle Traffic Director

 

Introduction


The T3 protocol is WebLogic's fast native binary protocol that is used for most inter-server communication, and by default, communication to a WebLogic server from client applications using RMI, EJB or JMS, for example. If a JMS distributed queue or an RMI/EJB based application is deployed to a WebLogic cluster, for high availability, the client application needs a way of addressing the "virtual endpoint" of the remote clustered service. Subsequent client messages sent to the service then need to be load-balanced and failed-over, when necessary. For WebLogic, the client application achieves this in two phases:

  1. Addressing the virtual endpoint of the clustered service. The client Java code populates a "Provider URL" property with the address of the WebLogic cluster, to enable the "InitialContext" to be bootstrapped. As per the Weblogic JNDI documentation, "this address may be either a DNS host name that maps to multiple IP addresses or a comma separated list of single address host names or IP addresses". Essentially this virtual endpoint URL is only ever used by the client, to connect to a random server in the cluster, for creating the InitialContext. The URL is not used for any further interaction between client and clustered servers and does not influence how any of the subsequent T3 messages are routed.
  2. Load-balancing JNDI/RMI/EJB/JMS messaging to the clustered service. Once the InitialContext is obtained by the client application, all JNDI lookups and subsequent RMI/EJB/JMS remote invocations, use WebLogic generated "cluster-aware" stubs, under the covers. These stubs are populated with the list of clustered managed server direct connection URLs and the stub directly manages load-balancing and failover of T3 requests across the cluster. When a particular server is down, the stub will route new requests to another server, from the cluster list it maintains. Under certain circumstances, such as cases where no servers in the stub's current cluster list are reachable any more, the stub will be dynamically refreshed with a new version of the cluster list, to try. The stub's list of cluster membership URLs is unrelated to the fixed URL that is used for phase 1 above.

For bootstrapping the InitialContext (i.e. phase 1 above), the WebLogic Clustering documentation recommends that, for production environments, a DNS entry containing a list of clustered server addresses is used. This avoids the client application needing to "hard-code" a static set of cluster member addresses. However, sometimes it may be more convenient to use an external load balancer to virtualise this clustered T3 endpoint. In this blog entry, I will examine how the Oracle Traffic Director product can be used to virtualise such a cluster endpoint address, when bootstrapping the InitialContext.

It is not required or even supported to use an External Load Balancer for load-balancing the subsequent JNDI/RMI/EJB/JMS interaction over T3 (i.e phase 2 above). The WebLogic generated stubs already perform this role and are far more adept at it. Additionally, many of these T3 based interactions are stateful, which the stubs are aware of and help manage. So attempting to also introduce an External Load Balancer in the message path will break this stateful interaction. These rules are explicitly stated in the WebLogic EJB/RMI documentation which concludes "when using the t3 protocol with external load balancers, you must ensure that only the initial context request is routed through the load balancers, and that subsequent requests are routed and controlled using WebLogic Server load balancing".


Oracle Traffic Director


Oracle Traffic Director (OTD) is a layer-7 software load-balancer for use on Exalogic. It includes many performance related characteristics, plus high availability features to avoid the load-balancer from being a "single point of failure". OTD has been around for approximately 2 years and has typically been used to virtualise HTTP endpoints for WebLogic deployed Web Applications and Web Services. A new version of OTD has recently been released (see download site + online documentation), that adds some interesting new features, including the ability to load-balance generic TCP based requests. An example use case for this, is to provide a single virtual endpoint to client applications, to access a set of replicated LDAP servers as a single logical LDAP instance. Another example use case, is to virtualise the T3 endpoint for WebLogic client applications to bootstrap the InitialContext, which is precisely what I will demonstrate in the rest of this blog entry.


Configuring OTD to Load Balance the T3 Bootstrap of the InitialContext 


I used my Linux x86-64 laptop to test OTD based TCP load-balancing. I generated and started up a simple Weblogic version 10.3.6 domain containing a cluster of two managed servers with default listen addresses set to "locahost:7001" and "localhost:7011" respectively. I also deployed a simple EJB application targeted to the cluster (see next section).

I then installed OTD version 11.1.1.7 and created a simple OTD admin-server and launched its browser based admin console. In the admin console, I could see a new section ready to list "TCP Proxies", in addition to the existing section to list "Virtual Servers" for HTTP(S) end-points, as shown in the screenshot below.


(click image for larger view)
To create the TCP Proxy configuration I required, to represent a virtual endpoint for my WebLogic cluster, I chose the 'New Configuration' option and was presented with Step 1 of a wizard, as shown in the screenshot below.


(click image for larger view)
In this Step 1, I provided a configuration name and an OS user name to run the listener instance under, and I selected the TCP radio button, rather than the usual HTTP(S) ones. In Step 2 (shown below), I defined the new listener host and port for this TCP Proxy, which in this case I nominated as port 9001 using the loopback address of my laptop.


(click image for larger view)
In Step 3 (shown below), I then provided the details of the listen addresses and ports for my two running WebLogic managed servers that needed to be proxied to, which is listen on localhost:7001 and localhost:7011 respectively.


(click image for larger view)
In the remaining two wizard steps (not shown), I selected the local OTD node to deploy to and reviewed the summary of the proposed configuration before hitting the 'Create Configuration' button. Once created, I went to the "Instances" view in the OTD admin console and hit the "Start/Restart" button to start my configured TCP Proxy up, listening on port 9001 of localhost.


(click image for larger view)
At this point, I assumed that I had correctly configured an OTD TCP Proxy to virtualize a Weblogic clustered T3 InitialContext endpoint, so I then needed to prove it worked....


Example Deployed Stateless Session Bean for the Test


I created a simple stateless session bean (including home interface) and deployed it as an EJB-JAR to Weblogic, targeted to my running cluster. The EJB interface for this is shown below and the implementation simply receives a text message from a client application and prints this message to the system-out, before sending an acknowledgement text message back to the client.

  public interface Example extends EJBObject {
    public String sendReceiveMessage(String msg) throws RemoteException;
  }


Example EJB Client Code for the Test


I then coded a simple standalone Java test client application in a single main class, as shown below, to invoke the remote EJB's sole business method (I've not listed the surrounding class & method definition and try-catch-finally exception handling code, for the sake of brevity).


  Hashtable env = new Hashtable();
  env.put(Context.INITIAL_CONTEXT_FACTORY,
        "weblogic.jndi.WLInitialContextFactory");
  env.put(Context.PROVIDER_URL, "t3://localhost:9001");
  System.out.println("1: Obtaining initial context");
  ctx = new InitialContext(env);
  System.out.println("2: Sleeping for 10 secs having got initial context");
  Thread.sleep(10 * 1000);
  System.out.println("3: Obtaining EJB home interface using JNDI lookup");
  ExampleHome home = (ExampleHome) PortableRemoteObject.narrow(
                 ctx.lookup("ejb.ExampleEJB"), ExampleHome.class);
  System.out.println("4: Sleeping for 10 secs having retrieved EJB home interface");
  Thread.sleep(10 * 1000);
  System.out.println("5: Creating EJB home");
  Example exampleEJB = home.create();
  System.out.println("6: Sleeping for 10 secs having got EJB home");
  Thread.sleep(10 * 1000);
  System.out.println("7: Start of indefinite loop");

  while (true) {
    System.out.println("7a: Calling EJB remote method");
    String msg = exampleEJB.sendReceiveMessage("Hello from client");
    System.out.println("7b: Received response from EJB call: " + msg);
    System.out.println("7c: Sleeping for 10 secs before looping again");
    Thread.sleep(10 * 1000);
  }

I included Java code to bootstrap the InitialContext, perform the JNDI lookup of the EJB home, create the EJB remote client instance and invoke the EJB's sendReceiveMessage() method, which are all shown in bold. Additionally, I included various debug print statements, a few 10 second pauses and a while loop (shown in non-bold) to space out the important client-to-server interactions, for reasons that I'll allude to later.

As can be seen in the code, the Provider URL address I used for the InitialContext was "localhost:9001", which is the OTD TCP Proxy endpoint I'd configured, rather than the direct address of the clustered managed servers (which would have been hard-coded to "localhost:7001,localhost:7011", for example). I compiled this client application using a JDK 1.6 "javac" compiler, and ensured that whenever I ran the client application with a Java 1.6 JRE, I included the relevant WebLogic JAR on the classpath.


Test Runs and WebLogic Configuration Change


With my WebLogic clustered servers running (ports 7001 & 7011), and my OTD TCP Proxy listener running (port 9001), I ran the test client application and hit a problem immediately  In the system-out for one of the managed servers, the following error was shown:

  <02-Apr-2013 15:53:00 o'clock GMT> <Error> <RJVM> <BEA-000572> <The server rejected a connection attempt JVMMessage from:
 '-561770856250762161C:127.0.1.1R:-5037874633160149671S:localhost:localhost:7001,localhost:7011:MyDemoSystemDomain:MyDemoSystemServer2' to: '0B:127.0.0.1:[9001,-1,-1,-1,-1,-1,-1]' cmd: 'CMD_IDENTIFY_REQUEST', QOS: '101', responseId: '-1', invokableId: '-1', flags: 'JVMIDs Sent, TX Context Not Sent, 0x1', abbrev offset: '105' probably due to an incorrect firewall configuration or admin command.> 

In the client application's terminal, the following error was subsequently shown:

  javax.naming.CommunicationException [Root exception is java.net.ConnectException: t3://localhost:9001: Bootstrap to: localhost/127.0.0.1:9001' over: 't3' got an error or timed out]
  Caused by: java.net.ConnectException: t3://localhost:9001: Bootstrap to: localhost/127.0.0.1:9001' over: 't3' got an error or timed out

After some investigation, I found that the My Oracle Support knowledge base contained a document (ID 860340.1) which defines this expected WebLogic behaviour and provides the solution. By default, WebLogic expects any remote T3 access to have been initiated by the client, using the same port number that the proxied request actually hits a port of the WebLogic server on. In my case, because I was running this all on one machine, the port referenced by the client was port 9001 but the port hit on the Weblogic Server was 7001. As recommended in the knowledge base document, I was able to avoid this benign server-side check from being enforced, by including a JVM-level parameter (see below) in the start-up command line for my WebLogic servers. I then re-started both managed servers in the cluster to pick up this new parameter.

  -Dweblogic.rjvm.enableprotocolswitch=true

This time, when I ran the client application, it successfully invoked the remote EJB. No errors were shown in the output of the clustered managed servers, and instead the system-out of one the managed servers showed the following, confirming that the EJB had been called.

  class example.ExampleEJB_ivy2sm_Impl EJB received message: Hello from client
    (output repeats indefinitely)

On the client side, the expected output was also logged, as shown below.


  1: Obtaining initial context
  2: Sleeping for 10 secs having got initial context
  3: Obtaining EJB home interface using JNDI lookup
  4: Sleeping for 10 secs having retrieved EJB home interface
  5: Creating EJB home
  6: Sleeping for 10 secs having got EJB home
  7: Start of indefinite loop
  7a: Calling EJB remote method
  7b: Received response from EJB call: Request successfully received by EJB
  7c: Sleeping for 10 secs before looping again
    (output 7a to 7c repeats indefinitely)


Proving OTD is not in the Network Path for T3 Stub Load-Balancing


The reason why my test client code includes 10 second pauses and debug system-out logging, is to enable me to observe what happens when I kill various WebLogic and/or OTD servers, part way through client application test runs.

One of the main scenarios I was interested in was to allow the client to bootstrap the InitalContext via the OTD configured TCP Proxy listening on localhost:9001, then terminate the OTD instance/listener process. OTD would be stopped whilst the client application was still in the first 10 second pause, and before the client application had a chance to perform any JNDI lookup or EJB operations. By doing this I wanted to prove that once a client application bootstraps an InitialContext, the Provider URL (eg. "localhost:9001") would subsequently be ignored and the cluster-aware T3 stub, dynamically generated for the client application, would take over the role of routing messages, directly to the clustered managed servers.

Upon re-running the client and stopping the OTD listener, after the client application logged the message "2: Sleeping for 10 secs having got initial context", it continued to work normally. After 10 seconds, it woke up and correctly performed the JNDI lookup of the EJB home, called the remote EJB home create() method and then called the remote EJB business method, logging each of these steps to system-out on the way. This proved that the InitialContext Provider URL is not used for the JNDI lookups or EJB invocations.

To further explore this, I also re-ran the scenario, after first starting Wireshark to sniff all local loopback network interface related traffic. I inspected the content of the T3 messages in the Wireshark captured network traffic logs. I observed that in some of the T3 messages, the responding managed server was indicating the current cluster address as being "localhost:7001,localhost:7011". Upon terminating the first of the two managed servers in the cluster (whilst the client application was still running in its continuous loop), I observed that subsequent cluster address metadata in the T3 responses from the second managed server, back to the client, showed the current cluster address as just "localhost:7011".

I ran one final test. With the WebLogic cluster of two managed servers running, but the OTD TCP Proxy Listener instance not running, I re-ran the client application again from start to end. As soon as the application started and attempted to bootstrap the InitialContext it failed, as expected, with the following logged to system-out.


  1: Obtaining initial context
  javax.naming.CommunicationException [Root exception is java.net.ConnectException: t3://localhost:9001: Destination unreachable; nested exception is: 
java.net.ConnectException: Connection refused; No available router to destination]
at weblogic.jndi.internal.ExceptionTranslator.toNamingException(ExceptionTranslator.java:40)
at weblogic.jndi.WLInitialContextFactoryDelegate.toNamingException(WLInitialContextFactoryDelegate.java:792)
at weblogic.jndi.WLInitialContextFactoryDelegate.getInitialContext(WLInitialContextFactoryDelegate.java:368)
at weblogic.jndi.Environment.getContext(Environment.java:315)
at weblogic.jndi.Environment.getContext(Environment.java:285)
at weblogic.jndi.WLInitialContextFactory.getInitialContext(WLInitialContextFactory.java:117)
at javax.naming.spi.NamingManager.getInitialContext(NamingManager.java:667)
at javax.naming.InitialContext.getDefaultInitCtx(InitialContext.java:288)
at javax.naming.InitialContext.init(InitialContext.java:223)
at javax.naming.InitialContext.(InitialContext.java:197)
at client.RunEJBClient.main(RunEJBClient.java:19)


This showed that the client application was indeed attempting to bootstrap the InitialContext using the Provider URL of "localhost:9001". In this instance, the TCP proxy endpoint, that I'd configured in OTD, was not available, so the URL was not reachable. This proved that the client application was not magically discovering the cluster address from a cold-start and does indeed rely on a suitable Provider URL being provided, to bootstrap the InitialContext.


Summary


In this blog entry, I have shown how an external load balancer can be used to provide a virtualised endpoint for WebLogic client applications to reference, to bootstrap the InitialContext from a WebLogic cluster. In these particular tests, the load balancer product used was the latest version of OTD, with its new "TCP Proxy" capability. However, most of the findings are applicable to environments that employ other types of load balancers, including hardware load balancers.

I have also shown that, other than for InitialContext bootstrapping, an external load balancer will not be used for subsequent T3 load-balancing. Instead, the cluster-aware T3 stubs, that are dynamically loaded into client applications, automatically take on this role.

For production environments, system administrators still have the choice of mapping a DNS hostname to multiple IP addresses, to provide a single logical hostname address representing a WebLogic cluster. However, in some data-centres, it may be more convenient for a system administrator to re-use an existing load balancer technology, that is already in place, to virtualise the endpoint and provide a single logical address for a cluster. This may be the case if it is much quicker for a system administrator to make frequent and on-demand configuration changes to a load balancer, rather than continuously raising tickets with the network team, to update DNS entries.



Song for today: Pace by Sophia

8 comments:

Tejas said...

Paul excellent work. It is very insightful.

Augustin said...

Very nice entry and clarifying my questions related to OTD and balancing t3.
I still don't see why would need OTD if t3 is already balanced, and also if subsequent request goes directly to managed servers then you would need an opened route from client to those manage servers ...

Regards,
Augustin

Paul Done said...

Hi Augustin

Great questions. I'll attempt to answer....

1) Need some-way of allowing client to initially address and locate the cluster. Granted, multiple DNS entries is often easier than using LB VIP, but I've often seen resistance from network guys to keep changing DNS each time cluster size increases/decreases. If LB is already virtualizing HTTP endpoint, for the cluster, why not use same place to virtualize and managed T3 IC bootstrap endpoint for same cluster? However, it's just a matter of choice. Worst practice would be to hard code into client app code, the comma separated list of individual address/ports of each server in cluster, because then it is hard to change cluster without breaking things. Ideally want one place (DNS or LB) to change cluster addresss metadata, when cluster changes, that then means all existing client apps continue to work as normal, without requiring modification.

2) For subsequent T3 load balancing, you absolutely need addressable connections for each individual server in the cluster. So one-to-one address-to-server mappings, from the client, to each server in cluster has to be achievable. The stub, that then load balances and fails over across these. However, these addresses could be different to the server listen addresses, and be on a different client network using same DNS hostname entries but resolving to different, NAT'd (Network Address Translated), IP Addresses thru a firewall. The trick here is to ensure that any translation maps a "client-side" IP address one-to-one to an individual "server-side" IP address of specific server - its the one-to-many address mapping of LB VIPs that would break this. Also see "External Listen Address" at http://docs.oracle.com/cd/E29597_01/apirefs.1111/e13952/pagehelp/Corecoreserverserverconfiggeneraltitle.html

Does this answer your question? :-)

Paul

冷眼 said...

Hi Paul,

We had tested similar approach (under WLS 10.3.4) with some HW LB and found (confirmed with network capture) that the connection for bootstrapping is still in use after bootstrap (effectively replacing one of the client-server direct connection). So the LB is one of the network path that will be used after bootstrap. Would you mind to check if it the same in your case?

Paul Done said...

Hi

For my environment I already proved its wasn't in the path when I say I stopped OTD before when following appeared but before EJB home was created: " System.out.println("2: Sleeping for 10 secs having got initial context");". So in my case, the LB was not in the path, post bootstrap. I can only imagine you had one of two factors going on: 1) The external address of you cluster servers was referencing the VIP hostname/port and hence the T3 response metadata updating the client with the VIP address and not direct address, or 2) you still had code periodically performing InitialContext creation.

Paul

Simon Haslam said...

Thanks for this excellent post Paul!

I was thinking about the multiple DNS entries approach as compared to using a HB for the initial call, I wonder if the response time for the initial request would be worse under failure conditions with DNS?

For example, say there are 4 managed servers in the DNS entry and one is down (or worse, running very slowly) - if the first call hit the bad server (25% chance), how long would it take to retry another server in the DNS list? Conversely using the health monitors on a LB the bad server should already be marked as down and no connection attempts would be made.

Paul Done said...

Good point Simon and I agree, that would be another reason, which I hadn't thought of, why one may favour a LB VIP for initial context over DNS round robin. In reality I'd assume that most clients (often apps in other weblogic clusters for example) would probably cache EJB homes, queues handles, connection factories, etc, so probably would not be creating initial context for every request. Paul

Paul said...

Hi Paul .

First of all thank you very much for the post , this could be the solution to my current problem.
I currently have a WL domain , with 4 managed servers ( to form a cluster) .
Host1 : 7003
Host1 : 7005
Host2 : 7007
Host2 : 7009
Currently there deployed about Web Services in the domain that consume EJB and its properties file have defined all nodes separated by commas . This is not scalable at all. That's why it caught my attention OTD .
Unfortunately WS and EJB are not developed by me , so I do not know how they work completely.
As I said the developer , consuming the EJB by T3, consult the database and then from the database server needs to reach each of the nodes to give some response .
There are also other EJBs that are consumed from other infrastructure services in my domain WebLogic T3 protocol and also do not know the logic of such services.

Perform configuration of a TCP proxy for T3 in OTD and I'll make some tests.

Thank you very much for the information.

Atte .

Paul Gallegos.