[vworld-tech] Java scaling

Sat Apr 30 19:25:08 PDT 2005

Not intended as any pro-java evangelism post; just trying to give Alex 
some advice and pointers for further exploration.

Alex Chacha wrote:
 > ceo wrote:
 >
 >> ...but now I *know* you're doing something very, very wrong, or you're
 >> distributed raytracer), because java scales fairly effortlessly to 500
 >> clients these days (i.e. use the sun tutorials aimed at newcomers to
 >
 >
 > Let me give you two scenarios where I have direct experience with java.
 >
 > 1. The MMORPG I am working on part time (labor of love more than
...
 > magnitude difference).  This was done using Sun's JVM 1.4.1_06

Right. 1.4.1 was never a usable release for networking and was retired 
?several years ago (2003)? IIRC. So, right from the start, you're using 
an unsupported version of java - seriously, there are massive bugs in 
that version that make it practically impossible to do various things, 
and there are no workarounds (FYI IIRC it was mistakes in the 
interfacing to low-level platform-specific networking libs on two of the 
major platforms; for isntance, a confused approach to using IOCP on 
windows. But only IIRC...it was a long time ago).

 > out which would work best).  There are 10 worker threads and 1
 > listener/queue thread.  The design works great, but the quest and

Off the top of my head...I'm sure you're not doing this (it would be 
crazy), but you're not running a single I/O thread using old I/O are you?

Assuming you're using NIO, I could make some stab-in-the-dark guesses. 
If upgrading to the current version (1.4.2_XX - currently _07 IIRC) 
doesn't fix it, then...I'd take a look at your synchronization design. 
Sun's API comes with very sparse docs, but it does explicitly tell you 
that "high quality implementations will block for a very small amount of 
time, if at all, in this scenario, but simpler implementations may block 
for a long time or indefinitely". They wrote an API spec with holes you 
could drive a bus through - and then implemented their JVM with 
ultra-low-quality.

(I believe this is known as making the free version of your product crap 
so that more people buy the expensive one? Tongue firmly in cheek)

 > tradeskill engine is still using simple data for testing, so I am not
 > yet CPU bound.  I tried loading up the machine in this scenario and hit
 > a wall when I had 50 remote clients.  CPU utilization was about 70% but

First suggestion: go read my section in Game Programming Gems 4 
(Thousands of clients per server...although the copy of the source on 
the CD isn't usable). In some ways, I regret publishing it there, 
because it measn I can't (legally) post it anywhere on the web :(. I'm 
not banging my drum here, but it highlights a load of separate issues, 
and you may just glance at one and go "ah - that gives me an idea of 
something I didn't mean to be doing".

 > the JVM was spending a lot of time in context switches.  I didn't really

Shrug. With your invalid JVM I'm not even sure it's worth commenting on 
this, other than to say "this doesn't happen in normal usage". If you 
reproduce that on a real JVM then you'll need to give a rundown of the 
pseudo-code algo you're using - there may be something pathologically 
bad you're doing with your arrangement of code within your method to 
cause real badness here.

Note: not necessarily "wrong", just unfortunately bad as far as the JVM 
goes.

 > 2. Where I am currently doing time of hard labor, we get about a billion
 > hits a day.  With C++ backend the whole thing ran with 40 machines, we

We're building a billion-hit-per-day-at-peak system at the moment. It's 
specced to handle a little more than a billion, but we're only expecting 
it to run at that rate for a few days a week, and only 6 hours at a 
time. So whilst I'm not in the same position at the moment, I'm working 
within similar parameters.

 > ported to java and pushing 1500 machines to keep the same response times

Off the top of my head, knowing practically nothing about what you do 
:P, 40 to 1500 means one of two things:
   1. You need to fire your System Architects for gross incompetence
   2. Someone decided that it was cheaper for you to buy and maintain 
1500, given the context, than to go with 40

Whilst it sounds drastic, point number 2 is fairly common these days. 
Why waste manpower making a system fast and efficient if it's cheaper to 
be slow and inefficient and throw cheap hardware and low-end sysadmins 
at the problem? Not for me to judge...

 > (and this is after EJB was dumped due to dismal performance).  Now
 > before you say something is really wrong, we have almost 1000 people
 > working full time at optimizing, tuning, debugging and coding (not

Shrug. With a large staff, I wouldn't be scared of having that many 
machines if it was going to save me some hassle in some other way not 
mentioned so far. So...I wouldn't say it's "wrong" per se, but I would 
be very surprised if you couldn't get it down to 100 fairly easily 
(conceptually; I'm guessing that in practice you've got that classic 
problem that your data is in the "wrong" structure and just transforming 
it to the right structure to fit the easy efficient solution would be 
very costly).

 > a way of life, but with it we accepted that the performance will be
 > awful give the complexity of the system (lots of database interaction,
 > messaging, logging, and tons of external services, etc).  I can see a

To put it another way, I could walk into a 50k-staff corporation as CTO 
and quickly design and rollout a system that replaced their office 
systems and service systems and was horribly slow and inefficient but on 
paper looked fine. It's easy to do stuff like buy-in to CORBA and assume 
that "it works" means "it works as fast as we want it to", or buy-in to 
Sun's J2EE because "Sun has lots of evidence of other bigger rollouts 
working fast".

I'm not criticising, just pointing out how easy it is to do a complex 
middleware system with Java that is awfully slow - largely because the 
DEFAULT setup *wasn't aimed at people who want raw speed*. This isn't a 
secret, BUT an awful lot of people whose training or experience in J2EE 
isn't quite sufficient assume it was built for speed (or, at least, for 
their expectation of speed), and get burnt. Badly.

 > Now for even more issues to note with Java:
 > 1. Client implementation (unless you mean telnet like text emulation) is
 > going to be very very tough.  The UI parts change between versions and

No, they don't. Which means I must be completely misunderstanding you :).

 > something written for v1.3 will not look right with v1.4 and v1.5 (and

It will, in fact, look identical !?! I'm sure I'm just being stupid in 
misunderstanding you, so please expand on what you mean and I should 
spot it.

 > vice versa).  To add to this, I have ran into more versions of java
 > installed than I cared to note.  Trying to enforce a version on the
 > client side is also a messy endeavor.  After trying a ver revisions on

There are extremely effective (and free) solutions to this problem that 
work very well. Webstart (part of core java), for instance, is a very 
very good way of hadnling *all* this automatically. The only downside is 
that Sun's corporate arm couldnt' find it's arse with both hands 
sometimes, and fails to promote the best of it's own new technology, 
such as JWS.

 > 2. Not all JVMs are implemented the same.  Recently I ran into a nasty

Which is why nearly everyone uses Sun only, unless they choose to commit 
to supporting IBM too because it has in the past, for a year or two at a 
time, been much more cutting-edge in terms of performance.

Getting the same codebase across the three major client platforms 
(windows, linux, OS X) tends to push people towards Sun.

 > slightly different results, a non-deterministic JVM was a surprise.  I

Sounds horrible. This isn't C, and non-deterministic JVM activity is 
extremely rare (has something to do with the arduous certification 
process, I believe).

That said, I've found a couple of instances of it. In all cases, 
however, it turned out not really to be the JVM, but the OS. For 
instance, buggy graphics card drivers running under Microsoft's 
not-quite-as-robust-as-it-should-be DirectX failing to perform an op 
(like allocating RAM!) yet returning a success code, leading to 
catastrophic or bizarre behaviour much later in execution.

Obviously, it's up to the JVM vendor to implement workarounds for such 
bugs, so they're still responsible. But it's nothing like the 
gcc/msvc/etc weirdnesses that used to go on.

 > What type of work do you use java for on the back end?

The central part that's serving those 1 billion hits per day will be 
java (it's not live yet. Ha. Famous last words!). Obviously, it's a lot 
more complex than that, and deep at the bottom sits MySQL DB's (which is 
getting close to bearable performance these days ;)). Things are going 
OK so far, and we know what we're doing - e.g. one of my staff used to 
run a billion-hits-a-day site entirely in perl. There's quite a few such 
sites around, if you know where to look ;).

Previously, I did a lot of work on the GrexEngine, which is (in gross 
simplification) "J2EE re-designed and written from scratch as a 
high-performance system, specifically aimed at online games".

At grex, I would compare performance of a game-server to the latest, 
fastest, apache running largely static pages, and be happy when I was 
level-pegging. Day-to-day testing loads were typically in the 200-750 
simulated clients to each server range over a couple of 100Mbit switches 
(with the servers running very old hardware - 0.5-Gigahertz processors 
for instance)

J2EE makes a lot of assumptions to the tune of "your requests are 
business traffic, hence infrequent and either very very light or very 
very heavy". The GE assumes all requests are very frequent and 
moderately heavy.

Adam M