Java_openVZ_Futex_issue, version 1

Anthony Callegaro 2012-06-01 09:07:26 +00:00
parent 800cd923b1
commit 4c6593d00c

122
Java_openVZ_Futex_issue.md Normal file

@ -0,0 +1,122 @@
Java and openVZ 2.6.32 - Futex issue
====================================
1. The context
--------------
We are currently in the process of upgrading all our servers.
Thanks to our [HAVEN High Availability
architecture](http://www.personalized-software.ie/Services#Hosting) we
can migrate virtual servers from one host to its slave with virtually no
down time. So once the new servers were installed we started testing the
migration of containers.
HAVEN being based on openVZ we were using kernels from Proxmox as they
backport all fixes from RedHat 6. The only difference between the hosts
was that the new servers had to use the 2.6.32 kernel, whereas the old
ones were deployed with 2.6.24.
2. The issue
------------
During the migration testing we noticed that one of our Java application
(BigBlueButton) was crashing. It looks as if it was running fine but it
wasnt opening its ports and youd have to do a `kill -9` to stop it.
Some quick investigation with strace showed that one of BigBlueButton
requirements (ActiveMQ) was waiting on one of its children which was
stuck in a infinite loop :
- strace of the parent process
strace -p PID
Process PID attached - interrupt to quit
futex(0xb77dfbd8, FUTEX_WAIT, FIRST_CHILD_PID, NULL^C <unfinished ...>
Process PID detached
- Get all child processes
ps -efL | grep PID
- strace-ing through the children list I found one that was looping
infinitely :
strace -p CHILD_PID
futex(0x998e028, FUTEX_WAKE_PRIVATE, 1) = 0
gettimeofday({1338478277, 139155}, NULL) = 0
clock_gettime(CLOCK_REALTIME, {1338478277, 139659879}) = 0
futex(0x998e044, FUTEX_WAIT_PRIVATE, 1, {0, 999495121}) = -1 ETIMEDOUT (Connection timed out)
A quick search on the openVZ bugzilla showed that it was indeed a [known
issue](http://bugzilla.openvz.org/show_bug.cgi?id=2206) affecting Java
application with openVZ 2.6.32 kernels BUT there is no bugfix, nor
workaround to the problem. Comments from other bug-posters were also
less than encouraging.
Receiving no answer from the openVZ devs I decided to do more
investigations on the issue.
3. Investigation and workaround
-------------------------------
I first tried to reproduce the problem on a local virtual machine by
installing Proxmox 1.9 and creating a container with
[BigBlueButton](opensource:BigBlueButton\_Debian\_Squeeze) in it.
At first I couldnt reproduce the issue, even though I was reproducing
it on a test container on a remote server where it would fail on any
threaded Java program.
Because it is a futex related problem I wondered if it was due to SMP so
I added a new processor to my virtual machine and this time I reproduced
it.
All our containers in production have the CPUS= parameter set but for
some reasons containers running on 2.6.24 where still seeing all the
hosts CPUs even if only 1 was in the configuration file. This seems to
have been corrected in 2.6.32 and this is probably the reason why Java
is now crashing.
Java is already suffering of very annoying memory issues when running
inside containers that oblige us to run everything with the -Xmx,-Xms,
XX:MaxPermSize etc. parameters and it seems that even if the container
has only 1 CPU it tries to use more.
Java does not provide any CPU affinity options as the process scheduler
is part of the OS. Fortunately openVZ has a very handy settings called
CPUMASK that allows you to force a Container to run on only one specific
CPU.
After trying a `vzctl set XXX --cpumask 0 --save` on my test environment
the issue disappeared !
A quick test show that it also work for containers that requires
multiple CPUs like this :
vzctl set XXX --cpus 2 --cpumask 0 --save
Also assigning more than 1 CPU to a container work around the problem.
I cannot guarantee that it will work for you but at least for us we have
no more problem since implementing either of these tricks. We can now
continue our migration testing.
4. Summary and Conclusion
-------------------------
The issue is apparently affecting only :
- hosts with several cores/CPUs running any version of the
2.6.32-openvz kernel (tested with debian squeeze, proxmox 1.9,
proxmox 2.0 and vanilla patched kernel).
- (debian) guests with only one CPU
Solution/workaround :
- Affect more than 1 CPU to the guest
- Give CPU affinity (—cpumask) to the guest
This was quite tricky to debug so I hope this might help other people
stuck with the same problem. Unfortunately once you know what the
solution is you always find people who [found the
same](http://forum.openvz.org/index.php?t=msg&th=10025&goto=43571&#msg_43571)
In any case it cannot hurt to have more documentation about this :o)