Java_openVZ_Futex_issue, version 1
parent
800cd923b1
commit
4c6593d00c
122
Java_openVZ_Futex_issue.md
Normal file
122
Java_openVZ_Futex_issue.md
Normal file
@ -0,0 +1,122 @@
|
||||
Java and openVZ 2.6.32 - Futex issue
|
||||
====================================
|
||||
|
||||
1. The context
|
||||
--------------
|
||||
|
||||
We are currently in the process of upgrading all our servers.
|
||||
|
||||
Thanks to our [HAVEN High Availability
|
||||
architecture](http://www.personalized-software.ie/Services#Hosting) we
|
||||
can migrate virtual servers from one host to its slave with virtually no
|
||||
down time. So once the new servers were installed we started testing the
|
||||
migration of containers.
|
||||
|
||||
HAVEN being based on openVZ we were using kernels from Proxmox as they
|
||||
backport all fixes from RedHat 6. The only difference between the hosts
|
||||
was that the new servers had to use the 2.6.32 kernel, whereas the old
|
||||
ones were deployed with 2.6.24.
|
||||
|
||||
2. The issue
|
||||
------------
|
||||
|
||||
During the migration testing we noticed that one of our Java application
|
||||
(BigBlueButton) was crashing. It looks as if it was running fine but it
|
||||
wasn’t opening its ports and you’d have to do a `kill -9` to stop it.
|
||||
|
||||
Some quick investigation with strace showed that one of BigBlueButton
|
||||
requirements (ActiveMQ) was waiting on one of its children which was
|
||||
stuck in a infinite loop :
|
||||
|
||||
- strace of the parent process
|
||||
strace -p PID
|
||||
Process PID attached - interrupt to quit
|
||||
futex(0xb77dfbd8, FUTEX_WAIT, FIRST_CHILD_PID, NULL^C <unfinished ...>
|
||||
Process PID detached
|
||||
|
||||
- Get all child processes
|
||||
ps -efL | grep PID
|
||||
|
||||
- strace-ing through the children list I found one that was looping
|
||||
infinitely :
|
||||
strace -p CHILD_PID
|
||||
futex(0x998e028, FUTEX_WAKE_PRIVATE, 1) = 0
|
||||
gettimeofday({1338478277, 139155}, NULL) = 0
|
||||
clock_gettime(CLOCK_REALTIME, {1338478277, 139659879}) = 0
|
||||
futex(0x998e044, FUTEX_WAIT_PRIVATE, 1, {0, 999495121}) = -1 ETIMEDOUT (Connection timed out)
|
||||
|
||||
A quick search on the openVZ bugzilla showed that it was indeed a [known
|
||||
issue](http://bugzilla.openvz.org/show_bug.cgi?id=2206) affecting Java
|
||||
application with openVZ 2.6.32 kernels BUT there is no bugfix, nor
|
||||
workaround to the problem. Comments from other bug-posters were also
|
||||
less than encouraging.
|
||||
|
||||
Receiving no answer from the openVZ devs I decided to do more
|
||||
investigations on the issue.
|
||||
|
||||
3. Investigation and workaround
|
||||
-------------------------------
|
||||
|
||||
I first tried to reproduce the problem on a local virtual machine by
|
||||
installing Proxmox 1.9 and creating a container with
|
||||
[BigBlueButton](opensource:BigBlueButton\_Debian\_Squeeze) in it.
|
||||
|
||||
At first I couldn’t reproduce the issue, even though I was reproducing
|
||||
it on a test container on a remote server where it would fail on any
|
||||
threaded Java program.
|
||||
|
||||
Because it is a futex related problem I wondered if it was due to SMP so
|
||||
I added a new processor to my virtual machine and this time I reproduced
|
||||
it.
|
||||
|
||||
All our containers in production have the CPUS= parameter set but for
|
||||
some reasons containers running on 2.6.24 where still seeing all the
|
||||
hosts CPUs even if only 1 was in the configuration file. This seems to
|
||||
have been corrected in 2.6.32 and this is probably the reason why Java
|
||||
is now crashing.
|
||||
|
||||
Java is already suffering of very annoying memory issues when running
|
||||
inside containers that oblige us to run everything with the -Xmx,-Xms,
|
||||
XX:MaxPermSize etc. parameters and it seems that even if the container
|
||||
has only 1 CPU it tries to use more.
|
||||
|
||||
Java does not provide any CPU affinity options as the process scheduler
|
||||
is part of the OS. Fortunately openVZ has a very handy settings called
|
||||
CPUMASK that allows you to force a Container to run on only one specific
|
||||
CPU.
|
||||
|
||||
After trying a `vzctl set XXX --cpumask 0 --save` on my test environment
|
||||
the issue disappeared !
|
||||
|
||||
A quick test show that it also work for containers that requires
|
||||
multiple CPUs like this :
|
||||
|
||||
vzctl set XXX --cpus 2 --cpumask 0 --save
|
||||
|
||||
Also assigning more than 1 CPU to a container work around the problem.
|
||||
|
||||
I cannot guarantee that it will work for you but at least for us we have
|
||||
no more problem since implementing either of these tricks. We can now
|
||||
continue our migration testing.
|
||||
|
||||
4. Summary and Conclusion
|
||||
-------------------------
|
||||
|
||||
The issue is apparently affecting only :
|
||||
|
||||
- hosts with several cores/CPUs running any version of the
|
||||
2.6.32-openvz kernel (tested with debian squeeze, proxmox 1.9,
|
||||
proxmox 2.0 and vanilla patched kernel).
|
||||
- (debian) guests with only one CPU
|
||||
|
||||
Solution/workaround :
|
||||
|
||||
- Affect more than 1 CPU to the guest
|
||||
- Give CPU affinity (—cpumask) to the guest
|
||||
|
||||
This was quite tricky to debug so I hope this might help other people
|
||||
stuck with the same problem. Unfortunately once you know what the
|
||||
solution is you always find people who [found the
|
||||
same](http://forum.openvz.org/index.php?t=msg&th=10025&goto=43571&#msg_43571)
|
||||
In any case it cannot hurt to have more documentation about this :o)
|
||||
|
Loading…
Reference in New Issue
Block a user