Java_openVZ_Futex_issue, version 1
parent
800cd923b1
commit
4c6593d00c
122
Java_openVZ_Futex_issue.md
Normal file
122
Java_openVZ_Futex_issue.md
Normal file
@ -0,0 +1,122 @@
|
|||||||
|
Java and openVZ 2.6.32 - Futex issue
|
||||||
|
====================================
|
||||||
|
|
||||||
|
1. The context
|
||||||
|
--------------
|
||||||
|
|
||||||
|
We are currently in the process of upgrading all our servers.
|
||||||
|
|
||||||
|
Thanks to our [HAVEN High Availability
|
||||||
|
architecture](http://www.personalized-software.ie/Services#Hosting) we
|
||||||
|
can migrate virtual servers from one host to its slave with virtually no
|
||||||
|
down time. So once the new servers were installed we started testing the
|
||||||
|
migration of containers.
|
||||||
|
|
||||||
|
HAVEN being based on openVZ we were using kernels from Proxmox as they
|
||||||
|
backport all fixes from RedHat 6. The only difference between the hosts
|
||||||
|
was that the new servers had to use the 2.6.32 kernel, whereas the old
|
||||||
|
ones were deployed with 2.6.24.
|
||||||
|
|
||||||
|
2. The issue
|
||||||
|
------------
|
||||||
|
|
||||||
|
During the migration testing we noticed that one of our Java application
|
||||||
|
(BigBlueButton) was crashing. It looks as if it was running fine but it
|
||||||
|
wasn’t opening its ports and you’d have to do a `kill -9` to stop it.
|
||||||
|
|
||||||
|
Some quick investigation with strace showed that one of BigBlueButton
|
||||||
|
requirements (ActiveMQ) was waiting on one of its children which was
|
||||||
|
stuck in a infinite loop :
|
||||||
|
|
||||||
|
- strace of the parent process
|
||||||
|
strace -p PID
|
||||||
|
Process PID attached - interrupt to quit
|
||||||
|
futex(0xb77dfbd8, FUTEX_WAIT, FIRST_CHILD_PID, NULL^C <unfinished ...>
|
||||||
|
Process PID detached
|
||||||
|
|
||||||
|
- Get all child processes
|
||||||
|
ps -efL | grep PID
|
||||||
|
|
||||||
|
- strace-ing through the children list I found one that was looping
|
||||||
|
infinitely :
|
||||||
|
strace -p CHILD_PID
|
||||||
|
futex(0x998e028, FUTEX_WAKE_PRIVATE, 1) = 0
|
||||||
|
gettimeofday({1338478277, 139155}, NULL) = 0
|
||||||
|
clock_gettime(CLOCK_REALTIME, {1338478277, 139659879}) = 0
|
||||||
|
futex(0x998e044, FUTEX_WAIT_PRIVATE, 1, {0, 999495121}) = -1 ETIMEDOUT (Connection timed out)
|
||||||
|
|
||||||
|
A quick search on the openVZ bugzilla showed that it was indeed a [known
|
||||||
|
issue](http://bugzilla.openvz.org/show_bug.cgi?id=2206) affecting Java
|
||||||
|
application with openVZ 2.6.32 kernels BUT there is no bugfix, nor
|
||||||
|
workaround to the problem. Comments from other bug-posters were also
|
||||||
|
less than encouraging.
|
||||||
|
|
||||||
|
Receiving no answer from the openVZ devs I decided to do more
|
||||||
|
investigations on the issue.
|
||||||
|
|
||||||
|
3. Investigation and workaround
|
||||||
|
-------------------------------
|
||||||
|
|
||||||
|
I first tried to reproduce the problem on a local virtual machine by
|
||||||
|
installing Proxmox 1.9 and creating a container with
|
||||||
|
[BigBlueButton](opensource:BigBlueButton\_Debian\_Squeeze) in it.
|
||||||
|
|
||||||
|
At first I couldn’t reproduce the issue, even though I was reproducing
|
||||||
|
it on a test container on a remote server where it would fail on any
|
||||||
|
threaded Java program.
|
||||||
|
|
||||||
|
Because it is a futex related problem I wondered if it was due to SMP so
|
||||||
|
I added a new processor to my virtual machine and this time I reproduced
|
||||||
|
it.
|
||||||
|
|
||||||
|
All our containers in production have the CPUS= parameter set but for
|
||||||
|
some reasons containers running on 2.6.24 where still seeing all the
|
||||||
|
hosts CPUs even if only 1 was in the configuration file. This seems to
|
||||||
|
have been corrected in 2.6.32 and this is probably the reason why Java
|
||||||
|
is now crashing.
|
||||||
|
|
||||||
|
Java is already suffering of very annoying memory issues when running
|
||||||
|
inside containers that oblige us to run everything with the -Xmx,-Xms,
|
||||||
|
XX:MaxPermSize etc. parameters and it seems that even if the container
|
||||||
|
has only 1 CPU it tries to use more.
|
||||||
|
|
||||||
|
Java does not provide any CPU affinity options as the process scheduler
|
||||||
|
is part of the OS. Fortunately openVZ has a very handy settings called
|
||||||
|
CPUMASK that allows you to force a Container to run on only one specific
|
||||||
|
CPU.
|
||||||
|
|
||||||
|
After trying a `vzctl set XXX --cpumask 0 --save` on my test environment
|
||||||
|
the issue disappeared !
|
||||||
|
|
||||||
|
A quick test show that it also work for containers that requires
|
||||||
|
multiple CPUs like this :
|
||||||
|
|
||||||
|
vzctl set XXX --cpus 2 --cpumask 0 --save
|
||||||
|
|
||||||
|
Also assigning more than 1 CPU to a container work around the problem.
|
||||||
|
|
||||||
|
I cannot guarantee that it will work for you but at least for us we have
|
||||||
|
no more problem since implementing either of these tricks. We can now
|
||||||
|
continue our migration testing.
|
||||||
|
|
||||||
|
4. Summary and Conclusion
|
||||||
|
-------------------------
|
||||||
|
|
||||||
|
The issue is apparently affecting only :
|
||||||
|
|
||||||
|
- hosts with several cores/CPUs running any version of the
|
||||||
|
2.6.32-openvz kernel (tested with debian squeeze, proxmox 1.9,
|
||||||
|
proxmox 2.0 and vanilla patched kernel).
|
||||||
|
- (debian) guests with only one CPU
|
||||||
|
|
||||||
|
Solution/workaround :
|
||||||
|
|
||||||
|
- Affect more than 1 CPU to the guest
|
||||||
|
- Give CPU affinity (—cpumask) to the guest
|
||||||
|
|
||||||
|
This was quite tricky to debug so I hope this might help other people
|
||||||
|
stuck with the same problem. Unfortunately once you know what the
|
||||||
|
solution is you always find people who [found the
|
||||||
|
same](http://forum.openvz.org/index.php?t=msg&th=10025&goto=43571&#msg_43571)
|
||||||
|
In any case it cannot hurt to have more documentation about this :o)
|
||||||
|
|
Loading…
Reference in New Issue
Block a user