TL;DR

  • With a bit of kernel tuning, i was able to get up to 523k connections opened simultaneously from 4 client boxes to 1 MegaComet server.
  • Memory and CPU usage was minimal (128M across the servers processes, maybe 24% CPU at 4000 connections/sec).
  • I’ll try to improve the kernel tuning to get it to 1M by checking the /var/log/kern.log next time.
  • Libev basically runs on the smell of an oily rag.

Setup

I started 5 EC2 Large 64-bit servers, using the amazon linux image ‘ami-221fec4b’ (aka: amzn-ami-2011.02.1.x86_64). One of these was the server, and the other 4 are the client servers, each trying to open 250k connections. These are vanilla EC2 Large instances, with the following kernel tuning (credits to the metabrew article):

Tuning

The following increases the user limit for number of open file descriptors (TCP connections are file descriptors):

echo "* soft nofile 1048576" >> /etc/security/limits.conf 
echo "* hard nofile 1048576" >> /etc/security/limits.conf

After the above is done, you have to log out and back in again.

To tune the kernel to allow 1M connections, the following was appended to the /etc/sysctl.conf:

# Settings from http://www.metabrew.com/article/a-million-user-comet-application-with-mochiweb-part-3
# Config needed to have enough tcp stack memory:
net.core.rmem_max = 33554432
net.core.wmem_max = 33554432
net.ipv4.tcp_rmem = 4096 16384 33554432
net.ipv4.tcp_wmem = 4096 16384 33554432
net.ipv4.tcp_mem = 786432 1048576 26777216
net.ipv4.tcp_max_tw_buckets = 360000
net.core.netdev_max_backlog = 2500
vm.min_free_kbytes = 65536
vm.swappiness = 0
# This is for the outgoing connections max:
net.ipv4.ip_local_port_range = 1024 65535
# I added this to set the system wide file max:
fs.file-max = 1100000  
# Reduce the time sockets stay in time_wait: http://forums.theplanet.com/lofiversion/index.php/t62399.html
net.ipv4.tcp_fin_timeout = 12

To apply it, you need to do: sudo sysctl -p I believe this tuning still needs work. Next time i run the tests i’ll check the kernel log to see if anything in the TCP stack has maxed out.

Steps

To reproduce my tests, you can follow the steps used to configure the vanilla instances:

# Install compiler / tools
sudo yum -y install gcc* git* make

# Install libev
wget http://dist.schmorp.de/libev/libev-4.04.tar.gz
tar -zxvf libev-4.04.tar.gz
cd libev-4.04
./configure && make && sudo make install
sudo sh -c "echo /usr/local/lib > /etc/ld.so.conf.d/usr-local-lib.conf"
sudo ldconfig

# Install MC
cd ~
git clone git://github.com/chrishulbert/MegaComet.git
cd MegaComet

# Now do the kernel tuning as mentioned above

# To run the server:
cd MegaComet
make
./start

# To run the clients:
cd MegaComet/testing
make
./megatest X Y # (where X is a,b,c,d depending on which testing server this is)
# Also Y is the IP address of the comet server

Results

The clients got up to 142k, 144k, 105k, and 132k connections respectively before trying to open new connections timed out. This is a total of 523k connections, just over half a million! The RAM and CPU usage on the server was minimal throughout the test. Here’s a screenshot of top while the tests were running at approx 4000 new connections/second, to give an idea of CPU and memory usage:

top - 11:03:28 up  1:12,  2 users,  load average: 0.25, 0.58, 0.48
Tasks:  77 total,   2 running,  75 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.8%us,  2.7%sy,  0.0%ni, 95.1%id,  0.0%wa,  0.1%hi,  1.1%si,  0.2%st
Mem:   7652552k total,  1441076k used,  6211476k free,    22144k buffers
Swap:        0k total,        0k used,        0k free,   823848k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                        
22612 ec2-user  20   0  8664  340  264 S  0.0  0.0   0:00.00 megamanager                                                             
22614 ec2-user  20   0 25552  17m  460 S  4.7  0.2   0:08.22 megacomet                                                               
22615 ec2-user  20   0 25552  17m  460 S  5.0  0.2   0:08.08 megacomet                                                               
22616 ec2-user  20   0 25556  17m  460 S  5.0  0.2   0:08.16 megacomet                                                               
22617 ec2-user  20   0 25556  17m  460 S  5.0  0.2   0:08.28 megacomet                                                               
22618 ec2-user  20   0 25580  17m  460 R  5.0  0.2   0:08.01 megacomet                                                               
22619 ec2-user  20   0 25556  17m  460 S  5.0  0.2   0:08.38 megacomet                                                               
22620 ec2-user  20   0 25556  17m  460 S  4.7  0.2   0:08.23 megacomet                                                               
22621 ec2-user  20   0 25552  17m  460 S  4.7  0.2   0:08.16 megacomet

I forgot to grab a top screenshot when the connections were all opened, but the memory usage was no different, and CPU was zero.

Conclusions

I really can’t believe the CPU and RAM usage are so small when the 1/2M connections are live and idle! At this stage, i’m not really testing for performance when passing messages around. I hope to get to 1M (static) open connections, and then start testing messaging. I’m optimistic: it looks promising. Next time i try this, i’ll keep a close eye on the kernel log (/var/log/kern.log) and see if i can find any bottlenecks.

References

http://www.metabrew.com/article/a-million-user-comet-application-with-mochiwe... http://www.cs.wisc.edu/condor/condorg/linux_scalability.html

Thanks for reading! And if you want to get in touch, I'd love to hear from you: chris.hulbert at gmail.