How to find out bottlenecks in HTTP server and MySQL

I participated in ISUCON for the first time.

I acquired some knowledge to find out bottleneck in HTTP server and MySQL. Our team used alp for HTTP response analysis and pt-query-digest for MySQL slow query analysis.

I try to introduce these tools that were useful in our situation or might also other teams(or situations).

How to find the bottleneck of web APIs on web application

We need to analyze access log of API generated with Nginx to decrease API’s response time.

So, I used alp to analyze access log.

Download alp binary proper your architecture, usually amd64 if you use Linux, from here

$ wget https://github.com/tkuchiki/alp/releases/download/v0.3.1/alp_linux_amd64.zip
$ unzip alp_linux_amd64.zip

Change Nginx log formart to LTSV format

$ sudo vim /etc/nginx/nginx.conf
log_format ltsv "time:$time_local"
                "\thost:$remote_addr"
                "\tforwardedfor:$http_x_forwarded_for"
                "\treq:$request"
                "\tstatus:$status"
                "\tmethod:$request_method"
                "\turi:$request_uri"
                "\tsize:$body_bytes_sent"
                "\treferer:$http_referer"
                "\tua:$http_user_agent"
                "\treqtime:$request_time"
                "\tcache:$upstream_http_x_cache"
                "\truntime:$upstream_http_x_runtime"
                "\tapptime:$upstream_response_time"
                "\tvhost:$host";

access.log ltsv;

Analyze Nginx log

$ ./alp -f /var/log/nginx/access.log

The list analyzed is like this;

+-------+-------+-------+---------+-------+-------+-------+-------+--------+------------+------------+-------------+------------+--------+-----------------------------------------------------+
| COUNT |  MIN  |  MAX  |   SUM   |  AVG  |  P1   |  P50  |  P99  | STDDEV | MIN(BODY)  | MAX(BODY)  |  SUM(BODY)  | AVG(BODY)  | METHOD |                         URI                         |
+-------+-------+-------+---------+-------+-------+-------+-------+--------+------------+------------+-------------+------------+--------+-----------------------------------------------------+
|    12 | 0.000 | 0.000 |   0.000 | 0.000 | 0.000 | 0.000 | 0.000 |  0.000 |    318.000 |    318.000 |    3816.000 |    318.000 | GET    | /favicon.ico                                        |
|     1 | 0.000 | 0.000 |   0.000 | 0.000 | 0.000 | 0.000 | 0.000 |  0.000 |   4162.000 |   4162.000 |    4162.000 |   4162.000 | GET    | /js/chat.js                                         |
|     6 | 0.770 | 1.932 |   7.167 | 1.194 | 0.770 | 0.988 | 1.628 |  0.432 | 327999.000 | 327999.000 | 1967994.000 | 327999.000 | GET    | /icons/b1c8c5bc9b026507ae62e6bf62cf55f70f4ac3d8.png |
|     7 | 0.212 | 1.967 |   7.447 | 1.064 | 0.212 | 0.998 | 1.501 |  0.536 | 105173.000 | 105173.000 |  736211.000 | 105173.000 | GET    | /icons/e58ed7194ed4a99d9d6ada9e3383983edcbd1edc.png |
|   301 | 0.099 | 3.752 | 473.578 | 1.573 | 0.549 | 1.507 | 2.921 |  0.407 |      0.000 |   3005.000 |  885578.000 |   2942.120 | GET    | /fetch                                              |
+-------+-------+-------+---------+-------+-------+-------+-------+--------+------------+------------+-------------+------------+--------+-----------------------------------------------------+

This table shows the API response times that are listed in ascending order of time.

We started to improve the response time on worst. This case, we should start at /fetch and /icon.

How to find bottleneck queries on MySQL

To find bottleneck of SQL queries, we used pt-query-digest.

Install percona toolkit

See this document.

$ wget https://repo.percona.com/apt/percona-release_0.1-4.$(lsb_release -sc)_all.deb

$ sudo dpkg -i percona-release_0.1-4.$(lsb_release -sc)_all.deb

$ sudo apt-get update

$ sudo apt-cache search percona

$ sudo apt-get install percona-toolkit
$ sudo vim /etc/mysql/my.cnf
# Enable slow query log
slow_query_log = 1
slow_query_log_file = /var/log/mysql/slow_query.log

Analyze slow query log

$ pt-query-digest /var/log/mysql/slow_query.log

The list analyzed is like this;

# 660ms user time, 30ms system time, 39.37M rss, 121.42M vsz
# Current date: Sat Oct 21 14:54:35 2017
# Hostname: app1053
# Files: /var/log/mysql/slow-query.log
# Overall: 1.50k total, 23 unique, 2.73 QPS, 0.01x concurrency ___________
# Time range: 2017-10-21T05:45:19 to 2017-10-21T05:54:29
# Attribute          total     min     max     avg     95%  stddev  median
# ============     ======= ======= ======= ======= ======= ======= =======
# Exec time             3s     1us   608ms     2ms     5ms    20ms   103us
# Lock time           78ms       0     7ms    51us   131us   174us    36us
# Rows sent          1.96k       0     100    1.33    0.99    8.72       0
# Rows examine       2.91M       0   9.77k   1.98k   9.33k   3.74k       0
# Query size        90.88k      20     202   61.88  136.99   25.68   62.76

# Profile
# Rank Query ID           Response time Calls R/Call V/M   Item
# ==== ================== ============= ===== ====== ===== ==============
#    1 0x16A13978F3B7FCFA  1.7328 55.1%     6 0.2888  0.06 SELECT image
#    2 0x4866C96BA295C56E  0.7782 24.7%   301 0.0026  0.00 SELECT message
#    3 0xEBDB62606B914A34  0.4282 13.6%    90 0.0048  0.00 INSERT channel
#    4 0x91283C307570F439  0.0334  1.1%     4 0.0083  0.00 SELECT message
#    5 0x41883793B68D128E  0.0302  1.0%   100 0.0003  0.00 SELECT user
# MISC 0xMISC              0.1427  4.5%  1003 0.0001   0.0 <18 ITEMS>

# Query 1: 6 QPS, 1.73x concurrency, ID 0x16A13978F3B7FCFA at byte 190178
# Scores: V/M = 0.06
# Time range: 2017-10-21T05:46:17 to 2017-10-21T05:46:18
# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count          0       6
# Exec time     55      2s   203ms   608ms   289ms   580ms   136ms   242ms
# Lock time      1   840us   109us   193us   140us   185us    29us   152us
# Rows sent      5     103      10      23   17.17   22.53    4.06   17.65
# Rows examine   0   5.87k    1001    1001    1001    1001       0    1001
# Query size     0     474      79      79      79      79       0      79
# String:
# Databases    isubata
# Hosts        app1051
# Users        isucon
# Query_time distribution
#   1us
#  10us
# 100us
#   1ms
#  10ms
# 100ms  ################################################################
#    1s
#  10s+
# Tables
#    SHOW TABLE STATUS FROM `isubata` LIKE 'image'\G
#    SHOW CREATE TABLE `isubata`.`image`\G
# EXPLAIN /*!50100 PARTITIONS*/
SELECT * FROM image WHERE name = '78a9228a393eb2621f346fc6a5e099d5bc373f76.png'\G

# Query 2: 1.08 QPS, 0.00x concurrency, ID 0x4866C96BA295C56E at byte 177211
# Scores: V/M = 0.00
# Time range: 2017-10-21T05:46:16 to 2017-10-21T05:50:54
# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count         20     301
# Exec time     24   778ms     2ms     9ms     3ms     5ms   966us     2ms
# Lock time     15    12ms    30us   208us    39us    66us    15us    33us
# Rows sent     15     301       1       1       1       1       0       1
# Rows examine  98   2.87M   9.77k   9.77k   9.77k   9.77k       0   9.77k
# Query size    19  17.58k      58      60   59.81   59.77    0.85   59.77
# String:
# Databases    isubata
# Hosts        app1052 (298/99%), app1051 (3/0%)
# Users        isucon
# Query_time distribution
#   1us
#  10us
# 100us
#   1ms  ################################################################
#  10ms
# 100ms
#    1s
#  10s+
# Tables
#    SHOW TABLE STATUS FROM `isubata` LIKE 'message'\G
#    SHOW CREATE TABLE `isubata`.`message`\G
# EXPLAIN /*!50100 PARTITIONS*/
SELECT COUNT(*) as cnt FROM message WHERE channel_id = '733'\G

This table shows the queries time are listed in desacending order of time.

In this case, we should start to improve the queryt time that is SELECT * FROM image WHERE name = '78a9228a393eb2621f346fc6a5e099d5bc373f76.png'\G. For example to improve it, use caching on Nginx.