Full-text search for Japanese with ngram full-text parser

TL;DR

Create an index with ngram full-text parser.

CREATE FULLTEXT INDEX idx_message_log ON message_log (message) WITH PARSER ngram;

ngram Full-Text Parser

We want to do a full-text search for searching our entire texts that exist about 150k rows. Its text is written in Japanese.

Our table like this;

create table message_log
(
    id int(11) unsigned not null,
    message varchar(255) default '' not null,
    primary key (id)
);

If message column would fill out in English or other space-separated languages, you can create a full-text index.

CREATE FULLTEXT INDEX idx_message_log on message_log.message (message);

However, we treat the message as Japanese. In this case, we cannot get any message, because Japanese is not the text space-separated words.

For instance, Japanese like this;

好きなメンバーとその理由を教えて下さい!

not space-separated;

好きな メンバー と その 理由 を 教えて 下さい!

We assume we have this record.

id message
1 好きなメンバーとその理由を教えて下さい!
2 好きな メンバー と その 理由 を 教えて 下さい!

We find messages with the full-text search function.

SELECT * FROM message_log WHERE MATCH (message) AGAINST ('メンバー');

And then get this result.

id message
2 好きな メンバー と その 理由 を 教えて 下さい!

We expect to can get all records. However, it does not include the text 好きなメンバーとその理由を教えて下さい!.

So we create an index with ngram full-text parser.

CREATE FULLTEXT INDEX idx_message_log ON message_log (message) WITH PARSER ngram;

Again, find messages.

SELECT * FROM message_log WHERE MATCH (message) AGAINST ('メンバー');

And then we get ID 1 and 2 that we expected.

id message
1 好きなメンバーとその理由を教えて下さい!
2 好きな メンバー と その 理由 を 教えて 下さい!

How to find out bottlenecks in HTTP server and MySQL

I participated in ISUCON for the first time.

I acquired some knowledge to find out bottleneck in HTTP server and MySQL. Our team used alp for HTTP response analysis and pt-query-digest for MySQL slow query analysis.

I try to introduce these tools that were useful in our situation or might also other teams(or situations).

How to find the bottleneck of web APIs on web application

We need to analyze access log of API generated with Nginx to decrease API’s response time.

So, I used alp to analyze access log.

Download alp binary proper your architecture, usually amd64 if you use Linux, from here

$ wget https://github.com/tkuchiki/alp/releases/download/v0.3.1/alp_linux_amd64.zip
$ unzip alp_linux_amd64.zip

Change Nginx log formart to LTSV format

$ sudo vim /etc/nginx/nginx.conf
log_format ltsv "time:$time_local"
                "\thost:$remote_addr"
                "\tforwardedfor:$http_x_forwarded_for"
                "\treq:$request"
                "\tstatus:$status"
                "\tmethod:$request_method"
                "\turi:$request_uri"
                "\tsize:$body_bytes_sent"
                "\treferer:$http_referer"
                "\tua:$http_user_agent"
                "\treqtime:$request_time"
                "\tcache:$upstream_http_x_cache"
                "\truntime:$upstream_http_x_runtime"
                "\tapptime:$upstream_response_time"
                "\tvhost:$host";

access.log ltsv;

Analyze Nginx log

$ ./alp -f /var/log/nginx/access.log

The list analyzed is like this;

+-------+-------+-------+---------+-------+-------+-------+-------+--------+------------+------------+-------------+------------+--------+-----------------------------------------------------+
| COUNT |  MIN  |  MAX  |   SUM   |  AVG  |  P1   |  P50  |  P99  | STDDEV | MIN(BODY)  | MAX(BODY)  |  SUM(BODY)  | AVG(BODY)  | METHOD |                         URI                         |
+-------+-------+-------+---------+-------+-------+-------+-------+--------+------------+------------+-------------+------------+--------+-----------------------------------------------------+
|    12 | 0.000 | 0.000 |   0.000 | 0.000 | 0.000 | 0.000 | 0.000 |  0.000 |    318.000 |    318.000 |    3816.000 |    318.000 | GET    | /favicon.ico                                        |
|     1 | 0.000 | 0.000 |   0.000 | 0.000 | 0.000 | 0.000 | 0.000 |  0.000 |   4162.000 |   4162.000 |    4162.000 |   4162.000 | GET    | /js/chat.js                                         |
|     6 | 0.770 | 1.932 |   7.167 | 1.194 | 0.770 | 0.988 | 1.628 |  0.432 | 327999.000 | 327999.000 | 1967994.000 | 327999.000 | GET    | /icons/b1c8c5bc9b026507ae62e6bf62cf55f70f4ac3d8.png |
|     7 | 0.212 | 1.967 |   7.447 | 1.064 | 0.212 | 0.998 | 1.501 |  0.536 | 105173.000 | 105173.000 |  736211.000 | 105173.000 | GET    | /icons/e58ed7194ed4a99d9d6ada9e3383983edcbd1edc.png |
|   301 | 0.099 | 3.752 | 473.578 | 1.573 | 0.549 | 1.507 | 2.921 |  0.407 |      0.000 |   3005.000 |  885578.000 |   2942.120 | GET    | /fetch                                              |
+-------+-------+-------+---------+-------+-------+-------+-------+--------+------------+------------+-------------+------------+--------+-----------------------------------------------------+

This table shows the API response times that are listed in ascending order of time.

We started to improve the response time on worst. This case, we should start at /fetch and /icon.

How to find bottleneck queries on MySQL

To find bottleneck of SQL queries, we used pt-query-digest.

Install percona toolkit

See this document.

$ wget https://repo.percona.com/apt/percona-release_0.1-4.$(lsb_release -sc)_all.deb

$ sudo dpkg -i percona-release_0.1-4.$(lsb_release -sc)_all.deb

$ sudo apt-get update

$ sudo apt-cache search percona

$ sudo apt-get install percona-toolkit
$ sudo vim /etc/mysql/my.cnf
# Enable slow query log
slow_query_log = 1
slow_query_log_file = /var/log/mysql/slow_query.log

Analyze slow query log

$ pt-query-digest /var/log/mysql/slow_query.log

The list analyzed is like this;

# 660ms user time, 30ms system time, 39.37M rss, 121.42M vsz
# Current date: Sat Oct 21 14:54:35 2017
# Hostname: app1053
# Files: /var/log/mysql/slow-query.log
# Overall: 1.50k total, 23 unique, 2.73 QPS, 0.01x concurrency ___________
# Time range: 2017-10-21T05:45:19 to 2017-10-21T05:54:29
# Attribute          total     min     max     avg     95%  stddev  median
# ============     ======= ======= ======= ======= ======= ======= =======
# Exec time             3s     1us   608ms     2ms     5ms    20ms   103us
# Lock time           78ms       0     7ms    51us   131us   174us    36us
# Rows sent          1.96k       0     100    1.33    0.99    8.72       0
# Rows examine       2.91M       0   9.77k   1.98k   9.33k   3.74k       0
# Query size        90.88k      20     202   61.88  136.99   25.68   62.76

# Profile
# Rank Query ID           Response time Calls R/Call V/M   Item
# ==== ================== ============= ===== ====== ===== ==============
#    1 0x16A13978F3B7FCFA  1.7328 55.1%     6 0.2888  0.06 SELECT image
#    2 0x4866C96BA295C56E  0.7782 24.7%   301 0.0026  0.00 SELECT message
#    3 0xEBDB62606B914A34  0.4282 13.6%    90 0.0048  0.00 INSERT channel
#    4 0x91283C307570F439  0.0334  1.1%     4 0.0083  0.00 SELECT message
#    5 0x41883793B68D128E  0.0302  1.0%   100 0.0003  0.00 SELECT user
# MISC 0xMISC              0.1427  4.5%  1003 0.0001   0.0 <18 ITEMS>

# Query 1: 6 QPS, 1.73x concurrency, ID 0x16A13978F3B7FCFA at byte 190178
# Scores: V/M = 0.06
# Time range: 2017-10-21T05:46:17 to 2017-10-21T05:46:18
# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count          0       6
# Exec time     55      2s   203ms   608ms   289ms   580ms   136ms   242ms
# Lock time      1   840us   109us   193us   140us   185us    29us   152us
# Rows sent      5     103      10      23   17.17   22.53    4.06   17.65
# Rows examine   0   5.87k    1001    1001    1001    1001       0    1001
# Query size     0     474      79      79      79      79       0      79
# String:
# Databases    isubata
# Hosts        app1051
# Users        isucon
# Query_time distribution
#   1us
#  10us
# 100us
#   1ms
#  10ms
# 100ms  ################################################################
#    1s
#  10s+
# Tables
#    SHOW TABLE STATUS FROM `isubata` LIKE 'image'\G
#    SHOW CREATE TABLE `isubata`.`image`\G
# EXPLAIN /*!50100 PARTITIONS*/
SELECT * FROM image WHERE name = '78a9228a393eb2621f346fc6a5e099d5bc373f76.png'\G

# Query 2: 1.08 QPS, 0.00x concurrency, ID 0x4866C96BA295C56E at byte 177211
# Scores: V/M = 0.00
# Time range: 2017-10-21T05:46:16 to 2017-10-21T05:50:54
# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count         20     301
# Exec time     24   778ms     2ms     9ms     3ms     5ms   966us     2ms
# Lock time     15    12ms    30us   208us    39us    66us    15us    33us
# Rows sent     15     301       1       1       1       1       0       1
# Rows examine  98   2.87M   9.77k   9.77k   9.77k   9.77k       0   9.77k
# Query size    19  17.58k      58      60   59.81   59.77    0.85   59.77
# String:
# Databases    isubata
# Hosts        app1052 (298/99%), app1051 (3/0%)
# Users        isucon
# Query_time distribution
#   1us
#  10us
# 100us
#   1ms  ################################################################
#  10ms
# 100ms
#    1s
#  10s+
# Tables
#    SHOW TABLE STATUS FROM `isubata` LIKE 'message'\G
#    SHOW CREATE TABLE `isubata`.`message`\G
# EXPLAIN /*!50100 PARTITIONS*/
SELECT COUNT(*) as cnt FROM message WHERE channel_id = '733'\G

This table shows the queries time are listed in desacending order of time.

In this case, we should start to improve the queryt time that is SELECT * FROM image WHERE name = '78a9228a393eb2621f346fc6a5e099d5bc373f76.png'\G. For example to improve it, use caching on Nginx.

Index Key Prefix Length Limitation on InnoDB

Index Key Prefix Length Limitation on InnoDB

I encountered the problem index key prefix length limitation on InnoDB when I migrate a database. The problem was like this:

Specified key was too long; max key length is 767 bytes

The index key prefix length limit is 767 bytes if you use the version of MySQL either 5.5 or 5.6. Or you use 5.7 the default is 3076 bytes.

innodb_large_prefix is deprecated and will be removed in a future release. Therefore, We need to consider whether we should enable innodb_large_prefix.

And Index key prefixes larger than 767 bytes are silently truncated if you change innodb_large_prefix to OFF from ON.