Full-text search for Japanese with ngram full-text parser

TL;DR

Create an index with ngram full-text parser.

CREATE FULLTEXT INDEX idx_message_log ON message_log (message) WITH PARSER ngram;

ngram Full-Text Parser

We want to do a full-text search for searching our entire texts that exist about 150k rows. Its text is written in Japanese.

Our table like this;

create table message_log
(
    id int(11) unsigned not null,
    message varchar(255) default '' not null,
    primary key (id)
);

If message column would fill out in English or other space-separated languages, you can create a full-text index.

CREATE FULLTEXT INDEX idx_message_log on message_log.message (message);

However, we treat the message as Japanese. In this case, we cannot get any message, because Japanese is not the text space-separated words.

For instance, Japanese like this;

好きなメンバーとその理由を教えて下さい!

not space-separated;

好きな メンバー と その 理由 を 教えて 下さい!

We assume we have this record.

id message
1 好きなメンバーとその理由を教えて下さい!
2 好きな メンバー と その 理由 を 教えて 下さい!

We find messages with the full-text search function.

SELECT * FROM message_log WHERE MATCH (message) AGAINST ('メンバー');

And then get this result.

id message
2 好きな メンバー と その 理由 を 教えて 下さい!

We expect to can get all records. However, it does not include the text 好きなメンバーとその理由を教えて下さい!.

So we create an index with ngram full-text parser.

CREATE FULLTEXT INDEX idx_message_log ON message_log (message) WITH PARSER ngram;

Again, find messages.

SELECT * FROM message_log WHERE MATCH (message) AGAINST ('メンバー');

And then we get ID 1 and 2 that we expected.

id message
1 好きなメンバーとその理由を教えて下さい!
2 好きな メンバー と その 理由 を 教えて 下さい!

Index Key Prefix Length Limitation on InnoDB

Index Key Prefix Length Limitation on InnoDB

I encountered the problem index key prefix length limitation on InnoDB when I migrate a database. The problem was like this:

Specified key was too long; max key length is 767 bytes

The index key prefix length limit is 767 bytes if you use the version of MySQL either 5.5 or 5.6. Or you use 5.7 the default is 3076 bytes.

innodb_large_prefix is deprecated and will be removed in a future release. Therefore, We need to consider whether we should enable innodb_large_prefix.

And Index key prefixes larger than 767 bytes are silently truncated if you change innodb_large_prefix to OFF from ON.