これは

PostgreSQLで文字列型のカラムに 2バイト文字 が含まれるかを調査する方法です。まさかこんな調査が必要になるとは思っていなかったのですが、必要になったのでやりました。

これは

TL;DR

以下をやるだけです。

charの場合

以下のようなクエリで 0なら２バイト無し です。

select count(*) from 対象テーブル
where octet_length(対象カラム) > 対象カラムの桁数;

varcharの場合

以下のようなクエリで 0なら２バイト無し です。

select count(*) from 対象テーブル
where octet_length(対象カラム) <> char_length(対象カラム);

検証

テストテーブルで試して問題なかったので、100万件を超える本番テーブルでやってもきっちり2バイトあり/なしを切り分けられました。

以下は検証でやったテストテーブル作成とクエリでの参照手順です。

テストテーブルの作成＆テストデータのInsert

難しいことは考えずに、適当なテーブルを作ります。環境構築がめんどいのでDockerでやりましょう。

docker pullで公式イメージをDL（バージョン指定がしたければお好みで）

# docker pull postgres:10

10: Pulling from library/postgres
0a4690c5d889: Already exists
723861590717: Already exists
db019468bdf4: Already exists
91cb81a60371: Already exists
a2a4ab07588d: Already exists
a7ccdc2a5f31: Already exists
93687df2bb93: Already exists
f00839cf3313: Already exists
722253fa84b2: Pull complete
5fc834e81b3c: Pull complete
b5bfe36fd72d: Pull complete
b2733b8e24de: Pull complete
9f032a1876d3: Pull complete
5f1457b9fa28: Pull complete
Digest: sha256:70aadcd012d4d59d2ffd4fadcce4aaf1308b0eaee618536565158caf58d33f6a
Status: Downloaded newer image for postgres:10

pullしたらrunします。Docker HubのコマンドそのままでOKです。

docker run --name some-postgres -e POSTGRES_PASSWORD=mysecretpassword -d postgres

docker psするとpostgresの起動を確認できます。簡単！

$ docker ps
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS               NAMES
1cec48407fbf        postgres            "docker-entrypoint.s…"   7 minutes ago       Up 7 minutes        5432/tcp            some-postgres
$

起動したpostgresさんにログインします。

docker exec -it some-postgres /bin/bash

root@1cec48407fbf:/# <-コンテナの中でプロンプトがでる

コンテナ内でpsql打っていきましょう。postgresユーザから始めるのを忘れすに。

su - postgres

psql

postgres=# が出たら準備OK

あとはテスト用のクエリを打っていきます。

create table morihaya (hoge char(10) , fuga varchar(10));
insert into morihaya (hoge , fuga) values ('a','a');
insert into morihaya (hoge , fuga) values ('aa','aa');
insert into morihaya (hoge , fuga) values ('あ','あ');
insert into morihaya (hoge , fuga) values ('ああ','ああ');
insert into morihaya (hoge , fuga) values ('あa','あa');
insert into morihaya (hoge , fuga) values ('漢字','漢字');


問題なくスイスイINSERTされるはずです。
postgres=# create table morihaya (hoge char(10) , fuga varchar(10));
CREATE TABLE
postgres=#
postgres=# insert into morihaya (hoge , fuga) values ('a','a');
INSERT 0 1
postgres=# insert into morihaya (hoge , fuga) values ('aa','aa');
INSERT 0 1
postgres=# insert into morihaya (hoge , fuga) values ('あ','あ');
INSERT 0 1
postgres=# insert into morihaya (hoge , fuga) values ('ああ','ああ');
INSERT 0 1
postgres=# insert into morihaya (hoge , fuga) values ('あa','あa');
INSERT 0 1
postgres=# insert into morihaya (hoge , fuga) values ('漢字','漢字');
INSERT 0 1
postgres=#

ここまでで、検証の準備は整ったはずです。

char

char型の検証をしていきます。

select length(hoge) , char_length(hoge) , octet_length(hoge) from morihaya;

以下のような結果になりました。

hoge列はchar(10)で作成している
3行目以降から2バイト文字を挿入している

これらの条件からもうお分かりですね？ char(x)で宣言したx以上のoctet_lengthが戻る列が2バイト文字 ということが分かります。

postgres=# select length(hoge) , char_length(hoge) , octet_length(hoge) from morihaya;
 length | char_length | octet_length
--------+-------------+--------------
      1 |           1 |           10
      2 |           2 |           10
      1 |           1 |           12
      2 |           2 |           14
      2 |           2 |           12
      2 |           2 |           14
(6 rows)

つまり、char(x)より上のoctet_lengthが1件でも含まれていれば、それは2バイト文字を含む列であると言えるのです。よって以下のSQLでchar列の2バイト文字の有無を確認できます。

select count(*) from 対象テーブル
where octet_length(対象カラム) > 対象カラムの桁数;

varchar

一方でvarcharはどうでしょうか。

select length(fuga) , char_length(fuga) , octet_length(fuga) from morihaya;

以下のような結果になりました。

postgres=# select length(fuga) , char_length(fuga) , octet_length(fuga) from morihaya;
 length | char_length | octet_length
--------+-------------+--------------
      1 |           1 |            1
      2 |           2 |            2
      1 |           1 |            3
      2 |           2 |            6
      2 |           2 |            4
      2 |           2 |            6
(6 rows)

一見すると法則性が無いように見えますが、そうではありません。

2行目までは英数字のみ(2バイト文字はない)
3行目以降は2バイト文字を含む
char_lengthとoctet_lengthが一致したりしなかったり？

そう、2バイト文字がない場合は char_length と octet_length が一致 します。これを利用して以下のクエリで2バイト文字の有無を調査することができます。

select count(*) from 対象テーブル
where octet_length(対象カラム) <> char_length(対象カラム);

余談

dockerあるあるですが、使い終わったコンテナは潔くGood Byしましょう。Let's immutableですし、Not Pets But Cattleです。

$ docker ps
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS               NAMES
1cec48407fbf        postgres            "docker-entrypoint.s…"   32 minutes ago      Up 32 minutes       5432/tcp            some-postgres
$
$
$ docker stop some-postgres
docker stop some-postgres
$ docker ps

参考

ひしだま's ホームページ - PostgreSQL 文字列メモ
- このブログの日付 2003-07-19 何ですけど普通にめっちゃ助けられたので、基礎技術はずっと役立つっていうのを体感させて頂きました、感謝
postgresql.org/docs/9.6/functions-string
- 公式にも一覧がある（ただし2バイト文字の記述がない）
日本語のpostgresql.org/docs/9.6/functions-string
- postgresの日本語ドキュメントは精度が高いことで有名（個人の観測）
Slack: postgres-jp
- 心底震えるような強いエンジニアもいるし、ゆるふわなエンジニアもいるしで怖くないですよ。気軽にJoin!!

もりはやメモφ(・ω・ )

ITとか読書感想文とか

PostgreSQLで2バイト文字が含まれるカラムかを調査する方法

これは