RumaleにGradient Tree Boostingによる分類と回帰を実装した

はじめに

Rumaleでは決定木系のアルゴリズムの高速化と追加を進めている。ついに人気のGradient Tree Boosting（Gradient Boosting MachineやGradient Boosted Regression Treeなどとも呼ばれる）を実装して、ver. 0.9.2としてリリースした。

rumale | RubyGems.org | your community gem host

Gradient Tree Boosting (GTB)は、PythonではXGBoost、LightGBM、CatBoostなどで有名なアルゴリズムである。 GTBを実装しようと思ったのは、Scikit-Learnでも実装されているのと、ver. 0.2.1でLightGBMを参考にした実装であるHistGradientBoostingClassifier/Regressorが追加されるのを知って、Rubyで実装してみてみるか〜という気持ちになったのもあって。

使い方

Rumaleはgemコマンドでインストールできる。Numo::NArrayに依存している。

$ gem install rumale

データセットの読み込みでred-datasetsを使いたいので、これもインストールする。

$ gem install red-datasets-numo-narray

LIBSVM Dataのcpusamllデータセット（データ数 8,192、次元数12）を用いた回帰を試してみる。

require 'rumale'
require 'datasets'
require 'datasets-numo-narray'

# Numo::NArray形式でデータセットを読み込む.
datasets = Datasets::LIBSVM.new('cpusmall').to_narray
values = Numo::DFloat.cast(datasets[true, 0])
samples = Numo::DFloat.cast(datasets[true, 1..-1])

# ランダムに訓練とテストに分ける.
ss = Rumale::ModelSelection::ShuffleSplit.new(n_splits: 1, test_size: 0.1, random_seed: 1)
train_ids, test_ids = ss.split(samples, values).first
train_s = samples[train_ids, true]
train_v = values[train_ids]
test_s = samples[test_ids, true]
test_v = values[test_ids]

# Gradient Boosting Treeによる回帰を訓練する.
# ※ハイパーパラメータは勘で決めている.
est = Rumale::Ensemble::GradientBoostingRegressor.new(
  n_estimators: 100,   # 生成する決定木の数、Boostingにおける繰り返し数でもある
  learning_rate: 0.1,  # 学習率
  reg_lambda: 0.001,   # L2正則化の係数
  subsample: 0.8,      # ランダムサンプリングする際のデータの割合
  max_depth: 4,        # 決定木の深さ
  max_features: 8,     # 使用する特徴数、column samplingとも呼ばれる
  random_seed: 1       # 乱数のシード
)
est.fit(train_s, train_v)

# テストセットの決定係数（1に近づくほどよい）確認する.
puts("GTB R2-Score: %.4f" % est.score(test_s, test_v))

# 比較のためにRandom Forestでも同様のことを行う.
est = Rumale::Ensemble::RandomForestRegressor.new(
  n_estimators: 100,
  max_depth: 4,
  max_features: 8,
  random_seed: 1
)
est.fit(train_s, train_v)
puts("RF R2-Score: %.4f" % est.score(test_s, test_v))

これを実行すると以下のようになる。GTBの方が良い値を得られている。

GTB R2-Score: 0.9703
RF R2-Score: 0.9433

分類器のRumale::Ensemble::GradientBoostingClassifierも、同様のパラメータと手順で利用できる。

特徴量の離散化

LightGBMなどでは、特徴量を離散値に変換すること（離散化）で高速な計算を実現している。決定木では特徴量の値（特徴ベクトルをuniqして残る値）が、木を分割する際の閾値の候補となる。離散化することで候補値が少なくなると、分割の評価計算の回数が少なくなるので、そのぶん速くなる。この離散化込みでアルゴリズムを考え、全体的に高速化しているものもあるが、Rumaleでは、離散化したくない場合もあると思い、アルゴリズムとは別で、特徴量を離散化する BinDiscretizer クラスを用意した。実行例は以下のようになる。[-1, 1]な実数値を4段階に離散化している。

irb(main):001:0> require 'rumale'
=> true
irb(main):002:0> t = Rumale::Preprocessing::BinDiscretizer.new(n_bins: 4)
=> #<Rumale::Preprocessing::BinDiscretizer... 省略
irb(main):003:0> x=Numo::DFloat.new(5, 3).rand - 0.5
=> Numo::DFloat#shape=[5,3]
[[-0.438246, -0.126933, 0.294815],
 [-0.298958, -0.383959, -0.155968],
 [0.039948, 0.237815, -0.334911],
 [-0.449117, -0.391935, -0.431292],
 [0.404121, -0.0213559, -0.157031]]
irb(main):004:0> t.fit_transform(x)
=> Numo::DFloat#shape=[5,3]
[[0, 1, 3],
 [0, 0, 1],
 [2, 3, 0],
 [0, 0, 0],
 [3, 2, 1]]

離散化することでGTB高速になるかを以下のコードで確認した。

require 'rumale'
require 'datasets'
require 'datasets-numo-narray'
require 'benchmark'

datasets = Datasets::LIBSVM.new('cpusmall').to_narray
values = Numo::DFloat.cast(datasets[true, 0])
samples = Numo::DFloat.cast(datasets[true, 1..-1])

ss = Rumale::ModelSelection::ShuffleSplit.new(n_splits: 1, test_size: 0.1, random_seed: 1)
train_ids, test_ids = ss.split(samples, values).first
train_s = samples[train_ids, true]
train_v = values[train_ids]
test_s = samples[test_ids, true]
test_v = values[test_ids]

est = Rumale::Ensemble::GradientBoostingRegressor.new(
  n_estimators: 100,
  learning_rate: 0.1,
  reg_lambda: 0.001,
  subsample: 0.8,
  max_depth: 4,
  max_features: 8,
  random_seed: 1
)

Benchmark.bm 10 do |r|
  r.report 'non-transform' do
    est.fit(train_s, train_v)
    puts(" (R2-Score: %.4f)" % est.score(test_s, test_v))
  end

  r.report 'discretized' do
    # 4段階の離散値に変換する（ちょっと極端な例で実用では32段階以上が良いと思われる）
    t = Rumale::Preprocessing::BinDiscretizer.new(n_bins: 4)
    dis_train_s = t.fit_transform(train_s)
    dis_test_s = t.transform(test_s)
    est.fit(dis_train_s, train_v)
    puts(" (R2-Score: %.4f)" % est.score(dis_test_s, test_v))
  end
end

実行結果は以下のとおり。高速化されるが、予測精度は落ちるようだ。このあたりは、データセットの大きさや特徴量の次元数、パラメータの兼ね合いで変わってくると思われる。

                 user     system      total        real
non-transform (R2-Score: 0.9703)
  8.030000   0.810000   8.840000 (  8.866683)
discretized (R2-Score: 0.9203)
  6.330000   0.920000   7.250000 (  7.275490)