PyData.Tokyo Meetup #21 LightGBM / Optuna に参加してから1ヶ月程経ってしまいましたが, Optuna に入門しました。 pfnet/optuna 内の LightGBM の example を実行したのでインストールや使い方を備忘録として残しておきます。
環境は macOS Mojave (10.14.2), Python 3.7.4 です。
Install Optuna
Optuna はハイパーパラメータの最適化ツールで, その機能は既に多くの記事で紹介されている。大きく以下の3つの特徴がある。
- Define-by-Run
- Parallel distributed optimization
- Pruning of unpromising trials
サンプリング法や ASHA (Asynchronous Successive Halving Algorithm) などの技術的な詳細や既存の最適化ツールとの比較は [1] を参照。応用例として FFmpeg の符号化パラメータのチューニングについて紹介されており面白かった。
Optuna を pip で PyPI からインストールする。
$ pip install optuna
Install LightGBM
GBDT (Gradient Boosting Decision Trees) ライブラリである LightGBM をソースからインストールする。具体的には Installation Guide の macOS に記載の手順に沿ってインストールする。
$ brew update
$ brew install cmake
$ brew install libomp
$ git clone --recursive https://github.com/microsoft/LightGBM ; cd LightGBM
$ mkdir build ; cd build
$ cmake \
-DOpenMP_C_FLAGS="-Xpreprocessor -fopenmp -I$(brew --prefix libomp)/include" \
-DOpenMP_C_LIB_NAMES="omp" \
-DOpenMP_CXX_FLAGS="-Xpreprocessor -fopenmp -I$(brew --prefix libomp)/include" \
-DOpenMP_CXX_LIB_NAMES="omp" \
-DOpenMP_omp_LIBRARY=$(brew --prefix libomp)/lib/libomp.dylib \
..
-- The C compiler identification is AppleClang 10.0.0.10001044
-- The CXX compiler identification is AppleClang 10.0.0.10001044
-- Check for working C compiler: /Library/Developer/CommandLineTools/usr/bin/cc
-- Check for working C compiler: /Library/Developer/CommandLineTools/usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /Library/Developer/CommandLineTools/usr/bin/c++
-- Check for working CXX compiler: /Library/Developer/CommandLineTools/usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found OpenMP_C: -Xpreprocessor -fopenmp -I/usr/local/opt/libomp/include (found version "3.1")
-- Found OpenMP_CXX: -Xpreprocessor -fopenmp -I/usr/local/opt/libomp/include (found version "3.1")
-- Found OpenMP: TRUE (found version "3.1")
-- Configuring done
-- Generating done
-- Build files have been written to: /your/env/LightGBM/build
$ cd ../python-package/
$ python setup.py install
インストールした Optuna と LightGBM を確認する。
$ ipython
Python 3.7.4 (default, Aug 13 2019, 15:17:50)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.8.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import lightgbm as lgb
In [2]: import optuna
In [3]: lgb.__version__
Out[3]: '2.3.1'
In [4]: optuna.__version__
Out[4]: '0.17.0'
In [5]: import numpy as np
In [6]: import sklearn.datasets
In [7]: import sklearn.metrics
In [8]: from sklearn.model_selection import train_test_split
Run Examples
今回は pfnet/optuna に含まれる2つの Example (examples/lightgbm_simple.py, examples/pruning/lightgbm_integration.py) を実行する。
Breast Cancer Wisconsin (Diagnostic) dataset
Examples で使われている Breast Cancer Wisconsin (Diagnostic) dataset は1995年に Dr. William H. らによって公開された乳がんの診断データセットで, 現在では二値分類のデータセットとして主に学習やチュートリアルの用途で利用されている。
目的変数は Malignant (悪性), Benign (良性) の二値で 569 サンプルの内, 悪性が 212 サンプル, 良性が 357 サンプルである。
各サンプルは腫瘤に関する以下の10個の各属性について, 平均 (1-10列), 標準誤差 (11-20列), worst または largest (上位3つの値の平均) (21-30列) を計算した計30個の特徴を持つ。
- radius (mean of distances from center to points on the
- texture (standard deviation of gray-scale values)
- perimeter
- area
- smoothness (local variation in radius lengths)
- compactness (perimeter^2 / area – 1.0)
- concavity (severity of concave portions of the contour)
- concave points (number of concave portions of the contour)
- symmetry
- fractal dimension (“coastline approximation” – 1)
sklearn.datasets からデータセットを読み込む。
In [9]: data, target = sklearn.datasets.load_breast_cancer(return_X_y=True)
In [10]: print(data.shape, target.shape)
(569, 30) (569,)
In [11]: target.sum()
Out[11]: 357
Simple tuning
pfnet/optuna の examples/lightgbm_simple.py を実行する。チューニングする LightGBM のパラメータは以下。
- lambda_l1 (reg_alpha): L1正則化
- lambda_l2 (reg_lambda): L2正則化
- num_leaves: 木の葉の数
- feature_fraction (sub_feature, colsample_bytree): ランダムに選択される特徴の内, 何割の特徴を使うか
- bagging_fraction (bagging): resampling を除いてデータの何割を使うか
- bagging_freq (subsample_freq): バギングを行う間隔 (0 の場合, バギングしない)
- min_data_in_leaf (min_child_samples): ひとつの葉の最小サンプルサイズ
In [12]: def objective(trial):
...: data, target = sklearn.datasets.load_breast_cancer(return_X_y=True)
...: train_x, test_x, train_y, test_y = train_test_split(data, target, test_size=0.25)
...: dtrain = lgb.Dataset(train_x, label=train_y)
...:
...: param = {
...: 'objective': 'binary',
...: 'metric': 'binary_logloss',
...: 'verbosity': -1,
...: 'boosting_type': 'gbdt',
...: 'lambda_l1': trial.suggest_loguniform('lambda_l1', 1e-8, 10.0),
...: 'lambda_l2': trial.suggest_loguniform('lambda_l2', 1e-8, 10.0),
...: 'num_leaves': trial.suggest_int('num_leaves', 2, 256),
...: 'feature_fraction': trial.suggest_uniform('feature_fraction', 0.4, 1.0),
...: 'bagging_fraction': trial.suggest_uniform('bagging_fraction', 0.4, 1.0),
...: 'bagging_freq': trial.suggest_int('bagging_freq', 1, 7),
...: 'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
...: }
...:
...: gbm = lgb.train(param, dtrain)
...: preds = gbm.predict(test_x)
...: pred_labels = np.rint(preds)
...: accuracy = sklearn.metrics.accuracy_score(test_y, pred_labels)
...: return accuracy
...:
In [13]: study = optuna.create_study(direction='maximize')
...: study.optimize(objective, n_trials=100)
[I 2019-10-14 20:56:47,529] Finished trial#0 resulted in value: 0.951048951048951. Current best value is 0.951048951048951 with parameters: {'lambda_l1': 6.962619943277916e-08, 'lambda_l2': 2.4560252117247406e-05, 'num_leaves': 249, 'feature_fraction': 0.7647391158856618, 'bagging_fraction': 0.759593045720237, 'bagging_freq': 1, 'min_child_samples': 94}.
[I 2019-10-14 20:56:47,681] Finished trial#1 resulted in value: 0.9230769230769231. Current best value is 0.951048951048951 with parameters: {'lambda_l1': 6.962619943277916e-08, 'lambda_l2': 2.4560252117247406e-05, 'num_leaves': 249, 'feature_fraction': 0.7647391158856618, 'bagging_fraction': 0.759593045720237, 'bagging_freq': 1, 'min_child_samples': 94}.
[I 2019-10-14 20:56:47,815] Finished trial#2 resulted in value: 0.9790209790209791. Current best value is 0.9790209790209791 with parameters: {'lambda_l1': 0.009376054218409027, 'lambda_l2': 1.892546246456281, 'num_leaves': 208, 'feature_fraction': 0.40592732682242627, 'bagging_fraction': 0.9614938668620573, 'bagging_freq': 5, 'min_child_samples': 65}.
...
[I 2019-10-14 20:57:11,652] Finished trial#97 resulted in value: 0.965034965034965. Current best value is 1.0 with parameters: {'lambda_l1': 1.8269284172146948e-07, 'lambda_l2': 0.20632817551331567, 'num_leaves': 225, 'feature_fraction': 0.5131845782464036, 'bagging_fraction': 0.99436585113325, 'bagging_freq': 6, 'min_child_samples': 64}.
[I 2019-10-14 20:57:11,955] Finished trial#98 resulted in value: 0.986013986013986. Current best value is 1.0 with parameters: {'lambda_l1': 1.8269284172146948e-07, 'lambda_l2': 0.20632817551331567, 'num_leaves': 225, 'feature_fraction': 0.5131845782464036, 'bagging_fraction': 0.99436585113325, 'bagging_freq': 6, 'min_child_samples': 64}.
[I 2019-10-14 20:57:12,273] Finished trial#99 resulted in value: 0.972027972027972. Current best value is 1.0 with parameters: {'lambda_l1': 1.8269284172146948e-07, 'lambda_l2': 0.20632817551331567, 'num_leaves': 225, 'feature_fraction': 0.5131845782464036, 'bagging_fraction': 0.99436585113325, 'bagging_freq': 6, 'min_child_samples': 64}.
In [14]: print('Best trial:')
...: trial = study.best_trial
...:
...: print(' Value: {}'.format(trial.value))
...:
...: print(' Params: ')
...: for key, value in trial.params.items():
...: print(' {}: {}'.format(key, value))
...:
Best trial:
Value: 1.0
Params:
lambda_l1: 1.8269284172146948e-07
lambda_l2: 0.20632817551331567
num_leaves: 225
feature_fraction: 0.5131845782464036
bagging_fraction: 0.99436585113325
bagging_freq: 6
min_child_samples: 64
visualization.plot_optimization_history() で最適化の履歴を可視化できる。
In [15]: from optuna.visualization import plot_optimization_history
...: plot_optimization_history(study)
Pruning
Pruning (a.k.a automated early-stopping) により見込みの薄い trials を早期に終了することでパラメータ探索を効率化する。pfnet/optuna の examples/pruning/lightgbm_integration.py を実行する。
In [16]: def objective(trial):
...: data, target = sklearn.datasets.load_breast_cancer(return_X_y=True)
...: train_x, test_x, train_y, test_y = train_test_split(data, target, test_size=0.25)
...: dtrain = lgb.Dataset(train_x, label=train_y)
...: dtest = lgb.Dataset(test_x, label=test_y)
...:
...: param = {
...: 'objective': 'binary',
...: 'metric': 'auc',
...: 'verbosity': -1,
...: 'boosting_type': 'gbdt',
...: 'lambda_l1': trial.suggest_loguniform('lambda_l1', 1e-8, 10.0),
...: 'lambda_l2': trial.suggest_loguniform('lambda_l2', 1e-8, 10.0),
...: 'num_leaves': trial.suggest_int('num_leaves', 2, 256),
...: 'feature_fraction': trial.suggest_uniform('feature_fraction', 0.4, 1.0),
...: 'bagging_fraction': trial.suggest_uniform('bagging_fraction', 0.4, 1.0),
...: 'bagging_freq': trial.suggest_int('bagging_freq', 1, 7),
...: 'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
...: }
...:
...: # Add a callback for pruning.
...: pruning_callback = optuna.integration.LightGBMPruningCallback(trial, 'auc')
...: gbm = lgb.train(
...: param, dtrain, valid_sets=[dtest], verbose_eval=False, callbacks=[pruning_callback])
...:
...: preds = gbm.predict(test_x)
...: pred_labels = np.rint(preds)
...: accuracy = sklearn.metrics.accuracy_score(test_y, pred_labels)
...: return accuracy
In [17]: # optuna.logging.set_verbosity(optuna.logging.WARNING)
...: study = optuna.create_study(pruner=optuna.pruners.MedianPruner(n_warmup_steps=10),
...: direction='maximize')
In [18]: study.optimize(objective, n_trials=100)
...: trial = study.best_trial
[I 2019-10-22 11:55:29,817] Finished trial#0 resulted in value: 0.958041958041958. Current best value is 0.958041958041958 with parameters: {'lambda_l1': 1.9570352469571285e-07, 'lambda_l2': 1.1459013975087378, 'num_leaves': 34, 'feature_fraction': 0.8997956033930001, 'bagging_fraction': 0.7003740217574765, 'bagging_freq': 1, 'min_child_samples': 33}.
[I 2019-10-22 11:55:30,905] Finished trial#1 resulted in value: 0.972027972027972. Current best value is 0.972027972027972 with parameters: {'lambda_l1': 2.8042253461628537e-05, 'lambda_l2': 3.4975210119502136, 'num_leaves': 195, 'feature_fraction': 0.8203075154545054, 'bagging_fraction': 0.9364548026623444, 'bagging_freq': 2, 'min_child_samples': 7}.
[I 2019-10-22 11:55:31,443] Finished trial#2 resulted in value: 0.9370629370629371. Current best value is 0.972027972027972 with parameters: {'lambda_l1': 2.8042253461628537e-05, 'lambda_l2': 3.4975210119502136, 'num_leaves': 195, 'feature_fraction': 0.8203075154545054, 'bagging_fraction': 0.9364548026623444, 'bagging_freq': 2, 'min_child_samples': 7}.
...
[I 2019-10-22 11:55:51,815] Setting status of trial#22 as TrialState.PRUNED. Trial was pruned at iteration 11.
[I 2019-10-22 11:55:52,392] Setting status of trial#23 as TrialState.PRUNED. Trial was pruned at iteration 11.
[I 2019-10-22 11:55:52,952] Setting status of trial#24 as TrialState.PRUNED. Trial was pruned at iteration 11.
...
[I 2019-10-22 11:57:01,874] Finished trial#92 resulted in value: 0.965034965034965. Current best value is 1.0 with parameters: {'lambda_l1': 0.2876610835417842, 'lambda_l2': 0.000175639598247342, 'num_leaves': 56, 'feature_fraction': 0.9231284444098429, 'bagging_fraction': 0.718131583563206, 'bagging_freq': 4, 'min_child_samples': 84}.
...
[I 2019-10-22 11:57:05,871] Setting status of trial#97 as TrialState.PRUNED. Trial was pruned at iteration 11.
[I 2019-10-22 11:57:06,830] Setting status of trial#98 as TrialState.PRUNED. Trial was pruned at iteration 11.
[I 2019-10-22 11:57:07,610] Setting status of trial#99 as TrialState.PRUNED. Trial was pruned at iteration 11.
In [19]: print(' Value: {}'.format(trial.value))
...:
...: print(' Params: ')
...: for key, value in trial.params.items():
...: print(' {}: {}'.format(key, value))
Value: 1.0
Params:
lambda_l1: 0.2876610835417842
lambda_l2: 0.000175639598247342
num_leaves: 56
feature_fraction: 0.9231284444098429
bagging_fraction: 0.718131583563206
bagging_freq: 4
min_child_samples: 84
先ほどと同様に visualization.plot_optimization_history() で最適化の履歴を可視化。
In [20]: from optuna.visualization import plot_optimization_history
...: plot_optimization_history(study)
visualization.plot_intermediate_values() で全ての trials の中間値を可視化できる。
In [21]: from optuna.visualization import plot_intermediate_values
...: plot_intermediate_values(study)
n_warmup_steps で指定した 10 Steps で Pruning が行われていることが確認できる。
おわりに
今回は macOS に Optuna と LightGBM をインストールし pfnet/optuna に含まれる LightGBM の example を実行したところまでを備忘録として書きました。
[1] Optuna: A Next-generation Hyperparameter Optimization Framework
[2] 有名ライブラリと比較した LightGBM の現在
[3] 手を動かして GBDT を理解してみる
[4] 6.2.7. Breast cancer wisconsin (diagnostic) dataset
[5] 明治大学講演資料「機械学習と自動ハイパーパラメタ最適化」佐野正太郎
[6] Successive Halvingの性能解析