LEE_BOMB 2021. 12. 7. 16:26
์•™์ƒ๋ธ”(Ensemble) ๋ชจ๋ธ

์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋ถ„๋ฅ˜๊ธฐ(Classifier)๋ฅผ ์ƒ์„ฑํ•˜๊ณ , ๊ฐ ์˜ˆ์ธก ๊ฒฐ๊ณผ๋ฅผ ๊ฒฐํ•ฉํ•˜์—ฌ ์˜ˆ์ธก๋ ฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋ชจ๋ธ

์žฅ์  : ๋‹จ์ผ ๋ชจ๋ธ(Decision Tree)์— ๋น„ํ•ด์„œ ๋ถ„๋ฅ˜ ์„ฑ๋Šฅ ์šฐ์ˆ˜

 

 

๋‹จ์ผ Tree vs ์•™์ƒ๋ธ” ๋ชจ๋ธ

Decision tree Random Forest
์ƒ์„ฑ๋œ ๋ถ„๋ฅ˜๋ชจ๋ธ์„ ๊ฒ€์ •๋ฐ์ดํ„ฐ์— ์ ์šฉํ•˜์—ฌ ๋ชฉํ‘œ๋ณ€์ˆ˜ ์˜ˆ์ธก ๋™์ผํ•œ ํ•˜๋‚˜์˜ ๋ฐ์ดํ„ฐ ์ง‘ํ•ฉ์—์„œ ์ž„์˜๋ณต์› ์ƒ˜ํ”Œ๋ง์„ ํ†ตํ•ด ์—ฌ๋Ÿฌ ๊ฐœ์˜ ํ›ˆ๋ จ์šฉ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑ
์ƒ์„ฑ๋œ ๋ถ„๋ฅ˜๋ชจ๋ธ์„ ๊ฒ€์ •๋ฐ์ดํ„ฐ์— ์ ์šฉํ•˜์—ฌ ๋ชฉํ‘œ๋ณ€์ˆ˜ ์˜ˆ์ธก ์—ฌ๋Ÿฌ ๋ฒˆ์˜ ํ•™์Šต์„ ํ†ตํ•ด ์—ฌ๋Ÿฌ ๊ฐœ์˜ ํŠธ๋ฆฌ ์ƒ์„ฑํ•˜๊ณ , ์ด๋ฅผ ๊ฒฐํ•ฉํ•˜์—ฌ ์ตœ์ข…์ ์œผ๋กœ ๋ชฉํ‘œ๋ณ€์ˆ˜ ์˜ˆ์ธก
์ƒ์„ฑ๋œ ๋ถ„๋ฅ˜๋ชจ๋ธ์„ ๊ฒ€์ •๋ฐ์ดํ„ฐ์— ์ ์šฉํ•˜์—ฌ ๋ชฉํ‘œ๋ณ€์ˆ˜ ์˜ˆ์ธก ๋ถ„๋ฅ˜๋ชจ๋ธ์„ ๊ฒ€์ •๋ฐ์ดํ„ฐ์— ์ ์šฉ

 

 

์•™์ƒ๋ธ”(Ensemble) ์•Œ๊ณ ๋ฆฌ์ฆ˜

์ข…๋ฅ˜ : ๋ฐฐ๊น…(Bagging), ๋ถ€์ŠคํŒ…(Boosting)

์•™์ƒ๋ธ” ํ•™์Šต ๋ชจ๋ธ ์ƒ์„ฑ ์ ˆ์ฐจ : 

1. ์ „์ฒด ๋ฐ์ดํ„ฐ์—์„œ ํ›ˆ๋ จ ์ง‘ํ•ฉ ์ƒ์„ฑ

2. ๊ฐ ํ›ˆ๋ จ์ง‘ํ•ฉ ๋ชจ๋ธ ํ•™์Šต

3. ํ•™์Šต๋œ ๋ชจ๋ธ ์•™์ƒ๋ธ” ๋„์ถœ

* ๋‹จ์  : ๋ชจ๋ธ ๊ฒฐ๊ณผ์˜ ํ•ด์„์ด ์–ด๋ ต๊ณ , ์˜ˆ์ธก ์‹œ๊ฐ„ ๋งŽ์ด ์†Œ์š”

 

 

์•™์ƒ๋ธ” ๋ชจ๋ธ ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋น„๊ต

  ๋ฐฐ๊น…(Bagging) ๋ถ€์ŠคํŒ…(Boosting)
๊ณตํ†ต์  ์ „์ฒด ๋ฐ์ดํ„ฐ ์ง‘ํ•ฉ์œผ๋กœ๋ถ€ํ„ฐ ๋ณต์› ๋žœ๋ค ์ƒ˜ํ”Œ๋ง(bootstrap) ์œผ๋กœ ํ›ˆ๋ จ ์ง‘ํ•ฉ ์ƒ์„ฑ
์ฐจ์ด์  ๋ณ‘๋ ฌํ•™์Šต : ๊ฐ ๋ชจ๋ธ์˜ ๊ฒฐ๊ณผ ๋ฅผ ์กฐํ•ฉํ•˜์—ฌ ํˆฌํ‘œ ๊ฒฐ์ • ์ˆœ์ฐจํ•™์Šต : ํ˜„์žฌ ๋ชจ๋ธ ๊ฐ€์ค‘์น˜ -> ๋‹ค์Œ ๋ชจ๋ธ ์ „๋‹ฌ
ํŠน์ง• ๊ท ์ผํ•œ ํ™•๋ฅ ๋ถ„ํฌ์— ์˜ํ•ด์„œ ํ›ˆ๋ จ ์ง‘ํ•ฉ ์ƒ์„ฑ ๋ถ„๋ฅ˜ํ•˜๊ธฐ ์–ด๋ ค์šด ํ›ˆ๋ จ ์ง‘ํ•ฉ ์ƒ์„ฑ
๊ฐ•์  ๊ณผ๋Œ€์ ํ•ฉ์— ๊ฐ•ํ•จ ๋†’์€ ์ •ํ™•๋„
์•ฝ์  ํŠน์ • ์˜์—ญ ์ •ํ™•๋„ ๋‚ฎ์Œ Outlier ์ทจ์•ฝ
R ํŒจํ‚ค์ง€ randomForest XGboost

 

 

์•™์ƒ๋ธ” ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜

* Bagging์€ ์ผ๋ฐ˜์ ์ธ ๋ชจ๋ธ์„ ๋งŒ๋“œ๋Š”๋ฐ ์ง‘์ค‘, Boosting์€ ๋งž์ถ”๊ธฐ ์–ด๋ ค์šด ๋ฌธ์ œ๋ฅผ ๋งž์ถ”๋Š”๋ฐ ์ดˆ์ ์„ ๋งž์ถค ์•™์ƒ๋ธ” ํ•™์Šต ๋ฐฉ๋ฒ•

 

 

 

1) ๋ถ€ํŠธ์ŠคํŠธ๋žฉ(Bootstrap)

์•™์ƒ๋ธ” ๋ชจ๋ธ์—์„œ ์›๋ž˜์˜ ๋ฐ์ดํ„ฐ ์…‹์œผ๋กœ๋ถ€ํ„ฐ ๋ณต์› ์ถ”์ถœ๋ฐฉ์‹์œผ๋กœ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ ์…‹์„ ์ƒ์„ฑํ•˜๋Š” ๊ธฐ๋ฒ•

๋ฐ์ดํ„ฐ์˜ ์–‘์„ ์ž„์˜์ ์œผ๋กœ ๋Š˜๋ฆฌ๊ณ , ๋ฐ์ดํ„ฐ์…‹์˜ ๋ถ„ํฌ๊ฐ€ ๊ณ ๋ฅด์ง€ ์•Š์„ ๋•Œ ๊ณ ๋ฅด๊ฒŒ ๋งŒ๋“œ๋Š” ํšจ๊ณผ

 

2) ๋ฐฐ๊น… ์•Œ๊ณ ๋ฆฌ์ฆ˜

Bagging : Bootstrap Aggregating(“์ฃผ๋จธ๋‹ˆ ํ†ตํ•ฉ”)

๋ถ€ํŠธ์ŠคํŠธ๋žฉ์„ ํ†ตํ•ด์„œ ์กฐ๊ธˆ์”ฉ ์„œ๋กœ ๋‹ค๋ฅธ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜์—ฌ ๋ชจ๋ธ (ํ›ˆ๋ จ ๋œ ํŠธ๋ฆฌ)์„ ์ƒ์„ฑํ•˜๊ณ , ๊ฒฐ๊ณผ๋ฅผ ๊ฒฐํ•ฉ(aggregating) ์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•

 

3) ๋ถ€์ŠคํŒ…(boosting) ์•Œ๊ณ ๋ฆฌ์ฆ˜

์ž˜๋ชป ๋ถ„๋ฅ˜๋œ ๊ฐ์ฒด๋“ค์— ์ง‘์ค‘ํ•˜์—ฌ ์ƒˆ๋กœ์šด ๋ถ„๋ฅ˜๊ทœ์น™์„ ์ƒ์„ฑํ•˜๋Š” ๋‹จ๊ณ„๋ฅผ ๋ฐ˜๋ณตํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜(์ˆœ์ฐจ์  ํ•™์Šต)

์•ฝํ•œ ์˜ˆ์ธก๋ชจํ˜•๋“ค์„ ๊ฒฐํ•ฉํ•˜์—ฌ ๊ฐ•ํ•œ ์˜ˆ์ธก๋ชจํ˜•์„ ๋งŒ๋“œ๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜

์˜ค ๋ถ„๋ฅ˜๋œ ๊ฐœ์ฒด๋Š” ๋†’์€ ๊ฐ€์ค‘์น˜, ์ • ๋ถ„๋ฅ˜๋œ ๊ฐœ์ฒด๋Š” ๋‚ฎ์€ ๊ฐ€์ค‘์น˜๋ฅผ ์ ์šฉํ•˜์—ฌ ์˜ˆ์ธก๋ชจํ˜•์˜ ์ •ํ™•๋„๋ฅผ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•

 

4) RandomForest

๋ฐฐ๊น… ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ์œ ์‚ฌํ•จ(์•™์ƒ๋ธ” ๋ชจ๋ธ)

ํšŒ๊ท€๋ถ„์„, ๋ถ„๋ฅ˜๋ถ„์„ ๋ชจ๋‘ ๊ฐ€๋Šฅ

๋ณ„๋„ ํŠœ๋‹(์Šค์ผ€์ผ ์กฐ์ •) ๊ณผ์ • ์—†์Œ

๋‹จ์ผ tree ๋ชจ๋ธ ๋‹จ์  ๋ณด์™„(์„ฑ๋Šฅ, ๊ณผ๋Œ€์ ํ•ฉ)

๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ ์…‹์œผ๋กœ ์ฒ˜๋ฆฌ์‹œ๊ฐ„ ์ฆ๊ฐ€(๋‹จ์ )

๋ฉ€ํ‹ฐ์ฝ”์–ด ํ”„๋กœ์„ธ์Šค ์ด์šฉ ๋ณ‘๋ ฌ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅ

 

์ฐจ์ด์ 

๋ฐฐ๊น… : ์ƒ˜ํ”Œ ๋ณต์› ์ถ”์ถœ ์‹œ ๋ชจ๋“  ๋…๋ฆฝ๋ณ€์ˆ˜(์„ค๋ช…๋ณ€์ˆ˜) ์‚ฌ์šฉ

๋žœ๋คํฌ๋ ˆ์ŠคํŠธ : a๊ฐœ ๋…๋ฆฝ๋ณ€์ˆ˜(์„ค๋ช…๋ณ€์ˆ˜)๋งŒ์œผ๋กœ ๋ณต์› ์ถ”์ถœ

์„ค๋ช…๋ณ€์ˆ˜ ๊ฐœ์ˆ˜ : sqrt(์„ค๋ช…๋ณ€์ˆ˜ ๊ฐœ์ˆ˜) (์˜ˆ: 15๊ฐœ ๋ณ€์ˆ˜๋ผ๋ฉด 4๊ฐœ ์ •๋„)

์„ค๋ช…๋ณ€์ˆ˜๊ฐ€ ๋งŽ์„ ๊ฒฝ์šฐ ๋ณ€์ˆ˜๊ฐ„ ์ƒ๊ด€์„ฑ์ด ๋†’์€ ๋ณ€์ˆ˜๊ฐ€ ์„ž์ผ ํ™•๋ฅ  ์ œ๊ฑฐ

 

 

5) XGBoost ์•Œ๊ณ ๋ฆฌ์ฆ˜

์—ฌ๋Ÿฌ ๊ฐœ์˜ ๊ฒฐ์ • ํŠธ๋ฆฌ๋ฅผ ์ž„์˜์ ์œผ๋กœ ํ•™์Šตํ•˜๋Š” ๋ฐฉ์‹. ์•™์ƒ๋ธ” ํ•™์Šต๋ฐฉ๋ฒ•(๋ถ€์ŠคํŒ… ์œ ํ˜•

์ˆœ์ฐจ์  ํ•™์Šต ๋ฐฉ๋ฒ• : ์•ฝํ•œ ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ๊ฐ•ํ•œ ๋ถ„๋ฅ˜๊ธฐ๋กœ ์ƒ์„ฑ. ๋ถ„๋ฅ˜์ •ํ™•๋„ ์šฐ์ˆ˜, Outlier ์ทจ์•ฝ

 

 

 

 

 

[์‹ค์Šต]

RandomForest
from sklearn.ensemble import RandomForestClassifier #์งˆ์ ๋ณ€์ˆ˜
from sklearn.datasets import load_wine #3๊ฐœ class
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report #model ํ‰๊ฐ€๋„๊ตฌ


1. dataset load

wine = load_wine()


x๋ณ€์ˆ˜ ์ด๋ฆ„

x_names = wine.feature_names
x_names
len(x_names) #13

X, y = load_wine(return_X_y = True)




2. model ์ƒ์„ฑ

model = RandomForestClassifier().fit(X = X, y = y) #๊ธฐ๋ณธ ํŒŒ๋ผ๋ฏธํ„ฐ




3. test set ์ƒ์„ฑ

import numpy as np

idx = np.random.choice(a = len(X), size = 100, replace = False)
idx

X_test, y_test = X[idx], y[idx]
X_test.shape #(100, 13)
y_test.shape #(100,)




4. model ํ‰๊ฐ€

y_pred = model.predict(X = X_test)

con_mat = confusion_matrix(y_test, y_pred)
print(con_mat)

[[33  0  0]
 [ 0 43  0]
 [ 0  0 24]]

acc = accuracy_score(y_test, y_pred)
print(acc) #1.0

 



5. ์ค‘์š”๋ณ€์ˆ˜ ์‹œ๊ฐํ™” (y์— ์˜ํ–ฅ๋ ฅ์ด ๊ฐ€์žฅ ํฐ ์ค‘์š”๋ณ€์ˆ˜ ์„ ์ • )

dir(model) #model์—์„œ ํ˜ธ์ถœ๊ฐ€๋Šฅํ•œ methodํ™•์ธ

model.feature_importances_ #proline

array([0.11749764, 0.02167489, 0.0126863 , 0.02941608, 0.02915404,
       0.05885703, 0.17041459, 0.01035685, 0.01737543, 0.15155433,
       0.08574004, 0.1037657 , 0.19150708])

x_names[-1] #'proline'
x_names[6] #'flavanoids'


๊ฐ€๋กœ๋ง‰๋Œ€ ์ฐจํŠธ

import matplotlib.pyplot as plt
plt.barh(range(13), width = model.feature_importances_)


13๊ฐœ์˜ ๋ˆˆ๊ธˆ๋งˆ๋‹ค x๋ณ€์ˆ˜ ์ด๋ฆ„์„ ๋ถ™์ด๊ธฐ

plt.yticks(range(13), x_names) 
plt.xlabel('feature_importances')
plt.show()

 

 

 

 

 

RandomForest GridSearch

1. RandomForest 
2. GridSearch : best parameters


from sklearn.ensemble import RandomForestClassifier #model 
from sklearn.datasets import load_digits #10๊ฐœ class 
from sklearn.model_selection import GridSearchCV #best params



1. dataset load 

X, y = load_digits(return_X_y=True)




2. model ์ƒ์„ฑ : default params 

model = RandomForestClassifier().fit(X = X, y = y)

n_estimators : int, default=100 : ๊ฒฐ์ •ํŠธ๋ฆฌ ๊ฐœ์ˆ˜
criterion : {"gini", "entropy"}, default="gini" : ์ค‘์š”๋ณ€์ˆ˜ ์„ ์ • ๊ธฐ์ค€ 
max_depth : int, default=None : ์ตœ๋Œ€ ํŠธ๋ฆฌ ๊นŠ์ด 
min_samples_split : int or float, default=2
min_samples_leaf : int or float, default=1 
max_features : {"auto", "sqrt", "log2"} : ์ตœ๋Œ€ ์‚ฌ์šฉํ•  x๋ณ€์ˆ˜ ๊ฐœ์ˆ˜ 

model.score(X = X, y = y)

 



3. GridSearch model 

parmas = {'n_estimators' : [100, 150, 200],
          'criterion' : ["gini", "entropy"],
          'max_depth' : [None, 3, 5, 7],
          'min_samples_leaf' : [1, 2, 3],
          'max_features' : ["auto", "sqrt"]} #dictํ˜• 

grid_model = GridSearchCV(model, param_grid=parmas,
             scoring='accuracy', cv = 5, n_jobs= -1)


model2 = grid_model.fit(X = X, y = y)

dir(model2)
'''
'best_params_',
'best_score_',
'''
model2.best_score_ #0.9438130609718354

print('best params : ', model2.best_params_)

best params :  {'criterion': 'gini',
                'max_depth': None, 
                'max_features': 'sqrt', 
                'min_samples_leaf': 1, 
                'n_estimators': 150}

new model 

new_model = RandomForestClassifier(
    max_features='sqrt',n_estimators=150).fit(X = X, y = y)

new_model.score(X = X, y = y) #1.0

 

 

 

 

 

XGBoost_test

pip install xgboost


from xgboost import XGBClassifier #๋ถ„๋ฅ˜ Tree
from xgboost import plot_importance #์ค‘์š”๋ณ€์ˆ˜ ์‹œ๊ฐํ™”
from sklearn.datasets import make_blobs #dataset : ๋ฐฉ์šธ. ์„ ์ ๋„์šฉ
from sklearn.model_selection import train_test_split #split๋„๊ตฌ
from sklearn.metrics import accuracy_score, classification_report




1. dataset load

X, y = make_blobs(n_samples=2000, n_features=4, centers=2,
           cluster_std=2.5, random_state=123)

n_samples : ํ‘œ๋ณธ(=๋ฐฉ์šธ)์˜ ๊ฐœ์ˆ˜ (๊ธฐ๋ณธ : 100)
n_features = x๋ณ€์ˆ˜ ๊ฐœ์ˆ˜
centers = y๋ณ€์ˆ˜ class(=๋ฐฉ์šธ์˜ ์œ ํ˜•) ๊ฐœ์ˆ˜
cluster_std = ํด๋Ÿฌ์Šคํ„ฐ ๊ฐ„ ํ‘œ์ค€ํŽธ์ฐจ(๋…ธ์ด์ฆˆ ๋ฐœ์ƒ), ๋ฐ์ดํ„ฐ์˜ ๋ณต์žก๋„(๊ธฐ๋ณธ:1)



2. train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)




3. model ์ƒ์„ฑ : ์ดํ•ญ๋ถ„๋ฅ˜๊ธฐ(centers=2)

model = XGBClassifier(Objective = "binary:logistic").fit(X = X, y = y, eval_metric = 'error')

ํ™œ์„ฑํ•จ์ˆ˜(activeation function) : model์˜ˆ์ธก๊ฐ’ -> ์ถœ๋ ฅ y๋กœ ํ™œ์„ฑํ™”
Objective = "binary:logistic" : ์ดํ•ญ๋ถ„๋ฅ˜๊ธฐ - sigmoidํ•จ์ˆ˜
Objective = "multi:softprob" : ๋‹คํ–ฅ๋ถ„๋ฅ˜๊ธฐ - softmaxํ•จ์ˆ˜



4. model ํ‰๊ฐ€ (๋ถ„๋ฅ˜์ •ํ™•๋„)

y_pred = model.predict(X = X_test)
len(y_pred)
len(y_test)

acc = accuracy_score(y_test, y_pred)
print(acc) #1.0




5. ์ค‘์š”๋ณ€์ˆ˜ ์‹œ๊ฐํ™” 

fscore = model.get_booster().get_fscore()
print(fscore)
plot_importance(model)

 

 

 

 

 

XGBoost_boston

ํšŒ๊ท€ํŠธ๋ฆฌ : y๋ณ€์ˆ˜ ์–‘์ ๋ณ€์ˆ˜(๋น„์œจ์ฒ™๋„) 
๋ถ„๋ฅ˜ํŠธ๋ฆฌ : y๋ณ€์ˆ˜ ์งˆ์ ๋ณ€์ˆ˜(๋ช…๋ชฉ์ฒ™๋„) 


from xgboost import XGBRegressor #ํšŒ๊ท€ํŠธ๋ฆฌ ๋ชจ๋ธ 
from xgboost import plot_importance #์ค‘์š”๋ณ€์ˆ˜ ์‹œ๊ฐํ™” 

from sklearn.datasets import load_boston #dataset
from sklearn.model_selection import train_test_split #split 
from sklearn.metrics import mean_squared_error, r2_score #ํ‰๊ฐ€



1. dataset load 

boston = load_boston()
x_names = boston.feature_names #x๋ณ€์ˆ˜ ์ด๋ฆ„ 
print(x_names)

['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']

X = boston.data
y = boston.target 
X.shape #(506, 13)
type(X) #numpy.ndarray
y #์—ฐ์†ํ˜•(๋น„์œจ์ฒ™๋„)




2. dataset split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3)




3. model ์ƒ์„ฑ 

model = XGBRegressor().fit(X=X_train, y=y_train)




4. ์ค‘์š”๋ณ€์ˆ˜ ํ™•์ธ 

fscore = model.get_booster().get_fscore()
print(fscore)

{'f0': 698.0, 'f1': 43.0, 'f2': 107.0, 'f3': 23.0, 'f4': 140.0, 'f5': 410.0, 'f6': 361.0, 'f7': 337.0, 'f8': 36.0, 'f9': 60.0, 'f10': 77.0, 'f11': 313.0, 'f12': 311.0}

plot_importance(model) #1 > 6 > 7 
x_names[0] #'CRIM'
x_names[5] #'RM'
x_names[6] #'AGE'




5. model ํ‰๊ฐ€ 

y_pred = model.predict(X = X_test)


1) MSE : y๋ณ€์ˆ˜ ๋กœ๊ทธ๋ณ€ํ™˜ or ์ •๊ทœํ™”(o)

mse = mean_squared_error(y_test, y_pred)
print(mse) #11.779651544177534


2) ๊ฒฐ์ •๊ณ„์ˆ˜(r^2) : y๋ณ€์ˆ˜ ๋กœ๊ทธ๋ณ€ํ™˜ or ์ •๊ทœํ™”(x)

score = r2_score(y_test, y_pred)
print(score) #0.8240142993822802

 

 

 

์ค‘์š”๋ณ€์ˆ˜ ์‹œ๊ฐํ™”์—์„œ X๋ณ€์ˆ˜๋ช… ํ‘œ์‹œ 

import pandas as pd

 


1. numpy -> DataFrame ๋ณ€ํ™˜ 

X = boston.data
X.shape #(506, 13)
y = boston.target 

df = pd.DataFrame(X, columns=x_names)
df.info()


์นผ๋Ÿผ ์ถ”๊ฐ€ 

df['target'] = y
df.info()


x,y๋ณ€์ˆ˜ ์„ ์ • 

cols = list(df.columns)
X = df[cols[:-1]]
y = df['target']




2. dataset split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3)




3. model ์ƒ์„ฑ 

model = XGBRegressor().fit(X=X_train, y=y_train)




4. ์ค‘์š”๋ณ€์ˆ˜ ํ™•์ธ 

fscore = model.get_booster().get_fscore()
print(fscore)

plot_importance(model)




5. model ํ‰๊ฐ€ 

y_pred = model.predict(X = X_test)


1) MSE : y๋ณ€์ˆ˜ ๋กœ๊ทธ๋ณ€ํ™˜ or ์ •๊ทœํ™”(o)

mse = mean_squared_error(y_test, y_pred)
print(mse) # 11.779651544177534


2) ๊ฒฐ์ •๊ณ„์ˆ˜(r^2) : y๋ณ€์ˆ˜ ๋กœ๊ทธ๋ณ€ํ™˜ or ์ •๊ทœํ™”(x)

score = r2_score(y_test, y_pred)
print(score) # 0.8240142993822802

 



6. model save & load 

model #object 

import pickle #binary file


model save 

pickle.dump(model, open('xgb_model.pkl', mode='wb'))


model load 

load_model = pickle.load(open('xgb_model.pkl', mode='rb'))

y_pred = load_model.predict(X = X_test) #new_data

score = r2_score(y_test, y_pred)
print(score) #0.8919146007745038

 

 

 

 

 

XGBoost GridSearch

1. XGBoost Hyper parameter : ppt.19
2. model ํ•™์Šต ์กฐ๊ธฐ ์ข…๋ฃŒ 
3. Best Hyper parameter ์ฐพ๊ธฐ 


from xgboost import XGBClassifier #model 
from xgboost import plot_importance #์ค‘์š”๋ณ€์ˆ˜ ์‹œ๊ฐํ™” 
from sklearn.datasets import load_breast_cancer #์ดํ•ญ๋ถ„๋ฅ˜ dataset
from sklearn.model_selection import train_test_split #split 
from sklearn.metrics import accuracy_score, classification_report #ํ‰๊ฐ€



1. XGBoost Hyper parameter
1) dataset load 

cancer = load_breast_cancer()

x_names = cancer.feature_names #x๋ณ€์ˆ˜๋ช… 
print(x_names, len(x_names)) #30

class_names = cancer.target_names #y๋ณ€์ˆ˜ class๋ช…  
print(class_names) #['malignant' 'benign']

X = cancer.data
y = cancer.target


2) dataset split 

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3)


3) model ์ƒ์„ฑ 

model = XGBClassifier().fit(X_train, y_train) #default ํŒŒ๋ผ๋ฏธํ„ฐ 

print(model) #default ํŒŒ๋ผ๋ฏธํ„ฐ

1. colsample_bytree=1 : ํŠธ๋ฆฌ ๋ชจ๋ธ ์ƒ์„ฑ ์‹œ ํ›ˆ๋ จ์…‹ ์ƒ˜ํ”Œ๋ง ๋น„์œจ(๋ณดํ†ต : 0.6 ~ 1)
2. learning_rate=0.3 : ํ•™์Šต์œจ(๋ณดํ†ต : 0.01~0.1) = 0์˜ ์ˆ˜๋ ด์†๋„ 
3. max_depth=6 : ํŠธ๋ฆฌ์˜ ๊นŠ์ด(ํด ์ˆ˜๋ก ์„ฑ๋Šฅ์ด ์ข‹์•„์ง, ๊ณผ์ ํ•ฉ ์˜ํ–ฅ)
4. min_child_weight=1 : ์ž์‹ ๋…ธ๋“œ ๋ถ„ํ• ์„ ๊ฒฐ์ •ํ•˜๋Š” ๊ฐ€์ค‘์น˜(Weight)์˜ ํ•ฉ
* 0~n๊ฐ’ ์ง€์ • : ์ž‘์„ ์ˆ˜๋ก ์„ฑ๋Šฅ์ด ์ข‹์•„์ง, ๊ณผ์ ํ•ฉ ์˜ํ–ฅ 
5. n_estimators=100 ๊ฒฐ์ • ํŠธ๋ฆฌ ๊ฐœ์ˆ˜(default=100), ๋งŽ์„ ์ˆ˜๋ก ๊ณ ์„ฑ๋Šฅ
6. objective='binary:logistic'(์ดํ•ญ๋ถ„๋ฅ˜๊ธฐ), 'multi:softprob'(๋‹คํ•ญ๋ถ„๋ฅ˜๊ธฐ)
๊ณผ์ ํ•ฉ ์กฐ์ ˆ : max_depth ์ž‘๊ฒŒ, min_child_weight ๋†’๊ฒŒ ์„ค์ • 

 

 


2. model ํ•™์Šต ์กฐ๊ธฐ ์ข…๋ฃŒ 

xgb = XGBClassifier(n_estimators=500) #object 

model = xgb.fit(X = X_train, y = y_train, #ํ›ˆ๋ จ์…‹ 
        eval_set = [ (X_test, y_test) ], #ํ‰๊ฐ€์…‹,
        eval_metric = 'error', #ํ‰๊ฐ€ ๋ฐฉ๋ฒ•(์˜ค์ฐจ) 
        early_stopping_rounds = 80, #๊ธฐ์ค€ tree ๊ฐœ์ˆ˜ 
        verbose = True) #ํ•™์Šต๊ณผ์ • ํ™•์ธ

ํ›ˆ๋ จ์…‹ : X = X_train, y = y_train
ํ‰๊ฐ€์…‹ : (X_test, y_test)
ํ‰๊ฐ€๋ฐฉ๋ฒ• : 'error'
์กฐ๊ธฐ์ข…๋ฃŒ ๋ผ์šด๋“œ ์ˆ˜ : early_stopping_rounds
ํ•™์Šต๊ณผ์ • ์ฝ˜์†” ์ถœ๋ ฅ : verbose = True

[91] validation_0-error:0.04094

4) model ํ‰๊ฐ€ 

y_pred = model.predict(X = X_test)

acc = accuracy_score(y_test, y_pred)
print(acc) #0.9649122807017544




3. Best Hyper parameter ์ฐพ๊ธฐ 

from sklearn.model_selection import GridSearchCV #class


1) ๊ธฐ๋ณธ model 

model = XGBClassifier()

params = {'colsample_bytree' : [0.5, 0.7, 1],
          'learning_rate' : [0.01, 0.3, 0.5],
          'max_depth' : [5,6,7],
          'min_child_weight' : [0.5,1,3],
          'n_estimators' : [90, 100, 200]} #dict


2) grid search model 

gs = GridSearchCV(estimator=model, param_grid=params, cv = 5)
grid_model = gs.fit(X_train, y_train)


3) best score & parameter

dir(grid_model)

grid_model.best_score_ #0.9649367088607596
grid_model.best_params_

{'colsample_bytree': 0.7,
 'learning_rate': 0.3,
 'max_depth': 5,
 'min_child_weight': 1,
 'n_estimators': 200}

4) best model ์ƒ์„ฑ 

model = XGBClassifier(colsample_bytree=0.7,
              learning_rate=0.3,
              max_depth=5,
              min_child_weight=1,
              n_estimators=200).fit(X_train, y_train)

y_pred = model.predict(X = X_test)

acc = accuracy_score(y_test, y_pred)
print(acc) #0.9532163742690059

report = classification_report(y_test, y_pred)
print(report)


์ค‘์š”๋ณ€์ˆ˜ ์‹œ๊ฐํ™” 

plot_importance(model)