LEE_BOMB 2021. 12. 6. 16:01
decisionTree ์˜์‚ฌ๊ฒฐ์ •ํŠธ๋ฆฌ

๋ชจ๋ธ์˜ ์‹œ๊ฐํ™”๊ฐ€ ์‰ฝ๊ณ , ๊ฐ€๋…์„ฑ ๋†’์Œ(ํ•ด์„ ์‰ฌ์›€)

ํŠน์ง•(๋ณ€์ˆ˜)์˜ ์Šค์ผ€์ผ(์ •๊ทœํ™”๋‚˜ ํ‘œ์ค€ํ™”)์กฐ์ •์ด ํ•„์š” ์—†์Œ

๋…๋ฆฝ๋ณ€์ˆ˜์— ์ด์ง„๊ณผ ์—ฐ์† ๋ณ€์ˆ˜๊ฐ€ ํ˜ผํ•ฉ๋˜์–ด ์žˆ์–ด๋„ ์ž˜ ๋™์ž‘

๋งŽ์€ ํŠน์ง•(์ž…๋ ฅ๋ณ€์ˆ˜)์„ ๊ฐ–๋Š” ๋ฐ์ดํ„ฐ ์…‹์€ ๋ถ€์ ํ•ฉ

๋‹จ์ผ๊ฒฐ์ • Tree ํ•™์Šต์œผ๋กœ ๊ณผ์ ํ•ฉ ๋ฐœ์ƒ ์šฐ๋ ค (์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ ์ €ํ•˜)

๊ณผ์ ํ•ฉ ํ•ด๊ฒฐ๋ฐฉ์•ˆ : ๊ฐ€์ง€์น˜๊ธฐ (CP : Cut Prune)

* ๊นŠ์€ ํŠธ๋ฆฌ(๋ณต์žกํ•œ ๋ชจ๋ธ) : ๊ณผ์ ํ•ฉ(↑), ์˜ค๋ถ„๋ฅ˜(↓)

 

 

์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด(Decision Tree) ์•Œ๊ณ ๋ฆฌ์ฆ˜

์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ค‘์š”๋ณ€์ˆ˜ ํ‰๊ฐ€์ง€์ˆ˜ ๋น„๊ณ 
CART(Classification And Regression Trees) GINI Index ๋ฒ”์ฃผํ˜•๊ณผ ์ˆซ์žํ˜• ์ข…์†๋ณ€์ˆ˜ ํŒจํ‚ค์ง€ : rpart
C5.0(C4.5) Information Gain ๋ฒ”์ฃผํ˜•๊ณผ ์ˆซ์žํ˜• ์ข…์†๋ณ€์ˆ˜ ํŒจํ‚ค์ง€ : C50



Entropy & GINI

ํ™•๋ฅ  ๋ณ€์ˆ˜ ๊ฐ„์˜ ๋ถˆํ™•์‹ค์„ฑ์„ ๋‚˜ํƒ€๋‚ด๋Š” ์ˆ˜์น˜

Tree model์—์„œ ์ค‘์š” ๋ณ€์ˆ˜(x) ์„ ์ • ์‹œ ์‚ฌ์šฉ

 

1. C5.0(C4.5)์—์„œ๋Š” ์ •๋ณด์ด๋“ ์ด์šฉ : ์ •๋ณด์ด๋“์ด ํด ์ˆ˜๋ก ์ค‘์š”๋ณ€์ˆ˜

2. CART์—์„œ๋Š” GINI index ์ด์šฉ : ์ง€๋‹ˆ ๊ณ„์ˆ˜๊ฐ€ ํด ์ˆ˜๋ก ์ค‘์š”๋ณ€์ˆ˜

 

 

 

 

 

์‹ค์Šต
import pandas as pd #csv file read
from sklearn.tree import DecisionTreeClassifier #Tree model 
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score



Tree ์‹œ๊ฐํ™”

from sklearn.tree import export_graphviz #dot file ๋‚ด๋ณด๋‚ด๊ธฐ
from graphviz import Source #dot file ์‹œ๊ฐํ™” (pip install graphviz)




1. dataset load

dataset = pd.read_csv(r'C:\ITWILL\4_Python-2\data\tree_data.csv')
dataset.info()

 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   iq         6 non-null      int64
 1   age        6 non-null      int64
 2   income     6 non-null      int64
 3   owner      6 non-null      int64
 4   unidegree  6 non-null      int64
 5   smoking    6 non-null      int64 -> y๋ณ€์ˆ˜

 

 dataset

     iq  age  income  owner  unidegree  smoking
0   90   42      40      0          0        1
1  110   20      20      1          1        0
2  100   50      46      0          0        0
3  140   40      28      1          1        1
4  110   70     100      0          0        1
5  100   50      20      0          0        0
* decision tree์˜ ์žฅ์  : x๊ฐ€ 2์ง„์ˆ˜์—ฌ๋„ y์˜ ๋ถ„๋ฅ˜๋ชจ๋ธ์„ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Œ

cols = list(dataset.columns)
cols #[iq  age  income  owner  unidegree  smoking(y๋ณ€์ˆ˜)]

X = dataset[cols[:-1]] #๋ณ€์ˆ˜ 5๊ฐœ : iq, age, income, owner, unidegree
y = dataset[cols[-1]]
X.shape #(6, 5)
y.shape #(6, )




2. tree model ์ƒ์„ฑ

model = DecisionTreeClassifier(random_state=123).fit(X, y)
dir(model)


tree์˜ ๊นŠ์ด

model.get_depth() #3 -> ์ตœ์ƒ๋‹จ X์ œ์™ธํ•˜๊ณ  ๊ฐ€์ง€๊ฐ€ 3๊ฐœ


model ์˜ˆ์ธก์น˜

y_pred = model.predict(X=X)
print(y_pred) #[1 0 0 1 1 0]
print(y)

acc = accuracy_score(y, y_pred)
print('accuracy =', acc) #accuracy = 1.0




3. tree model ์‹œ๊ฐํ™”

feature_names = cols[:-1] #['iq', 'age', 'income', 'owner', 'unidegree']
class_names = ['No', 'Yes'] #y๋ณ€์ˆ˜์˜ class๋ช…

export_graphviz

function sklearn.tree._export.export_graphviz
(decision_tree, out_file=None, *, max_depth=None, feature_names=None,
 class_names=None, label='all', filled=False, leaves_parallel=False,
 impurity=True, node_ids=False, proportion=False, rotate=False,
 rounded=False, special_characters=False, precision=3)

graph = export_graphviz(model, out_file = "tree_graph.dot", #๊ฒฝ๋กœ ์„ค์ • ์•ˆ ํ•˜๋ฉด ๊ธฐ๋ณธ ๊ฒฝ๋กœ์— ํŒŒ์ผ ์ €์žฅ
                feature_names = feature_names,
                class_names = class_names,
                filled = False,
                impurity = True,
                rounded = False)

filled = False : ์ƒ‰์ƒ ์•ˆ ์ฑ„์šฐ๊ฒ ๋‹ค
impurity = True : GINI ๊ณ„์ˆ˜
rounded = False : ๋ชจ์„œ๋ผ ๋ผ์šด๋”ฉ ์•ˆ ํ•˜๊ฒ ๋‹ค

dot file read

file = open("tree_graph.dot")
dot_graph = file.read()


dot file ์‹œ๊ฐํ™”

Source(dot_graph)

 

 

 

 

 

decisionTree_parameter

Dicision Tree Hyper parameter
๊ณผ์ ํ•ฉ ๊ด€๋ จ parameter
์ค‘์š”๋ณ€์ˆ˜ ์„ ์ • ๊ด€๋ จ parameter


from sklearn.datasets import load_iris #dataset
from sklearn.tree import DecisionTreeClassifier #Tree model 
from sklearn.model_selection import train_test_split #ํ›ˆ๋ จ/๊ฒ€์ •์…‹
from sklearn.metrics import accuracy_score #tree model ์‹œ๊ฐํ™”  
from sklearn.tree import export_graphviz #dot file ๋‚ด๋ณด๋‚ด๊ธฐ 
from graphviz import Source #dot file ์‹œ๊ฐํ™”(pip install graphviz)



1.dataset load

iris = load_iris()
feature_names = iris.feature_names #listํ˜•์‹์œผ๋กœ x๋ณ€์ˆ˜๋ช… ์ถ”์ถœ
#['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
class_names = iris.target_names

X = iris.data
y = iris.target

 



2. train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.3, random_state = 123)




3. tree model ์ƒ์„ฑ
criterion : defalut = "gini" ์ค‘์š”๋ณ€์ˆ˜ ์„ ์ • ๊ธฐ์ค€
splitter : defalut = "best" ๊ฐ ๋…ธ๋“œ์˜ ๋ถ„ํ•  ๋ฐฉ๋ฒ•
max_depth : ์ตœ๋Œ€ tree๊นŠ์ด(max_depth = 3) -> ๊ณผ์ ํ•ฉ ์ œ์–ด ์—ญํ• 
min_sample_split : defalut = 2 ๋‚ด๋ถ€๋…ธ๋“œ ๋ถ„ํ• ์— ํ•„์š”ํ•œ ์ตœ์†Œ ์ƒ˜ํ”Œ์ˆ˜
min_samples_leaf : defalut = 1 ์ข…๋‹จ๋…ธ๋“œ ๋ถ„ํ• ์— ํ•„์š”ํ•œ ์ตœ์†Œ ์ƒ˜ํ”Œ์ˆ˜

tree = DecisionTreeClassifier (criterion = 'gini',
                       splitter = 'best',
                       max_depth = None,
                       min_samples_split = 2,
                       random_state = 123)

model = tree.fit(X = X_train, y = y_train)


์ค‘์š” ๋ณ€์ˆ˜ ๋А๋‚Œ

model.feature_importances_ #array([0.01364196, 0.01435996, 0.5461181 , 0.42587999])


tree ๊นŠ์ด

model.get_depth() #5




4. model ํ‰๊ฐ€

train_score = model.score(X = X_train, y = y_train)
test_score = model.score(X = X_test, y = y_test)

print('train score :', train_score) #train score : 1.0
print('test_socre :', test_score) #test_socre : 0.9555555555555556




5. tree model ์‹œ๊ฐํ™”

graph = export_graphviz(model, out_file = "tree_graph.dot", #๊ฒฝ๋กœ ์„ค์ • ์•ˆ ํ•˜๋ฉด ๊ธฐ๋ณธ ๊ฒฝ๋กœ์— ํŒŒ์ผ ์ €์žฅ
                feature_names = feature_names,
                class_names = class_names,
                filled = True,
                impurity = True,
                rounded = True)


dot file read

file = open("tree_graph.dot")
dot_graph = file.read()


dot file ์‹œ๊ฐํ™”

Source(dot_graph)

 

 

 

 

 

์ƒˆ๋กœ์šด ๋ชจ๋ธ ๋งŒ๋“ค์–ด๋ณด๊ธฐ

์กฐ๊ฑด : criterion = 'entropy', max_depth = 3
์ค‘์š”๋ณ€์ˆ˜ ์„ ์ • ๊ธฐ์ค€ ์ง€๋‹ˆ๊ณ„์ˆ˜ -> ์—”ํŠธ๋กœํ”ผ
๊ณผ์ ํ•ฉ ๋ฐœ์ƒ์„ ๊ฐ€์ •ํ•˜๊ณ , ๋‘ ๊ฐœ ๊ฐ€์ง€์น˜๊ธฐ


1. new model

tree2 = DecisionTreeClassifier (criterion = 'entropy',
                       splitter = 'best',
                       max_depth = 3,
                       min_samples_split = 2,
                       random_state = 123)

model2 = tree2.fit(X = X_train, y = y_train)




2. new model ํ‰๊ฐ€

train_score = model2.score(X = X_train, y = y_train)
test_score = model2.score(X = X_test, y = y_test)

print('train score :', train_score) #train score : 0.9809523809523809
print('test_socre :', test_score) #test_socre : 0.9333333333333333

* ๋‘ ์ ์ˆ˜์˜ ์ฐจ๊ฐ€ ์ž‘์•„์ง€๋ฉด ๊ณผ์ ํ•ฉ์ด ํ•ด์†Œ๋˜์—ˆ๋‹ค๊ณ  ๋ด„ -> ์˜ค๋ถ„๋ฅ˜๊ฐ€ ์ ์–ด์กŒ๊ณ , ์ •ํ™•๋„๊ฐ€ ๋‚ฎ์•„์ง„๋‹ค



3. new tree model ์‹œ๊ฐํ™”

graph = export_graphviz(model, out_file = "tree_graph.dot", #๊ฒฝ๋กœ ์„ค์ • ์•ˆ ํ•˜๋ฉด ๊ธฐ๋ณธ ๊ฒฝ๋กœ์— ํŒŒ์ผ ์ €์žฅ
                feature_names = feature_names,
                class_names = class_names,
                filled = True,
                impurity = True,
                rounded = True)


dot file read

file = open("tree_graph.dot")
dot_graph = file.read()


dot file ์‹œ๊ฐํ™”

Source(dot_graph)

[ํ•ด์„] modelํŠœ๋‹ : ์ •ํ™•๋„๋Š” ๋–จ์–ด์กŒ์œผ๋‚˜, ๊ณผ์ ํ•ฉ์„ ํ•ด๊ฒฐ

์ค‘์š”๋ณ€์ˆ˜ ์„ ์ •๊ธฐ์ค€ : ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๋ณ€์ˆ˜๋Š” petal length. ์ง€๋‹ˆ๊ณ„์ˆ˜์™€ ์—”ํŠธ๋กœํ”ผ๊ฐ€ ์œ ์‚ฌํ•จ.