72. Python TreeModel 연습문제

개인공부/Python

72. Python TreeModel 연습문제

LEE_BOMB 2021. 12. 7. 20:11

문1) load_breast_cancer 데이터 셋을 이용하여 다음과 같이 Decision Tree 모델을 생성하시오.
<조건1> 75:25비율 train/test 데이터 셋 구성
<조건2> y변수 : cancer.target, x변수 : cancer.data
<조건3> tree 최대 깊이 : 5
<조건4> decision tree 시각화 & 중요변수 확인

from sklearn import model_selection
from sklearn.datasets import load_breast_cancer
from sklearn.tree import DecisionTreeClassifier
#tree 시각화 
from sklearn.tree import export_graphviz
from graphviz import Source #pip install graphviz

데이터 셋 load

cancer = load_breast_cancer()

<단계1> y변수 : cancer.target, x변수 : cancer.data

feature_names = cancer.feature_names
class_names = cancer.target_names
X = cancer.data
y = cancer.target

<단계2> 75:25비율 train/test 데이터 셋 구성

X_train, X_test, y_train, y_test = model_selection.train_test_split(
    X, y, test_size=0.25, random_state=123)

<단계3> tree 최대 깊이 : 5

tree = DecisionTreeClassifier(max_depth=5, random_state=123)
model = tree.fit(X=X_train, y=y_train)

test_score = model.score(X=X_test, y=y_test)
print('accuracy =', test_score) #accuracy = 0.972027972027972

<단계4> decision tree 시각화 & 중요변수 확인

export_graphviz(model, out_file="tree_exam.dot",
                    feature_names=feature_names, 
                    class_names=class_names)

file = open("tree_exam.dot")
graph = file.read()
file.close()

Source(graph) #중요변수 : worst_radius

문2) 당료병(diabetes.csv) 데이터 셋을 이용하여 다음과 같은 단계로 RandomForest 모델을 생성하시오.

<단계1> 데이터셋 로드 & 칼럼명 적용
<단계2> x, y 변수 선택 : x변수 : 1 ~ 8번째 칼럼, y변수 : 9번째 칼럼
<단계3> 500개의 트리를 이용하여 모델 생성
<단계4> 중요변수 시각화 : feature names 적용

from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import matplotlib.pyplot as plt #중요변수 시각화

단계1. 테이터셋 로드

dia = pd.read_csv('C:/ITWILL/4_Python-2/data/diabetes.csv', 
                  header=None) #제목 없음 
print(dia.info())

칼럼명 추가

dia.columns = ['Pregnancies','Glucose','BloodPressure','SkinThickness',
               'Insulin','BMI','DiabetesPedigree','Age','Outcome']
print(dia.info())

0   Pregnancies       759 non-null    float64
1   Glucose           759 non-null    float64
2   BloodPressure     759 non-null    float64
3   SkinThickness     759 non-null    float64
4   Insulin           759 non-null    float64
5   BMI               759 non-null    float64
6   DiabetesPedigree  759 non-null    float64
7   Age               759 non-null    float64
8   Outcome           759 non-null    int64    - y변수
(한글명 : 임신, 혈당, 혈압, 피부두께,인슐린,비만도지수,당료병유전,나이,결과)

type(dia)

단계2. x,y 변수 생성

cols = list(dia.columns)
cols

X = dia[cols[:-1]] #중첩list 
X.shape  #(759, 8)

y = dia['Outcome'] #단일list

단계3. model 생성

model = RandomForestClassifier(n_estimators=500).fit(X, y)

단계4. 중요변수 시각화

model.feature_importances_
# array([0.0786396 , 0.26605613, 0.0838773 , 0.0800118 , 0.08649011,
#       0.16625653, 0.12740268, 0.11126585])

x_names = cols[:-1]
size = len(x_names) #8

plt.barh(y=range(size), width=model.feature_importances_)
#y축 눈금 : x변수 이름  
plt.yticks(range(size), x_names)
plt.xlabel("feature_importances")
plt.show()

#중요변수 : Glucose(혈당) > BMI(비만도지수)

문3) iris dataset을 이용하여 다음과 같은 단계로 XGBoost model을 생성하시오.

import pandas as pd #file read
from xgboost import XGBClassifier #model 생성 
from xgboost import plot_importance #중요변수 시각화  
from sklearn.model_selection import train_test_split #dataset split
from sklearn.metrics import confusion_matrix, accuracy_score,classification_report # model 평가

단계1 : data set load

iris = pd.read_csv("C:/ITWILL/4_Python-2/data/iris.csv")

변수명 추출

cols=list(iris.columns)
cols #['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species']


col_x=cols[:4] #x변수명 : ['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width']
col_y=cols[-1] #y변수명 : 'Species'

X = iris[col_x]
y = iris[col_y] 
y.value_counts()

setosa        50
versicolor    50
virginica     50

단계2 : 훈련/검정 데이터셋 생성

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25)

단계3 : model 생성 : train data 이용

model = XGBClassifier(objective="multi:softprob").fit(X = X_train, y = y_train, eval_metric='merror')

UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release.
- use_label_encoder=True 속성 : y변수 자동 인코딩 기능
- 향후 릴리즈에서 제거될 예정으로 경고 메시지 나타남
- 경고 메시지 제거 방법
  1. y변수의 label 인코딩 -> 2) use_label_encoder=False

단계4 :예측치 생성 : test data 이용

y_pred = model.predict(X = X_test)

단계5 : 중요변수 확인 & 시각화

fscore = model.get_booster().get_score()
print('fscore =', fscore)

fscore = {'Sepal.Length': 39.0, 'Sepal.Width': 71.0, 'Petal.Length': 115.0, 'Petal.Width': 104.0}

plot_importance(model) #중요변수 : 'Petal.Width'

단계6 : model 평가 : confusion matrix, accuracy, report

con_mat = confusion_matrix(y_test, y_pred)
print(con_mat)

acc = accuracy_score(y_test, y_pred)
print(acc) 

report = classification_report(y_test, y_pred)
print(report)

문4) food를 대상으로 다음과 같이 xgboost 모델을 생성하시오.

<조건1> 6:4 비율 train/test set 생성
<조건2> y변수 ; 폐업_2년, x변수 ; 나머지 20개
<조건3> 중요변수에 대한 f1 score 출력
<조건4> 중요변수 시각화
<조건5> accuracy와 model report 출력

import pandas as pd #csv file read 
from sklearn import model_selection, metrics  #split, 평가 도구 
from xgboost import XGBClassifier #xgboost 모델 생성 
from xgboost import plot_importance #중요변수 시각화

중요변수 시각화

from matplotlib import font_manager, rc #한글 지원
font_name = font_manager.FontProperties(fname="c:/Windows/Fonts/malgun.ttf").get_name()
rc('font', family=font_name)

외식업종 관련 data set

food = pd.read_csv("C:/ITWILL/4_Python-2/data/food_dataset.csv",
                   encoding="utf-8", thousands=',')

결측치 제거

food=food.dropna()  
print(food.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 68796 entries, 0 to 70170
Data columns (total 21 columns):

food['폐업_2년'].value_counts()

0 54284 : 폐업(x)
1 14512 : 폐업(o)

<조건2> X, y변수 선택

cols = list(food.columns)
cols #21개 변수 
y = food[cols[-1]] #food['폐업_2년']
X = food[cols[:-1]] #20개 변수 
X.shape #(68796, 20)

<조건1> 6:4 비율 train/test set 생성

X_train, X_test, y_train, y_test = model_selection.train_test_split(
    X, y, test_size=0.4, random_state=123)

<조건3> 중요변수에 대한 f1 score 출력

model = XGBClassifier(objective = "binary:logistic").fit(X_train, y_train)

score = model.get_booster().get_fscore() 
print(score)

{'소재지면적': 472.0, '위생업태명': 109.0, '주변': 341.0, '주변동종': 244.0, '기간평균': 485.0, 'pop': 236.0, 'bank': 265.0, 'nonbank': 240.0, 'tax_sum': 210.0, '유동인구_주중_오전': 287.0, '유동인구_주중_오후': 304.0, '유동인구_주말_오전': 299.0, '유동인구_주말_오후': 295.0, 'X1km_병원갯수': 127.0, 'X1km_초등학교갯수': 98.0, 'X3km_대학교갯수': 124.0, 'X1km_고등학교갯수': 98.0, 'X1km_영화관갯수': 90.0, 'X1km_지하철역갯수': 96.0}

<조건4> 중요변수 시각화

plot_importance(model)

<조건5> accuracy와 model report 출력

y_pred = model.predict(X = X_test)

acc = metrics.accuracy_score(y_test, y_pred)
print(acc)  #0.7898542824957302  

y_test.value_counts()

0 21730
1 5789

report = metrics.classification_report(y_test, y_pred)
print(report)

precision    recall  f1-score   support

           0       0.80      0.98      0.88     21730
           1       0.50      0.08      0.14      5789

    accuracy                           0.79     27519
   macro avg       0.65      0.53      0.51     27519
weighted avg       0.74      0.79      0.73     27519

문5) wine dataset을 이용하여 다음과 같이 다항분류 모델을 생성하시오.
<조건1> tree model 200개 학습
<조건2> tree model 학습과정에서 조기 종료 100회 지정
<조건3> model의 분류정확도와 리포트 출력

from xgboost import XGBClassifier #model
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_wine #다항분류
from sklearn.metrics import accuracy_score, classification_report

1. XGBoost Hyper Parameter
1. dataset load

wine = load_wine()
print(wine.feature_names) #13개 
print(wine.target_names) #['class_0' 'class_1' 'class_2']

X, y = load_wine(return_X_y=True)

2. train/test 생성

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3)

3. model 생성 : 다항분류

xgb = XGBClassifier(objective='multi:softprob',
                    n_estimators = 200) #softmax 함수

4. model 학습 조기종료

eval_set = [(X_test, y_test)] #평가셋 

model = xgb.fit(X_train, y_train, 
                eval_set = eval_set,
                eval_metric='merror',
                early_stopping_rounds=100)

5. model 평가

y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print('accuracy =', acc) #accuracy = 0.9814814814814815

report = classification_report(y_test, y_pred)
print(report)

precision    recall  f1-score   support

           0       1.00      1.00      1.00        23
           1       0.96      1.00      0.98        23
           2       1.00      0.88      0.93         8

    accuracy                           0.98        54
   macro avg       0.99      0.96      0.97        54
weighted avg       0.98      0.98      0.98        54