๋ฐ์ดํ„ฐ๋ถ„์„๊ฐ€ ๊ณผ์ •/Python

DAY51. Python Classification (1)Classification (๋ฌธ์„œ ๋ถ„๋ฅ˜)

LEE_BOMB 2021. 12. 2. 16:02
kNN

์•Œ๋ ค์ง„ ๋ฒ”์ฃผ๋กœ ์•Œ๋ ค์ง€์ง€ ์•Š์€ ๋ฒ”์ฃผ ๋ถ„๋ฅ˜ (ํ•ด์„ ์šฉ์ด)

๊ธฐ์กด์— ๋ฒ”์ฃผ๊ฐ€ ์กด์žฌํ•ด์•ผ ํ•จ - ์‹๋ฃŒํ’ˆ(๊ณผ์ผ, ์ฑ„์†Œ, ๋‹จ๋ฐฑ์งˆ ๋“ฑ)

ํ•™์Šตํ•˜์ง€ ์•Š์Œ : ๊ฒŒ์œผ๋ฅธ ํ•™์Šต

๊ฒฐ์ธก์น˜(NA)/์ด์ƒ์น˜ ์ „์ฒ˜๋ฆฌ ์ค‘์š”

๋งŽ์€ ํŠน์ง•์„ ๊ฐ–๋Š” ๋ฐ์ดํ„ฐ ์…‹์€ ๋ถ€์ ํ•ฉ
์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ๊ณ„์‚ฐ์‹ ์ด์šฉ : ๊ฐ€์žฅ ์œ ์‚ฌํ•œ ๋ฒ”์ฃผ๋ฅผ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ๊ฑฐ๋ฆฌ๋กœ ์„ ํƒ

 

from sklearn.neighbors import KNeighborsClassifier #kNN ๋ถ„๋ฅ˜๊ธฐ



1. dataset ์ƒ์„ฑ 

grape = [8, 5] #๊ณผ์ผ - 0
fish = [2, 3] #๋‹จ๋ฐฑ์งˆ - 1
carrot = [7, 10] #์ฑ„์†Œ - 2
orange = [7, 3] #๊ณผ์ผ - 0
celery = [3, 8] #์ฑ„์†Œ - 2
cheese = [1, 1] #๋‹จ๋ฐฑ์งˆ - 1
know_group = [grape,fish,carrot,orange,celery,cheese] #x๋ณ€์ˆ˜

y_class = [0, 1, 2, 0, 2, 1] #๋ถ„๋ฅ˜ ํด๋ž˜์Šค(y๋ณ€์ˆ˜)
class_label= ['๊ณผ์ผ', '๋‹จ๋ฐฑ์งˆ', '์ฑ„์†Œ'] #๋ ˆ์ด๋ธ”


         


2. kNN ๋ถ„๋ฅ˜๊ธฐ               

knn = KNeighborsClassifier(n_neighbors=3) #k=3  : ์ตœ๊ทผ์ ‘ ์ด์›ƒ 
model = knn.fit(X = know_group, y=y_class)

 



3. kNN ๋ถ„๋ฅ˜๊ธฐ 

x1 = int(input('๋‹จ๋ง›(1~10) : '))
x2 = int(input('์•„์‚ญ๊ฑฐ๋ฆผ(1~10) : '))
              
test_X = [[x1, x2]] # ์•Œ๋ ค์ง€์ง€ ์•Š์€ ์ง‘๋‹จ

          
class ์˜ˆ์ธก      

y_pred = model.predict(X = test_X)         
print(y_pred) # [2]


print('๋ถ„๋ฅ˜๊ฒฐ๊ณผ : ', class_label[y_pred[0]])

5, 9 -> ์ฑ„์†Œ
8, 5 -> ๊ณผ์ผ
2, 3 -> ๋‹จ๋ฐฑ์งˆ

 

 

 

 

 

NB (Naive Bayes)

ํ†ต๊ณ„์  ๋ถ„๋ฅ˜๊ธฐ : ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ . ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ๊ฐ€ ํŠน์ • ํด๋ž˜์Šค์— ์†ํ•˜๋Š”์ง€๋ฅผ ํ™•๋ฅ ์„ ํ†ตํ•ด์„œ ์˜ˆ์ธก

* ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ  : ์‚ฌ๊ฑด A๊ฐ€ ๋ฐœ์ƒํ–ˆ๋‹ค๋Š” ์ „์ œ ํ•˜์—์„œ ๋‹ค๋ฅธ ์‚ฌ๊ฑด B๊ฐ€ ๋ฐœ์ƒํ•  ํ™•๋ฅ 

๋ฒ ์ด์ฆˆ ํ™•๋ฅ  ์ด๋ก (Bayes’ theorem)์„ ์ ์šฉํ•œ ๊ธฐ๊ณ„ํ•™์Šต ๋ฐฉ๋ฒ•

* ๋ฒ ์ด์ฆˆ ํ™•๋ฅ  ์ด๋ก  : ๊ณผ๊ฑฐ์˜ ๊ฒฝํ—˜(์‚ฌ๊ฑด B)๊ณผ ํ˜„์žฌ์˜ ์ฆ๊ฑฐ(์‚ฌ๊ฑด A)๋ฅผ ํ† ๋Œ€๋กœ ์–ด ๋–ค ์‚ฌ๊ฑด์˜ ํ™•๋ฅ ์„ ์˜ˆ์ธกํ•˜๋Š” ์ด๋ก 

ํŠน์ • ์˜์—ญ์—์„œ๋Š” DT๋‚˜ kNN ๋ถ„๋ฅ˜๊ธฐ ๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์šฐ์ˆ˜

ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ ์ฒ˜๋Ÿผ ํฌ์†Œํ•œ ๊ณ ์ฐจ์›์ธ ๊ฒฝ์šฐ ๋†’์€ ์ •ํ™•๋„์™€ ์†๋„ ์ œ๊ณต


GaussianNB : x๋ณ€์ˆ˜๊ฐ€ ์—ฐ์†ํ˜•, ์ •๊ทœ๋ถ„ํฌ์ธ ๊ฒฝ์šฐ ์ ์šฉ
MultinomialNB : ๊ณ ์ฐจ์›์˜ ํ…์ŠคํŠธ ๋ถ„๋ฅ˜(tf-idf)์— ์ ์šฉ(Sparse matrix)

 

 

<์กฐ๊ฑด๋ถ€ ํ™•๋ฅ >
์‚ฌ๊ฑดA ๋ฐœ์ƒํ™•๋ฅ  -> ์‚ฌ๊ฑดB ๋ฐœ์ƒ ์˜ํ–ฅ
P(B|A) = P(A^B)/P(A) = P(A|B)*p(B)/P(A) : ํ™•๋ฅ  ๊ณฑ์…ˆ ๋ฒ•์น™
์‚ฌ์ „ํ™•๋ฅ  : P(A), P(B)
๊ฒฐํ•ฉํ™•๋ฅ  : P(A|B)

ex) ๋‚ ์”จ์™€ ๋น„ ๊ด€๊ณ„
        yes   no   tot
๋ง‘์€๋‚     2    8    10
ํ๋ฆฐ๋‚     6    4    10
tot      8    12    20

1. ์‚ฌ์ „ํ™•๋ฅ  : ํ™•๋ฅ  ์‹คํ—˜ ์ด์ „๋ถ€ํ„ฐ ์•Œ๊ณ  ์žˆ๋Š” ํ™•๋ฅ 
1) ๋น„๊ฐ€ ์˜จ ํ™•๋ฅ  : 8/20 = 0.4

p_yes = 8/20 #0.4


2) ๋น„๊ฐ€ ์•ˆ ์˜จ ํ™•๋ฅ 
p_no = 12/20 #0.6
-> ๋น„ ์˜จ ํ™•๋ฅ ๊ณผ ์•ˆ ์˜จ ํ™•๋ฅ ์€ ๋…๋ฆฝ์‚ฌ๊ฑด



2. ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ  : P(B|A) = P(A^B)/P(A) = P(A|B)*p(B)/P(A)
ex) ๋ง‘์€ ๋‚ (A) ๋น„๊ฐ€ ์˜จ(B) ํ™•๋ฅ 
P(yes|๋ง‘์€ ๋‚ ) = P(๋ง‘์€ ๋‚ |yes) * P(yes) / P(๋ง‘์€ ๋‚ )

p = (2/8) * (8/20) / (10/20)
p #0.2 -> 20%


ex2) ํ๋ฆฐ ๋‚ (A) ๋น„๊ฐ€ ์˜จ(B) ํ™•๋ฅ 
P(yes|ํ๋ฆฐ ๋‚ ) = P(ํ๋ฆฐ ๋‚ |yes) * P(yes) / P(ํ๋ฆฐ ๋‚ )

p2 = (6/8) * (8/20) / (10/20)
p2 #0.60 -> 60%




 

from sklearn.naive_bayes import GaussianNB #model
from sklearn.datasets import load_iris #dataset
from sklearn.model_selection import train_test_split #split
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report #ํ˜ผ๋™ํ–‰๋ ฌ, ๋ถ„๋ฅ˜์ •ํ™•๋„, y๊ฐ€ ๋ถˆ๊ท ํ˜•์ผ ๊ฒฝ์šฐ f1 score๊นŒ์ง€ ๋ชจ๋ธ ํ‰๊ฐ€


y๋ณ€์ˆ˜ ๋‹คํ•ญ๋ถ„๋ฅ˜๊ธฐ
1. dataset load

X, y = load_iris(return_X_y = True)
X #numpyํ˜•์‹, 2์ฐจ์›, ์—ฐ์†ํ˜• ๋ณ€์ˆ˜
y




2. train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.3, random_state=123)




3. nb model ์ƒ์„ฑ

model = GaussianNB().fit(X=X_train, y=y_train)




4. model ํ‰๊ฐ€

y_pred = model.predict(X = X_test) #class๋‹จ์œ„๋กœ y์˜ ์˜ˆ์ธก์น˜ ๊ตฌํ•˜๊ธฐ
y_true = y_test


1) confusion_matrix

con_mat = confusion_matrix(y_true, y_pred)
con_mat

array([[18,  0,  0], -> ์˜ˆ์ธก์น˜์˜ 0๋ฒˆ ์˜ค๋ถ„๋ฅ˜ ์—†์Œ
       [ 0, 10,  0], -> ์˜ˆ์ธก์น˜์˜ 1๋ฒˆ ์˜ค๋ถ„๋ฅ˜ ์—†์Œ
       [ 0,  2, 15]], dtype=int64) -> ์˜ˆ์ธก์น˜์˜ 2๋ฒˆ ์˜ค๋ถ„๋ฅ˜ 2๊ฐœ

2) accuracy_score

acc = accuracy_score(y_true, y_pred)
acc #0.9555555555555556 ์ •ํ™•๋„


3) classification_report

report = classification_report (y_true, y_pred) #๋ถ„๋ฅ˜๊ฒฐ๊ณผ๋ฅผ reportํ˜•์‹์œผ๋กœ ์ œ๊ณต
print(report)

              precision    recall  f1-score   support

-> precision(์ •ํ™•๋ฅ ) + recall(์žฌํ˜„์œจ) = f1-score, support : ์ƒ˜ํ”Œ๋ง ๋œ ๊ฐฏ์ˆ˜ (45๊ฐœ ์ค‘ 18๊ฐœ)
           0       1.00      1.00      1.00        18 -> 18๊ฐœ ์ƒ˜ํ”Œ๋ง ์™„๋ฒฝํ•˜๊ฒŒ(100%) ๋ถ„๋ฅ˜๋จ
           1       0.83      1.00      0.91        10
           2       1.00      0.88      0.94        17
-> ์ƒ˜ํ”Œ๋ง ๋นˆ๋„์ˆ˜๊ฐ€ ๋ทธ๊ท ํ˜•ํ•˜๋ฉด ์ฃผ๋กœ f1-score(macro avg, weighted avg)๋กœ ํ‰๊ฐ€


    accuracy                           0.96        45 -> 2๋ฒˆ ํ‰๊ฐ€๋„๊ตฌ์™€ ๋™์ผ (๋ฐ˜์˜ฌ๋ฆผ๋จ)
   macro avg       0.94      0.96      0.95        45 -> ํด๋ž˜์Šค๋ณ„ f1-score๋ฅผ ์ตœ์ข…์œผ๋กœ ์ง‘๊ณ„(ํด๋ž˜์Šค๋ณ„ f1-score๊ฐ’์˜ ํ‰๊ท )
weighted avg       0.96      0.96      0.96        45 -> ํด๋ž˜์Šค๋ณ„ f1-score๋ฅผ ์ตœ์ข…์œผ๋กœ ์ง‘๊ณ„(ํด๋ž˜์Šค๋ณ„ f1-score๊ฐ’์˜ ํ‰๊ท )

* macro avg : ์‚ฐ์ˆ ํ‰๊ท ๊ฐ’. (1 + 0.91 + 0.94) / 3 = 0.950
* weighted avg : ๊ฐ class์˜ ํ‘œ๋ณธ์ˆ˜๋ฅผ ๊ฐ€์ค‘ํ‰๊ท  ๋‚ธ ๊ฐ’
ex) (1.0 * 0.18 + 0.91 * 0.1 + 0.94 * 0.17) / (0.18 + 0.1 + 0.17) = 0.9573333333333333
0.18, 0.1, 0.17 = ๊ฐ€์ค‘์น˜ = ์ƒ˜ํ”Œ๋ง ๋œ ์ˆ˜

 

 

 

 

 

SVM

1. ์„ ํ˜• SVM, ๋น„์„ ํ˜• SVM
2. Hyper parameter : kernel, C, gamma  
3. Grid Search : best parameter ์ฐพ๊ธฐ 

from sklearn.svm import SVC #SVM model 
from sklearn.datasets import load_breast_cancer #datasat 
from sklearn.model_selection import train_test_split #dataset split
from sklearn.metrics import accuracy_score #model ํ‰๊ฐ€



1. dataset load 

X, y = load_breast_cancer(return_X_y=True)
X.shape # (569, 30)
y #0 or 1

 



2. train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=123)

 

 


3. ๋น„์„ ํ˜• SVM ๋ชจ๋ธ 

help(SVC)

C=1.0 : float, default=1.0 - ๊ฒฐ์ •๊ฒฝ๊ณ„ ์œ„์น˜ ์กฐ์ •  
kernel='rbf' : ์ปค๋„ํŠธ๋ฆญ ํ•จ์ˆ˜('linear' or 'rbf') 
 -> {'linear', 'poly', 'rbf', 'sigmoid', 'precomputed'}
gamma='scale' : ๊ฒฐ์ •๊ฒฝ๊ณ„ ๋ชจ์–‘ ์กฐ์ •
 -> {'scale', 'auto'}
 'scale' : 1 / (n_features * X.var())
 'auto' : 1 / n_features.
random_state=None : seed๊ฐ’ 

svc = SVC(C=1.0, kernel='rbf', gamma='scale',random_state=123)
model = svc.fit(X=X_train, y=y_train)


model ํ‰๊ฐ€ 

y_pred = model.predict(X = X_test)
y_true = y_test

acc = accuracy_score(y_true, y_pred)
print('accuracy =', acc)

accuracy = 0.9005847953216374

 

 

4. ์„ ํ˜• SVM : ์„ ํ˜•๋ถ„๋ฅ˜ ๊ฐ€๋Šฅํ•œ ๋ฐ์ดํ„ฐ 

svc = SVC(C=1.0, kernel='linear', gamma='scale',random_state=123)
model2 = svc.fit(X=X_train, y=y_train)


model ํ‰๊ฐ€ 

y_pred = model2.predict(X = X_test)
y_true = y_test

acc2 = accuracy_score(y_true, y_pred)
print('accuracy =', acc2)

accuracy = 0.9707602339181286

 


Grid Search
์ตœ์ ์˜ ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์ฐพ๋Š” ๋ฐฉ๋ฒ• -> model ํŠœ๋‹  

from sklearn.model_selection import GridSearchCV # best parameter 

parmas = {'kernel' : ['rbf', 'linear'],
          'C' : [0.01, 0.1, 1.0, 10.0, 100.0],
          'gamma' : ['scale', 'auto'] } # dict ์ •์˜


cv=5 : 5๊ฒน ๊ต์ฐจ๊ฒ€์ • -> ์ „์ฒด dataset 

grid_model = GridSearchCV(model, param_grid=parmas,
                          scoring='accuracy', cv=5).fit(X=X, y=y)

dir(grid_model)

print('best socre :', grid_model.best_score_)

best socre : 0.9631268436578171

print('best params : ', grid_model.best_params_)

best params :  {'C': 100.0, 'gamma': 'scale', 'kernel': 'linear'}

svc = SVC(C=100.0, kernel='linear', gamma='scale',random_state=123)
model = svc.fit(X=X_train, y=y_train)

test_score = model.score(X=X_test, y=y_test)
print(test_score) # 0.9766081871345029