DAY51. Python Classification (1)Classification (문서 분류)

데이터분석가 과정/Python

DAY51. Python Classification (1)Classification (문서 분류)

LEE_BOMB 2021. 12. 2. 16:02

kNN

알려진 범주로 알려지지 않은 범주 분류 (해석 용이)

기존에 범주가 존재해야 함 - 식료품(과일, 채소, 단백질 등)

학습하지 않음 : 게으른 학습

결측치(NA)/이상치 전처리 중요

많은 특징을 갖는 데이터 셋은 부적합
유클리드 거리계산식 이용 : 가장 유사한 범주를 가장 가까운 거리로 선택

from sklearn.neighbors import KNeighborsClassifier #kNN 분류기

1. dataset 생성

grape = [8, 5] #과일 - 0
fish = [2, 3] #단백질 - 1
carrot = [7, 10] #채소 - 2
orange = [7, 3] #과일 - 0
celery = [3, 8] #채소 - 2
cheese = [1, 1] #단백질 - 1
know_group = [grape,fish,carrot,orange,celery,cheese] #x변수

y_class = [0, 1, 2, 0, 2, 1] #분류 클래스(y변수)
class_label= ['과일', '단백질', '채소'] #레이블

2. kNN 분류기

knn = KNeighborsClassifier(n_neighbors=3) #k=3  : 최근접 이웃 
model = knn.fit(X = know_group, y=y_class)

3. kNN 분류기

x1 = int(input('단맛(1~10) : '))
x2 = int(input('아삭거림(1~10) : '))
              
test_X = [[x1, x2]] # 알려지지 않은 집단

class 예측

y_pred = model.predict(X = test_X)         
print(y_pred) # [2]


print('분류결과 : ', class_label[y_pred[0]])

5, 9 -> 채소
8, 5 -> 과일
2, 3 -> 단백질

NB (Naive Bayes)

통계적 분류기 : 조건부 확률. 주어진 데이터가 특정 클래스에 속하는지를 확률을 통해서 예측

* 조건부 확률 : 사건 A가 발생했다는 전제 하에서 다른 사건 B가 발생할 확률

베이즈 확률 이론(Bayes’ theorem)을 적용한 기계학습 방법

* 베이즈 확률 이론 : 과거의 경험(사건 B)과 현재의 증거(사건 A)를 토대로 어 떤 사건의 확률을 예측하는 이론

특정 영역에서는 DT나 kNN 분류기 보다 성능이 우수

텍스트 데이터 처럼 희소한 고차원인 경우 높은 정확도와 속도 제공

GaussianNB : x변수가 연속형, 정규분포인 경우 적용
MultinomialNB : 고차원의 텍스트 분류(tf-idf)에 적용(Sparse matrix)

<조건부 확률>
사건A 발생확률 -> 사건B 발생 영향
P(B|A) = P(A^B)/P(A) = P(A|B)*p(B)/P(A) : 확률 곱셈 법칙
사전확률 : P(A), P(B)
결합확률 : P(A|B)

ex) 날씨와 비 관계
        yes   no   tot
맑은날    2    8    10
흐린날    6    4    10
tot      8    12    20

1. 사전확률 : 확률 실험 이전부터 알고 있는 확률
1) 비가 온 확률 : 8/20 = 0.4

p_yes = 8/20 #0.4

2) 비가 안 온 확률
p_no = 12/20 #0.6
-> 비 온 확률과 안 온 확률은 독립사건

2. 조건부 확률 : P(B|A) = P(A^B)/P(A) = P(A|B)*p(B)/P(A)
ex) 맑은 날(A) 비가 온(B) 확률
P(yes|맑은 날) = P(맑은 날|yes) * P(yes) / P(맑은 날)

p = (2/8) * (8/20) / (10/20)
p #0.2 -> 20%

ex2) 흐린 날(A) 비가 온(B) 확률
P(yes|흐린 날) = P(흐린 날|yes) * P(yes) / P(흐린 날)

p2 = (6/8) * (8/20) / (10/20)
p2 #0.60 -> 60%

from sklearn.naive_bayes import GaussianNB #model
from sklearn.datasets import load_iris #dataset
from sklearn.model_selection import train_test_split #split
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report #혼동행렬, 분류정확도, y가 불균형일 경우 f1 score까지 모델 평가

y변수 다항분류기
1. dataset load

X, y = load_iris(return_X_y = True)
X #numpy형식, 2차원, 연속형 변수
y

2. train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.3, random_state=123)

3. nb model 생성

model = GaussianNB().fit(X=X_train, y=y_train)

4. model 평가

y_pred = model.predict(X = X_test) #class단위로 y의 예측치 구하기
y_true = y_test

1) confusion_matrix

con_mat = confusion_matrix(y_true, y_pred)
con_mat

array([[18,  0,  0], -> 예측치의 0번 오분류 없음
       [ 0, 10,  0], -> 예측치의 1번 오분류 없음
       [ 0,  2, 15]], dtype=int64) -> 예측치의 2번 오분류 2개

2) accuracy_score

acc = accuracy_score(y_true, y_pred)
acc #0.9555555555555556 정확도

3) classification_report

report = classification_report (y_true, y_pred) #분류결과를 report형식으로 제공
print(report)

precision recall f1-score support

-> precision(정확률) + recall(재현율) = f1-score, support : 샘플링 된 갯수 (45개 중 18개)
0       1.00      1.00      1.00        18 -> 18개 샘플링 완벽하게(100%) 분류됨
           1       0.83      1.00      0.91        10
           2       1.00      0.88      0.94        17
-> 샘플링 빈도수가 뷸균형하면 주로 f1-score(macro avg, weighted avg)로 평가

    accuracy                           0.96        45 -> 2번 평가도구와 동일 (반올림됨)
   macro avg       0.94      0.96      0.95        45 -> 클래스별 f1-score를 최종으로 집계(클래스별 f1-score값의 평균)
weighted avg       0.96      0.96      0.96        45 -> 클래스별 f1-score를 최종으로 집계(클래스별 f1-score값의 평균)

* macro avg : 산술평균값. (1 + 0.91 + 0.94) / 3 = 0.950
* weighted avg : 각 class의 표본수를 가중평균 낸 값
ex) (1.0 * 0.18 + 0.91 * 0.1 + 0.94 * 0.17) / (0.18 + 0.1 + 0.17) = 0.9573333333333333
0.18, 0.1, 0.17 = 가중치 = 샘플링 된 수

SVM

1. 선형 SVM, 비선형 SVM
2. Hyper parameter : kernel, C, gamma
3. Grid Search : best parameter 찾기

from sklearn.svm import SVC #SVM model 
from sklearn.datasets import load_breast_cancer #datasat 
from sklearn.model_selection import train_test_split #dataset split
from sklearn.metrics import accuracy_score #model 평가

1. dataset load

X, y = load_breast_cancer(return_X_y=True)
X.shape # (569, 30)
y #0 or 1

2. train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=123)

3. 비선형 SVM 모델

help(SVC)

C=1.0 : float, default=1.0 - 결정경계 위치 조정
kernel='rbf' : 커널트릭 함수('linear' or 'rbf')
-> {'linear', 'poly', 'rbf', 'sigmoid', 'precomputed'}
gamma='scale' : 결정경계 모양 조정
-> {'scale', 'auto'}
'scale' : 1 / (n_features * X.var())
'auto' : 1 / n_features.
random_state=None : seed값

svc = SVC(C=1.0, kernel='rbf', gamma='scale',random_state=123)
model = svc.fit(X=X_train, y=y_train)

model 평가

y_pred = model.predict(X = X_test)
y_true = y_test

acc = accuracy_score(y_true, y_pred)
print('accuracy =', acc)

accuracy = 0.9005847953216374

4. 선형 SVM : 선형분류 가능한 데이터

svc = SVC(C=1.0, kernel='linear', gamma='scale',random_state=123)
model2 = svc.fit(X=X_train, y=y_train)

model 평가

y_pred = model2.predict(X = X_test)
y_true = y_test

acc2 = accuracy_score(y_true, y_pred)
print('accuracy =', acc2)

accuracy = 0.9707602339181286

Grid Search
최적의 매개변수를 찾는 방법 -> model 튜닝

from sklearn.model_selection import GridSearchCV # best parameter 

parmas = {'kernel' : ['rbf', 'linear'],
          'C' : [0.01, 0.1, 1.0, 10.0, 100.0],
          'gamma' : ['scale', 'auto'] } # dict 정의

cv=5 : 5겹 교차검정 -> 전체 dataset

grid_model = GridSearchCV(model, param_grid=parmas,
                          scoring='accuracy', cv=5).fit(X=X, y=y)

dir(grid_model)

print('best socre :', grid_model.best_score_)

best socre : 0.9631268436578171

print('best params : ', grid_model.best_params_)

best params : {'C': 100.0, 'gamma': 'scale', 'kernel': 'linear'}

svc = SVC(C=100.0, kernel='linear', gamma='scale',random_state=123)
model = svc.fit(X=X_train, y=y_train)

test_score = model.score(X=X_test, y=y_test)
print(test_score) # 0.9766081871345029