76. Python Cluster&Recommand 연습문제

개인공부/Python

76. Python Cluster&Recommand 연습문제

LEE_BOMB 2021. 12. 10. 22:44

문) 신입사원 면접시험(interview.csv) 데이터 셋을 이용하여 다음과 같이 군집모델을 생성하시오.
<조건1> 대상칼럼 : 가치관,전문지식,발표력,인성,창의력,자격증,종합점수
<조건2> 계층적 군집분석의 완전연결방식 적용
<조건3> 덴드로그램 시각화
<조건4> 텐드로그램을 보고 3개 군집으로 서브셋 생성

import pandas as pd
from scipy.cluster.hierarchy import linkage, dendrogram #계층적 군집 model
import matplotlib.pyplot as plt
from sklearn.preprocessing import minmax_scale

data loading - 신입사원 면접시험 데이터 셋

interview = pd.read_csv("c:/ITWILL/4_Python-2/data/interview.csv", encoding='ms949')
print(interview.info())

RangeIndex: 15 entries, 0 to 14
Data columns (total 9 columns):

<조건1> subset 생성 : no 칼럼을 제외한 나머지 칼럼 이용

cols = list(interview.columns)
interviewDF = interview[cols[1:-1]]
print(interviewDF)

interviewDF = minmax_scale(interviewDF)

<조건2> 계층적 군집 분석 완전연결방식 - 가장 먼 거리의 클러스터를 대상으로 거리 측정하는 방식

clusters = linkage(interviewDF, method='complete')

<조건3> 덴드로그램 시각화 : 군집수는 사용 결정

plt.figure(figsize=(15, 5))
dendrogram(clusters, 
           leaf_rotation=90,
           leaf_font_size=20,)
plt.show()

cluster1 = 8,2,10,4,13
cluster2 = 12,1,5,0,3
cluster3 = 14,7,9,6,11

4. 클러스터링 자르기/평가 : y 변수 -> 3개 클러스터링

from scipy.cluster.hierarchy import fcluster #클러스터 자르기

1) 클러스터 자르기 객체 생성 : 텐드로그램 결과 평가

re_clusters = fcluster(clusters, t=3, criterion='maxclust') 
print(re_clusters)

2) 칼럼으로 추가

interview['cluster'] = re_clusters
interview.head()
interview.tail()

no  가치관  전문지식  발표력  인성  창의력  자격증  종합점수 합격여부  cluster
10  111   13    14   19  12    8    0    66  불합격        1
11  112   14    20   11   9   16    0    70  불합격        2
12  113   18    14   16  12   10    1    70   합격        3
13  114   10    13   18  10    6    0    57  불합격        1
14  115   13    17   10   8   17    0    65  불합격        2

문2) 아래와 같은 단계로 kMeans 알고리즘을 적용하여 확인적 군집분석을 수행하시오.

<조건> 변수 설명 : tot_price : 총구매액, buy_count : 구매횟수, visit_count : 매장방문횟수, avg_price : 평균구매액
단계1 : 3개 군집으로 군집화
단계2: 원형데이터에 군집 예측치 추가
단계3 : tot_price 변수와 가장 상관계수가 높은 변수로 산점도(색상 : 클러스터 결과)
단계4 : 산점도에 군집의 중심점 시각화
단계5 : 군집별 특성분석 : 우수고객 군집 식별

import pandas as pd
from sklearn.cluster import KMeans # kMeans model
import matplotlib.pyplot as plt

sales = pd.read_csv(r"C:\ITWILL\4_Python-2/data/product_sales.csv")
print(sales.info())

RangeIndex: 150 entries, 0 to 149
Data columns (total 4 columns):
tot_price      150 non-null float64 -> 총구매금액
visit_count    150 non-null float64 -> 매장방문수
buy_count      150 non-null float64 -> 구매횟수
avg_price      150 non-null float64 -> 평균구매금액

model = KMeans(n_clusters=3, random_state=0, algorithm='auto') #k=3, auto:default
model.fit(sales)

kMeans model 에측치

pred = model.predict(sales)
print(pred)

예측치 추가

sales['predict'] = pred #column 추가 = numpy vector 추가 가능 
print(sales)

상관계수

print(sales.corr()) #tot_price vs avg_price

tot_price vs avg_price 산점도

plt.scatter(sales['tot_price'], sales['avg_price'], c=sales.iloc[:,4])

군집 중앙값

centers = model.cluster_centers_
print(centers)

[[6.83902439 5.67804878]
[5.00784314 1.49215686]
[5.87413793 4.39310345]]

중앙값 시각화

plt.scatter(centers[:,0], centers[:, 3], marker='D', c='r')
plt.show()

[추가] 각 군집별 특성분석(통계)

ales_g = sales.groupby('predict')
sales_g.size()
'''
0    50
1    62
2    38
'''

print(sales_g.mean())

tot_price  visit_count  buy_count  avg_price
predict
0         5.006000     0.244000   3.284000   1.464000
1         5.901613     1.433871   2.754839   4.393548
2         6.850000     2.071053   3.071053   5.742105 -> 우수고객