๊ฐœ์ธ๊ณต๋ถ€/Python

76. Python Cluster&Recommand ์—ฐ์Šต๋ฌธ์ œ

LEE_BOMB 2021. 12. 10. 22:44
๋ฌธ) ์‹ ์ž…์‚ฌ์› ๋ฉด์ ‘์‹œํ—˜(interview.csv) ๋ฐ์ดํ„ฐ ์…‹์„ ์ด์šฉํ•˜์—ฌ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ตฐ์ง‘๋ชจ๋ธ์„ ์ƒ์„ฑํ•˜์‹œ์˜ค.
<์กฐ๊ฑด1> ๋Œ€์ƒ์นผ๋Ÿผ : ๊ฐ€์น˜๊ด€,์ „๋ฌธ์ง€์‹,๋ฐœํ‘œ๋ ฅ,์ธ์„ฑ,์ฐฝ์˜๋ ฅ,์ž๊ฒฉ์ฆ,์ข…ํ•ฉ์ ์ˆ˜ 
<์กฐ๊ฑด2> ๊ณ„์ธต์  ๊ตฐ์ง‘๋ถ„์„์˜ ์™„์ „์—ฐ๊ฒฐ๋ฐฉ์‹ ์ ์šฉ 
<์กฐ๊ฑด3> ๋ด๋“œ๋กœ๊ทธ๋žจ ์‹œ๊ฐํ™” 
<์กฐ๊ฑด4> ํ…๋“œ๋กœ๊ทธ๋žจ์„ ๋ณด๊ณ  3๊ฐœ ๊ตฐ์ง‘์œผ๋กœ ์„œ๋ธŒ์…‹ ์ƒ์„ฑ 
import pandas as pd
from scipy.cluster.hierarchy import linkage, dendrogram #๊ณ„์ธต์  ๊ตฐ์ง‘ model
import matplotlib.pyplot as plt
from sklearn.preprocessing import minmax_scale


data loading - ์‹ ์ž…์‚ฌ์› ๋ฉด์ ‘์‹œํ—˜ ๋ฐ์ดํ„ฐ ์…‹ 

interview = pd.read_csv("c:/ITWILL/4_Python-2/data/interview.csv", encoding='ms949')
print(interview.info())

RangeIndex: 15 entries, 0 to 14
Data columns (total 9 columns):

 



<์กฐ๊ฑด1> subset ์ƒ์„ฑ : no ์นผ๋Ÿผ์„ ์ œ์™ธํ•œ ๋‚˜๋จธ์ง€ ์นผ๋Ÿผ ์ด์šฉ 

cols = list(interview.columns)
interviewDF = interview[cols[1:-1]]
print(interviewDF)

interviewDF = minmax_scale(interviewDF)

 



<์กฐ๊ฑด2> ๊ณ„์ธต์  ๊ตฐ์ง‘ ๋ถ„์„  ์™„์ „์—ฐ๊ฒฐ๋ฐฉ์‹ - ๊ฐ€์žฅ ๋จผ ๊ฑฐ๋ฆฌ์˜ ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ๋Œ€์ƒ์œผ๋กœ ๊ฑฐ๋ฆฌ ์ธก์ •ํ•˜๋Š” ๋ฐฉ์‹ 

clusters = linkage(interviewDF, method='complete')

 



<์กฐ๊ฑด3> ๋ด๋“œ๋กœ๊ทธ๋žจ ์‹œ๊ฐํ™” : ๊ตฐ์ง‘์ˆ˜๋Š” ์‚ฌ์šฉ ๊ฒฐ์ • 

plt.figure(figsize=(15, 5))
dendrogram(clusters, 
           leaf_rotation=90,
           leaf_font_size=20,)
plt.show()

cluster1 = 8,2,10,4,13
cluster2 = 12,1,5,0,3
cluster3 = 14,7,9,6,11


 

4. ํด๋Ÿฌ์Šคํ„ฐ๋ง ์ž๋ฅด๊ธฐ/ํ‰๊ฐ€ : y ๋ณ€์ˆ˜ -> 3๊ฐœ ํด๋Ÿฌ์Šคํ„ฐ๋ง 

from scipy.cluster.hierarchy import fcluster #ํด๋Ÿฌ์Šคํ„ฐ ์ž๋ฅด๊ธฐ


1) ํด๋Ÿฌ์Šคํ„ฐ ์ž๋ฅด๊ธฐ ๊ฐ์ฒด ์ƒ์„ฑ : ํ…๋“œ๋กœ๊ทธ๋žจ ๊ฒฐ๊ณผ ํ‰๊ฐ€ 

re_clusters = fcluster(clusters, t=3, criterion='maxclust') 
print(re_clusters)


2) ์นผ๋Ÿผ์œผ๋กœ ์ถ”๊ฐ€ 

interview['cluster'] = re_clusters
interview.head()
interview.tail()

     no  ๊ฐ€์น˜๊ด€  ์ „๋ฌธ์ง€์‹  ๋ฐœํ‘œ๋ ฅ  ์ธ์„ฑ  ์ฐฝ์˜๋ ฅ  ์ž๊ฒฉ์ฆ  ์ข…ํ•ฉ์ ์ˆ˜ ํ•ฉ๊ฒฉ์—ฌ๋ถ€  cluster
10  111   13    14   19  12    8    0    66  ๋ถˆํ•ฉ๊ฒฉ        1
11  112   14    20   11   9   16    0    70  ๋ถˆํ•ฉ๊ฒฉ        2
12  113   18    14   16  12   10    1    70   ํ•ฉ๊ฒฉ        3
13  114   10    13   18  10    6    0    57  ๋ถˆํ•ฉ๊ฒฉ        1
14  115   13    17   10   8   17    0    65  ๋ถˆํ•ฉ๊ฒฉ        2

 

 

 

 

 

๋ฌธ2) ์•„๋ž˜์™€ ๊ฐ™์€ ๋‹จ๊ณ„๋กœ kMeans ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ ์šฉํ•˜์—ฌ ํ™•์ธ์  ๊ตฐ์ง‘๋ถ„์„์„ ์ˆ˜ํ–‰ํ•˜์‹œ์˜ค.

<์กฐ๊ฑด> ๋ณ€์ˆ˜ ์„ค๋ช… : tot_price : ์ด๊ตฌ๋งค์•ก, buy_count : ๊ตฌ๋งคํšŸ์ˆ˜, visit_count : ๋งค์žฅ๋ฐฉ๋ฌธํšŸ์ˆ˜, avg_price : ํ‰๊ท ๊ตฌ๋งค์•ก
๋‹จ๊ณ„1 : 3๊ฐœ ๊ตฐ์ง‘์œผ๋กœ ๊ตฐ์ง‘ํ™”
๋‹จ๊ณ„2: ์›ํ˜•๋ฐ์ดํ„ฐ์— ๊ตฐ์ง‘ ์˜ˆ์ธก์น˜ ์ถ”๊ฐ€
๋‹จ๊ณ„3 : tot_price ๋ณ€์ˆ˜์™€ ๊ฐ€์žฅ ์ƒ๊ด€๊ณ„์ˆ˜๊ฐ€ ๋†’์€ ๋ณ€์ˆ˜๋กœ ์‚ฐ์ ๋„(์ƒ‰์ƒ : ํด๋Ÿฌ์Šคํ„ฐ ๊ฒฐ๊ณผ)
๋‹จ๊ณ„4 : ์‚ฐ์ ๋„์— ๊ตฐ์ง‘์˜ ์ค‘์‹ฌ์  ์‹œ๊ฐํ™”
๋‹จ๊ณ„5 : ๊ตฐ์ง‘๋ณ„ ํŠน์„ฑ๋ถ„์„ : ์šฐ์ˆ˜๊ณ ๊ฐ ๊ตฐ์ง‘ ์‹๋ณ„
import pandas as pd
from sklearn.cluster import KMeans # kMeans model
import matplotlib.pyplot as plt

 

sales = pd.read_csv(r"C:\ITWILL\4_Python-2/data/product_sales.csv")
print(sales.info())

RangeIndex: 150 entries, 0 to 149
Data columns (total 4 columns):
tot_price      150 non-null float64 -> ์ด๊ตฌ๋งค๊ธˆ์•ก 
visit_count    150 non-null float64 -> ๋งค์žฅ๋ฐฉ๋ฌธ์ˆ˜ 
buy_count      150 non-null float64 -> ๊ตฌ๋งคํšŸ์ˆ˜ 
avg_price      150 non-null float64 -> ํ‰๊ท ๊ตฌ๋งค๊ธˆ์•ก 

model = KMeans(n_clusters=3, random_state=0, algorithm='auto') #k=3, auto:default
model.fit(sales)

 

kMeans model ์—์ธก์น˜ 

pred = model.predict(sales)
print(pred)


์˜ˆ์ธก์น˜ ์ถ”๊ฐ€ 

sales['predict'] = pred #column ์ถ”๊ฐ€ = numpy vector ์ถ”๊ฐ€ ๊ฐ€๋Šฅ 
print(sales)


์ƒ๊ด€๊ณ„์ˆ˜

print(sales.corr()) #tot_price vs avg_price


tot_price vs avg_price ์‚ฐ์ ๋„  

plt.scatter(sales['tot_price'], sales['avg_price'], c=sales.iloc[:,4])


๊ตฐ์ง‘ ์ค‘์•™๊ฐ’  

centers = model.cluster_centers_
print(centers)

[[6.83902439 5.67804878]
 [5.00784314 1.49215686]
 [5.87413793 4.39310345]]

์ค‘์•™๊ฐ’ ์‹œ๊ฐํ™” 

plt.scatter(centers[:,0], centers[:, 3], marker='D', c='r')
plt.show()


[์ถ”๊ฐ€] ๊ฐ ๊ตฐ์ง‘๋ณ„ ํŠน์„ฑ๋ถ„์„(ํ†ต๊ณ„)

ales_g = sales.groupby('predict')
sales_g.size()
'''
0    50
1    62
2    38
'''

print(sales_g.mean())

         tot_price  visit_count  buy_count  avg_price
predict                                              
0         5.006000     0.244000   3.284000   1.464000
1         5.901613     1.433871   2.754839   4.393548
2         6.850000     2.071053   3.071053   5.742105 -> ์šฐ์ˆ˜๊ณ ๊ฐ