๋ฐ์ดํ„ฐ๋ถ„์„๊ฐ€ ๊ณผ์ •/Python

DAY55. Python Cluster&Recommand (๊ณ„์ธต๊ตฐ์ง‘๋ถ„์„, KMeans, ์ถ”์ฒœ์‹œ์Šคํ…œ๋ชจํ˜•)

LEE_BOMB 2021. 12. 8. 15:57
๊ตฐ์ง‘ ๋ถ„์„

์œ ์‚ฌ๋„๊ฐ€ ๋†’์€ ๋ฐ์ดํ„ฐ ๊ตฐ์ง‘ํ™”

์œ ์‚ฌ๋„(์œ ํด๋ฆฌ๋“œ์•ˆ ๊ฑฐ๋ฆฌ์‹)๊ฐ€ ๋†’์€ ๋ฐ์ดํ„ฐ๋ผ๋ฆฌ ๊ทธ๋ฃนํ™”

๊ณ„์ธตํ˜• ํด๋Ÿฌ์Šคํ„ฐ๋ง๊ณผ ๋น„๊ณ„์ธตํ˜• ํด๋Ÿฌ์Šคํ„ฐ๋ง์œผ๋กœ ๋ถ„๋ฅ˜

- K-means : ๋น„๊ณ„์ธต์  ๊ตฐ์ง‘๋ถ„์„

- Hierarchical : ๊ณ„์ธต์  ๊ตฐ์ง‘๋ถ„์„

 

 

๊ตฐ์ง‘๋ถ„์„ ํŠน์ง•

์ข…์†๋ณ€์ˆ˜(y๋ณ€์ˆ˜)๊ฐ€ ์—†๋Š” ๋ฐ์ดํ„ฐ ๋งˆ์ด๋‹ ๊ธฐ๋ฒ•

์ „์ฒด์ ์ธ ๋ฐ์ดํ„ฐ ๊ตฌ์กฐ๋ฅผ ํŒŒ์•…ํ•˜๋Š”๋ฐ ์ด์šฉ

๊ด€์ธก๋Œ€์ƒ ๊ฐ„ ์œ ์‚ฌ์„ฑ์„ ๊ธฐ์ดˆ๋กœ ๋น„์Šทํ•œ ๊ฒƒ ๋ผ๋ฆฌ ๊ทธ๋ฃนํ™”(Clustering)

 

์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ ๊ธฐ๋ฐ˜ ์œ ์‚ฌ ๊ฐ์ฒด ๋ฌถ์Œ (์œ ์‚ฌ์„ฑ = ์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ)

* ๊ด€์ธก๋Œ€์ƒ p์™€ q์˜ ๋Œ€์‘ํ•˜๋Š” ๋ณ€๋Ÿ‰๊ฐ’์˜ ์ฐจ๊ฐ€ ์ž‘์œผ๋ฉด, ๋‘ ๊ด€์ธก๋Œ€์ƒ์€ ์œ ์‚ฌํ•˜๋‹ค๊ณ  ์ •์˜ํ•˜๋Š” ์‹

ex. ๊ณ ๊ฐ DB -> ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ ์šฉ -> ํŒจํ„ด ์ถ”์ถœ(rule) -> ๊ทผ๊ฑฐ๋ฆฌ ๋ชจ ํ˜•์œผ๋กœ ๊ตฐ์ง‘ํ˜•์„ฑ

๊ณ„์ธต์  ๊ตฐ์ง‘๋ถ„์„(ํƒ์ƒ‰์ ), ๋น„๊ณ„์ธต์  ๊ตฐ์ง‘๋ถ„์„(ํ™•์ธ์ )

์ฃผ์š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ : k-means, hierarchical

 

๋ถ„์„๊ฒฐ๊ณผ์— ๋Œ€ํ•œ ๊ฐ€์„ค ๊ฒ€์ • ์—†์Œ(ํƒ€๋‹น์„ฑ ๊ฒ€์ฆ ๋ฐฉ๋ฒ• ์—†์Œ)

๋ถ„์•ผ : ์‚ฌํšŒ๊ณผํ•™, ์ž์—ฐ๊ณผํ•™, ๊ณตํ•™ ๋ถ„์•ผ

์ฒ™๋„ : ๋“ฑ๊ฐ„, ๋น„์œจ์ฒ™๋„(์—ฐ์†์ ์ธ ์–‘)

 

 

๊ณ„์ธต์  ๊ตฐ์ง‘ ๋ถ„์„ ๋น„ ๊ณ„์ธต์  ๊ตฐ์ง‘ ๋ถ„์„
์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ๋ฅผ ์ด์šฉํ•œ ๊ตฐ์ง‘๋ถ„์„ ๋ฐฉ๋ฒ•
๊ณ„์ธต์ (hierarchical)์œผ๋กœ ๊ตฐ์ง‘ ๊ฒฐ๊ณผ ๋„์ถœ
ํƒ์ƒ‰์  ๊ตฐ์ง‘๋ถ„์„
ํ™•์ธ์  ๊ตฐ์ง‘๋ถ„์„ ๋ฐฉ๋ฒ•
๊ณ„์ธต์  ๊ตฐ์ง‘๋ถ„์„๋ฒ• ๋ณด๋‹ค ์†๋„ ๋น ๋ฆ„
๊ตฐ์ง‘์˜ ์ˆ˜๋ฅผ ์•Œ๊ณ  ์žˆ๋Š” ๊ฒฝ์šฐ ์ด์šฉ
K๋Š” ๋ฏธ๋ฆฌ ์ •ํ•˜๋Š” ๊ตฐ์ง‘ ์ˆ˜
๊ณ„์ธต์  ๊ตฐ์ง‘ํ™”์˜ ๊ฒฐ๊ณผ์— ์˜๊ฑฐํ•˜์—ฌ ๊ตฐ์ง‘ ์ˆ˜ ๊ฒฐ์ •
๋ณ€์ˆ˜๋ณด๋‹ค ๊ด€์ธก๋Œ€์ƒ ๊ตฐ์ง‘ํ™”์— ๋งŽ์ด ์ด์šฉ
๊ตฐ์ง‘์˜ ์ค‘์‹ฌ(Cluster Center) ์‚ฌ์šฉ์ž๊ฐ€ ์ •ํ•จ
๊ตฐ์ง‘ํ™” ๋ฐฉ์‹
1. ๋‹จ์ผ๊ธฐ์ค€๊ฒฐํ•ฉ๋ฐฉ์‹
2. ์™„์ „๊ธฐ์ค€๊ฒฐํ•ฉ๋ฐฉ์‹
3. ํ‰๊ท ๊ธฐ์ค€๊ฒฐํ•ฉ๋ฐฉ์‹
k-ํ‰๊ท  ๊ตฐ์ง‘๋ถ„์„ ์•Œ๊ณ ๋ฆฌ์ฆ˜
๊ฒฐ๊ณผ : ๋ฒค๋“œ๋กœ๊ทธ๋žจ  

 

 

 

 

 

 

๊ณ„์ธต์  ๊ตฐ์ง‘๋ถ„์„(Hierarchical Clustering) 

์ƒํ–ฅ์‹(Bottom-up)์œผ๋กœ ๊ณ„์ธต์  ๊ตฐ์ง‘ ํ˜•์„ฑ 
์œ ํด๋ฆฌ๋“œ์•ˆ ๊ฑฐ๋ฆฌ๊ณ„์‚ฐ์‹ ์ด์šฉ 
์ˆซ์žํ˜• ๋ณ€์ˆ˜๋งŒ ์‚ฌ์šฉ


from sklearn.datasets import load_iris #dataset
import pandas as pd #DataFrame
from scipy.cluster.hierarchy import linkage, dendrogram #๊ตฐ์ง‘๋ถ„์„ tool
import matplotlib.pyplot as plt #์‚ฐ์ ๋„ ์‹œ๊ฐํ™”



1. dataset loading

iris = load_iris() #Load the data

X = iris.data #x๋ณ€์ˆ˜ 
y = iris.target #y๋ณ€์ˆ˜(target) - ์ˆซ์žํ˜• : ๊ฑฐ๋ฆฌ๊ณ„์‚ฐ


numpy -> DataFrame 

iris_df = pd.DataFrame(X, columns=iris.feature_names)
iris_df['species'] = y #target ์ถ”๊ฐ€ 

iris_df.info()

 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
 4   species            150 non-null    int32   -> object์—์„œ intํ˜•์œผ๋กœ



2. ๊ณ„์ธต์  ๊ตฐ์ง‘๋ถ„์„ 

clusters = linkage(iris_df, method='complete')

method = 'complete' : default - ์™„์ „์—ฐ๊ฒฐ 
method = 'simple' : ๋‹จ์ˆœ์—ฐ๊ฒฐ
method = 'average' : ํ‰๊ท ์—ฐ๊ฒฐ

print(clusters)
clusters.shape #(149, 4)

 

 


3. ํ…๋“œ๋กœ๊ทธ๋žจ ์‹œ๊ฐํ™” : ๊ตฐ์ง‘์ˆ˜๋Š” ์‚ฌ์šฉ์ž๊ฐ€ ๊ฒฐ์ • 

plt.figure(figsize = (25, 10))
dendrogram(clusters)
plt.show()

[ํ•ด์„] y์ถ•์— ์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ๊ณ„์‚ฐ์‹. ์ตœ์ƒ๋‹จ ๊ตฐ์ง‘์ด ๊ฐ€์žฅ ๋จผ ๊ฒƒ. ๊ตฐ์ง‘์˜ ๊ฐœ์ˆ˜๋Š” ๋ถ„์„์ž์˜ ๋ชซ.

 

 

 

4. ๊ตฐ์ง‘ ์ˆ˜ ์ž๋ฅด๊ธฐ -> 3๊ฐœ

cluster = fcluster(clusters, t=3, criterion='maxclust') #criterion : ๊ฐ€๊นŒ์šด ๊ฑฐ๋ฆฌ ๊ธฐ์ค€์œผ๋กœ ์ž๋ฅด๊ธฐ
cluster #1~3
len(cluster) #150


2) ์นผ๋Ÿผ ์ถ”๊ฐ€

iris_df['cluster'] = cluster
print(iris_df)

iris_df.columns


3) ์‚ฐ์ ๋„ ์‹œ๊ฐํ™”(x=1 ์นผ๋Ÿผ, y=3 ์นผ๋Ÿผ, c=cluster)

plt.scatter(x = iris_df['sepal length (cm)'],
            y = iris_df['petal length (cm)'],
            c = iris_df['cluster'])
plt.show

[ํ•ด์„] ์ขŒ์ธก์—์„œ๋ถ€ํ„ฐ 1, 3, 2๋ฒˆ ๊ตฐ์ง‘ ์ˆœ์„œ

 

 

 

5. ๊ตฐ์ง‘๋ณ„ ํŠน์„ฑ ๋ถ„์„
1) ๊ทธ๋ฃน(clust) ๊ฐ์ฒด

cluster_g = iris_df.groupby('cluster')
cluster_g.size()

cluster
1    50
2    34
3    66
dtype: int64

2) ๊ตฐ์ง‘์˜ ํ‰๊ท 

cluster_g.mean() #species(๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜) : 1๋ฒˆ๊ตฐ์ง‘(0), 2๋ฒˆ๊ตฐ์ง‘(2), 3๋ฒˆ๊ตฐ์ง‘(1+2)

         sepal length (cm)  sepal width (cm)  ...  petal width (cm)   species
cluster                                       ...                            
1                 5.006000          3.428000  ...          0.246000  0.000000
2                 6.888235          3.100000  ...          2.123529  2.000000
3                 5.939394          2.754545  ...          1.445455  1.242424

 

 

 

 

 

kMeans

kMeans์•Œ๊ณ ๋ฆฌ์ฆ˜
ํ™•์ธ์  ๊ตฐ์ง‘๋ถ„์„
๊ตฐ์ง‘ ์ˆ˜ k๋ฅผ ์•Œ๊ณ  ์žˆ๋Š” ๊ฒฝ์šฐ ์ด์šฉ


import pandas as pd #DataFrame
from sklearn.cluster import KMeans #kMeans model
import numpy as np #array
import matplotlib.pyplot as plt #์‹œ๊ฐํ™”



1. text file -> dataset ์ƒ์„ฑ

file = open(r'C:\ITWILL\4_Python-2\data\testSet.txt')
lines = file.readlines() #list๋ฐ˜ํ™˜
print(lines, len(lines)) #80๊ฐœ

dataset = []
for line in lines : #'1.658985\t4.285136\n'
    cols = line.split('\t') #'1.658985', '4.285136\n'
    rows = [] #1ํ–‰ ์ €์žฅ
    for col in cols : #'1.658985', '4.285136\n'
        rows.append(float(col)) #floatํ˜• ํ˜•๋ณ€ํ™˜ [1.658985, 4.285136]
        
    dataset.append(rows) #[[1.658985, 4.285136],... [-4.905566, -2.911070]]


list -> numpy

dataset_arr = np.array(dataset)
dataset_arr.shape #(80, 2)

plt.scatter(x = dataset_arr[:,0], y = dataset_arr[:,1])

 

 

 

2. DF์ƒ์„ฑ

data_df = pd.DataFrame(dataset_arr, columns = ['x', 'y'])
data_df




3. KMeans model

model = KMeans(n_clusters = 4, algorithm = 'auto')
model.fit(data_df) #dataset์ ์šฉ


kMeans model ์˜ˆ์ธก์น˜

pred = model.predict(data_df)
print(pred) #์‚ฐ์ ๋„์—์„œ 4๊ฐœ(0~3)์˜ ๊ตฐ์ง‘์œผ๋กœ ์ชผ๊ฐฐ์„ ๋•Œ ์˜ˆ์ธกํ•ด์„œ ๋ฐ˜ํ™˜

dir(model)

[3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0
 1 2 3 0 1 2]

๊ฐ ๊ตฐ์ง‘ ์ค‘์•™๊ฐ’

centers = model.cluster_centers_

array([[-2.46154315,  2.78737555],
       [ 2.80293085, -2.7315146 ],
       [-3.38237045, -2.9473363 ],
       [ 2.6265299 ,  3.10868015]])



4. kMeans model ์‹œ๊ฐํ™”

data_df['predict'] = pred
data_df.info()


์‚ฐ์ ๋„

plt.scatter(x = data_df['x'], y = data_df['y'],
            c = data_df['predict'])


์ค‘์•™๊ฐ’ ์ถ”๊ฐ€

plt.scatter(x = centers[:, 0], y = centers[:, 1],
            c = 'r', marker = 'D')
#marker : ์‚ฐ์ ๋„ ๋ชจ์–‘ Diamond

plt.show()

* k๊ฐ’์ด 4์ธ ๊ฒฝ์šฐ. k๊ฐ’์„ ์กฐ์ ˆํ•˜๊ณ  ์‹ถ์œผ๋ฉด KMeans model ๋‹จ๊ณ„์—์„œ ์กฐ์ ˆ

 

 

 

 

 

Best Cluster ์ฐพ๋Š” ๋ฐฉ๋ฒ• 

 

from sklearn.cluster import KMeans #model 
from sklearn.datasets import load_iris #dataset 
import matplotlib.pyplot as plt #์‹œ๊ฐํ™”



1. dataset load 

X, y = load_iris(return_X_y=True)
print(X.shape) #(150, 4)
print(X)




2. kMeans model 

obj = KMeans(n_clusters=3)
model = obj.fit(X)


model ์˜ˆ์ธก์น˜ 

pred = model.predict(X)



3. Best Cluster 

size = range(1, 11) # k๊ฐ’ ๋ฒ”์œ„(1~10)
inertia = [] # ๊ตฐ์ง‘์˜ ์‘์ง‘๋„

* inertia value
 - ๊ตฐ์ง‘์˜ ์‘์ง‘๋„๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ์ˆ˜์น˜(์ˆ˜์น˜๊ฐ€ ์ž‘์„ ์ˆ˜๋ก ์‘์ง‘๋„๊ฐ€ ์ข‹๋‹ค.) 
 - ์ค‘์‹ฌ์ (centroid)์—์„œ ๊ตฐ์ง‘(cluster)๋‚ด ํฌ์ธํŠธ๊ฐ„์˜ ๊ฑฐ๋ฆฌ ์ œ๊ณฑ์˜ ํ•ฉ 
 - center์ˆ˜๊ฐ€ ์ž‘์„ ์ˆ˜๋ก ๊ฐ’์ด ์ปค์ง„๋‹ค.

for k in size :
    obj = KMeans(n_clusters = k) # 1 ~ 10
    model = obj.fit(X)
    inertia.append(model.inertia_) 

print(inertia)

plt.plot(size, inertia, '-o')
plt.xticks(size)
plt.show()

[ํ•ด์„] ๊ธ‰๊ฒฉํžˆ ํ•˜๊ฐ•ํ•œ ํ›„ 3-5 ์‚ฌ์ด์—์„œ ๋ณ€ํ™”์˜ ํญ์ด ์™„๋งŒํ•˜๋‹ค.
3-5์‚ฌ์ด์˜ ํด๋Ÿฌ์Šคํ„ฐ๋กœ ๋ถ„๋ฅ˜ํ•˜๋Š” ๊ฒƒ์ด ์ ์ ˆํ•˜๋‹ค๊ณ  ํŒ๋‹จํ•œ๋‹ค.

 

 

 

 

 

์ถ”์ฒœ์‹œ์Šคํ…œ ๋ชจํ˜•

์ถ”์ฒœ ์‹œ์Šคํ…œ?

์ •๋ณด ํ•„ํ„ฐ๋ง (IF) ๊ธฐ์ˆ ์˜ ์ผ์ข…์œผ๋กœ, ํŠน์ • ์‚ฌ์šฉ์ž๊ฐ€ ๊ด€์‹ฌ์„ ๊ฐ€์งˆ๋งŒํ•œ ์ •๋ณด (์˜ํ™”, ์Œ์•…, ์ฑ…, ๋‰ด์Šค, ์ด๋ฏธ์ง€, ์›น ํŽ˜์ด์ง€ ๋“ฑ)๋ฅผ ์ถ”์ฒœํ•˜๋Š” ์‹œ์Šคํ…œ

 

 

์ถ”์ฒœ ์•Œ๊ณ ๋ฆฌ์ฆ˜

1. ํ˜‘์—… ํ•„ํ„ฐ๋ง(Collaborative Filtering : CF)

๊ตฌ๋งค/์†Œ๋น„ ํŒจํ„ด์ด ๋น„์Šทํ•œ ์‚ฌ์šฉ์ž๋ฅผ ํ•œ ์ง‘๋‹จ์œผ๋กœ ๋ณด๊ณ  ๊ทธ ์ง‘๋‹จ์— ์†ํ•œ ์†Œ๋น„์ž๋“ค์˜ ์ทจํ–ฅ์„ ์ถ”์ฒœํ•˜๋Š” ๋ฐฉ์‹

- UBCF(User Based CF) : ํŒจํ„ด์ด ๋น„์Šทํ•œ ์‚ฌ์šฉ์ž๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ƒํ’ˆ(Item) ์ถ”์ฒœ ๋ฐฉ์‹

์ถ”์ฒœ ๋Œ€์ƒ์ž(Active User)์™€์˜ ์œ ์‚ฌ๋„(Correlation Match)๋ฅผ ์ธก์ •ํ•ด์„œ ๊ฐ€์žฅ ์œ ์‚ฌํ•œ ์‚ฌ๋žŒ๋“ค์˜ ํ‰๊ท ๊ฐ’์œผ๋กœ ์ถ”์ฒœ์ƒํ’ˆ์„ ๋„์ถœ

* ๊ฒฐ์ธก์น˜๊ฐ€ ๋งŽ์œผ๋ฉด ์‹ ๋ขฐ์„ฑ์ด ๋‚ฎ์•„์งˆ ์ˆ˜ ์žˆ๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ๋‹ค.

 

- IBCF(Item Based CF) : ์ƒํ’ˆ(Item)์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์—ฐ๊ด€์„ฑ์ด ์žˆ๋Š” ์ƒํ’ˆ(Item) ์ถ”์ฒœ ๋ฐฉ์‹

์ปคํ”ผ, ์นดํŽ˜๋ผํ…Œ๋Š” ๊ฐ€์žฅ ๋น„์Šทํ•œ ๋ฒกํ„ฐ ๊ตฌ์กฐ๋ฅผ ๊ฐ–๋Š”๋‹ค. ๊ทธ๋Ÿฌ๋ฏ€๋กœ ์ปคํ”ผ๋ฅผ ์ƒ€์„ ๋•Œ, ์นดํŽ˜๋ผํ…Œ๋ฅผ ์‚ฌ๋ผ๊ณ  ์ถ”์ฒœํ•  ์ˆ˜ ์žˆ๋‹ค.

๋…น์ฐจ๋Š” ์ปคํ”ผ์™€ ์นดํŽ˜๋ผํ…Œ์™€ ๋‹ค๋ฅด๋ฏ€๋กœ ์œ ์‚ฌ๋„๊ฐ€ ๋‚ฎ์•„ ์ถ”์ฒœํ•ด์ฃผ์ง€ ์•Š๋Š”๋‹ค.

 

2. ๋‚ด์šฉ๊ธฐ๋ฐ˜ ํ•„ํ„ฐ๋ง(Content-Based Filtering : CB)

์†Œ๋น„์ž๊ฐ€ ์†Œ๋น„ํ•˜๋Š” ์ œํ’ˆ ์ค‘ ํ…์ŠคํŠธ ์ •๋ณด๊ฐ€ ๋งŽ์€ ์ œํ’ˆ ๋Œ€์ƒ

๋‰ด์Šค, ์ฑ… ๋“ฑ ํ…์ŠคํŠธ์˜ ๋‚ด์šฉ์„ ๋ถ„์„ํ•ด์„œ ์ถ”์ฒœํ•˜๋Š” ๋ฐฉ๋ฒ•

ํ…์ŠคํŠธ ์ค‘์—์„œ ํ˜•ํƒœ์†Œ(๋ช…์‚ฌ, ๋™์‚ฌ ๋“ฑ)๋ฅผ ๋ถ„์„ํ•˜์—ฌ ํ•ต์‹ฌ ํ‚ค์›Œ๋“œ๋ฅผ ๋ถ„์„ํ•˜๋Š” ๊ธฐ์ˆ ์ด ํ•ต์‹ฌ

 

3. ์ง€์‹๊ธฐ๋ฐ˜ ํ•„ํ„ฐ๋ง(Knowledge-Based Filtering : KB)

ํŠน์ • ๋ถ„์•ผ์— ๋Œ€ํ•œ ์ „๋ฌธ๊ฐ€์˜ ๋„์›€์„ ๋ฐ›์•„์„œ ๊ทธ ๋ถ„์•ผ์— ๋Œ€ํ•œ ์ „์ฒด์  ์ธ ์ง€์‹๊ตฌ์กฐ๋ฅผ ๋งŒ๋“ค๊ณ  ์ด๋ฅผ ํ™œ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•

 

 

์ถ”์ฒœ ์‹œ์Šคํ…œ ์‚ฌ๋ก€

๋„ทํ”Œ๋ฆญ์Šค : ๊ณ ๊ฐ์˜ ์˜ํ™” ํ‰๊ฐ€๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ํŠน์ • ๊ณ ๊ฐ์—๊ฒŒ ์˜ํ™” ์ถ”์ฒœ

์•„๋งˆ์กด : ์ œํ’ˆ ์›น ํŽ˜์ด์ง€ ๋ฐฉ๋ฌธ ๊ธฐ๋ก, ์‡ผํ•‘์žฅ๋ฐ”๊ตฌ๋‹ˆ ๊ธฐ๋Šฅ, ๊ตฌ๋งค ์ƒํ’ˆ ์„ ํ˜ธ ๋“ฑ ๋‹ค์–‘ํ•œ ์ •๋ณด๋ฅผ ํ† ๋Œ€๋กœ ์ถ”์ฒœ ์‹œ์Šคํ…œ ๊ตฌํ˜„

 

 

ํ˜‘์—… ํ•„ํ„ฐ๋ง์˜ ์œ ์‚ฌ๋„(Similarity) ๊ณ„์‚ฐ ๋ฐฉ๋ฒ•

์ƒ๊ด€๊ณ„์ˆ˜(Correlation coefficient) ์œ ์‚ฌ๋„ : Pearson ์ƒ๊ด€๊ณ„์ˆ˜ ์ด์šฉ

์ฝ”์‚ฌ์ธ(Cosine) ์œ ์‚ฌ๋„ : ๋‘ ๋ฒกํ„ฐ ์‚ฌ์ด์˜ ๊ฐ๋„

์œ ํด๋ฆฌ๋“œ์•ˆ(Euclidean) : ๊ฑฐ๋ฆฌ ๊ธฐ๋ฐ˜ ์œ ์‚ฌ๋„ ๊ณ„

Jaccard ์œ ์‚ฌ๋„ : ์ด์ง„ํ™” ์ž๋ฃŒ(binary data) ๋Œ€์ƒ ์œ ์‚ฌ๋„ ๊ณ„์‚ฐ

 

 

ํŠน์ด๊ฐ’ ๋ถ„ํ•ด (SVD)

ํŠน์ด๊ฐ’ ํ–‰๋ ฌ ๋ถ„ํ•ด(Singular Value Decomposition)

์ฐจ์›์ถ•์†Œ(dimension reduction) ๊ธฐ๋ฒ• : ํŠน์ด๊ฐ’ ์ƒ์„ฑ

ex. ํ–‰๋ ฌ A(m x n) ๋ถ„ํ•ด

A(m์‚ฌ์šฉ์ž X n์•„์ดํ…œ)์˜ NULL๊ฐ’ ์˜ˆ์ธกํ•˜๋Š” ๊ณผ์ •์ด ํŠน์ด๊ฐ’ ๋ถ„ํ•ด.

P(M์‚ฌ์šฉ์ž x kํŠน์ด๊ฐ’) * Q.T(kํŠน์ด๊ฐ’ x N์•„์ดํ…œ)

* ํŠน์ด๊ฐ’ : ํ–‰๋ ฌ์˜ ํŠน์ง•์„ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฐ’. ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ด์šฉ (์ฐจ์›์ถ•์†Œ์—์„œ ์‚ฌ์šฉ๋จ ex. 5๊ฐœ ์ฐจ์› -> 2๊ฐœ ์ฐจ์› ์ถ•์†Œ)

 

 

[์‹ค์Šต] ์˜ํ™” ์ถ”์ฒœ ์‹œ์Šคํ…œ ์•Œ๊ณ ๋ฆฌ์ฆ˜
์ถ”์ฒœ ๋Œ€์ƒ์ž : Toby   
์œ ์‚ฌ๋„ ํ‰์  = ๋ฏธ๊ด€๋žŒ ์˜ํ™”ํ‰์  * Toby์™€์˜ ์œ ์‚ฌ๋„
์ถ”์ฒœ ์˜ํ™” ์˜ˆ์ธก = ์œ ์‚ฌ๋„ ํ‰์  / Toby์™€์˜ ์œ ์‚ฌ๋„

 

import pandas as pd



1. ๋ฐ์ดํ„ฐ ๊ฐ€์ ธ์˜ค๊ธฐ 

ratings = pd.read_csv('D:/ITWILL/4_Python-2/data/movie_rating.csv')
print(ratings) #ํ‰๊ฐ€์ž[critic]   ์˜ํ™”[title]  ํ‰์ [rating]




2. pivot table ์ž‘์„ฑ : row(์˜ํ™”์ œ๋ชฉ), column(ํ‰๊ฐ€์ž), cell(ํ‰์ )

print('movie_ratings')
movie_ratings = pd.pivot_table(ratings,
               index = 'title',
               columns = 'critic',
               values = 'rating').reset_index()

print(movie_ratings) #default index ์ถ”๊ฐ€




3. ์‚ฌ์šฉ์ž ์œ ์‚ฌ๋„ ๊ณ„์‚ฐ(์ƒ๊ด€๊ณ„์ˆ˜ R) : ๋ฒˆํ˜ธ ํ–‰ ์ถ”๊ฐ€ ํšจ๊ณผ 

sim_users = movie_ratings.corr().reset_index() #corr(method='pearson')
print(sim_users) #default index ์ถ”๊ฐ€

critic   critic   Claudia      Gene      Jack      Lisa      Mick      Toby
0       Claudia  1.000000  0.314970  0.028571  0.566947  0.566947  0.893405
1          Gene  0.314970  1.000000  0.963796  0.396059  0.411765  0.381246
2          Jack  0.028571  0.963796  1.000000  0.747018  0.211289  0.662849
3          Lisa  0.566947  0.396059  0.747018  1.000000  0.594089  0.991241
4          Mick  0.566947  0.411765  0.211289  0.594089  1.000000  0.924473
5          Toby  0.893405  0.381246  0.662849  0.991241  0.924473  1.000000

 

 

4. Toby ๋ฏธ๊ด€๋žŒ ์˜ํ™” ์ถ”์ถœ  
1) movie_ratings table์—์„œ title, Toby ์นผ๋Ÿผ์œผ๋กœ subset ์ž‘์„ฑ 

toby_rating = movie_ratings[['title', 'Toby']] #index ์นผ๋Ÿผ ์ถ”๊ฐ€ 
print(toby_rating)


์นผ๋Ÿผ๋ช… ๊ต์ฒด 

toby_rating.columns=['title', 'rating']
print(toby_rating)

critic title  rating
0    Just My     NaN
1       Lady     NaN
2     Snakes     4.5
3   Superman     4.0
4  The Night     NaN
5     You Me     1.0

2) Toby ๋ฏธ๊ด€๋žŒ ์˜ํ™”์ œ๋ชฉ ์ถ”์ถœ : rating null ์กฐ๊ฑด์œผ๋กœ title ์ถ”์ถœ 

toby_not_see = toby_rating.title[toby_rating.rating.isnull()] 
print(toby_not_see)

0      Just My
1         Lady
4    The Night

Series -> list

toby_not_see = list(toby_not_see)


3) raw data์—์„œ Toby ๋ฏธ๊ด€๋žŒ ์˜ํ™” subset ์ƒ์„ฑ 

rating_t = ratings[ratings.title.isin(toby_not_see)]
print(rating_t)

    critic      title  rating
0      Jack       Lady     3.0
4      Jack  The Night     3.0
5      Mick       Lady     3.0
:
30     Gene  The Night     3.0

 

 


5. Toby ๋ฏธ๊ด€๋žŒ ์˜ํ™” + Toby ์œ ์‚ฌ๋„ join
1) Toby ์œ ์‚ฌ๋„ ์ถ”์ถœ 

toby_sim = sim_users[['critic','Toby']] #critic vs Toby ์œ ์‚ฌ๋„


2) ํ‰๊ฐ€์ž ๊ธฐ์ค€ ๋ณ‘ํ•ฉ  

rating_t = pd.merge(rating_t, toby_sim, on='critic')
print(rating_t)

    critic      title  rating      Toby
0      Jack       Lady     3.0  0.662849
1      Jack  The Night     3.0  0.662849
2      Mick       Lady     3.0  0.924473

 


6. ์˜ํ™” ์ถ”์ฒœ ์˜ˆ์ธก
1) ์œ ์‚ฌ๋„ ํ‰์  ๊ณ„์‚ฐ = Toby๋ฏธ๊ด€๋žŒ ์˜ํ™” ํ‰์  * Tody ์œ ์‚ฌ๋„ 

rating_t['sim_rating'] = rating_t.rating * rating_t.Toby
print(rating_t)

     critic      title  rating      Toby  sim_rating
0      Jack       Lady     3.0    0.662849    1.988547
1      Jack  The Night     3.0    0.662849    1.988547
2      Mick       Lady     3.0    0.924473    2.773420

2) ์˜ํ™”์ œ๋ชฉ๋ณ„ ํ•ฉ๊ณ„

gsum = rating_t.groupby(['title']).sum() #์˜ํ™” ์ œ๋ชฉ๋ณ„ ํ•ฉ๊ณ„


 3) ์˜ํ™”์ œ๋ชฉ๋ณ„ ํ•ฉ๊ณ„  Toby ์˜ํ™”์ถ”์ฒœ ์˜ˆ์ธก = ์œ ์‚ฌ๋„ ํ‰์  / Tody ์œ ์‚ฌ๋„

print('\n*** ์˜ํ™” ์ถ”์ฒœ ๊ฒฐ๊ณผ ***')
gsum['predict'] = gsum.sim_rating / gsum.Toby
print(gsum)

           rating  similarity  sim_rating   predict
title                                              
Just My       9.5    3.190366    8.074754  2.530981
Lady         11.5    2.959810    8.383808  2.832550
The Night    16.5    3.853215   12.899752  3.347790