๋น„์ง€๋„ํ•™์Šต(unSupervised Learning) ์ ˆ์ฐจ

๋ฐ์ดํ„ฐ์…‹ > ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ ์šฉ > ํŒจํ„ด๋ถ„์„ > ๊ทœ์น™๋ฐœ๊ฒฌ > ๋ชจ๋ธ ์ƒ์„ฑ > ํ‰๊ฐ€ > YES : ๋ฏธ๋ž˜์˜ˆ์ธก / NO : ์ฒ˜์Œ๋ถ€ํ„ฐ ๋‹ค์‹œ

 

 

01. ๊ตฐ์ง‘ ๋ถ„์„ ๊ฐœ์š”

๊ตฐ์ง‘ ๋ถ„์„?

๋Œ€์ƒ ๊ฐ์ฒด๋ฅผ ์œ ์‚ฌํ•˜๊ฑฐ๋‚˜ ์„œ๋กœ ๊ด€๋ จ ์žˆ๋Š” ํ•ญ๋ชฉ ๋ผ๋ฆฌ ๋ฌถ์–ด์„œ ๋ช‡ ๊ฐœ์˜ ์ง‘๋‹จ์œผ๋กœ ๊ทธ๋ฃนํ™”ํ•˜๋Š” ๊ฒƒ

* ๋ถ„๋ฅ˜๋ถ„์„๊ณผ ๊ฐ€์žฅ ํฐ ์ฐจ์ด์  : ์ •๋‹ต์ด ์—†๋Š” ๊ฒƒ๋“ค์„ ์ตœ๋Œ€ํ•œ ๋น„์Šทํ•œ ๊ฒƒ ๋ผ๋ฆฌ ๋ฌถ์–ด์ค€๋‹ค. (๋ฐ์ดํ„ฐ ์•ˆ์— ์ •๋‹ต์ด ์žˆ์œผ๋ฉด ๋ฌถ์–ด์ฃผ๋Š” ๊ฒƒ์ด ๋ถ„๋ฅ˜๋ถ„์„)

 

 

๊ตฐ์ง‘๋ถ„์„ ํŠน์ง•

์ „์ฒด์ ์ธ ๋ฐ์ดํ„ฐ ๊ตฌ์กฐ๋ฅผ ํŒŒ์•…ํ•˜๋Š”๋ฐ ์ด์šฉ

 

์ข…์†๋ณ€์ˆ˜(y๋ณ€์ˆ˜)๊ฐ€ ์—†๋Š” ๋น„์ง€๋„ํ•™์Šต(๋ฐ์ดํ„ฐ ๋งˆ์ด๋‹) ๊ธฐ๋ฒ•

๊ด€์ธก๋Œ€์ƒ ๊ฐ„ ์œ ์‚ฌ์„ฑ์„ ๊ธฐ์ดˆ๋กœ ๋น„์Šทํ•œ ๊ฒƒ ๋ผ๋ฆฌ ๊ทธ๋ฃนํ™”(Clustering)

์œ ์‚ฌ์„ฑ = ์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ์‹ ์ด์šฉ

์„๊ฒฐ๊ณผ์— ๋Œ€ํ•œ ๊ฐ€์„ค ๊ฒ€์ • ์—†์Œ (ํƒ€๋‹น์„ฑ ๊ฒ€์ฆ ๋ฐฉ๋ฒ• ์—†์Œ)

์ฒ™๋„ : ๋“ฑ๊ฐ„, ๋น„์œจ์ฒ™๋„(์—ฐ์†์ ์ธ ์–‘)

๊ณ„์ธต์  ๊ตฐ์ง‘๋ถ„์„(ํƒ์ƒ‰์ ) / ๋น„๊ณ„์ธต์  ๊ตฐ์ง‘๋ถ„์„(ํ™•์ธ์ )

์ฃผ์š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ : hierarchical, k-means

 

 

์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ๊ณ„์‚ฐ์‹

๊ด€์ธก๋Œ€์ƒ ๊ฐ„ ์œ ์‚ฌ์„ฑ์„ ๊ธฐ์ดˆ๋กœ ๋น„์Šทํ•œ ๊ฒƒ ๋ผ๋ฆฌ ๊ทธ๋ฃนํ™”(Clustering)

์œ ์‚ฌ์„ฑ = ์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ ์ด์šฉ

* ๊ด€์ธก๋Œ€์ƒ p์™€ q์˜ ๊ฐ’์˜ ์ฐจ๊ฐ€ ์ž‘์œผ๋ฉด, ๋‘ ๊ด€์ธก ๋Œ€์ƒ์€ ์œ ์‚ฌํ•˜๋‹ค๊ณ  ์ •์˜ํ•˜๋Š” ์‹

 

 

๊ณ„์ธต์  ๊ตฐ์ง‘๋ถ„์„ ๊ตฐ์ง‘ํ™” ๋ฐฉ๋ฒ•

* ๋ฐ์ดํ„ฐ๋ถ„์„ ์‹œํ—˜๋ฌธ์ œ์— ์ž์ฃผ ์ถœ์ œ

์ตœ๋‹จ์—ฐ๊ฒฐ๋ฒ• ๋‘ ๊ตฐ์ง‘ ๊ฐ„์˜ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ๊ด€์ธก์น˜ ๊ฐ„ ์—ฐ๊ฒฐ
์ตœ์žฅ์—ฐ๊ฒฐ๋ฒ• ๋‘ ๊ตฐ์ง‘ ๊ฐ„์˜ ๊ฐ€์žฅ ๋จผ ๊ฑฐ๋ฆฌ์˜ ๊ด€์ธก์น˜ ๊ฐ„ ์—ฐ๊ฒฐ
ํ‰๊ท ์—ฐ๊ฒฐ๋ฒ• ๋‘ ๊ตฐ์ง‘ ๊ฐ„์˜ ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  ๊ฑฐ๋ฆฌ์˜ ํ‰๊ท  ๊ธฐ์ค€ ์—ฐ๊ฒฐ
์ค‘์‹ฌ์—ฐ๊ฒฐ๋ฒ• ๊ฐ ๊ตฐ์ง‘์˜ ์ค‘์‹ฌ(centroid) ๊ฐ„ ๊ฑฐ๋ฆฌ ๊ธฐ์ค€ ์—ฐ๊ฒฐ
์™€๋“œ์—ฐ๊ฒฐ๋ฒ• ๋‘ ๊ตฐ์ง‘์ด ํ•ฉ์ณ์งˆ ๋•Œ ์˜ค์ฐจ์ œ๊ณฑํ•ฉ(ESS)์˜ ์ฆ๋ถ„์œผ๋กœ ์—ฐ๊ฒฐ
* ESS์˜ ์ฆ๋ถ„์— ๋”ฐ๋ผ์„œ ๋‘ ๊ตฐ์ง‘์˜ ๊ฑฐ๋ฆฌ ์ธก์ •์œผ๋กœ ์—ฐ๊ฒฐ, ์ค‘์‹ฌ์—ฐ๊ฒฐ๋ฒ•๊ณผ ์œ ์‚ฌํ•จ

* ์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ๊ณ„์‚ฐ์‹ ์ด์šฉ : ์ตœ๋‹จ,์ตœ์žฅ,ํ‰๊ท ,์ค‘์‹ฌ ์—ฐ๊ฒฐ๋ฒ•

 

์ตœ๋‹จ์—ฐ๊ฒฐ๋ฒ• (๋‹จ์ผ๊ธฐ์ค€๊ฒฐํ•ฉ๋ฐฉ์‹)

๊ฐ ๊ตฐ์ง‘์—์„œ ์ค‘์‹ฌ์œผ๋กœ๋ถ€ํ„ฐ ๊ฑฐ๋ฆฌ๊ฐ€ ๊ฐ€๊นŒ์šด ๊ฒƒ(2,3,6) 1๊ฐœ์”ฉ ๋น„๊ตํ•˜์—ฌ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ๊ฒƒ ๋ผ๋ฆฌ ๊ตฐ์ง‘ํ™”

 

์ตœ์žฅ์—ฐ๊ฒฐ๋ฒ• (์™„์ „๊ธฐ์ค€๊ฒฐํ•ฉ๋ฐฉ์‹) : ๊ธฐ๋ณธ์œผ๋กœ ์‚ฌ์šฉ. ๊ฐ ๊ตฐ์ง‘์—์„œ ์ค‘์‹ฌ์œผ๋กœ๋ถ€ํ„ฐ ๊ฐ€์žฅ ๋จผ ๋Œ€์ƒ(1,4,5) ๋ผ๋ฆฌ ๋น„๊ตํ•˜์—ฌ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ๊ฒƒ ๋ผ๋ฆฌ ๊ตฐ์ง‘ํ™”

 

ํ‰๊ท ์—ฐ๊ฒฐ๋ฒ• (ํ‰๊ท ๊ธฐ์ค€๊ฒฐํ•ฉ๋ฐฉ์‹) : ํ•œ ๊ตฐ์ง‘ ์•ˆ์— ์†ํ•ด ์žˆ๋Š” ๋ชจ๋“  ๋Œ€์ƒ๊ณผ ๋‹ค๋ฅธ ๊ตฐ์ง‘์— ์†ํ•ด์žˆ๋Š” ๋ชจ๋“  ๋Œ€์ƒ์˜ ์Œ ์ง‘ํ•ฉ์— ๋Œ€ํ•œ ๊ฑฐ๋ฆฌ๋ฅผ ํ‰๊ท  ๊ณ„์‚ฐํ•˜์—ฌ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ๊ฒƒ ๋ผ๋ฆฌ ๊ตฐ์ง‘ํ™” (1 -> 5,6 ํ‰๊ท , 2 -> 5, 6 ํ‰๊ท )

 

 

[์‹ค์Šต] ๊ทธ๋ฃนํ™”๋ฅผ ํ†ตํ•œ ์˜ˆ์ธก(๊ทธ๋ฃน ํŠน์„ฑ ์ฐจ์ด ๋ถ„์„-๊ณ ๊ฐ์ง‘๋‹จ ์ดํ•ด)
1. ์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ (p-q)^2
์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ(Euclidean distance)๋Š” ๋‘ ์  ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” 
๋ฐฉ๋ฒ•์œผ๋กœ ์ด ๊ฑฐ๋ฆฌ๋ฅผ ์ด์šฉํ•˜์—ฌ ์œ ํด๋ฆฌ๋“œ ๊ณต๊ฐ„์„ ์ •์˜ํ•œ๋‹ค.

(1) matrix ์ƒ์„ฑ

x <- matrix(1:9, nrow=3, by=T) 
x

       [,1] [,2] [,3]
[1,]    1    2    3 -> p
[2,]    4    5    6 -> q
[3,]    7    8    9
        3    3    3 -> p+p+p = 27 -> √27

sqrt(27) #1,2๋ฒˆ ํ–‰์— ๋Œ€ํ•œ ๊ฑฐ๋ฆฌ : 5.196152
sqrt(36+36+36) #1,3๋ฒˆ ํ–‰์— ๋Œ€ํ•œ ๊ฑฐ๋ฆฌ : 10.3923



(2) matrix ๋Œ€์ƒ ์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ ์ƒ์„ฑ ํ•จ์ˆ˜
ํ˜•์‹) dist(x, method="euclidean") -> x : numeric matrix, data frame

dist <- dist(x, method="euclidean") #xํ–‰๋ ฌ/๋งคํŠธ๋ฆญ์Šค๋ฅผ ๋Œ€์ƒ์œผ๋กœ ์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ ๊ณ„์‚ฐ. (method ์ƒ๋žต๊ฐ€๋Šฅ)
dist

       1             2
2   5.196152          
3  10.392305  5.196152

1ํ–‰(p) vs 2ํ–‰(q)

sqrt(sum((x[1, ] - x[2,])^2)) #5.196152


1ํ–‰(p) vs 3ํ–‰(q)

sqrt(sum((x[1, ] - x[3,])^2)) #10.3923 (ํ†ต์ƒ์ ์œผ๋กœ 2๋ฐฐ ์ฐจ์ด)


* ์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ ๊ณ„์‚ฐ์‹ ํ™œ์šฉ๋ถ„์•ผ
1. ๋ถ„๋ฅ˜๋ชจ๋ธ : kNN
2. ๊ตฐ์ง‘๋ชจ๋ธ : ๊ณ„์ธต์  / ๋น„๊ณ„์ธต์ 
3. ์ถ”์ฒœ๋ชจ๋ธ : ์œ ์‚ฌ๋„ ๊ณ„์‚ฐ
4. ์ขŒํ‘œ ๊ฑฐ๋ฆฌ๊ณ„์‚ฐ : ์œ„๋„์™€ ๊ฒฝ๋„



2. ๊ณ„์ธต์  ๊ตฐ์ง‘๋ถ„์„(ํƒ์ƒ‰์  ๋ถ„์„)
๊ณ„์ธต์  ๊ตฐ์ง‘๋ถ„์„(Hierarchical Clustering) ํŠธ๋ฆฌ ๊ตฌ์กฐ์—์„œ ์‚ฌ์šฉ๋˜๋Š” ๋Œ€ํ‘œ ๋ฐฉ๋ฒ• ex.ํšŒ์‚ฌ ์กฐ์ง๋„
๊ฑฐ๋ฆฌ๊ฐ€ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ๋Œ€์ƒ๋ถ€ํ„ฐ ๊ฒฐํ•ฉํ•˜์—ฌ ๋‚˜๋ฌด๋ชจ์–‘์˜ ๊ณ„์ธต๊ตฌ์กฐ๋ฅผ ์ƒํ–ฅ์‹(Bottom-up)์œผ๋กœ ๋งŒ๋“ค์–ด๊ฐ€๋ฉด์„œ ๊ตฐ์ง‘์„ ํ˜•์„ฑ 

(1) ๊ตฐ์ง‘๋ถ„์„(Clustering)๋ถ„์„์„ ์œ„ํ•œ ํŒจํ‚ค์ง€ ์„ค์น˜

install.packages("cluster") #hclust() : ๊ณ„์ธต์  ํด๋Ÿฌ์Šคํ„ฐ ํ•จ์ˆ˜ ์ œ๊ณต
library(cluster) #์ผ๋ฐ˜์ ์œผ๋กœ 3~10๊ฐœ ๊ทธ๋ฃนํ•‘์ด ์ ์ •


(2) ๋ฐ์ดํ„ฐ ์…‹ ์ƒ์„ฑ

r <- runif(15, min = 1, max = 50) #1์ด์ƒ 50์ดํ•˜์˜ ๋‚œ์ˆ˜ 15๊ฐœ ์ƒ์„ฑ์„ฑ
x <- matrix(r, nrow=5, by=T) #[5,3]
x


(3) matrix ๋Œ€์ƒ ์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ ์ƒ์„ฑ ํ•จ์ˆ˜

dist <- dist(x, method="euclidean") # method ์ƒ๋žต๊ฐ€๋Šฅ
dist

      1         2         3         4
2  4.114234                              
3 35.401012 31.626887                    
4 37.947768 34.295621 18.655243          
5 25.066933 27.803212 49.155692 56.803235
* 5๊ฐœ์˜ ๊ด€์ธก์น˜ ์ค‘ 1,2๊ฐ€ ๊ฑฐ๋ฆฌ๊ฐ€ ๊ฐ€์žฅ ๊ฐ€๊น๋‹ค (=์œ ์‚ฌ์„ฑ์ด ๋†’๋‹ค)

* ์ •๋ง์ผ๊นŒ?
        [,1]      [,2]     [,3]
[1,] 34.185929 17.089664 35.37096
[2,] 30.240518 18.241765 35.55348 -> 1,2ํ–‰์˜ ๊ฐ’๋“ค์ด ์„œ๋กœ ์œ ์‚ฌํ•จ์„ ๋ฐœ๊ฒฌํ•  ์ˆ˜ ์žˆ๋‹ค
[3,]  1.831156 25.291404 23.57555
[4,]  5.630017 41.771791 31.44851
[5,] 45.757080  4.141226 17.29342


(4) ์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ matrix๋ฅผ ์ด์šฉํ•œ ํด๋Ÿฌ์Šคํ„ฐ๋ง

hc =  hclust(dist, method="complete") #์™„์ „๊ฒฐํ•ฉ๊ธฐ์ค€

 

๊ตฐ์ง‘ ๋ฐฉ๋ฒ•(Cluster method) 
method = "complete" : ์™„์ „๊ฒฐํ•ฉ๊ธฐ์ค€(์ตœ๋Œ€๊ฑฐ๋ฆฌ ์ด์šฉ) <- default(์ƒ๋žต ์‹œ)
method = "single" : ๋‹จ์ˆœ๊ฒฐํ•ฉ๊ธฐ์ค€(์ตœ์†Œ๊ฑฐ๋ฆฌ ์ด์šฉ) 
method = "average" : ํ‰๊ท ๊ฒฐํ•ฉ๊ธฐ์ค€(ํ‰๊ท ๊ฑฐ๋ฆฌ ์ด์šฉ) 

help(hclust)
plot(hc) # ํด๋Ÿฌ์Šคํ„ฐ ํ”Œ๋กœํŒ…(Dendrogram) -> 1๊ณผ2 ๊ตฐ์ง‘(ํด๋Ÿฌ์Šคํ„ฐ) ํ˜•์„ฑ

 

[์‹ค์Šต] ์ค‘1ํ•™๋…„ ์‹ ์ฒด๊ฒ€์‚ฌ ๊ฒฐ๊ณผ ๊ตฐ์ง‘๋ถ„์„
๋‹จ๊ณ„1 : ๋ฐ์ดํ„ฐ์…‹ ๊ฐ€์ ธ์˜ค๊ธฐ 

body <- read.csv("c:/ITWILL/2_Rwork/data/bodycheck.csv")
names(body)


๋‹จ๊ณ„2 : ๊ฑฐ๋ฆฌ๊ณ„์‚ฐ 

idist <- dist(body)
idist


 ๋‹จ๊ณ„3 : ๊ณ„์ธต์  ๊ตฐ์ง‘๋ถ„์„ 

hc <- hclust(idist)
plot(hc, hang=-1) # ์Œ์ˆ˜๊ฐ’ ์ œ์™ธ

 

3๊ฐœ ๊ทธ๋ฃน ์„ ์ •, ์„  ์ƒ‰ ์ง€์ •

rect.hclust(hc, k=3, border="red") # 3๊ฐœ ๊ทธ๋ฃน ์„ ์ •, ์„  ์ƒ‰ ์ง€์ •


๋‹จ๊ณ„4 : ๊ฐ ๊ทธ๋ฃน๋ณ„ ์„œ๋ธŒ์…‹ ๋งŒ๋“ค๊ธฐ (์†Œ๋Ÿ‰์˜ ๋ฐ์ดํ„ฐ)

g1<- body[c(10,4,8,1,15), ]
g2<- body[c(11,3,5,6,14), ]
g3<- body[c(2,9,13,7,12), ]


๋‹จ๊ณ„5 : ๊ตฐ์ง‘๋ณ„ ํŠน์„ฑ๋ถ„์„ 

summary(g1)

        ๋ฒˆํ˜ธ           ์•…๋ ฅ           ์‹ ์žฅ            ์ฒด์ค‘         ์•ˆ๊ฒฝ์œ ๋ฌด  
Mean   : 7.6   Mean   :25.6   Mean   :149.8   Mean   :36.6   Mean   :1  
[ํ•ด์„] ์™œ์†Œํ•œ ์ง‘๋‹จ

summary(g2)

 

Mean   : 7.8   Mean   :33.8   Mean   :161.2   Mean   :48.8   Mean   :1.4 
[ํ•ด์„] ์ค‘๊ฐ„ ์ •๋„์˜ ์ง‘๋‹จ

summary(g3)

Mean   : 8.6   Mean   :40.6   Mean   :158.8   Mean   :56.8   Mean   :2
[ํ•ด์„] ๋ชธ์ง‘์ด ํฐ ์ง‘๋‹จ

 

 

 


cutree()ํ•จ์ˆ˜
๋Œ€๋Ÿ‰ dataset ๊ทธ๋ฃน๋ณ„ ์„œ๋ธŒ์…‹ ์ƒ์„ฑ
ํ˜•์‹) cutree(object, k=๊ตฐ์ง‘์ˆ˜)

g_num = cutree(hc, k=3) #1~3์œผ๋กœ ๊ฐ๊ฐ์˜ ๊ด€์ธก์น˜๋ฅผ ๋ฐ˜ํ™˜
g_num # 1 2 3 1 3 3 2 1 2 1 3 2 2 3 1


์นผ๋Ÿผ์ถ”๊ฐ€

body$cluster = g_num
head(body)

 

   ๋ฒˆํ˜ธ ์•…๋ ฅ ์‹ ์žฅ ์ฒด์ค‘ ์•ˆ๊ฒฝ์œ ๋ฌด cluster
1    1   28  146   34        1       1 -> 1๋ฒˆ ๊ตฐ์ง‘์— ์†ํ•œ๋‹ค
2    2   46  169   57        2       2
3    3   39  160   48        2       3
4    4   25  156   38        1       1
5    5   34  161   47        1       3
6    6   29  168   50        1       3

g1 = subset(body, cluster==1)
g1$'๋ฒˆํ˜ธ'
g2 = subset(body, cluster==2)
g2$'๋ฒˆํ˜ธ'
g3 = subset(body, cluster==3)
g3$'๋ฒˆํ˜ธ'
summary(g1) #g1==g1
summary(g2) #g2==g2
summary(g3) #g3==g3 ์ด์ „์˜ ๋‹จ๊ณ„ 4๋ฒˆ๊ณผ ๊ฐ™์€ ๊ฐ’ ์ถœ๋ ฅ

 

+ Recent posts