01. ์ƒ๊ด€๊ด€๊ณ„ ๋ถ„์„(Correlation Analysis)

๋ณ€์ˆ˜ ๊ฐ„ ๊ด€๋ จ์„ฑ ๋ถ„์„ ๋ฐฉ๋ฒ•

ํ•˜๋‚˜์˜ ๋ณ€์ˆ˜๊ฐ€ ๋‹ค๋ฅธ ๋ณ€์ˆ˜์™€ ๊ด€๋ จ์„ฑ ๋ถ„์„

ex. ๊ด‘๊ณ ๋น„์™€ ๋งค์ถœ์•ก ์‚ฌ์ด์˜ ๊ด€๋ จ์„ฑ ๋“ฑ ๋ถ„์„

 

์ƒ๊ด€๊ด€๊ณ„๋ถ„์„ ์ค‘์š”์‚ฌํ•ญ

ํšŒ๊ท€๋ถ„์„ ์ „ ๋ณ€์ˆ˜ ๊ฐ„ ๊ด€๋ จ์„ฑ ๋ถ„์„(๊ฐ€์„ค ๊ฒ€์ • ์ „ ์ˆ˜ํ–‰)

์ƒ๊ด€๊ณ„์ˆ˜(ํ”ผ์–ด์Šจ(Pearson) R๊ณ„์ˆ˜) ์ด์šฉ ๊ด€๋ จ์„ฑ ์œ ๋ฌด

์ƒ๊ด€๊ด€๊ณ„๋ถ„์„ ์ฒ™๋„ : ํ”ผ์–ด์Šจ ์ƒ๊ด€๊ณ„์ˆ˜(Pearson correlation coefficient) = r

 

์ƒ๊ด€๊ด€๊ณ„ ์˜ˆ์‹œ

์–‘์˜ ์ƒ๊ด€๊ด€๊ณ„ : ๋น„๋ก€๊ด€๊ณ„. ์ง€๋Šฅ์ง€์ˆ˜์™€ ์„ฑ์ , ํ‚ค์™€ ๋ชธ๋ฌด๊ฒŒ

์Œ์˜ ์ƒ๊ด€๊ด€๊ณ„ : ๋ฐ˜๋น„๋ก€๊ด€๊ณ„. ๋†์ž‘๋ฌผ ์ƒ์‚ฐ๋Ÿ‰๊ณผ ๊ฐ€๊ฒฉ, ๊ณ ๋„์™€ ๊ธฐ์˜จ

๋ฌด์˜ ์ƒ๊ด€๊ด€๊ณ„ : ๋น„๋ก€/๋ฐ˜๋น„๋ก€ ์—†์Œ. ์Šค๋งˆํŠธํฐ ์ด์šฉ ์‹œ๊ฐ„๊ณผ ์„ฑ์ , ๋ฒ”์ฃ„์œจ๊ณผ ์•„์ด์Šคํฌ๋ฆผ ํŒ๋งค๋Ÿ‰

 

 

 

02. ํ”ผ์–ด์Šจ ์ƒ๊ด€๊ณ„์ˆ˜ R

ํ”ผ์–ด์Šจ ์ƒ๊ด€๊ณ„์ˆ˜ R ์ƒ๊ด€๊ด€๊ณ„ ์ •๋„
±0.9 ์ด์ƒ ๋งค์šฐ ๋†’์€ ์ƒ๊ด€๊ด€๊ณ„
±0.9 ~ ±0.7 ๋†’์€ ์ƒ๊ด€๊ด€๊ณ„
±0.7 ~ ±0.4 ๋‹ค์†Œ ๋†’์€ ์ƒ๊ด€๊ด€๊ณ„
±0.4 ~ ±0.2 ๋‚ฎ์€ ์ƒ๊ด€๊ด€๊ณ„
±0.2 ๋ฏธ๋งŒ ์ƒ๊ด€๊ด€๊ณ„ ์—†์Œ

* ์ƒ๊ด€๊ณ„์ˆ˜ r์€ -1์—์„œ +1๊นŒ์ง€์˜ ๊ฐ’์„ ๊ฐ€์ง„๋‹ค. ๋˜ํ•œ ๊ฐ€์žฅ ๋†’์€ ์™„์ „ ์ƒ๊ด€๊ด€๊ณ„์˜ ์ƒ๊ด€๊ณ„์ˆ˜๋Š” 1์ด๊ณ , ๋‘ ๋ณ€์ˆ˜๊ฐ„์— ์ „ํ˜€ ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ ์—†์œผ๋ฉด ์ƒ๊ด€๊ณ„์ˆ˜๋Š” 0์ด๋‹ค.

์ฆ‰, ๋‘ ๋ณ€์ˆ˜์˜ ๊ด€๋ จ์„ฑ์ด ํด์ˆ˜๋ก ์ƒ๊ด€๊ณ„์ˆ˜๋Š” 1์— ๊ฐ€๊นŒ์›Œ์ง„๋‹ค.

๋‘ ๋ณ€์ˆ˜์˜ ๊ด€๋ จ์„ฑ์ด ์ ์„์ˆ˜๋ก ์ƒ๊ด€๊ณ„์ˆ˜๋Š” 0์— ๊ฐ€๊นŒ์›Œ์ง„๋‹ค.

 

์ƒ๊ด€๊ด€๊ณ„ ๊ทธ๋ž˜ํ”„

 

 

์‹ค์Šต

product <- read.csv("C:/ITWILL/2_Rwork/data/product.csv")
head(product) # ์นœ๋ฐ€๋„ ์ ์ ˆ์„ฑ ๋งŒ์กฑ๋„(๋“ฑ๊ฐ„์ฒ™๋„ - 5์  ์ฒ™๋„)


๊ธฐ์ˆ ํ†ต๊ณ„๋Ÿ‰

summary(product) # ์š”์•ฝํ†ต๊ณ„๋Ÿ‰
sd(product$์ œํ’ˆ_์นœ๋ฐ€๋„); sd(product$์ œํ’ˆ_์ ์ ˆ์„ฑ); sd(product$์ œํ’ˆ_๋งŒ์กฑ๋„)


๋ณ€์ˆ˜ ๊ฐ„์˜ ์ƒ๊ด€๊ด€๊ณ„ ๋ถ„์„ 
ํ˜•์‹) cor(x,y, method) # x๋ณ€์ˆ˜, y๋ณ€์ˆ˜, method(pearson): ๋ฐฉ๋ฒ•

1) ์ƒ๊ด€๊ณ„์ˆ˜(coefficient of correlation) : ๋‘ ๋ณ€๋Ÿ‰ X,Y ์‚ฌ์ด์˜ ์ƒ๊ด€๊ด€๊ณ„ ์ •๋„๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ์ˆ˜์น˜(๊ณ„์ˆ˜)

cor(product$์ œํ’ˆ_์นœ๋ฐ€๋„, product$์ œํ’ˆ_์ ์ ˆ์„ฑ) # 0.4992086 -> ๋‹ค์†Œ ๋†’์€ ์–‘์˜ ์ƒ๊ด€๊ด€๊ณ„
cor(product$์ œํ’ˆ_์นœ๋ฐ€๋„, product$์ œํ’ˆ_๋งŒ์กฑ๋„) # 0.467145 -> ๋‹ค์†Œ ๋†’์€ ์–‘์˜ ์ƒ๊ด€๊ด€๊ณ„


์ „์ฒด ๋ณ€์ˆ˜ ๊ฐ„ ์ƒ๊ด€๊ณ„์ˆ˜ ๋ณด๊ธฐ

COR = cor(product, method="pearson") # ํ”ผ์–ด์Šจ ์ƒ๊ด€๊ณ„์ˆ˜ - default
str(COR) #num [1:3, 1:3] : matrix


ํŠน์ •๋ณ€์ˆ˜๋ฅผ ๊ธฐ์ค€์œผ๋กœ ํ•œ ์ƒ๊ด€๊ณ„์ˆ˜ ๊ฒฐ๊ณผ์— ๋Œ€ํ•œ ์ƒ‰์ธ

COR['์ œํ’ˆ_๋งŒ์กฑ๋„',]


๋ฐฉํ–ฅ์„ฑ ์žˆ๋Š” ์ƒ‰์ƒ์œผ๋กœ ํ‘œํ˜„ - ๋™์ผ ์ƒ‰์ƒ์œผ๋กœ ๊ทธ๋ฃนํ™” ํ‘œ์‹œ ๋ฐ ์ƒ‰์˜ ๋†๋„ 

install.packages("corrgram")   
library(corrgram)
corrgram(product) # ์ƒ‰์ƒ ์ ์šฉ - ๋™์ผ ์ƒ‰์ƒ์œผ๋กœ ๊ทธ๋ฃนํ™” ํ‘œ์‹œ
corrgram(product, upper.panel=panel.conf) # ์ˆ˜์น˜(์ƒ๊ด€๊ณ„์ˆ˜) ์ถ”๊ฐ€(์œ„์ชฝ)
corrgram(product, lower.panel=panel.conf) # ์ˆ˜์น˜(์ƒ๊ด€๊ณ„์ˆ˜) ์ถ”๊ฐ€(์•„๋ž˜์ชฝ)


์ฐจํŠธ์— ๊ณก์„ ๊ณผ ๋ณ„ํ‘œ ์ถ”๊ฐ€

install.packages("PerformanceAnalytics") 
library(PerformanceAnalytics)


์ƒ๊ด€์„ฑ,p๊ฐ’(*),์ •๊ทœ๋ถ„ํฌ ์‹œ๊ฐํ™” - ๋ชจ์ˆ˜ ๊ฒ€์ • ์กฐ๊ฑด 

chart.Correlation(product, histogram=TRUE)

* spearman : ์„œ์—ด์ฒ™๋„ ๋Œ€์ƒ ์ƒ๊ด€๊ณ„์ˆ˜

install.packages('corrplot')
library(corrplot)

 

์ž๋™์ฐจ ์—ฐ๋น„ ๊ด€๋ จ dataset

data("mtcars")
str(mtcars)



์ƒ๊ด€๊ณ„์ˆ˜ ํ–‰๋ ฌ

COR = cor(mtcars)
COR['mpg',] #mpg:์—ฐ๋น„ vs cyl(์‹ค๋ฆฐ๋” ์ˆ˜), disp(์—”์ง„ํฌ๊ธฐ), hp(๋งˆ๋ ฅ)

corrplot(COR)
cor(mtcars$mpg, mtcars$wt) #-0.8676594

 

์‚ฐ์ ๋„

qplot (x, y, data)
library(ggplot2)

qplot(wt, mpg, data=mtcars, color=factor(cyl)) #cyl์„ ์š”์ธํ˜•(๋ฒ”์ฃผํ˜•)์œผ๋กœ ์ธ์‹

 

qplot(wt, mpg, data=mtcars, color=cyl) #cyl์„ ์—ฐ์†ํ˜•(์ˆซ์žํ˜•)์œผ๋กœ ์ธ์‹

[ํ•ด์„]์ค‘๋Ÿ‰์ด ์ ๊ณ , ์‹ค๋ฆฐ๋” ์ˆ˜๊ฐ€ ์ ์€ ๊ฒฝ์šฐ ๊ฐ€์žฅ ์—ฐ๋น„๊ฐ€ ์ข‹๋‹ค.

 

 

 

 

 

03. ๊ณต๋ถ„์‚ฐ

๊ณต๋ถ„์‚ฐ? ๋‘ ํ™•๋ฅ ๋ณ€์ˆ˜ ๊ฐ„์˜ ๋ถ„์‚ฐ(ํ‰๊ท ์—์„œ ํผ์ง ์ •๋„)๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ํ†ต๊ณ„

ํ™•๋ฅ ๋ณ€์ˆ˜ : X, Y -> X ํ‘œ๋ณธํ‰๊ท (xฬ„), Y ํ‘œ๋ณธํ‰๊ท (ศณ) - Cov(X,Y) = sum( (X-xฬ„) * (Y-ศณ) ) / n

 

 

 

 

 

04. ๊ณต๋ถ„์‚ฐ VS ์ƒ๊ด€๊ณ„์ˆ˜

Cov(X, Y) > 0 : X๊ฐ€ ์ฆ๊ฐ€ํ•  ๋•Œ Y๋„ ์ฆ๊ฐ€ - Cov(X, Y) < 0 : X๊ฐ€ ์ฆ๊ฐ€ํ•  ๋•Œ Y๋Š” ๊ฐ์†Œ

Cov(X, Y) = 0 : ๋‘ ๋ณ€์ˆ˜๋Š” ์„ ํ˜•๊ด€๊ณ„ ์•„๋‹˜(์„œ๋กœ ๋…๋ฆฝ์  ๊ด€๊ณ„)

๊ฐ’์ด ํฐ ๋ณ€์ˆ˜์— ์˜ํ–ฅ์„ ๋ฐ›๋Š”๋‹ค.(๊ฐ’ ํฐ ๋ณ€์ˆ˜๊ฐ€ ์ƒ๊ด€์„ฑ ๋†’์Œ) : ๋ฌธ์ œ์ 

 

์ƒ๊ด€๊ณ„์ˆ˜? ๊ณต๋ถ„์‚ฐ์„ ๊ฐ๊ฐ์˜ ํ‘œ์ค€ํŽธ์ฐจ๋กœ ๋‚˜๋ˆˆ์–ด ์ •๊ทœํ™”ํ•œ ํ†ต๊ณ„

๊ณต๋ถ„์‚ฐ ๋ฌธ์ œ์  ํ•ด๊ฒฐ

๋ถ€ํ˜ธ๋Š” ๊ณต๋ถ„์‚ฐ๊ณผ ๋™์ผ, ๊ฐ’์€ ์ ˆ๋Œ€๊ฐ’ 1์„ ๋„˜์ง€ ์•Š์Œ(-1 ~ 1)

Corr(X, Y) = Cov(X,Y) / std(X) * std(Y

 

 

(1) ๊ณต๋ถ„์‚ฐ

๊ณต๋ถ„์‚ฐ ์‹ : Cov(X,Y) = sum((X-xbar) * (Y-ybar)) / n

 

0.5463331 : ์ œํ’ˆ_์ ์ ˆ์„ฑ:X, ์ œํ’ˆ_๋งŒ์กฑ๋„:Y
X = product$'์ œํ’ˆ_์ ์ ˆ์„ฑ'
Y = product$'์ œํ’ˆ_๋งŒ์กฑ๋„'

Xbar = mean(X)
Ybar = mean(Y)

Cov_xy = mean((X-Xbar) * (Y-Ybar))
Cov_xy = sum((X-Xbar) * (Y-Ybar)) / length(X)
Cov_xy #0.5442637


(2) ์ƒ๊ด€๊ณ„์ˆ˜
0.7668527 : X:์ œํ’ˆ_์ ์ ˆ์„ฑ, Y:์ œํ’ˆ_๋งŒ์กฑ๋„

Cor_xy = Cov_xy / (sd(X) * sd(Y)) #-1~+1
Cor_xy #0.763948




05. scale์ด ๋‹ค๋ฅธ ๋ณ€์ˆ˜

getwd()
setwd("C:/ITWILL/2_Rwork/data")

score_iq = read.csv('score_iq.csv')
head(score_iq)

score VS iq(100๋‹จ์œ„) academy(1๋‹จ์œ„)
cov(score_iq[-1])

           score         iq    academy
score   42.968412 51.3375391  7.1199105
[๋ฌธ์ œ์ ] ๊ฐ’์ด ํฐ ๋ณ€์ˆ˜๊ฐ€ ์ƒ๊ด€์„ฑ์ด ๋†’์€ ๊ฒƒ์œผ๋กœ ํ•ด์„

cor(score_iq[-1])

           score          iq    academy
score    1.0000000  0.88222034  0.8962647
[ํ•ด๊ฒฐ] ๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— cor๋กœ ์ƒ๊ด€๋ถ„์„์„ ๋งŽ์ด ํ•จ

 

 


<๋ฌธ์ œํ•ด๊ฒฐ> ๋ฐ์ดํ„ฐ ๋ณ€ํ™˜ (ํ‘œ์ค€ํ™”, ์ •๊ทœํ™”)
๋ชฉ์  : scale์ด ๋‹ค๋ฅธ ๊ฒฝ์šฐ, ๋ฒ”์œ„๋ฅผ ์ผ์ •ํ•˜๊ฒŒ ๊ทœ๊ฒฉํ™”
ํ‰๊ท :0, ํ‘œ์ค€ํŽธ์ฐจ:1๋กœ ๋ณ€ํ™˜

1. ํ‘œ์ค€ํ™” 

๋ฐ์ดํ„ฐ๋ฅผ 0์„ ์ค‘์‹ฌ์œผ๋กœ ์–‘์ชฝ์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„ํฌ์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•

๊ฐ ๋ฐ์ดํ„ฐ๋“ค์„ ํ‰๊ท ์„ ๊ธฐ์ค€์œผ๋กœ ์–ผ๋งˆ๋‚˜ ๋–จ์—ฌ์ ธ ์žˆ๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฐ’์œผ๋กœ ๋ณ€ํ™˜
ํ‘œ์ค€ํ™” ๊ณต์‹ : z = x-mu / sigma ์ผ๋•Œ, z_score = scale(x=score_iq, center=T, scale=T)

summary(z_score) #ํ‰๊ท (Mean):0.0000 
sd(z_score) #๋ชจ๋“  ๋ณ€์ˆ˜์˜ ํ‘œ์ค€ํŽธ์ฐจ:0.9972153 (1์˜ ๊ทผ์‚ฌ์น˜)

str(z_score) #num [1:150, 1:6]:๋งคํŠธ๋ฆญ์Šค ํ˜•์‹์œผ๋กœ ๋ฐ˜ํ™˜

cov(z_score[,-1]) #sid์ œ์™ธํ•œ ์ •๋ณด ๋ฐ˜ํ™˜

            score          iq    academy
score    1.0000000  0.88222034  0.8962647


2. ์ •๊ทœํ™”=min/max scaling

๋ฐ์ดํ„ฐ๋ฅผ ํŠน์ • ๊ตฌ๊ฐ„์œผ๋กœ ๋ฐ”๊พธ๋Š” ์ฒ™๋„๋ฒ•

๋ฐ์ดํ„ฐ ๊ตฐ ๋‚ด์—์„œ ํŠน์ • ๋ฐ์ดํ„ฐ๊ฐ€ ๊ฐ€์ง€๋Š” ์œ„์น˜๋ฅผ ๋ณผ ๋•Œ ์‚ฌ์šฉ
์ •๊ทœํ™” ๊ณต์‹ : x-min(x) / max(x) - min(x)

head(score_iq) #scaling ์ด์ „
    sid score  iq academy game tv
1 10001    90 140       2    1  0

nor_scale = function(x){
  nor = (x-min(x)) / (max(x) - min(x))
  return(nor)
} 

nor_score = apply(score_iq, 2, nor_scale) #์ ์šฉํ•  ๋ฐ์ดํ„ฐ, 2:์—ด๋‹จ์œ„, apply์‹œํ‚ฌ ํ•จ์ˆ˜
head(nor_score)

             sid score        iq academy
[1,] 0.000000000  1.00 1.0000000    0.50

summary(nor_score) #num[1:150, 1:6]


๊ณต๋ถ„์‚ฐ ํ–‰๋ ฌ ์กฐํšŒ

cov(nor_score[,-1])

              score           iq     academy
score    0.06874946  0.058671473  0.07119911

+ Recent posts