๊ฐœ์ธ๊ณต๋ถ€/R

33. R ๋กœ์ง€์Šคํ‹ฑํšŒ๊ท€๋ถ„์„ ์—ฐ์Šต๋ฌธ์ œ

LEE_BOMB 2021. 10. 18. 22:17

01. ํƒ€์ดํƒ€๋‹‰(titanic) ๋ฐ์ดํ„ฐ ์…‹์„ ๋Œ€์ƒ์œผ๋กœ 7:3 ๋น„์œจ๋กœ ํ•™์Šต๋ฐ์ดํ„ฐ์™€ ๊ฒ€์ •๋ฐ์ดํ„ฐ๋กœ ๊ฐ๊ฐ ์ƒ˜ํ”Œ๋งํ•œ ํ›„ ๊ฐ ๋‹จ๊ณ„๋ณ„๋กœ ๋ถ„๋ฅ˜๋ถ„์„์„ ์ˆ˜ํ–‰ํ•˜์‹œ์˜ค.

titanic = read.csv("c:/ITWILL/2_Rwork/data/titanic3.csv")
str(titanic)

'data.frame': 1309 obs. of 14 variables:
pclass : 1, 2, 3๋“ฑ์„ ์ •๋ณด๋ฅผ ๊ฐ๊ฐ 1, 2, 3์œผ๋กœ ์ €์žฅ
survived : ์ƒ์กด ์—ฌ๋ถ€. survived(์ƒ์กด=1), dead(์‚ฌ๋ง=0)
name : ์ด๋ฆ„(์ œ์™ธ)
sex : ์„ฑ๋ณ„. female(์—ฌ์„ฑ), male(๋‚จ์„ฑ)
age : ๋‚˜์ด
sibsp : ํ•จ๊ป˜ ํƒ‘์Šนํ•œ ํ˜•์ œ ๋˜๋Š” ๋ฐฐ์šฐ์ž์˜ ์ˆ˜
parch : ํ•จ๊ป˜ ํƒ‘์Šนํ•œ ๋ถ€๋ชจ ๋˜๋Š” ์ž๋…€์˜ ์ˆ˜
ticket : ํ‹ฐ์ผ“ ๋ฒˆํ˜ธ(์ œ์™ธ)
fare : ํ‹ฐ์ผ“ ์š”๊ธˆ
cabin : ์„ ์‹ค ๋ฒˆํ˜ธ(์ œ์™ธ)
embarked : ํƒ‘์Šนํ•œ ๊ณณ.(์ œ์™ธ) C(Cherbourg), Q(Queenstown), S(Southampton)
boat     : (์ œ์™ธ)Factor w/ 28 levels "","1","10","11",..: 13 4 1 1 1 14 3 1 28 1 ...
body     : (์ œ์™ธ)int  NA NA NA 135 NA NA NA NA NA 22 ...
home.dest: (์ œ์™ธ)


์ƒ์กด์—ฌ๋ถ€ factorํ˜• ๋ณ€ํ™˜ : ๋”๋ฏธ๋ณ€์ˆ˜ ์ƒ์„ฑ 

titanic$survived <- factor(titanic$survived, levels = c(0, 1))

as.factor vs factor 
as.factor : ๋ฌธ์žํ˜•/์ˆซ์žํ˜• -> ๋”๋ฏธ๋ณ€์ˆ˜ 
factor : ๋ฌธ์žํ˜•/์ˆซ์žํ˜• -> ๋”๋ฏธ๋ณ€์ˆ˜, levels ๋ณ€๊ฒฝ ๊ฐ€๋Šฅ 


์ƒ์กด์—ฌ๋ถ€ ๋นˆ๋„์ˆ˜ 

table(titanic$survived)

  0   1  -> 0:์‚ฌ๋ง, 1:์ƒ์กด  
809 500

์ƒ์กด์—ฌ๋ถ€ ๋น„์œจ 

prop.table(table(titanic$survived))

       0        1 
0.618029 0.381971



๋‹จ๊ณ„1 : ๋ณ€์ˆ˜6๊ฐœ ์ œ์™ธ ์„œ๋ธŒ์…‹ ๋งŒ๋“ค๊ธฐ : name, ticket, cabin, embarked, boat, body, home.dest  

titanic <- titanic[-c(3, 8, 10:14)]

์นผ๋Ÿผ๋ช…์œผ๋กœ ์„œ๋ธŒ์…‹ ์ƒ์„ฑ ์‹œ : ๋ถ€ํ˜ธ(-)์™€ ์ฝœ๋ก (:) ์‚ฌ์šฉ ๋ถˆ๊ฐ€ 

str(titanic)

'data.frame': 1309 obs. of  7 variables:


๋‹จ๊ณ„2 : 80% : 20% ๋ฐ์ดํ„ฐ์…‹ ๋ถ„ํ• ํ•˜๊ธฐ 
ํ›ˆ๋ จ์…‹ : titanic_train, ๊ฒ€์ •์…‹ : titanic_test    

idx = sample(nrow(titanic), nrow(titanic)*0.8)
titanic_train = titanic[idx, ]
titanic_test = titanic[-idx, ]
dim(titanic_train) #1047    7
dim(titanic_test) #262   7


 
๋‹จ๊ณ„3 : ๋ถ„๋ฅ˜๋ชจ๋ธ ์ƒ์„ฑ : ์ข…์†๋ณ€์ˆ˜=survived, ๋…๋ฆฝ๋ณ€์ˆ˜=๋‚˜๋จธ์ง€ ๋ณ€์ˆ˜  

library(rpart)
model = rpart(survived ~ ., data = titanic_train)
model


  
๋‹จ๊ณ„4 : ์˜์‚ฌ๊ฒฐ์ •ํŠธ๋ฆฌ ์‹œ๊ฐํ™” : ์ค‘์š”๋ณ€์ˆ˜ 2~3๊ฐœ๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์ƒ์กดํ™•๋ฅ ์ด ๋†’์€ ๊ฒฝ์šฐ ์„ค๋ช…ํ•˜๊ธฐ   

rpart.plot(model)

[ํ•ด์„ค] ์„ฑ๋ณ„์ด ์—ฌ์„ฑ์ด๊ณ , ๋‚˜์ด๊ฐ€ 9.5์„ธ ๋ฏธ๋งŒ, ์„ ์‹ค์€ 2๋“ฑ ์ดํ•˜์ธ ๊ฒฝ์šฐ ์ƒ์กดํ™•๋ฅ  ๋†’๋‹ค. 


๋‹จ๊ณ„5 : ๋ชจ๋ธ ๊ฒ€์ •/ํ‰๊ฐ€(0:Negative, 1:Positive) : ๋ถ„๋ฅ˜์ •ํ™•๋„, ์ •ํ™•๋ฅ , ์ œํ˜„์œจ, F1 score

y_pred = predict(model, titanic_test, type = 'class')
y_true = titanic_test$survived


ํ˜ผ๋™ ํ–‰๋ ฌ 

t = table(y_true, y_pred)
t

            y_pred
y_true       0(N)   1(P)
      0(N) 137(TN)  26(FP) = 163
      1(P)  26(FN)  73(TP) = 99

accuracy = (t[1,1] + t[2,2]) / sum(t) # (TN + TP) / ์ „์ฒด 
accuracy #0.801526

precision = t[2,2] / sum(t[,2]) # TP / (TP + FP)
precision #0.7373737

recall = t[2,2] / sum(t[2,]) # TP / (TP + FN)
recall

f1_score = 2 * ((precision * recall) / (precision + recall))
f1_score #0.7373737

 

 


  

02. weather ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•˜์—ฌ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋‹จ๊ณ„๋ณ„๋กœ ์˜์‚ฌ๊ฒฐ์ • ํŠธ๋ฆฌ ๋ฐฉ์‹์œผ๋กœ ๋ถ„๋ฅ˜๋ถ„์„์„ ์ˆ˜ํ–‰ํ•˜์‹œ์˜ค. 
์กฐ๊ฑด1) rpart() ํ•จ์ˆ˜ ์ด์šฉ ๋ถ„๋ฅ˜๋ชจ๋ธ ์ƒ์„ฑ 
์กฐ๊ฑด2) y๋ณ€์ˆ˜ : RainTomorrow, x๋ณ€์ˆ˜ : 1, 6, 8, 14๋ฒˆ ๋ณ€์ˆ˜ ์ œ์™ธํ•œ ๋‚˜๋จธ์ง€ ๋ณ€์ˆ˜๋กœ ๋ถ„๋ฅ˜๋ชจ๋ธ ์ƒ์„ฑ 
์กฐ๊ฑด3) ๋ชจ๋ธ์˜ ์‹œ๊ฐํ™”๋ฅผ ํ†ตํ•ด์„œ y์— ๊ฐ€์žฅ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” x๋ณ€์ˆ˜ ํ™•์ธ 
์กฐ๊ฑด4) ๋น„๊ฐ€ ์˜ฌ ํ™•๋ฅ ์ด 50% ์ด์ƒ์ด๋ฉด ‘Yes Rain’, 50% ๋ฏธ๋งŒ์ด๋ฉด ‘No Rain’์œผ๋กœ ๋ฒ”์ฃผํ™”

๋‹จ๊ณ„1 : ๋ฐ์ดํ„ฐ ๊ฐ€์ ธ์˜ค๊ธฐ

library(rpart) # model ์ƒ์„ฑ 
library(rpart.plot) # ๋ถ„๋ฅ˜ํŠธ๋ฆฌ ์‹œ๊ฐํ™” 

setwd("c:/ITWILL/2_Rwork/data")
weather = read.csv("weather.csv", header=TRUE) 
str(weather)

'data.frame': 366 obs. of  15 variables:

chrํ˜• ๋ณ€์ˆ˜ ์ œ์™ธ : y = RainTomorrow

weather <- weather[-c(1, 6, 8, 14)]
str(weather)

'data.frame': 366 obs. of  11 variables:

y๋ณ€์ˆ˜ -> ๋”๋ฏธ๋ณ€์ˆ˜ ๋ณ€ํ™˜ 

weather$RainTomorrow <- as.factor(weather$RainTomorrow)
table(weather$RainTomorrow)




๋‹จ๊ณ„2 : ๋ฐ์ดํ„ฐ ์ƒ˜ํ”Œ๋ง : 70% vs 30%

idx <- sample(nrow(weather), nrow(weather)*0.7)
weather_train <- weather[idx, ] # model ํ•™์Šต 
weather_test <- weather[-idx, ] # model ํ‰๊ฐ€


๋‹จ๊ณ„3 : ๋ถ„๋ฅ˜๋ชจ๋ธ ์ƒ์„ฑ

model <- rpart(RainTomorrow ~ ., data = weather_train)


๋‹จ๊ณ„4 : ๋ถ„๋ฅ˜๋ชจ๋ธ ์‹œ๊ฐํ™” - ์ค‘์š”๋ณ€์ˆ˜ ํ™•์ธ 

rpart.plot(model)

[ํ•ด์„ค] ์•ˆ๊ฐœ(Humidity)์™€ ๋Œํ’์†๋„(WinGustSpeed), ๊ธฐ์••(Pressure) ๋“ฑ์ด ๋น„์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ๋ณ€์ˆ˜์ด๋‹ค. 


๋‹จ๊ณ„5 : ์˜ˆ์ธก ํ™•๋ฅ  ๋ฒ”์ฃผํ™”('Yes Rain', 'No Rain')

pred_rate <- predict(model, weather_test) # ๋น„์œจ ์˜ˆ์ธก 
range(pred_rate) # 0.04268293 0.95731707 : ํ™•๋ฅ ๋ฒ”์œ„ 
dim(pred_rate) # 110   2(No  Yes) -> (๋น„๊ฐ€ ์•ˆ์˜ฌ ํ™•๋ฅ  ๋น„๊ฐ€์˜ฌ ํ™•๋ฅ )


ํด๋ž˜์Šค ๋ถ„๋ฅ˜ 

y_pred <- ifelse(pred_rate[, 2] > 0.5, 'Yes Rain', 'No Rain')
table(y_pred)

No Rain Yes Rain 
   101        9 

๋‹จ๊ณ„6 : ํ˜ผ๋ˆ matrix ์ƒ์„ฑ ๋ฐ ๋ถ„๋ฅ˜ ์ •ํ™•๋„ ๊ตฌํ•˜๊ธฐ

t = table(y_pred, weather_test$RainTomorrow)
t
# y_pred     No Yes : ๊ด€์ธก์น˜

No Rain  88  13
Yes Rain  2   7

acc = (t[1,1] + t[2,2]) / sum(t)
cat('๋ถ„๋ฅ˜์ •ํ™•๋„ =', acc) # ๋ถ„๋ฅ˜์ •ํ™•๋„ = 0.8636364

 

 




03. Boston ๋ฐ์ดํ„ฐ์…‹์„ ๋Œ€์ƒ์œผ๋กœ ๋‹จ๊ณ„๋ณ„๋กœ ํšŒ๊ท€ํŠธ๋ฆฌ ๋ชจ๋ธ์„ ์ƒ์„ฑํ•˜์‹œ์˜ค. 

 

๋‹จ๊ณ„1 : ๋ฐ์ดํ„ฐ์…‹ ๊ฐ€์ ธ์˜ค๊ธฐ 

library(MASS)
data("Boston")
str(Boston)

$ crim   : 1์ธ๋‹น ๋ฒ”์ฃ„์œจ num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
$ zn     : 25,000 ํ‰๋ฐฉํ”ผํŠธ ์ดˆ๊ณผ ๊ฑฐ์ฃผ์ง€์—ญ ๋น„์œจ num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
$ indus  : ๋น„์†Œ๋งค์ƒ์—…์ง€์—ญ ์ ์œ  ํ† ์ง€ ๋น„์œจ num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
$ chas   : ์ฐฐ์Šค๊ฐ• ๋”๋ฏธ๋ณ€์ˆ˜(1: ๊ฐ•์˜ ๊ฒฝ๊ณ„)int  0 0 0 0 0 0 0 0 0 0 ...
$ nox    : ์ผ์‚ฐํ™”์งˆ์†Œ num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
$ rm     : ํ‰๊ท  ๋ฐฉ์˜ ๊ฐœ์ˆ˜ num  6.58 6.42 7.18 7 7.15 ...
$ age    : ๊ณ ํƒ ๋น„์œจ num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
$ dis    : ์ง์—…์„ผํ„ฐ ์ ‘๊ทผ์„ฑ ์ง€์ˆ˜ num  4.09 4.97 4.97 6.06 6.06 ...
$ rad    : ๋„๋กœ ์ ‘๊ทผ์„ฑ ์ง€์ˆ˜ int  1 2 2 3 3 3 5 5 5 5 ...
$ tax    : ์žฌ์‚ฐ์„ธ์œจ num  296 242 242 222 222 222 311 311 311 311 ...
$ ptratio: ํ•™์ƒ/๊ต์‚ฌ ๋น„์œจ num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
$ black  : ํ‘์ธ ๋น„์œจ num  397 397 393 395 397 ...
$ lstat  : ํ•˜์œ„๊ณ„์ธต ๋น„์œจ num  4.98 9.14 4.03 2.94 5.33 ...
$ medv   : ์ฃผํƒ๊ฐ€๊ฒฉ(๋‹จ์œ„ : 1,000 ๋‹ฌ๋Ÿฌ) num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

library(rpart) # model ์ƒ์„ฑ 
library(rpart.plot) # ์˜์‚ฌ๊ฒฐ์ •ํŠธ๋ฆฌ ์‹œ๊ฐํ™”


๋‹จ๊ณ„2 : ๋ถ„๋ฅ˜๋ชจ๋ธ ์ƒ์„ฑ : y๋ณ€์ˆ˜ : medv, x๋ณ€์ˆ˜ : ๋‚˜๋จธ์ง€ 13๊ฐœ ๋ณ€์ˆ˜

model <- rpart(medv ~ ., data = Boston)
model


๋‹จ๊ณ„3 : ๋ถ„๋ฅ˜๋ชจ๋ธ ์‹œ๊ฐํ™” - ์ค‘์š”๋ณ€์ˆ˜ ํ™•์ธ 

rpart.plot(model)

[ํ•ด์„ค] ๋ฐฉ์˜ ๊ฐœ์ˆ˜(rm) ๋งŽ๊ณ , ํ•˜์œ„๊ณ„์ธต ๋น„์œจ(lstat) ๋‚ฎ์„ ์ˆ˜๋ก ์ฃผํƒ ๊ฐ€๊ฒฉ์ด ์˜ค๋ฅธ๋‹ค. 


๋‹จ๊ณ„4 : 3๊ฒน ๊ต์ฐจ๊ฒ€์ •์„ ์ˆ˜ํ–‰ํ•˜์—ฌ ๋ชจ๋ธ ํ‰๊ฐ€ 

library(cvTools)


1) K๊ฒน ๊ต์ฐจ๊ฒ€์ •์„ ์œ„ํ•œ ์ƒ˜ํ”Œ๋ง 

cross <- cvFolds(nrow(Boston), K = 3)
cross # Fold   Index
cross$subsets[cross$which==1, 1]
cross$subsets[cross$which==2, 1]
cross$subsets[cross$which==3, 1]


2) K๊ฒน ๊ต์ฐจ๊ฒ€์ • 

n = 1:3
R2_score <- numeric() # ํ‰๊ฐ€๊ฒฐ๊ณผ 

for(k in n){ # 1 2 3
  idx <- cross$subsets[cross$which==k, 1]
  test <- Boston[idx, ] # ๊ฒ€์ •์…‹(1set) 
  train <- Boston[-idx, ] # ํ›ˆ๋ จ์…‹(2set)

model ์ƒ์„ฑ 
model = rpart(medv ~ ., data = train)

 

y ์˜ˆ์ธก์น˜ : ๋น„์œจ์˜ˆ์ธก 
y_pred <- predict(model, test)
y_true <- test$medv

 

model ํ‰๊ฐ€ : R2 score 
R2_score[k] <- cor(y_true, y_pred)^2}

3) ๊ต์ฐจ๊ฒ€์ • ํ‰๊ฐ€ : R2 score(y ์Šค์ผ€์ผ๋ง ์•ˆ๋œ ๊ฒฝ์šฐ) 

cat('R2 score =', mean(R2_score))
#R2 score = 0.7114659