๊ฐœ์ธ๊ณต๋ถ€/R

39. R ์•™์ƒ๋ธ”๋ชจ๋ธ ์—ฐ์Šต๋ฌธ์ œ

LEE_BOMB 2021. 10. 25. 18:28

01. ๋‚ ์”จ ๋ฐ์ดํ„ฐ์…‹์„ ์ด์šฉํ•˜์—ฌ ๋‹จ๊ณ„๋ณ„๋กœ RandomForest ๋ชจ๋ธ์„ ์ƒ์„ฑํ•˜์‹œ์˜ค.
<๋‹จ๊ณ„1> 1,2,22,23 ์นผ๋Ÿผ ์ œ์™ธ
<๋‹จ๊ณ„2> y๋ณ€์ˆ˜ : RainTomorrow -> ์ดํ•ญ๋ณ€์ˆ˜ ๋ณ€๊ฒฝ
<๋‹จ๊ณ„3> ๋ชจ๋ธ ์ƒ์„ฑ : Tree 400๊ฐœ, ๋ถ„๋ฅ˜๋ณ€์ˆ˜ 4๊ฐœ ์ด์šฉ
<๋‹จ๊ณ„4> Confusion matrix ์ด์šฉ ๋ถ„๋ฅ˜์ •ํ™•๋„ ๊ตฌํ•˜๊ธฐ 
<๋‹จ๊ณ„5> ๋ถ„๋ฅ˜์ •ํ™•๋„ ๊ฐœ์„ ์— ๊ธฐ์—ฌํ•˜๋Š” TOP3 ๋ณ€์ˆ˜ ํ™•์ธํ•˜๊ธฐ  

setwd("C:/ITWILL/2_Rwork/data")
weatherAUS = read.csv('weatherAUS.csv')

 

 


<๋‹จ๊ณ„1> 1,2,22,23 ์นผ๋Ÿผ ์ œ์™ธ

df = weatherAUS[ ,c(-1,-2, -22, -23)]
str(df)


<๋‹จ๊ณ„2> y๋ณ€์ˆ˜ : RainTomorrow -> ์ดํ•ญ๋ณ€์ˆ˜ ๋ณ€๊ฒฝ

sqrt(ncol(df)) # 4.242641

df$RainTomorrow = as.factor(df$RainTomorrow)



<๋‹จ๊ณ„3> ๋ชจ๋ธ ์ƒ์„ฑ : Tree 400๊ฐœ, ๋ถ„๋ฅ˜๋ณ€์ˆ˜ 4๊ฐœ ์ด์šฉ 

model_weather = randomForest(RainTomorrow ~ ., 
                              data = df, 
                              ntree=400, mtry=4, 
                              importance = T,
                              na.action=na.omit)
model_weather #error rate: 14.42%
names(model_weather)


<๋‹จ๊ณ„4> Confusion matrix ์ด์šฉ ๋ถ„๋ฅ˜์ •ํ™•๋„ ๊ตฌํ•˜๊ธฐ

con = model_weather$confusion
con

acc = (con[1,1]+con[2,2]) / sum(con)
cat('accuracy =', acc)


<๋‹จ๊ณ„5> ๋ถ„๋ฅ˜์ •ํ™•๋„ ๊ฐœ์„ ์— ๊ธฐ์—ฌํ•˜๋Š” TOP3 ๋ณ€์ˆ˜ ํ™•์ธํ•˜๊ธฐ  

varImpPlot(model_weather)

Humidity3pm(์Šต๋„)

Sunshine(ํ–‡๋น›)

WindGustSpeed(๋Œํ’์†๋„) 

 

 

 



02. ๋Œ€์ถœ์—ฌ๋ถ€ ๋ฐ์ดํ„ฐ์…‹์˜ ๋ณ€์ˆ˜ ๋ชฉ๋ก์„ ๋ณด๊ณ  ๋‹จ๊ณ„๋ณ„๋กœ RandomForest ๋ชจ๋ธ์„ ์ƒ์„ฑํ•˜์‹œ์˜ค. 
<๋‹จ๊ณ„1> 1,5 ์นผ๋Ÿผ ์ œ์™ธ
<๋‹จ๊ณ„2> y๋ณ€์ˆ˜ : Personal.Loan -> ์ดํ•ญ๋ณ€์ˆ˜ ๋ณ€๊ฒฝ
<๋‹จ๊ณ„3> ๋ชจ๋ธ ์ƒ์„ฑ : Tree 500๊ฐœ, ๋ถ„๋ฅ˜๋ณ€์ˆ˜ 3๊ฐœ ์ด์šฉ
<๋‹จ๊ณ„4> Confusion matrix ์ด์šฉ ๋ถ„๋ฅ˜์ •ํ™•๋„ ๊ตฌํ•˜๊ธฐ 
<๋‹จ๊ณ„5> ๋ถ„๋ฅ˜์ •ํ™•๋„ ๊ฐœ์„ ์— ๊ธฐ์—ฌํ•˜๋Š” TOP3 ๋ณ€์ˆ˜ ํ™•์ธํ•˜๊ธฐ


<๋Œ€์ถœ์—ฌ๋ถ€ ๋ฐ์ดํ„ฐ์…‹ ๋ณ€์ˆ˜ ๋ชฉ๋ก>  

bank = read.csv('UniversalBank.csv',  stringsAsFactors = F)
str(bank)


<๋Œ€์ถœ์—ฌ๋ถ€ ๋ณ€์ˆ˜ ๋ชฉ๋ก> 
'data.frame': 5000 obs. of  14 variables:
$ ID                :๊ณ ๊ฐ๊ตฌ๋ถ„(์ œ์™ธ)  int  1 2 3 4 5 6 7 8 9 10 ...
$ Age               :๋‚˜์ด  int  25 45 39 35 35 37 53 50 35 34 ...
$ Experience        :๊ฒฝ๋ ฅ  int  1 19 15 9 8 13 27 24 10 9 ...
$ Income            :์†Œ๋“  int  49 34 11 100 45 29 72 22 81 180 ...
$ ZIP.Code          :์šฐํŽธ๋ฒˆํ˜ธ(์ œ์™ธ)  int  91107 90089 94720 94112 91330 92121 91711 93943 90089 93023 ...
$ Family            :๊ฐ€์กฑ์ˆ˜  int  4 3 1 1 4 4 2 1 3 1 ...
$ CCAvg             :์›” ์‹ ์šฉ์นด๋“œ ์‚ฌ์šฉ์•ก  num  1.6 1.5 1 2.7 1 0.4 1.5 0.3 0.6 8.9 ...
$ Education         :๊ต์œก์ˆ˜์ค€  int  1 1 1 2 2 2 2 3 2 3 ...
$ Mortgage          :๋‹ด๋ณด์ฑ„๊ถŒ  int  0 0 0 0 0 155 0 0 104 0 ...
$ Personal.Loan     :๊ฐœ์ธ๋Œ€์ถœ(Y๋ณ€์ˆ˜:์ˆ˜๋ฝ or ๊ฑฐ์ ˆ)  int  0 0 0 0 0 0 0 0 0 1 ...
$ Securities.Account:์œ ๊ฐ€์ฆ๊ถŒ๊ณ„์ •  int  1 1 0 0 0 0 0 0 0 0 ...
$ CD.Account        :CD๊ณ„์ขŒ  int  0 0 0 0 0 0 0 0 0 0 ...
$ Online            :์˜จ๋ผ์ธ๋ฑ…ํ‚น  int  0 0 0 0 0 1 1 0 1 0 ...
$ CreditCard        :์‹ ์šฉ์นด๋“œ  int  0 0 0 0 1 0 0 1 0 0 ...

 


<๋‹จ๊ณ„1> 'ID', 'ZIP.Code' ์นผ๋Ÿผ ์ œ์™ธ subset ๋งŒ๋“ค๊ธฐ 

dim(bank) # 5000   14
bank_df = bank[c(-1, -5)] # ID, ZIP.Code ์ œ์™ธ 
dim(bank_df) # 5000   12
sqrt(11)


<๋‹จ๊ณ„2> y๋ณ€์ˆ˜ : Personal.Loan -> ์ดํ•ญ๋ณ€์ˆ˜ ๋ณ€๊ฒฝ
Personal.Loan : ๋Œ€์ถœ ์ˆ˜๋ฝ or ๊ฑฐ์ ˆ

bank$Personal.Loan = as.factor(bank_df$Personal.Loan)

 

<๋‹จ๊ณ„3> ๋ชจ๋ธ ์ƒ์„ฑ : Tree 500๊ฐœ, ๋ถ„๋ฅ˜๋ณ€์ˆ˜ 3๊ฐœ ์ด์šฉ 

model_bank = randomForest(Personal.Loan ~ ., 
                              data = bank_df, 
                              ntree=500, mtry=3, 
                              importance = T,
                              na.action=na.omit)
model_bank # error rate: 14.42%


<๋‹จ๊ณ„4> Confusion matrix ์ด์šฉ ๋ถ„๋ฅ˜์ •ํ™•๋„ ๊ตฌํ•˜๊ธฐ

con = model_bank$confusion
con

acc = (con[1,1]+con[2,2]) / sum(con)
cat('accuracy =', acc) # accuracy = 0.9883787


<๋‹จ๊ณ„5> ๋ถ„๋ฅ˜์ •ํ™•๋„ ๊ฐœ์„ ์— ๊ธฐ์—ฌํ•˜๋Š” TOP3 ๋ณ€์ˆ˜ ํ™•์ธํ•˜๊ธฐ  

varImpPlot(model_bank)

Income > Education > Family