๊ธฐ์ˆ ํ†ต๊ณ„(Descriptive Statistics)

๊ธฐ์ˆ ํ†ต๊ณ„ : ์ž๋ฃŒ๋ฅผ ์š”์•ฝํ•˜๋Š” ๊ธฐ์ดˆ์ ์ธ ํ†ต๊ณ„๋Ÿ‰, ๋ณ€์ˆ˜์˜ ํŠน์„ฑ ํŒŒ์•… ๋ฐ ๋ชจ์ง‘๋‹จ ์œ ์ถ”
๋Œ€ํ‘ฏ๊ฐ’ : ํ‰๊ท (Mean), ํ•ฉ๊ณ„(Sum), ์ค‘์œ„์ˆ˜(Median), ์ตœ๋นˆ์ˆ˜(mode), ์‚ฌ๋ถ„์œ„์ˆ˜(quartile) ๋“ฑ
์‚ฐํฌ๋„ : ๋ถ„์‚ฐ(Variance), ํ‘œ์ค€ํŽธ์ฐจ(Standard deviation), ์ตœ์†Œ๊ฐ’(Minimum), ์ตœ๋Œ€๊ฐ’(Maximum), ๋ฒ”์œ„(Range) ๋“ฑ 
๋น„๋Œ€์นญ๋„ : ์™œ๋„(Skewness), ์ฒจ๋„(Kurtosis)

์‹ค์ŠตํŒŒ์ผ ๊ฐ€์ ธ์˜ค๊ธฐ

setwd("C:/ITWILL/2_Rwork/data")
data = read.csv("descriptive.csv", header=TRUE)

head(data) # ๋ฐ์ดํ„ฐ์…‹ ํ™•์ธ

data Mart
resident   gender      age   level       cost    type      survey    pass
๊ฑฐ์ฃผ์ง€์—ญ   ์„ฑ๋ณ„       ๋‚˜์ด  ํ•™๋ ฅ์ˆ˜์ค€     ์ƒํ™œ๋น„  ํ•™๊ต์œ ํ˜•  ๋งŒ์กฑ๋„    ํ•ฉ๊ฒฉ์—ฌ๋ถ€
๋ช…๋ชฉ(1~3)  ๋ช…๋ชฉ(1,2)  ๋น„์œจ  ์„œ์—ด(1,2,3)  ๋น„์œจ    ๋ช…๋ชฉ(1,2) ๋“ฑ๊ฐ„(5์ ) ๋ช…๋ชฉ(1,2)
* ์ธ๊ตฌํ†ต๊ณ„ํ•™๋ณ€์ˆ˜ : ๊ฑฐ์ฃผ์ง€์—ญ, ์„ฑ๋ณ„, ๋‚˜์ด, ํ•™๋ ฅ์ˆ˜์ค€

 

 


1. ์ฒ™๋„๋ณ„ ๊ธฐ์ˆ ํ†ต๊ณ„๋Ÿ‰
1) ๋ช…๋ชฉ/์„œ์—ด ์ฒ™๋„ ๋ณ€์ˆ˜์˜ ๊ธฐ์ˆ ํ†ต๊ณ„๋Ÿ‰
๋ช…๋ชฉ์ƒ ์˜๋ฏธ์—†๋Š” ์ˆ˜์น˜๋กœ ํ‘œํ˜„๋œ ๋ณ€์ˆ˜ - ์„ฑ๋ณ„(gender)     

length(data$gender)
summary(data$gender) # ์ตœ์†Œ,์ตœ๋Œ€,์ค‘์œ„์ˆ˜,ํ‰๊ท -์˜๋ฏธ์—†์Œ
table(data$gender) # ๊ฐ ์„ฑ๋ณ„ ๋นˆ๋„์ˆ˜ - outlier ํ™•์ธ-> 0, 5
data = subset(data,data$gender == 1 | data$gender == 2) # ์„ฑ๋ณ„ outlier ์ œ๊ฑฐ
x = table(data$gender) # ์„ฑ๋ณ„์— ๋Œ€ํ•œ ๋นˆ๋„์ˆ˜ ์ €์žฅ
x # outlier ์ œ๊ฑฐ ํ™•์ธ
barplot(x) # ๋ฒ”์ฃผํ˜•(๋ช…๋ชฉ/์„œ์—ด์ฒ™๋„) ์‹œ๊ฐํ™” -> ๋ง‰๋Œ€์ฐจํŠธ
prop.table(x) # ๋น„์œจ ๊ณ„์‚ฐ : 0< x <1 ์‚ฌ์ด์˜ ๊ฐ’
y = prop.table(x)
round(y*100, 2) #๋ฐฑ๋ถ„์œจ ์ ์šฉ(์†Œ์ˆ˜์  2์ž๋ฆฌ)


2) ๋“ฑ๊ฐ„์ฒ™๋„ ๋ณ€์ˆ˜์˜ ๊ธฐ์ˆ ํ†ต๊ณ„๋Ÿ‰
์†์„ฑ์˜ ๊ฐ„๊ฒฉ์ด ์ผ์ •ํ•œ ๋ณ€์ˆ˜(survey) - ๋ง์…ˆ/๋บ„์…ˆ ์—ฐ์‚ฐ ๊ฐ€๋Šฅ

survey = data$survey
survey

summary(survey) # ๋งŒ์กฑ๋„(5์  ์ฒ™๋„)์ธ ๊ฒฝ์šฐ ์˜๋ฏธ ์žˆ์Œ 
x1<-table(survey) # ๋นˆ๋„์ˆ˜
x1

hist(survey) # ๋“ฑ๊ฐ„์ฒ™๋„ ์‹œ๊ฐํ™” -> ํžˆ์Šคํ† ๊ทธ๋ฆผ
pie(x1)


3) ๋น„์œจ์ฒ™๋„ ๋ณ€์ˆ˜์˜ ๊ธฐ์ˆ ํ†ต๊ณ„๋Ÿ‰ : cost ๋ณ€์ˆ˜   

summary(data$cost) # ์š”์•ฝํ†ต๊ณ„๋Ÿ‰ - ์˜๋ฏธ์žˆ์Œ(mean) - 8.784
mean(data$cost) # NA
data$cost


๋ฐ์ดํ„ฐ ์ •์ œ - ๊ฒฐ์ธก์น˜ ์ œ๊ฑฐ ๋ฐ outlier ์ œ๊ฑฐ

plot(data$cost)
data = subset(data,data$cost >= 2 & data$cost <= 8) # ์ด์ ๊ธฐ์ค€


cost๋ณ€์ˆ˜ ์ถ”์ถœ 

cost = data$cost
cost

 

 


2. ๋Œ€ํ‘ฏ๊ฐ’ 
1) ํ‰๊ท (Mean)

mean(cost)

* ํ‰๊ท ์ด ๊ทน๋‹จ์น˜์— ์˜ํ–ฅ์„ ๋ฐ›๋Š” ๊ฒฝ์šฐ -> ์ค‘์œ„์ˆ˜(median) ๋Œ€์ฒด

2) ์ค‘์œ„์ˆ˜(Median) : ์ •๋ ฌ -> ์ค‘์•™๊ฐ’

median(cost) # 5.4  

sort(cost) #์˜ค๋ฆ„์ฐจ์ˆœ ์ •๋ ฌ
sort(cost, decreasing = TRUE)


์ค‘์œ„์ˆ˜ ๊ตฌํ•˜๊ธฐ

length(cost) #248

 

์ „์ฒด๊ธธ์ด ์ง์ˆ˜ = (n/2๋ฒˆ์งธ + n/2+1๋ฒˆ์งธ)/2
์ „์ฒด ๊ธธ์ด ํ™€์ˆ˜ = n/2๋ฒˆ์งธ

idx = length(cost)/2 #n/2๋ฒˆ์งธ ์ƒ‰์ธ


์ •๋ ฌ

cost_sort = sort(cost)
(cost_sort[idx] + cost_sort[idx+1])/2 #5.4


3) ์ตœ๋นˆ์ˆ˜(mode) : ์—ฐ์†ํ˜•๋ณ€์ˆ˜ hist() ์ด์šฉ  

hist(cost) #๊ฐ€์žฅ ๋†’์€ ๋ด‰์˜ ๊ณ„๊ธ‰ = 6.5

<์ตœ๋นˆ์ˆ˜, ์ค‘์œ„์ˆ˜, ํ‰๊ท ์˜ ๊ด€๊ณ„>
1) ์ตœ๋นˆ์ˆ˜=์ค‘์œ„์ˆ˜=ํ‰๊ท  : ์ขŒ์šฐ ๋Œ€์นญ 
2) ์ตœ๋นˆ์ˆ˜ > ์ค‘์œ„์ˆ˜ > ํ‰๊ท  : ์˜ค๋ฅธ์ชฝ ๊ธฐ์šธ์–ด์ง
3) ์ตœ๋นˆ์ˆ˜ < ์ค‘์œ„์ˆ˜ < ํ‰๊ท  : ์™ผ์ชฝ ๊ธฐ์šธ์–ด์ง
[ํ•ด์„ค] ๊ทน๋‹จ์น˜์— ์˜ํ•ด์„œ ํ‰๊ท ๊ณผ ์ค‘์œ„์ˆ˜ ์œ„์น˜๋Š” ๋ณ€๊ฒฝ

4) ํ•ฉ๊ณ„(Sum) 

sum(cost)


5) ์‚ฌ๋ถ„์œ„์ˆ˜(quartile) 

quantile(cost, 1/4) # 1 ์‚ฌ๋ถ„์œ„์ˆ˜ - 25%, 4.6
quantile(cost, 3/4) # 3 ์‚ฌ๋ถ„์œ„์ˆ˜ - 75%, 6.2
quantile(cost)

0%  25%  50%  75% 100% 
2.1  4.6  5.4  6.2  7.9 

 

* IQR = Q3 - Q1 (IQR : ์ด์ƒ์น˜ ์ฒ˜๋ฆฌํ•  ๋•Œ ์‚ฌ์šฉ๋˜๋Š” ์ •์ƒ๋ฒ”์ฃผ)
Q1 - 1.5 * IQR ~ Q3 + 1.5 * IQR

 

์ž๋ฃŒ ์ •๋ ฌ sort() VS order()
sort(x) : ํ•ด๋‹น x๋ณ€์ˆ˜์˜ ๊ฐ’์œผ๋กœ ์ •๋ ฌ ํ›„ ๊ฐ’ ๋ณ€ํ™˜
order(x) : ํ•ด๋‹น x๋ณ€์ˆ˜์˜ ๊ฐ’์œผ๋กœ ์ •๋ ฌ ํ›„ ํ–‰ ๋ฒˆํ˜ธ(index) ๋ฐ˜ํ™˜

x = data$cost
sort(x) #2.1~7.9 ๊ฐ’(value)
order(x) #17~232 : ์ƒ‰์ธ(index)

x[17] #2.1
x[232] #7.9


ex. ํŠน์ • ๋ณ€์ˆ˜(cost)๋ฅผ ๊ธฐ์ค€์œผ๋กœ dataset์ •๋ ฌ

dim(data)
head(data)
tail(data)

data_order = data[order(data$cost), ] #์˜ค๋ฆ„์ฐจ์ˆœ ์ •๋ ฌ
#๋‚ด๋ฆผ์ฐจ์ˆœ ์ •๋ ฌ : order(data$cost, decreasing = TRUE)
head(data_order)
tail(data_order)

 



3. ์‚ฐํฌ๋„ (0์˜ ์ˆ˜๋ ด์ •๋„์— ์˜ํ•ด ํ‰๊ท ์—์„œ์˜ ๊ฑฐ๋ฆฌ๋ฅผ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 0์— ๊ฐ€๊นŒ์šธ์ˆ˜๋ก ํ‰๊ท ์— ๋ฐ€์ง‘๋˜์–ด ์žˆ๋‹ค.)
1) ๋ถ„์‚ฐ(Variance)

var(x) #๋ถ„์‚ฐ : 1.291597


๋ถ„์‚ฐ ์ˆ˜์‹ 

mu = mean(cost)
n = length(cost)
var = sum((cost-mu)^2) / n
var #1.291597


2) ํ‘œ์ค€ํŽธ์ฐจ(Standard deviation)

sd(cost) #1.138783 ํ‘œ์ค€ํŽธ์ฐจ๋Š” ๋ถ„์‚ฐ์˜ ์–‘์˜ ์ œ๊ณฑ๊ทผ
sqrt(var(cost)) #1.138783


ํ‘œ์ค€ํŽธ์ฐจ -> ๋ถ„์‚ฐ 

sd(cost) ** 2 #1.296826


3) ์ตœ์†Œ๊ฐ’/์ตœ๋Œ€๊ฐ’/๋ฒ”์œ„ 

min(cost) #์ตœ์†Œ๊ฐ’ 2.1
max(cost) #์ตœ๋Œ€๊ฐ’ 7.9
range(cost) #๋ฒ”์œ„(min ~ max) 2.1 7.9


4) ํ‘œ์ค€๊ฐ’ = (X-ํ‰๊ท ) / ํ‘œ์ค€ํŽธ์ฐจ
๋™์ผํ•œ ์ฒ™๋„(scale) ๊ธฐ์ค€์œผ๋กœ ๊ฐ€์น˜ํ‰๊ฐ€

ex. ํ™๊ธธ๋™ : ๊ตญ์–ด 70(๋ฐ˜ ํ‰๊ท  59, ํŽธ์ฐจ 15), ์ˆ˜ํ•™ 70์  (๋ฐ˜ ํ‰๊ท  51, ํŽธ์ฐจ 18)

kor_z = (70-59)/15
mat_z = (70-51) / 18
kor_z #0.7333333
mat_z #1.055556

[ํ•ด์„] ์ˆ˜ํ•™์ ์ˆ˜ 70์ ์ด ๊ตญ์–ด์ ์ˆ˜ 70์ ๋ณด๋‹ค ๊ฐ€์น˜๊ฐ€ ๋†’๋‹ค

 



4. ๋น„๋Œ€์นญ๋„ :  ํŒจํ‚ค์ง€ ์ด์šฉ 

install.packages("moments")  # ์™œ๋„/์ฒจ๋„ ์œ„ํ•œ ํŒจํ‚ค์ง€ ์„ค์น˜   
library(moments)

cost = data$cost # ์ •์ œ๋œ data
cost


1) ์™œ๋„ - ํ‰๊ท ์„ ์ค‘์‹ฌ์œผ๋กœ ๊ธฐ์šธ์–ด์ง„ ์ •๋„

skewness(cost) # -0.297234

0๋ณด๋‹ค ํฌ๋ฉด ์™ผ์ชฝ ๊ธฐ์šธ์–ด์ง(์˜ค๋ฅธ์ชฝ๋ฐฉํ–ฅ ๋น„๋Œ€์นญ ๊ผฌ๋ฆฌ) 
0๋ณด๋‹ค ์ž‘์œผ๋ฉด ์˜ค๋ฅธ์ชฝ ๊ธฐ์šธ์–ด์ง(์™ผ์ชฝ๋ฐฉํ–ฅ ๋น„๋Œ€์นญ ๊ผฌ๋ฆฌ)
0๊ณผ ๊ฐ™์œผ๋ฉด ์ขŒ์šฐ๋Œ€์นญ

2) ์ฒจ๋„ - ํ‘œ์ค€์ •๊ทœ๋ถ„ํฌ์™€ ๋น„๊ตํ•˜์—ฌ ์–ผ๋งˆ๋‚˜ ๋พฐ์กฑํ•œ๊ฐ€ ์ธก์ • ์ง€ํ‘œ

kurtosis(cost) # 2.683438

์ •๊ทœ๋ถ„ํฌ ์ฒจ๋„ = 3

3) ํžˆ์Šคํ† ๊ทธ๋žจ : ๋Œ€์นญ์„ฑ 

hist(cost)







๋ฐ€๋„๋ถ„ํฌ๊ณก์„ ๊ณผ ํ‘œ์ค€์ •๊ทœ๋ถ„ํฌ ๊ณก์„ 
๋‹จ๊ณ„1. ํžˆ์Šคํ† ๊ทธ๋žจ ํ™•๋ฅ ๋ฐ€๋„ 

hist(cost, freq = F) #freq = F ์กฐ๊ฑด์œผ๋กœ y์ถ•์˜ ๋‹จ์œ„๊ฐ€ ๋ฐ€๋„๋กœ ๋ฐ”๋€๋‹ค

(ํ™•๋ฅ )๋ฐ€๋„ ๋ถ„ํฌ ๊ณก์„   : ํžˆ์Šคํ† ๊ทธ๋žจ์˜ ๋ฐ€๋„ ์ถ”์ • 
๋ฐ€๋„(R ํ™•๋ฅ ๋ฐ€๋„ํ•จ์ˆ˜) = ํ™•๋ฅ ์˜ ์ ๋ถ„๊ฐ’(๋ฉด์ ) ๊ณ„์‚ฐ
๋ฐ€๋„ ์ถ”์ • : ํ™•๋ฅ  ๋ถ„ํฌ ํŠน์„ฑ์„ ์ถ”์ •

ํ‘œ์ค€์ •๊ทœ๋ถ„ํฌ ๊ณก์„  

?density # KDE[Kernel Density Estimation]
lines(density(cost), col='blue')


๋‹จ๊ณ„2. ํ‘œ์ค€์ •๊ทœ๋ถ„ํฌ ๊ณก์„  

?dnorm # Normal Distribution

ํ‰๊ท  ๋ฐ ํ‘œ์ค€ ํŽธ์ฐจ๋ฅผ ์ด์šฉํ•˜์—ฌ ํ‘œ์ค€์ •๊ทœ ๋ถ„ํฌ์˜ ํ™•๋ฅ  ๋ฐ€๋„ ๋ถ„ํฌ ๊ณ„์‚ฐ 

x = seq(0, 8, 0.1)
curve(dnorm(x, mean(cost), sd(cost)), col='red', add = T) #dnorm : ์ •๊ทœ๋ถ„ํฌ ์ถ”์ • ์กฐ๊ฑด

[ํ•ด์„] ์™œ๋„ < 0 : ์˜ค๋ฅธ์ชฝ์œผ๋กœ ๊ธฐ์šธ์—ˆ๋‹ค. ์ฒจ๋„๋Š” ์ •๊ทœ๋ถ„ํฌ 3๋ณด๋‹ค ์™„๋งŒํ•œ ๋ชจ์–‘์ด๋‹ค. ์ฆ‰ cost๋Š” ์ •๊ทœ๋ถ„ํฌ์™€ ์ฐจ์ด๊ฐ€ ์žˆ๋‹ค.

๋‹จ๊ณ„3. QQ-plot

qqnorm(cost, main = 'cost QQ-plot') #real value
qqline(cost, col='red') #์ •๊ทœ๋ถ„ํฌ๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ์ง์„ ํ˜• ๊ทธ๋ž˜ํ”„

[ํ•ด์„] ์ •๊ทœ๋ถ„ํฌ์™€ ์•ฝ๊ฐ„์˜ ์ฐจ์ด๊ฐ€ ์žˆ๋‹ค.

๋‹จ๊ณ„4. ์ •๊ทœ์„ฑ ๊ฒ€์ • 
๊ท€๋ฌด๊ฐ€์„ค : ์ •๊ทœ๋ถ„ํฌ์™€ ์ฐจ์ด๊ฐ€ ์—†๋‹ค or ๋Œ€๋ฆฝ๊ฐ€์„ค : ์ •๊ทœ๋ถ„ํฌ์™€ ์ฐจ์ด๊ฐ€ ์žˆ๋‹ค. 

shapiro.test(cost) #p-value = 0.002959 < 0.05

[ํ•ด์„] ๋Œ€๋ฆฝ๊ฐ€์„ค ์ฑ„ํƒ

* ์™œ๋„, ๋Œ€ํ‘œ๊ฐ’์˜ ๊ด€๊ณ„
์™œ๋„ > 0, ์ตœ๋นˆ์ˆ˜ < ์ค‘์œ„์ˆ˜ < ํ‰๊ท  : ์™ผ์ชฝ ๊ธฐ์šธ์–ด์ง
์™œ๋„ < 0, ์ตœ๋นˆ์ˆ˜ > ์ค‘์œ„์ˆ˜ > ํ‰๊ท  : ์˜ค๋ฅธ์ชฝ ๊ธฐ์šธ์–ด์ง



5. ๊ธฐ์ˆ ํ†ต๊ณ„ ๋ณด๊ณ ์„œ ์ž‘์„ฑ๋ฒ•
๋นˆ๋„๋ถ„์„ : ๋…ผ๋ฌธ์—์„œ ์ธ๊ตฌํ†ต๊ณ„ํ•™์  ํŠน์„ฑ ๋ฐ˜์˜   

1) ๊ฑฐ์ฃผ์ง€์—ญ 

data$resident2[data$resident == 1] = "ํŠน๋ณ„์‹œ"
data$resident2[data$resident >=2 & data$resident <=4] = "๊ด‘์—ญ์‹œ"
data$resident2[data$resident == 5] = "์‹œ๊ตฌ๊ตฐ"

x = table(data$resident2)
prop.table(x) # ๋น„์œจ ๊ณ„์‚ฐ

y = prop.table(x)
round(y*100, 2) #๋ฐฑ๋ถ„์œจ ์ ์šฉ(์†Œ์ˆ˜์  2์ž๋ฆฌ)


2) ์„ฑ๋ณ„

data$gender2[data$gender== 1] = "๋‚จ์ž"
data$gender2[data$gender== 2] = "์—ฌ์ž"

x = table(data$gender2)
prop.table(x) # ๋น„์œจ ๊ณ„์‚ฐ

y = prop.table(x)
round(y*100, 2) #๋ฐฑ๋ถ„์œจ ์ ์šฉ(์†Œ์ˆ˜์  2์ž๋ฆฌ)


3) ๋‚˜์ด

summary(data$age)# 40 ~ 69
data$age2[data$age <= 45] = "์ค‘๋…„์ธต"
data$age2[data$age >=46 & data$age <=59] = "์žฅ๋…„์ธต"
data$age2[data$age >= 60] = "๋…ธ๋…„์ธต"

x = table(data$age2)
prop.table(x) # ๋น„์œจ ๊ณ„์‚ฐ

y = prop.table(x)
round(y*100, 2) #๋ฐฑ๋ถ„์œจ ์ ์šฉ(์†Œ์ˆ˜์  2์ž๋ฆฌ)


4) ํ•™๋ ฅ์ˆ˜์ค€

data$level2[data$level== 1] = "๊ณ ์กธ"
data$level2[data$level== 2] = "๋Œ€์กธ"
data$level2[data$level== 3] = "๋Œ€ํ•™์›์กธ"

x = table(data$level2)
prop.table(x) #๋น„์œจ ๊ณ„์‚ฐ 
y = prop.table(x)
round(y*100, 2) #๋ฐฑ๋ถ„์œจ ์ ์šฉ(์†Œ์ˆ˜์  2์ž๋ฆฌ)


5) ํ•ฉ๊ฒฉ์—ฌ๋ถ€

data$pass2[data$pass== 1] <-"ํ•ฉ๊ฒฉ"
data$pass2[data$pass== 2] <-"์‹คํŒจ"
x = table(data$pass2)
prop.table(x) # ๋น„์œจ ๊ณ„์‚ฐ : 0< x <1 ์‚ฌ์ด์˜ ๊ฐ’
y = prop.table(x)
round(y*100, 2) #๋ฐฑ๋ถ„์œจ ์ ์šฉ(์†Œ์ˆ˜์  2์ž๋ฆฌ)

 

 

 

 

 

์—…๋‹ค์šด์ƒ˜ํ”Œ๋ง (UpDown Sampling)

1. sample(n, size) : ๋น„๋ณต์›์ถ”์ถœ

sample(x=10:20, size=5, replace = FALSE) #20 11 10 15 14
sample(c(10:20, 30:40), 10) #35 16 13 12 34 20 31 19 30 38


2. up/down ์ƒ˜ํ”Œ๋ง  
๋ณต์›์ถ”์ถœ ๋ฐฉ์‹ y๋ณ€์ˆ˜์˜ ๋น„์œจ์„ ๋งž์ถ”๋Š” ์ƒ˜ํ”Œ๋ง ๋ฐฉ์‹
ํ•„์š”์„ฑ :๋ชจ๋ธ ํ•™์Šต ์‹œ ๋™์ผํ•œ ๋น„์œจ๋กœ ์ƒ˜ํ”Œ๋งํ•  ๋•Œ

install.packages('caret')
library(caret)

weather = read.csv('weather.csv')
dim(weather) #366  15
str(weather) 
table(weather$RainTomorrow)

No Yes 
300  66

y๋ณ€์ˆ˜ ์š”์ธํ˜• ๋ณ€๊ฒฝ 

weather$RainTomorrow = as.factor(weather$RainTomorrow)
str(weather) # $RainTomorrow : Factor


y๋ณ€์ˆ˜ ์ œ์™ธ 

weather_df = subset(weather, select = -RainTomorrow)
dim(weather_df) #366  14


Up sample : y์˜ ๋†’์€ ๋น„์œจ ๊ธฐ์ค€ (y ๋ณ€์ˆ˜ ์ถ”๊ฐ€)

up_weather = upSample(weather_df, weather$RainTomorrow)
str(up_weather) #600
dim(up_weather) #600  15


Down sample : y์˜ ๋†’์€ ๋น„์œจ ๊ธฐ์ค€ (y ๋ณ€์ˆ˜ ์ถ”๊ฐ€)

down_weather = downSample(weather_df, weather$RainTomorrow)
str(down_weather) # 132 obs

table(down_weather$Class)

No Yes 
66  66

y๋ณ€์ˆ˜ ์ด๋ฆ„ ๋ณ€๊ฒฝ

cols = names(down_weather)
cols
cols[15] = 'RainTomorrow' #15๋ฒˆ ์ƒ‰์ธ ํ•˜๋‚˜๋งŒ ์ด๋ฆ„ ๋ณ€๊ฒฝ

names(down_weather) = cols
str(down_weather)

+ Recent posts