EDA๋ž€?

ํƒ์ƒ‰์  ๋ฐ์ดํ„ฐ ๋ถ„์„(Exploratory Data Analysis)

์ˆ˜์ง‘ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค์–‘ํ•œ ๊ฐ๋„์—์„œ ๊ด€์ฐฐํ•˜๊ณ  ์ดํ•ดํ•˜๋Š” ๊ณผ์ •

๊ทธ๋ž˜ํ”„๋‚˜ ํ†ต๊ณ„์  ๋ฐฉ๋ฒ•์œผ๋กœ ์ž๋ฃŒ๋ฅผ ์ง๊ด€์ ์œผ๋กœ ํŒŒ์•…ํ•˜๋Š” ๊ณผ์ •

 

 

 

 

 

EDA ๊ณผ์ •

1) ๋ถ„์„์˜ ๋ชฉ์ ๊ณผ ๋ณ€์ˆ˜ ํŠน์ง• ํ™•์ธ

์…€ ์ˆ˜ ์—†๋Š” ๋ณ€์ˆ˜(Categorical), ์…€ ์ˆ˜ ์žˆ๋Š” ๋ณ€์ˆ˜(Numerical)

 

2) ๋ฐ์ดํ„ฐ ์…‹ ํ™•์ธ ๋ฐ ์ „์ฒ˜๋ฆฌ

๊ฒฐ์ธก์น˜, ์ด์ƒ์น˜ ํ™•์ธ ๋ฐ ์ •์ œ

 

3) ๋ฐ์ดํ„ฐ ๊ฐœ๋ณ„ ๋ณ€์ˆ˜ ๊ฐ’ ๊ด€์ฐฐ

ํ†ต๊ณ„, ์ผ๋ฐ˜ ์‹œ๊ฐํ™”

 

4) ๋ณ€์ˆ˜ ๊ฐ„์˜ ๊ด€๊ณ„์— ์ดˆ์ ์„ ๋งž์ถฐ ๋ณ€์ˆ˜ ํŒจํ„ด ๋ฐœ๊ฒฌ

์ƒ๊ด€๊ด€๊ณ„, ๊ณ ๊ธ‰ ์‹œ๊ฐํ™”

 

 

 

 

 

๋ฐ์ดํ„ฐ์…‹ ์‚ดํŽด๋ณด๊ธฐ

dataset # ๋ฐ์ดํ„ฐ ์…‹ ์ „์ฒด ๋ณด๊ธฐ

View(dataset) # ๋ณ„๋„์˜ ๋ฐ์ดํ„ฐ ๋ทฐ์–ด์ฐฝ์—์„œ ์ถœ๋ ฅ๋จ

head(dataset) # ์•ž๋ถ€๋ถ„ ๋ฐ์ดํ„ฐ ์…‹ 6๊ฐœ

tail(dataset) # ๋๋ถ€๋ถ„ ๋ฐ์ดํ„ฐ ์…‹ 6๊ฐœ

head(dataset, 10) # ์•ž๋ถ€๋ถ„ 10๊ฐœ

names(dataset) # ๋ณ€์ˆ˜๋ช…(์ปฌ๋Ÿผ)

attributes(dataset) # names(์—ด์ด๋ฆ„), class, row.names(ํ–‰์ด๋ฆ„)

str(dataset) # ๋ฐ์ดํ„ฐ ๊ตฌ์กฐ ๋ณด๊ธฐ

 

 

 

 

 

์ฒ™๋„(Scale)

๋ณ€์ˆ˜์— ๊ฐ’์„ ๋ถ€์—ฌํ•˜๋Š” ๋ฐฉ๋ฒ•

๋ณ€์ˆ˜ ์ธก์ • ๋‹จ์œ„(์‘๋‹ต์ž๊ฐ€ ์„ ํƒํ•  ์ˆ˜ ์žˆ๋Š” ์งˆ๋ฌธ ํ•ญ๋ชฉ)

์ •์„ฑ์ -์งˆ์  ์ฒ™๋„(๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜) ์ •๋Ÿ‰์ -์–‘์  ์ฒ™๋„(์—ฐ์†ํ˜• ๋ณ€์ˆ˜)
๋ช…๋ชฉ์ฒ™๋„ ์ด๋ฆ„์ด๋‚˜ ๋ฒ”์ฃผ๋ฅผ ๋Œ€ํ‘œํ•˜๋Š” ์˜๋ฏธ ์—†๋Š” ์ˆซ์ž
ex) ๋‚จ์ž/์—ฌ์ž
๋“ฑ๊ฐ„์ฒ™๋„ ์†์„ฑ์— ๋Œ€ํ•œ ๊ฐ ์ˆ˜์ค€ ๊ฐ„์˜ ๊ฐ„๊ฒฉ์ด ๋™์ผ ํ•œ ์ˆซ์ž(๊ฐ€๊ฐ์‚ฐ ์—ฐ์‚ฐ)
ex) ์—ฐ์†Œ๋“์ด ์–ด๋””์— ํ•ด๋‹น๋˜๋Š”์ง€
์„œ์—ด์ฒ™๋„ ์ธก์ • ๋Œ€์ƒ ๊ฐ„์˜ ๋†’๊ณ  ๋‚ฎ์Œ(์„œ์—ด), ์ˆœ์„œ์— ๋Œ€ํ•œ ์˜๋ฏธ ์—†๋Š” ์ˆซ์ž
ex) ์ข‹์•„ํ•˜๋Š” ์ˆœ์œ„
๋น„์œจ์ฒ™๋„ ๋“ฑ๊ฐ„์ฒ™๋„์˜ ํŠน์„ฑ์— ์ ˆ๋Œ€์›์ (0)์ด ์กด์žฌ ํ•˜๊ณ , ๋น„์œจ๊ณ„์‚ฐ์ด ๊ฐ€๋Šฅํ•œ ์ˆซ์ž(์‚ฌ์น™์—ฐ์‚ฐ)
ex) ๋‚˜์ด๊ฐ€ ๋ช‡ ์„ธ ์ธ์ง€

์ฒ™๋„์˜ ์ž์„ธํ•œ ์ •์˜ : https://blog.naver.com/tlrror9496/222048149959

 

 

 

 

 

์ˆ˜์ง‘ ๋ฐ์ดํ„ฐ ์ดํ•ด

1. ํƒ์ƒ‰์  ๋ฐ์ดํ„ฐ ์กฐํšŒ 
์‹ค์Šต ๋ฐ์ดํ„ฐ ์ฝ์–ด์˜ค๊ธฐ

setwd("C:/ITWILL/2_Rwork/data")
dataset <- read.csv("dataset.csv", header=TRUE) # ํ—ค๋”๊ฐ€ ์žˆ๋Š” ๊ฒฝ์šฐ
#dataset.csv - ์นผ๋Ÿผ๊ณผ ์ฒ™๋„ ๊ด€๊ณ„


1) ๋ฐ์ดํ„ฐ ์กฐํšŒ
ํƒ์ƒ‰์  ๋ฐ์ดํ„ฐ ๋ถ„์„์„ ์œ„ํ•œ ๋ฐ์ดํ„ฐ ์กฐํšŒ 

(1) ๋ฐ์ดํ„ฐ ์…‹ ๊ตฌ์กฐ

names(dataset) #๋ณ€์ˆ˜๋ช…(์ปฌ๋Ÿผ)
attributes(dataset) #์†์„ฑ์ •๋ณด : names(), ๊ฐ์ฒด์˜ ์ถœ์ฒ˜(class), ํ–‰์˜ ์ด๋ฆ„ (row.names)
str(dataset) #๋ฐ์ดํ„ฐ ๊ตฌ์กฐ๋ณด๊ธฐ
dim(dataset) #์ฐจ์›๋ณด๊ธฐ : 300(ํ–‰. ๊ด€์ธก์น˜) 7(์—ด)
nrow(dataset) #๊ด€์ธก์น˜ ์ˆ˜ : 300
ncol(dataset) #์นผ๋Ÿผ์ˆ˜ : 7
length(dataset) #์นผ๋Ÿผ์ˆ˜ : 7
length(dataset$resident) #300


(2) ๋ฐ์ดํ„ฐ ์…‹ ์กฐํšŒ
์ „์ฒด ๋ฐ์ดํ„ฐ ๋ณด๊ธฐ

dataset #=print(dataset), ์ฝ˜์†” ์ฐฝ์—์„œ ์ตœ๋Œ€๋กœ ๋ณด์—ฌ์ค„ ์ˆ˜ ์žˆ๋Š” ๊ธธ์ด๊ฐ€ ์ œํ•œ๋˜์–ด ์žˆ์œผ๋ฏ€๋กœ ์ผ๋ถ€๋ถ„๋งŒ ์ถœ๋ ฅ๋จ  
View(dataset) #์—‘์…€ ์‹œํŠธํ˜•์˜ ๋ทฐ์–ด์ฐฝ ์ถœ๋ ฅ


์นผ๋Ÿผ๋ช… ํฌํ•จ ๊ฐ„๋‹จ ๋ณด๊ธฐ 

head(dataset) #์•ž์—์„œ 6๊ฐœ ๊ด€์ธก์น˜
tail(dataset) #๋’ค์—์„œ 6๊ฐœ ๊ด€์ธก์ง€


(3) ์นผ๋Ÿผ ์กฐํšŒ 
ํ˜•์‹) dataframe$์นผ๋Ÿผ๋ช… -> vector(1์ฐจ์›)๋ฐ˜ํ™˜   

dataset$age
dataset$resident
length(dataset$age) # data ์ˆ˜-300๊ฐœ


ํ˜•์‹) dataframe["์นผ๋Ÿผ๋ช…"] -> data.frame(2์ฐจ์›)๋ฐ˜ํ™˜

dataset['age'] #๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์œผ๋กœ ์ถœ๋ ฅ
dataset["gender"] 
dataset["price"]
dataset[c("age", "gender", 'price')] #์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋ณ€์ˆ˜ ์ง€์ • ์ถœ๋ ฅ


ํ˜•์‹) dataframe[์ƒ‰์ธ] : ์ƒ‰์ธ(index)์œผ๋กœ ์›์†Œ ์œ„์น˜ ์ง€์ • 

dataset[2] #๋‘๋ฒˆ์งธ ์ปฌ๋Ÿผ gender
dataset[6] #์—ฌ์„ฏ๋ฒˆ์งธ ์ปฌ๋Ÿผ price
dataset[3,] #3๋ฒˆ์งธ ๊ด€์ฐฐ์น˜(ํ–‰) ์ „์ฒด
dataset[,3] #3๋ฒˆ์งธ ๋ณ€์ˆ˜(์—ด) ์ „์ฒด (1์ฐจ์› ๋ฐ˜ํ™˜ = dataset$job)
dataset[3] #3๋ฒˆ์งธ ๋ณ€์ˆ˜(์—ด) ์ „์ฒด (2์ฐจ์› ๋ฐ˜ํ™˜ = dataset['job']


dataset์—์„œ 2๊ฐœ ์ด์ƒ ์นผ๋Ÿผ ์กฐํšŒ

names(dataset)

dataset['resident':'job'] #error. ์นผ๋Ÿผ๋ช…์œผ๋กœ๋Š” ์—ฐ์†ํ˜• ์ถœ๋ ฅ ๋ถˆ๊ฐ€
dataset[1:3] #์ƒ‰์ธ ์ด์šฉํ•˜์—ฌ ์—ฐ์†ํ˜• ์ถœ๋ ฅ ๊ฐ€๋Šฅ

dataset[c("job","price")]
dataset[c(2,6)] 

dataset[c(1,2,3)] 
dataset[c(1:3)] # = dataset[1:3]
dataset[c(2,4:6,3,1)] #2๋ฒˆ, 4~6๋ฒˆ, 3๋ฒˆ, 1๋ฒˆ ์นผ๋Ÿผ ์ˆœ์„œ๋Œ€๋กœ ์ถœ๋ ฅ

 

 

 

 

 

๊ฒฐ์ธก์น˜(NA) ๋ฐœ๊ฒฌ๊ณผ ์ฒ˜๋ฆฌ

1) ๊ฒฐ์ธก์น˜ ๋ฐœ๊ฒฌ(ํ™•์ธ)

summary(dataset) #์š”์•ฝํ†ต๊ณ„ ์กฐํšŒ. Na'sํ•ญ๋ชฉ์ด ๊ฒฐ์ธก์น˜ ๊ฐœ์ˆ˜
summary(dataset$price) #ํŠน์ • ๋ณ€์ˆ˜ ๋Œ€์ƒ์œผ๋กœ ์š”์•ฝํ†ต๊ณ„ ์กฐํšŒ


(2) ํŠน์ • ๋ณ€์ˆ˜

table(is.na(dataset$price)) #๋นˆ๋„ ์กฐํšŒ. is.na:TURE๊ฐ’์ด ๊ฒฐ์ธก์น˜ ๊ฐœ์ˆ˜


(3) ๊ทธ๋ž˜ํ”„ ์ด์šฉ : ๋ณ€์ˆ˜๊ฐ€ ๋งŽ์€ ๊ฒฝ์šฐ

install.packages('VIM')
library(VIM) #aggr()ํ•จ์ˆ˜ ์ œ๊ณต

aggr(dataset, prop=FALSE, numbers=TRUE)

* ์™ผ์ชฝ ๊ทธ๋ž˜ํ”„ : ๊ฐ ๋ณ€์ˆ˜์˜ NA์˜ ๊ฐœ์ˆ˜
* ์˜ค๋ฅธ์ชฝ ๊ทธ๋ž˜ํ”„ : ๋ณ€์ˆ˜ ์กฐํ•ฉ์— ์˜ํ•œ NA์˜ ๊ฐœ์ˆ˜


2)๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ - ์ œ๊ฑฐ

(1)ํŠน์ • ์นผ๋Ÿผ ๋Œ€์ƒ ๊ฒฐ์ธก์น˜ ์ œ๊ฑฐ

length(dataset$resident) #300๊ฐœ์˜ ๋ณ€์ˆ˜๋ช…
resident = na.omit(dataset$resident)
length(resident) #279๊ฐœ์˜ ๋ณ€์ˆ˜๋ช… (21๊ฐœ์˜ ๊ฒฐ์ธก์น˜ ์ œ๊ฑฐ)


(2)์ „์ฒด ์นผ๋Ÿผ ๋Œ€์ƒ ๊ฒฐ์ธก์น˜ ์ œ๊ฑฐ : ์ž๋ฃŒ ์†์‹ค ๋ฐœ์ƒ

dim(dataset) #300(๊ด€์ธก์น˜) 7(๋ณ€์ˆ˜)
dataset2=na.omit(dataset)
dim(dataset2) #209๊ฐœ์˜ ๊ด€์ธก์น˜ (21๊ฐœ์˜ ๊ฒฐ์ธก์น˜ ์ œ๊ฑฐ)

 

3) ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ - ์ƒ์ˆ˜(0)๋กœ ๋Œ€์ฒด
ํ˜•์‹) dataset$price = ifelse(์กฐ๊ฑด, ์ฐธ์ผ ๋•Œ ๊ฒฐ์ธก์น˜ ๋Œ€์ฒดํ•  ๊ฐ’, ๊ฑฐ์ง“์ผ ๋•Œ)

x = (dataset$price) #๋ฒกํ„ฐํ˜• ๋ณ€์ˆ˜๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๊ฐ์ฒดx

summary(dataset$price) #NA's = 30๊ฐœ
dataset$price2 = ifelse(is.na(x), 0, x)

summary(dataset$price2) #NA's ๊ฐ’ ์—†์Œ


4) ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ : ํ†ต๊ณ„(ํ‰๊ท , ์ค‘์œ„์ˆ˜, ์ตœ๋นˆ์ˆ˜)๋กœ ๋Œ€์ฒด

dataset$price3 = ifelse(is.na(x), median(x, na.rm=TRUE), x) #๊ฒฐ์ธก์น˜๋ฅผ ์ค‘์œ„์ˆ˜๋กœ ๋Œ€์ฒด
summary(dataset$price3) #NA's ๊ฐ’ ์—†์Œ

dataset[c('price','price2','price3')] #๊ฒฐ์ธก์น˜์žˆ์Œ, ๊ฒฐ์ธก์น˜=0, ๊ฒฐ์ธก์น˜=์ค‘์œ„๊ฐ’


5) ๊ธฐ๊ณ„ํ•™์Šต ๋Œ€์ฒด : ํ•™์Šต์„ ํ†ตํ•ด ๊ตฌํ•œ ์˜ˆ์ธก์น˜๋กœ ๊ฒฐ์ธก์น˜ ๋Œ€์ฒด
์ ์šฉ๋Œ€์ƒ : ์™„๋ฒฝํ•œ ๋ฐ์ดํ„ฐ์…‹(๋ณ€์ˆ˜๋ผ๋ฆฌ ๊ด€๋ จ์ด ์žˆ๋Š” ex.iris)

install.packages('mice')
library(mice)


30๊ฐœ ๋ณ€์ˆ˜๋งŒ ์ถ”์ถœํ•œ ์ƒˆ๋กœ์šด DF์ƒ์„ฑ

iris_df=head(iris, 30)


๊ฒฐ์ธก์น˜ ์ถ”๊ฐ€

iris_df
dim(iris_df) #ํ–‰30 ์—ด5

iris_df[1,1] = NA #1ํ–‰ 1์—ด๊ฐ’ 5.1์„ NA๋กœ ๋ณ€๊ฒฝ
iris_df[3,3] = NA #3ํ–‰ 3์—ด๊ฐ’ 1.3 -> NA

help(mice) #mice(๋ฐ์ดํ„ฐ์…‹, ๋ฐ˜๋ณตํ•™์Šต ํšŸ์ˆ˜)
model = mice(iris_df, m=5) #๋ฐ˜๋ณตํ•™์Šต 5ํšŒ ์‹œํ‚ด
iris_df2 = complete(model) #model์˜ ์˜ˆ์ธก์น˜ ๊ตฌํ•˜๊ธฐ

iris_df2 #[1,1]->5.0 [3,3]->1.9๋กœ ๋Œ€์ฒด๋จ.

* ์‹œ์Šคํ…œ ํ™˜๊ฒฝ์— ๋”ฐ๋ผ ์˜ˆ์ธก ๊ฒฐ๊ณผ๋Š” ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ์Œ.

 

๊ฒฐ์ธก์น˜๋ฅผ ์ž„์˜์˜ ๊ฐ’์œผ๋กœ ๋Œ€์ฒดํ•˜๋Š” ์ด์œ ?
- ์ ์€ ๊ฒฐ์ธก์น˜๋Š” ์‚ญ์ œ ํ›„์— ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ด๋„ ๋˜๊ฒ ์œผ๋‚˜, ์‹ค์ œ ๋ฐ์ดํ„ฐ์—์„œ๋Š” ์ˆ˜๋งŽ์€ ๊ฒฐ์ธก์น˜๊ฐ€ ์กด์žฌํ•œ๋‹ค.
์ˆ˜๋งŽ์€ ๊ฒฐ์ธก์น˜๋ฅผ ์‚ญ์ œํ•˜๋ฉด ์ž๋ฃŒ ์†์‹ค์ด ๋ฐœ์ƒํ•˜๊ณ , ํ•ด๋‹น ๋ณ€์ˆ˜๊ฐ€ ๊ผญ ํ•„์š”ํ•˜๊ฑฐ๋‚˜ ์ค‘์š”ํ•œ ๋ฐ์ดํ„ฐ์ธ ๊ฒฝ์šฐ๊ฐ€ ๋‹ค์ˆ˜์ด๋‹ค.
- ๋ฐ์ดํ„ฐ ๋ถ„์„์— ์ •๋‹ต์€ ์—†์Œ! ์—ฌ๋Ÿฌ ๋ฐฉ๋ฒ• ์ค‘ ์ตœ์„ ์˜ ๋ฐฉ๋ฒ•์„ ์ฐพ๋Š” ๊ฒƒ์ด ๋งž๋‹ค.

 

 

 

 

 

3. ์ด์ƒ์น˜ ๋ฐœ๊ฒฌ๊ณผ ์ •์ œ

์ด์ƒ์น˜ : ์ •์ƒ ๋ฒ”์ฃผ์—์„œ ํฌ๊ฒŒ ๋ฒ—์–ด๋‚œ ๊ฐ’
์ด์ƒ์น˜๊ฐ€ ๋ฏธ์น˜๋Š” ์˜ํ–ฅ : ํšŒ๊ท€๋ถ„์„ ์„ ์ด ์ž˜๋ชป ์ถœ๋ ฅ, ๋ถ„์„ ๊ฒฐ๊ณผ ์™œ๊ณก

1) ๋ฒ”์ฃผํ˜•(๋ช…๋ชฉ/์„œ์—ด) ๋ณ€์ˆ˜

names(dataset)


์ด์ƒ์น˜ ๋ฐœ๊ฒฌ

table(dataset$gender) #0๊ณผ 5๊ฐ€ ์ด์ƒ์น˜. 1:๋‚จ์ž 2:์—ฌ์ž
pie(table(dataset$gender))


์ •์ œ:์ด์ƒ์น˜ ์ œ๊ฑฐ

dataset = subset(dataset, gender==1 | gender==2) #์„ฑ๋ณ„์ด ๋‚จ์„ฑ ๋˜๋Š” ์—ฌ์„ฑ์ธ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ ์…‹ ์ƒ์„ฑ
pie(table(dataset$gender))

 

subset vs ifelse
subset : data.frame(2์ฐจ์›)
ifelse : vector(1์ฐจ์›)

 

2) ์—ฐ์†์„œํ˜•(๋“ฑ๊ฐ„/๋น„์œจ) ๋ณ€์ˆ˜

dataset$price #๋น„์œจ์ฒ™๋„

 

(1) ์ •์ƒ๋ฒ”์œ„๋ฅผ ๊ธฐ์ค€์œผ๋กœ ํ•œ ์ด์ƒ์น˜ ์ฒ˜๋ฆฌ
์ด์ƒ์น˜ ๋ฐœ๊ฒฌ

plot(dataset$price) #์‚ฐ์ ๋„

 

summary(dataset$price) #์š”์•ฝํ†ต๊ณ„

 

์ด์ƒ์น˜ ์ œ๊ฑฐ (์ •์ƒ๋ฒ”์ฃผ๋ฅผ ์•Œ๊ณ  ์žˆ์„ ๋•Œ. ์ •์ƒ๋ฒ”์ฃผ : 2~8)

dataset2 = subset(dataset, price>=2 & price<=8)
plot(dataset2$price)


dim(dataset2) #251ํ–‰ 7์—ด. 52๊ฐœ์˜ ์ด์ƒ์น˜ ์ œ๊ฑฐ๋˜์—ˆ์Œ.

 

(2) ์‚ฌ๋ถ„๋ฒ”์œ„(IQR)์ด์šฉ ์ด์ƒ์น˜ ๋ฐœ๊ฒฌ

* IQR : Inter Quartile Range = ์ œ3์‚ฌ๋ถ„์œ„์ˆ˜(Q3) - ์ œ1์‚ฌ๋ถ„์œ„์ˆ˜(Q1)

* ํ†ต๊ณ„์ ์œผ๋กœ ์ƒโ€คํ•˜์œ„ 0.3% ๋˜๋Š” ์‚ฌ๋ถ„๋ฒ”์œ„(IQR) * 1.5๋ฅผ ๋ฒ—์–ด๋‚˜๋Š” ๊ฒฝ์šฐ

 

์ด์ƒ์น˜ ๋ฐœ๊ฒฌ

boxplot(dataset$price)
boxplot(dataset$price)$stats #์ •์ƒ๋ฒ”์ฃผ ๊ฐ’ ํ™•์ธ. 2.1~7.9

 

์ด์ƒ์น˜ ๋Œ€์ฒด : ํ•˜ํ•œ๊ฐ’/์ƒํ•œ๊ฐ’ ๋Œ€์ฒด

dataset3 = na.omit(dataset) #๊ฒฐ์ธก์น˜ ์ œ๊ฑฐ
dim(dataset3) #206ํ–‰ 7์—ด


(1) ์ตœ์†Œ๊ฐ’, ์ตœ๋Œ€๊ฐ’ ์ด์šฉ

dataset3$price = ifelse(dataset3$price<2.1, 2.1, dataset3$price) #ํ•˜ํ•œ๊ฐ’ ๋Œ€์ฒด
dataset3$price = ifelse(dataset3$price>7.9, 7.9, dataset3$price) #์ƒํ•œ๊ฐ’ ๋Œ€์ฒด
boxplot(dataset3$price)


์ด์ƒ์น˜ ์ œ๊ฑฐ

dataset2 = subset(dataset, price>=2.1 & price<=7.9)

 

(2) ํ‰๊ท ๊ณผ ํ‘œ์ค€ํŽธ์ฐจ ์ด์šฉ : ํ•˜ํ•œ๊ฐ’/์ƒํ•œ๊ฐ’ ๋Œ€์ฒด

dataset4 = na.omit(dataset)

avg = mean(dataset4$price) #5.27
std = sd(dataset4$price) #53.15
n = 3 #์ž„๊ณ„๊ฐ’



dataset4$price = ifelse(dataset4$price<2.1, minVal, dataset4$price) #ํ•˜ํ•œ๊ฐ’ ๋Œ€์ฒด
dataset4$price = ifelse(dataset4$price>7.9, maxVal, dataset4$price) #์ƒํ•œ๊ฐ’ ๋Œ€์ฒด

boxplot(dataset4$price) #์ด์ƒ์น˜ ํ™•์ธ

 





4. ์ฝ”๋”ฉ๋ณ€๊ฒฝ 
๋ฐ์ดํ„ฐ์˜ ๊ฐ€๋…์„ฑ, ์ฒ™๋„ ๋ณ€๊ฒฝ, ์ตœ์ดˆ๋กœ ์ฝ”๋”ฉ๋œ ๋‚ด์šฉ์„ ๋ณ€๊ฒฝํ•˜๊ธฐ ์œ„ํ•ด

1) ๊ฐ€๋…์„ฑ์„ ์œ„ํ•œ ์ฝ”๋”ฉ๋ณ€๊ฒฝ 
ํ˜•์‹) dataframe$์ƒˆ ์นผ๋Ÿผ๋ช…[๋ถ€์šธ๋ฆฐboolean์–ธ์‹] <- ๋ณ€๊ฒฝ๊ฐ’

* boolean : ์ฐธ๊ณผ ๊ฑฐ์ง“์„ ๋‚˜ํƒ€๋‚ด๋Š” ์ˆซ์ž 1/0์„ ์ด์šฉํ•˜๋Š” ๋ฐฉ์‹   

dataset2$resident2[dataset2$resident == 1] <-'1.์„œ์šธํŠน๋ณ„์‹œ'
dataset2$resident2[dataset2$resident == 2] <-'2.์ธ์ฒœ๊ด‘์—ญ์‹œ'
dataset2$resident2[dataset2$resident == 3] <-'3.๋Œ€์ „๊ด‘์—ญ์‹œ'
dataset2$resident2[dataset2$resident == 4] <-'4.๋Œ€๊ตฌ๊ด‘์—ญ์‹œ'
dataset2$resident2[dataset2$resident == 5] <-'5.์‹œ๊ตฌ๊ตฐ'

* ์กฐ๊ฑด์— ๋งž๋Š” ๊ฐ’๋งŒ ์ถ”์ถœํ•ด์„œ ์ƒˆ๋กœ์šด ํŒŒ์ƒ ๋ณ€์ˆ˜๋กœ ์ƒ์„ฑ

dataset2[c("resident","resident2")] # 2๊ฐœ๋งŒ ์ง€์ •

table(dataset2$job) #1,2,3์€ ๋ฒ”์ฃผ(์นดํ…Œ๊ณ ๋ฆฌ), 61, 87, 88(ํ•ด๋‹น ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋นˆ๋„์ˆ˜)

dataset2$job2[dataset2$job == 1] <- '๊ณต๋ฌด์›'
dataset2$job2[dataset2$job == 2] <- 'ํšŒ์‚ฌ์›'
dataset2$job2[dataset2$job == 3] <- '๊ฐœ์ธ์‚ฌ์—…'

table(dataset2$job2) #๊ฐœ์ธ์‚ฌ์—…, ๊ณต๋ฌด์›, ํšŒ์‚ฌ์› (๋‚ด๋ฆผ์ฐจ์ˆœ ์ž๋™์ •๋ ฌ)


2) ์ฒ™๋„ ๋ณ€๊ฒฝ : ๋น„์œจ์ฒ™๋„(ํšŒ๊ท€, ์ƒ๊ด€) -> ๋ช…๋ชฉ์ฒ™๋„(์นด์ด์ œ๊ณฑ(๊ต์ฐจ)๊ฒ€์ •)
ํ˜•์‹) dataframe$์ƒˆ์นผ๋Ÿผ๋ช…[๋ถ€์šธ๋ฆฐ์–ธ์‹] = ๋ณ€๊ฒฝ๊ฐ’

dataset2$age #age๋Š” ๋น„์œจ์ฒ™๋„. ์—ฐ์†์„ฑ์ด ์žˆ๊ณ , ์‚ฌ์น™์—ฐ์‚ฐ์ด ๊ฐ€๋Šฅํ•จ

dataset2$age2[dataset2$age <= 30] = '์ฒญ๋…„์ธต'
dataset2$age2[dataset2$age > 30 & dataset2$age <= 55] = '์ค‘๋…„์ธต'
dataset2$age2[dataset2$age > 55] = '์žฅ๋…„์ธต'

table(dataset2$age2) #์žฅ๋…„์ธต 56, ์ค‘๋…„์ธต 110, ์ฒญ๋…„์ธต 69


3) ์—ญ์ฝ”๋”ฉ : ์ˆœ์„œ ๋ณ€๊ฒฝ (1->5, 5->1)

survey = dataset2$survey
length(survey) #251
survey[1:5] #1,2,4,2,1

csurvey = 6-survey #vector=์ƒ์ˆ˜(scala)-1์ฐจ์›๋ฐฐ์—ด(vector)
length(csurvey)
csurvey[1:5] #5,4,2,4,5


์—ญ์ฝ”๋”ฉ ๋ฐ˜์˜

dataset2$survey = csurvey

 

 

 

 

 

5. ์ •์ œ ๋ฐ์ดํ„ฐ ์ €์žฅ
getwd()
setwd("C:/ITWILL/2_Rwork/data")


(1) ๋ฐ์ดํ„ฐ ์ €์žฅ

write.csv(dataset2, 'cleanData.csv', row.names=FALSE, quote=F) #ํ–‰์˜ ์ด๋ฆ„ ์ €์žฅํ•˜์ง€ ์•Š์Œ, ๋ฌธ์ž ์ €์žฅํ•˜์ง€ ์•Š์Œ


(2) ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

new_data = read.csv('cleanData.csv')
head(new_data)

 





6. ํƒ์ƒ‰์  ๋ถ„์„์„ ์œ„ํ•œ ์‹œ๊ฐํ™”
์‹œ๊ฐ์ ์ธ ๋ฐฉ๋ฒ•์œผ๋กœ ๋‘ ๋ณ€์ˆ˜ ๊ฐ„ ์ƒ๊ด€๊ด€๊ณ„ ํŒŒ์•…

new_data <- read.csv("new_data.csv", header=TRUE)
new_data 
dim(new_data) #  231  15
str(new_data)


1) ๋ช…๋ชฉ์ฒ™๋„(๋ฒ”์ฃผ/์„œ์—ด) vs ๋ช…๋ชฉ์ฒ™๋„(๋ฒ”์ฃผ/์„œ์—ด) 
๊ฑฐ์ฃผ์ง€์—ญ๊ณผ ์„ฑ๋ณ„ ์นผ๋Ÿผ ์‹œ๊ฐํ™” (์ง‘๋‹จ๋ณ€์ˆ˜+์ง‘๋‹จ๋ณ€์ˆ˜=๊ต์ฐจํ‘œ ์ƒ์„ฑ)

resident_gender = table(new_data$resident2, new_data$gender2)
resident_gender
gender_resident = table(new_data$gender2, new_data$resident2)
gender_resident


์„ฑ๋ณ„์— ๋”ฐ๋ฅธ ๊ฑฐ์ฃผ์ง€์—ญ ๋ถ„ํฌ ํ˜„ํ™ฉ 

barplot(resident_gender, beside=T, horiz=T,
        col = rainbow(5),
        legend = row.names(resident_gender),
        main = '์„ฑ๋ณ„์— ๋”ฐ๋ฅธ ๊ฑฐ์ฃผ์ง€์—ญ ๋ถ„ํฌ ํ˜„ํ™ฉ') 
row.names(resident_gender) # ํ–‰ ์ด๋ฆ„


๊ฑฐ์ฃผ์ง€์—ญ์— ๋”ฐ๋ฅธ ์„ฑ๋ณ„ ๋ถ„ํฌ ํ˜„ํ™ฉ 

barplot(gender_resident, beside=T, 
        col=rep(c(2, 4),5), horiz=T,
        legend=c("๋‚จ์ž","์—ฌ์ž"),
        main = '๊ฑฐ์ฃผ์ง€์—ญ๋ณ„ ์„ฑ๋ณ„ ๋ถ„ํฌ ํ˜„ํ™ฉ')


2) ๋น„์œจ์ฒ™๋„(์—ฐ์†) vs ๋ช…๋ชฉ์ฒ™๋„(๋ฒ”์ฃผ/์„œ์—ด)
๋‚˜์ด์™€ ์ง์—…์œ ํ˜•์— ๋”ฐ๋ฅธ ์‹œ๊ฐํ™” 

install.packages("lattice")  # chap08
library(lattice)
densityplot(~x, data=dataset) #๋ฐ€๋„๋ถ„ํฌ๊ณก์„ 


์ง์—…์œ ํ˜•์— ๋”ฐ๋ฅธ ๋‚˜์ด ๋ถ„ํฌ ํ˜„ํ™ฉ   

densityplot( ~ age, data=new_data, groups = job2, #์—ฐ์†ํ˜• ๋ณ€์ˆ˜(age)๋ฅผ ์ง‘๋‹จ๋ณ€์ˆ˜(job2)์™€ ๊ทธ๋ฃนํ™”
             plot.points=T, auto.key = T)
plot.points=T : ํ•˜๋‹จ์˜ ๋ฐ€๋„ ํ‘œ์‹œ ์—ฌ๋ถ€, auto.key = T : ๋ฒ”๋ก€ ํ‘œ์‹œ ์—ฌ๋ถ€


3) ๋น„์œจ(์—ฐ์†) vs ๋ช…๋ชฉ(๋ฒ”์ฃผ/์„œ์—ด) vs ๋ช…๋ชฉ(๋ฒ”์ฃผ/์„œ์—ด)
* ๊ตฌ๋งค๋น„์šฉ(์—ฐ์†):x์นผ๋Ÿผ , ์„ฑ๋ณ„(๋ช…๋ชฉ):์กฐ๊ฑด, ์ง๊ธ‰(์„œ์—ด):๊ทธ๋ฃน   

(1) ์„ฑ๋ณ„์— ๋”ฐ๋ฅธ ์ง๊ธ‰๋ณ„ ๊ตฌ๋งค๋น„์šฉ ๋ถ„์„  

densityplot(~ price | factor(gender2), data=new_data, #factor(gender2):์ง‘๋‹จ๋ณ„(์—ฌ์ž/๋‚จ์ž) ๊ฒฉ์ž ์ƒ์„ฑ
            groups = position2, plot.points=T, auto.key = T) #ํ•œ ๊ฒฉ์ž ์•ˆ์— 5๊ฐœ์˜ ์ง๊ธ‰์— ๋Œ€ํ•œ ๊ทธ๋ฃนํ™”

* ์กฐ๊ฑด(๊ฒฉ์ž) : ์„ฑ๋ณ„, ๊ทธ๋ฃน : ์ง๊ธ‰ 

(2) ์ง๊ธ‰์— ๋”ฐ๋ฅธ ์„ฑ๋ณ„ ๊ตฌ๋งค๋น„์šฉ ๋ถ„์„  

densityplot(~ price | factor(position2), data=new_data, 
            groups = gender2, plot.points=T, auto.key = T)

* ์กฐ๊ฑด : ์ง๊ธ‰(๊ฒฉ์ž), ๊ทธ๋ฃน : ์„ฑ๋ณ„

 

 

 

 

 

 ํŒŒ์ƒ๋ณ€์ˆ˜๋ž€?

๊ธฐ์กด ๋ณ€์ˆ˜๋กœ ์ƒˆ๋กœ์šด ๋ณ€์ˆ˜๋ฅผ ์ƒ์„ฑํ•œ ๊ฒƒ

 

ํŒŒ์ƒ๋ณ€์ˆ˜ ์ƒ์„ฑ

(1) ์‚ฌ์น™์—ฐ์‚ฐ
(2) 1:1๋กœ ๋งŒ๋“ค๊ธฐ : ๊ธฐ์กด๋ณ€์ˆ˜(1) -> ์ƒˆ๋กœ์šด ํŒŒ์ƒ๋ณ€์ˆ˜(1)
(3) 1:N๋กœ ๋งŒ๋“ค๊ธฐ : ๊ธฐ์กด๋ณ€์ˆ˜(1) -> ์ƒˆ๋กœ์šด ํŒŒ์ƒ๋ณ€์ˆ˜(N)

setwd('C:/ITWILL/2_Rwork/data')
user_data <- read.csv('user_data.csv', header = T)
head(user_data) # user_id age house_type resident job


1) 1:1 ํŒŒ์ƒ๋ณ€์ˆ˜ ์ƒ์„ฑ : ๊ธฐ์กด ์นผ๋Ÿผ 1๊ฐœ -> ์ƒˆ๋กœ์šด ์นผ๋Ÿผ 1๊ฐœ
์ฃผํƒ ์œ ํ˜• :  0, ์•„ํŒŒํŠธ ์œ ํ˜• : 1(๋”๋ฏธ๋ณ€์ˆ˜ ์ƒ์„ฑ) : ์ฃผํƒ ์œ ํ˜• ํŒŒ์•… ๊ฐ€๋Šฅ  

summary(user_data$house_type) # NAํ™•์ธ - ์—†์Œ 
table(user_data$house_type)


dummy ์ƒ์„ฑ 

house_type2 <- ifelse(user_data$house_type == 1 | user_data$house_type == 2, 0, 1)


๊ฒฐ๊ณผ ํ™•์ธ

house_type2[1:10]


ํŒŒ์ƒ๋ณ€์ˆ˜ ์ถ”๊ฐ€ 

user_data$์ฃผ๊ฑฐํ™˜๊ฒฝ <- house_type2
head(user_data)


2) 1:N ํŒŒ์ƒ๋ณ€์ˆ˜ ์ƒ์„ฑ : ๊ฐ id(๊ณ ๊ฐ)์— ๋Œ€ํ•œ ๊ตฌ๋งค์ƒํ’ˆ, ์ง€๋ถˆ๋ฐฉ๋ฒ• ๋‚˜์—ด 

pay_data <- read.csv('pay_data.csv', header = T)
head(pay_data,10) # user_id product_type pay_method  price
table(pay_data$product_type)

install.packages('reshape2')
library(reshape2) # dcast() ํ•จ์ˆ˜ ์ œ๊ณต


(1) ๊ณ ๊ฐ๋ณ„ ์ƒํ’ˆ ์œ ํ˜•์— ๋”ฐ๋ฅธ ๊ตฌ๋งค๊ธˆ์•ก ํ•ฉ๊ณ„ ํŒŒ์ƒ๋ณ€์ˆ˜ ์ƒ์„ฑ   

product_price <- dcast(pay_data, user_id ~ product_type, sum, na.rm=T) 
head(product_price, 3) # ํ–‰(๊ณ ๊ฐ id) ์—ด(์ƒํ’ˆ ํƒ€์ž…), sum(price)

names(product_price) <- c('user_id','์‹๋ฃŒํ’ˆ(1)','์ƒํ•„ํ’ˆ(2)','์˜๋ฅ˜(3)','์žกํ™”(4)','๊ธฐํƒ€(5)')
head(product_price, 3) # ์นผ๋Ÿผ๋ช… ์ˆ˜์ • ํ™•์ธ


โ˜…(2) ํŒŒ์ƒ๋ณ€์ˆ˜ ์ถ”๊ฐ€(data.frame ํ•ฉ์น˜๊ธฐ) : joinโ˜…

install.packages('plyr')
library(plyr) # ํŒจํ‚ค์ง€ ๋กœ๋”ฉ

user_pay_data <- join(user_data, product_price, by='user_id') #by='๊ณตํ†ต๋ณ€์ˆ˜'
head(user_pay_data,10)


(3) ์ด ๊ตฌ๋งค๊ธˆ์•ก ํŒŒ์ƒ๋ณ€์ˆ˜ ์ƒ์„ฑ(์‚ฌ์น™์—ฐ์‚ฐ : ์ง€๊ธ‰๋ฐฉ๋ฒ• ์ด์šฉ) 

user_pay_data$์ด๊ตฌ๋งค๊ธˆ์•ก <- user_pay_data$`์‹๋ฃŒํ’ˆ(1)` +user_pay_data$`์ƒํ•„ํ’ˆ(2)`+user_pay_data$`์˜๋ฅ˜(3)` +
  user_pay_data$`์žกํ™”(4)` + user_pay_data$`๊ธฐํƒ€(5)`
head(user_pay_data)


โ˜…dcast ํ•จ์ˆ˜ ์˜ˆโ˜…
ํ˜•์‹) dcast(data, formula=y~x, funct)


1) vector์ƒ์„ฑ

a = rep(1:2, 4) #1~2๋ฅผ 4๋ฒˆ ๋ฐ˜๋ณต (์›์†Œ์˜ ๊ธธ์ด 8)
b = rep(1:4, 2) #1~4๋ฅผ 2๋ฒˆ ๋ฐ˜๋ณต (์›์†Œ์˜ ๊ธธ์ด 8)
c = seq(10,80,10) #10~80๊นŒ์ง€ 10์”ฉ ์ฆ๊ฐ€
d = seq(100,800,100)


2) DF์ƒ์„ฑ

df = data.frame(a,b,c,d) #d๋ณ€์ˆ˜ ์ด์šฉ
df
#   a b  c   d
# 1 1 1 10 100
# 2 2 2 20 200
# 3 1 3 30 300
# 4 2 4 40 400
# 5 1 1 50 500
# 6 2 2 60 600
# 7 1 3 70 700
# 8 2 4 80 800


3) dcast ์ ์šฉ

dcast(df, a~b, sum)
#   a   1   2    3    4
# 1 1  600  0   1000  0
# 2 2   0  800   0   1200
#1๊ฐœ์˜ ๋ณ€์ˆ˜ -> 4๊ฐœ์˜ ๋ณ€์ˆ˜

dcast(df[1:3], a~b, sum) #c๋ณ€์ˆ˜ ์ด์šฉ




+ Recent posts