데이터뢄석가 κ³Όμ •/R

DAY29. R 총정리 문제

LEE_BOMB 2021. 10. 26. 20:23
총정리 μ—°μŠ΅λ¬Έμ œ

 

2012λ…„ λ―Έκ΅­ λŒ€μ„  κΈ°λΆ€κΈˆ ν˜„ν™© 데이터 μ…‹ 

election = read.csv(file.choose(), stringsAsFactors = F) # election_2012.csv 선택
dim(election) # 1001731      16
str(election) # dim + class

<데이터 μ…‹ μ„€λͺ…> : 2012λ…„ λ―Έκ΅­ λŒ€μ„ μž('Romney, Mitt'와 'Obama, Barack') ν›„μ›κΈˆ ν˜„ν™©
'data.frame': 1001731 obs. of  16 variables:
3. cand_nm : λŒ€μ„  ν›„λ³΄μžμ΄λ¦„
4. contbr_nm : ν›„μ›μžμ΄λ¦„ 
5. contbr_city : 후원 λ„μ‹œ  
9. contbr_occupation : ν›„μ›μž 직업ꡰ 
10. contb_receipt_amt: ν›„μ›κΈˆ 
11. contb_receipt_dt : 후원 λ‚ μ§œ 

 

 

 

chapter 01 : μžλ£Œν˜•, ν˜•λ³€ν™˜(λ‚ μ§œ λ³€ν™˜)
[문제1] election λ°μ΄ν„°μ…‹μ˜ λ³€μˆ˜λ₯Ό λŒ€μƒμœΌλ‘œ μžλ£Œν˜•μ„ ν™•μΈν•˜κ³  μžλ£Œν˜•μ„ λ³€κ²½ν•˜μ‹œμ˜€. 
μ†Œμš”μ‹œκ°„ : 5λΆ„ 

1) cand_nm, contb_receipt_amt, contb_receipt_dt λ³€μˆ˜μ˜ μžλ£Œν˜• ν™•μΈν•˜κΈ°  
힌트) mode() 이용 

mode(election$cand_nm) # "character"
mode(election$contb_receipt_amt) # "numeric"
mode(election$contb_receipt_dt) # "character"


2) ν›„μ›λ‚ μ§œ(contb_receipt_dt)λ³€μˆ˜λ₯Ό λ‚ μ§œν˜•μœΌλ‘œ λ³€ν™˜ν•˜κΈ°  

date = election$contb_receipt_dt 
date[1:10] # "20-Jun-11" "23-Jun-11" -> 미ꡭ식 : 일-μ›”-년도

Sys.Date(data) # Error in Sys.Date(data) : unused argument (data)


λ‹€κ΅­μ–΄ 정보 λ³€κ²½ : ν•œκ΅­ -> μ˜μ–΄ 

Sys.getlocale() # "LC_COLLATE=Korean_Korea
Sys.setlocale(locale = 'English_USA') # 미ꡭ식

 

미ꡭ식 : 일-μ›”-년도 -> ν•œκ΅­μ‹ : 년도-μ›”-일 

kdate <- strptime(date, "%d-%b-%y")
kdate[1:10]


λ‚ μ§œν˜• μˆ˜μ • 

election$contb_receipt_dt <- kdate

Sys.setlocale(locale = 'Korean_Korea') # ν•œκ΅­μ‹ λ³€κ²½

 

 




chapter 02 : 색인(index), 칼럼λͺ… λ³€κ²½  
[문제2] election 데이터셋을 λŒ€μƒμœΌλ‘œ 6개 칼럼(데이터 μ…‹ μ„€λͺ…)만 μ„ νƒν•˜μ—¬ μƒˆλ‘œμš΄ 데이터셋을 λ§Œλ“€μ‹œμ˜€.
μ†Œμš”μ‹œκ°„ : 3λΆ„ 

1) 색인(index) μ΄μš©ν•˜κΈ° : 힌트) dataset[, c(μ—΄index1, μ—΄index2, ...)]

election_df = election[,c(3:5,9:11)]
dim(election_df)  # 1001731       6

 
2) election_df 칼럼λͺ… λ³€κ²½ν•˜κΈ° : 힌트) names(dataset) <- c('칼럼λͺ…1','칼러λͺ…2', ...)

μˆ˜μ • 칼럼λͺ… :'cand_name','contbr_name','city','occupation','receipt_amt','receipt_date'

names(election_df)
names(election_df) <- c('cand_name','contbr_name','city','occupation','receipt_amt','receipt_date')    
names(election_df)

 

 




chapter 03 : μ„œλΈŒμ…‹(subset) λ§Œλ“€κΈ°  
[문제3] 'Romney, Mitt'와 'Obama, Barack' λŒ€λ Ήν†΅ ν›„λ³΄μž λ³„λ‘œ μ„œλΈŒμ…‹(subset)을 μƒμ„±ν•˜μ‹œμ˜€.
μ†Œμš”μ‹œκ°„ : 6λΆ„
  
1) λŒ€μ„  ν›„λ³΄μž 이름(cand_name)을 λŒ€μƒμœΌλ‘œ μ€‘λ³΅λ˜μ§€ μ•Šμ€ ν›„λ³΄μž 이름과 각 ν›„λ³΄μžλ³„ λΉˆλ„μˆ˜ ν™•μΈν•˜κΈ°
힌트) unique() : μœ μΌκ°’ 확인, table() : λΉˆλ„μˆ˜ 확인 

unique(election_df$cand_name) # 13λͺ… - "Romney, Mitt", "Obama, Barack"
table(election_df$cand_name)


2) 'Romney, Mitt'와 'Obama, Barack' λŒ€λ Ήν†΅ ν›„λ³΄μž λ³„λ‘œ μ„œλΈŒμ…‹ λ§Œλ“€κΈ°
힌트) subset(dataset, subset = 쑰건식)

romney = subset(election_df, subset = cand_name == "Romney, Mitt") # 'Romney, Mitt'
obama = subset(election_df, subset = cand_name == "Obama, Barack")# 'Obama, Barack'


차원 확인 

dim(romney) # 107229      6
dim(obama) # 593746      6


λ‚΄μš© 확인 

head(romney)
tail(romney)
head(obama) 
tail(obama)