๋ฐ์ดํ„ฐ๋ถ„์„๊ฐ€ ๊ณผ์ •/R

DAY16. R ๋น„์ •ํ˜•๋ฐ์ดํ„ฐ (ํ† ํ”ฝ/์—ฐ๊ด€์–ด/๊ฐ์„ฑ๋ถ„์„)

LEE_BOMB 2021. 10. 6. 18:59
๋น„์ •ํ˜• ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ (ํ…์ŠคํŠธ ๋งˆ์ด๋‹)

1๋‹จ๊ณ„ : ํ† ํ”ฝ๋ถ„์„(๋‹จ์–ด์˜ ๋นˆ๋„์ˆ˜)

2๋‹จ๊ณ„ : ์—ฐ๊ด€์–ด ๋ถ„์„(๊ด€๋ จ ๋‹จ์–ด ๋ถ„์„)

3๋‹จ๊ณ„ : ๊ฐ์„ฑ ๋ถ„์„(๋‹จ์–ด์˜ ๊ธ์ •/๋ถ€์ • ๋ถ„์„)

 

 

 

 

 

ํ…์ŠคํŠธ ๋งˆ์ด๋‹ ํŠน์ง•

ocial ๋ฐ์ดํ„ฐ, ๋””์ง€ํ„ธ๋ฐ์ดํ„ฐ๋ฅผ ๋Œ€์ƒ์œผ๋กœ ๋ฏธ๋ฆฌ ๋งŒ๋“ค์–ด ๋†“์€ ์‚ฌ์ „์„ ๋น„๊ตํ•˜์—ฌ ๋‹จ์–ด์˜ ๋นˆ๋„๋ฅผ ๋ถ„์„ํ•œ๋‹ค.

ํ•œ๊ณ„์  : ์‚ฌ์ „ ์ž‘์„ฑ์ด ์–ด๋ ค

KoNLP : ํ•œ๊ธ€ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ์‚ฌ์ „, ์„ธ์ข…์‚ฌ์ „(์นด์ด์ŠคํŠธ ๊ฐœ๋ฐœ) ์ ์šฉ

์ƒ์šฉํ”„๋กœ๊ทธ๋žจ ์‚ฌ์šฉ ๊ถŒ์žฅ

m : ์˜๋ฌธ ํ…์ŠคํŠธ ๋งˆ์ด๋‹ ํŒจํ‚ค์ง€

๋ฐ์ดํ„ฐ Crawling ์‹œ์Šคํ…œ or ์ „๋ฌธ ์‚ฌ์ดํŠธ ์˜๋ขฐ -> ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘

 

SNS / ๋ฌธํ—Œ ๋ฐ์ดํ„ฐ ๋ถ„์„ ์ ˆ์ฐจ

๋‹จ๊ณ„1 : ํ† ํ”ฝ๋ถ„์„(๋‹จ์–ด์˜ ๋นˆ๋„์ˆ˜)

ํ˜•ํƒœ์†Œ ๋ถ„์„์œผ๋กœ ์‚ฌ์ „์— ๋‹จ์–ด ์ถ”๊ฐ€

์‚ฌ์ „๊ณผ ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ ๋น„๊ต → ๋‹จ์–ด ๋นˆ๋„ ๋ถ„์„

์‹œ๊ฐํ™” : Wordcloud

 

๋‹จ๊ณ„2 : ์—ฐ๊ด€์–ด ๋ถ„์„(๊ด€๋ จ ๋‹จ์–ด ๋ถ„์„)

ํŠน์ • ๋‹จ์–ด์˜ ์—ฐ๊ด€๋‹จ์–ด ๋นˆ๋„ ๋ถ„์„

์‹œ๊ฐํ™” : ๋‹จ์–ด๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋ง ํ˜•ํƒœ๋กœ ์‹œ๊ฐํ™”

 

๋‹จ๊ณ„3 : ๊ฐ์„ฑ ๋ถ„์„(๋‹จ์–ด์˜ ๊ธ์ •/๋ถ€์ • ๋ถ„์„)

์‹œ๊ฐํ™” : ๊ธ์ •(ํŒŒ๋ž‘), ๋ถ€์ •(๋นจ๊ฐ•) -> ๋ถˆ๋งŒ๊ณ ๊ฐ ์‹œ๊ฐํ™”

* ํ˜•ํƒœ์†Œ ๋ถ„์„ : ๋ฌธ์žฅ์„ ๋ถ„ํ•ด ๊ฐ€๋Šฅํ•œ ์ตœ์†Œํ•œ์˜ ๋‹จ์œ„๋กœ ๋ถ„๋ฆฌํ•˜๋Š” ์ž‘์—…

 

 

 

 

 

 ํŒจํ‚ค์ง€ ์„ค์น˜์™€ ์ค€๋น„ 

1) KoNLP ์„ค์น˜ 

install.packages("KoNLP") # ์ตœ์‹  ๋ฒ„์ „์—์„œ ํŒจํ‚ค์ง€ ์„ค์น˜ ์˜ค๋ฅ˜ 
Warning in install.packages :
package ‘KoNLP’ is not available (for R version 4.0.0)


[์˜ค๋ฅ˜ ํ•ด๊ฒฐ ์ ˆ์ฐจ]
ํ˜„์žฌ R ๋ฒ„์ „์—์„œ ์ œ๊ณตํ•˜์ง€ ์•Š๋Š” ํŒจํ‚ค์ง€(KoNLP) ์„ค์น˜ ๋ฐฉ๋ฒ•  

[๋‹จ๊ณ„1] Rstudio ์ข…๋ฃŒ 

[๋‹จ๊ณ„2] R 3.6.3 ๋ฒ„์ „ ๋‹ค์šด๋กœ๋“œ & ์„ค์น˜ : KoNLP ์‚ฌ์šฉ ๋ฒ„์ „  
https://cran.r-project.org/bin/windows/base/old/
์œ„ ์‚ฌ์ดํŠธ ์ ‘์† ํ›„ 'R 3.6.3 (February, 2020)' ํด๋ฆญ

[๋‹จ๊ณ„3] Rstudio ์‹คํ–‰ & R ๋ฒ„์ „ ํ™•์ธ 
๋ฉ”๋‰ด [Tools] -> [Global Options] -> [General] ํƒญ์—์„œ
R version : 64-bit R-3.6.3  ๋ณ€๊ฒฝ -> RStudion ์žฌ์‹œ์ž‘


[๋‹จ๊ณ„4] ์ด์ „ R ๋ฒ„์ „์—์„œ kONLP ์„ค์น˜ 

install.packages("https://cran.rstudio.com/bin/windows/contrib/3.4/KoNLP_0.80.1.zip",
                 repos = NULL)

* package ‘KoNLP’ successfully unpacked and MD5 sums checked ๋ฌธ๊ตฌ ๋œจ๋ฉด์„œ ์„ค์น˜ ์„ฑ๊ณต

* C:\Program Files\R์—์„œ R-3.6.3๊ณผ R-4.1.1์ค‘ ์„ ํƒํ•ด์„œ ์‹คํ–‰ ๊ฐ€๋Šฅ

* C:\Users\KIM YOON\Documents\R\win-library์—์„œ ์„ค์น˜๋œ ํŒจํ‚ค์ง€ ํ™•์ธ ๊ฐ€๋Šฅ

 

2) Sejong ์„ค์น˜ : KoNLP์™€ ์˜์กด์„ฑ ์žˆ๋Š” Sejong ์„ค์น˜ 

install.packages("Sejong")


3) wordcloud ์„ค์น˜    

install.packages("wordcloud")


4) tm ์„ค์น˜ 

install.packages("tm")


5) ์„ค์น˜ ์œ„์น˜ ํ™•์ธ 

.libPaths()


6) KoNLP ์˜์กด์„ฑ ํŒจํ‚ค์ง€ ๋ชจ๋‘ ์„ค์น˜ & ๋กœ๋“œ 

install.packages(c('hash','rJava','tau','RSQLite','devtools')) # 30๊ฑด ๊ฒฝ๊ณ 


์ž์—ฐ์–ด ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ํŒจํ‚ค์ง€ ๋กœ๋“œ 

library(hash)
Sys.setenv(JAVA_HOME='C:\\Program Files\\Java\\jre1.8.0_151') # jdk ๊ฒฝ๋กœ
library(rJava)
library(tau)
library(RSQLite)
library(devtools)


7) KoNLP ํŒจํ‚ค์ง€ ๋กœ๋”ฉ

library(KoNLP) #๋กœ๋”ฉ ์„ฑ๊ณต  
library(tm) #์ „์ฒ˜๋ฆฌ ์šฉ๋„ 
library(wordcloud) #๋‹จ์–ด ๊ตฌ๋ฆ„ ์‹œ๊ฐํ™”






1. ํ† ํ”ฝ๋ถ„์„(ํ…์ŠคํŠธ ๋งˆ์ด๋‹) 
- ์‹œ๊ฐํ™” : ๋‹จ์–ด ๋นˆ๋„์ˆ˜์— ๋”ฐ๋ฅธ ์›Œ๋“œ ํด๋ผ์šฐ๋“œ

๋‹จ๊ณ„1 :  facebook_bigdata.txt ๊ฐ€์ ธ์˜ค๊ธฐ

facebook <- file(file.choose(), encoding="UTF-8")
facebook_data <- readLines(facebook) # ์ค„ ๋‹จ์œ„ ์ฝ๊ธฐ 
str(facebook_data) # chr [1:76]
facebook_data[1:6] # ์•ž๋ถ€๋ถ„ 6๋ฌธ์žฅ ํ™•์ธ

 

 


๋‹จ๊ณ„2 : ์„ธ์ข… ์‚ฌ์ „์— ์‹ ๊ทœ ๋‹จ์–ด ์ถ”๊ฐ€
term='์ถ”๊ฐ€๋‹จ์–ด', tag=ncn(๋ช…์‚ฌ์ง€์‹œ์ฝ”๋“œ)

user_dic <- data.frame(term=c("R ํ”„๋กœ๊ทธ๋ž˜๋ฐ","ํŽ˜์ด์Šค๋ถ","๊น€์ง„์„ฑ","์†Œ์…œ๋„คํŠธ์›Œํฌ"), tag='ncn')


Sejong ์‚ฌ์ „์— ์‹ ๊ทœ ๋‹จ์–ด ์ถ”๊ฐ€ : KoNLP ์ œ๊ณต 

buildDictionary(ext_dic='sejong', user_dic = user_dic)

 

 


๋‹จ๊ณ„3 : ๋‹จ์–ด์ถ”์ถœ ์‚ฌ์šฉ์ž ํ•จ์ˆ˜ ์ •์˜
(1) Sejong ์‚ฌ์ „์— ๋“ฑ๋ก๋œ ์‹ ๊ทœ ๋‹จ์–ด ํ…Œ์ŠคํŠธ    

paste(extractNoun('ํ™๊ธธ๋™์€ ๋งŽ์€ ์‚ฌ๋žŒ๊ณผ ์†Œํ†ต์„ ์œ„ํ•ด์„œ ์†Œ์…œ๋„คํŠธ์›Œํฌ์— ๊ฐ€์ž…ํ•˜์˜€์Šต๋‹ˆ๋‹ค.'), collapse=" ")


(2) ์‚ฌ์šฉ์ž ์ •์˜ ํ•จ์ˆ˜ ์‹คํ–‰ ์ˆœ์„œ : ๋ฌธ์ž๋ณ€ํ™˜ -> ๋ช…์‚ฌ ๋‹จ์–ด์ถ”์ถœ -> ๊ณต๋ฐฑ์œผ๋กœ ํ•ฉ์นจ

exNouns <- function(x) { 
  paste(extractNoun(as.character(x)), collapse=" ")
}


(3) exNouns ํ•จ์ˆ˜ ์ด์šฉ ๋‹จ์–ด ์ถ”์ถœ 
ํ˜•์‹) sapply(vector, ํ•จ์ˆ˜) -> 76๊ฐœ vector ๋ฌธ์žฅ(์›์†Œ)์—์„œ ๋‹จ์–ด ์ถ”์ถœ 

facebook_nouns <- sapply(facebook_data, exNouns)

 

 


(4) ๋‹จ์–ด ์ถ”์ถœ ๊ฒฐ๊ณผ

str(facebook_nouns) # [1:76] attr(*, 'names')=ch [1:76] 
facebook_nouns[1] # vector names:์›๋ž˜๋ฌธ์žฅ(์œ—๋ฌธ์žฅ), vector:๋‹จ์–ด ์ถ”์ถœ(์•„๋žซ๋ฌธ์žฅ)
facebook_nouns[2]

 

 


๋‹จ๊ณ„4 : ์ž๋ฃŒ ์ „์ฒ˜๋ฆฌ   
(1) ๋ง๋ญ‰์น˜(์ฝ”ํผ์Šค:Corpus) ์ƒ์„ฑ : ๋ฌธ์žฅ์„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š” ์ž๋ฃŒ์˜ ์ง‘ํ•ฉ 

myCorpus <- Corpus(VectorSource(facebook_nouns)) 
myCorpus

<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 76

corpus ๋‚ด์šฉ ๋ณด๊ธฐ

inspect(myCorpus[1])  
inspect(myCorpus[2])


(2) ์ž๋ฃŒ ์ „์ฒ˜๋ฆฌ : ๋ง๋ญ‰์น˜ ๋Œ€์ƒ ์ „์ฒ˜๋ฆฌ 

myCorpusPrepro <- tm_map(myCorpus, removePunctuation) # ๋ฌธ์žฅ๋ถ€ํ˜ธ ์ œ๊ฑฐ
myCorpusPrepro <- tm_map(myCorpusPrepro, removeNumbers) # ์ˆ˜์น˜ ์ œ๊ฑฐ
myCorpusPrepro <- tm_map(myCorpusPrepro, tolower) # ์†Œ๋ฌธ์ž ๋ณ€๊ฒฝ

 

์˜๋ฌธ ๋Œ€์ƒ ๋ถˆ์šฉ์–ด ์ œ์™ธ : stopwords()

myCorpusPrepro <-tm_map(myCorpusPrepro, removeWords, stopwords('english')) # ๋ถˆ์šฉ์–ด์ œ๊ฑฐ


(3) ์ „์ฒ˜๋ฆฌ ๊ฒฐ๊ณผ ํ™•์ธ 

myCorpusPrepro # Content:  documents: 76
inspect(myCorpusPrepro[1:3]) # ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ๊ฒฐ๊ณผ ํ™•์ธ(์ˆซ์ž, ๋ฌธ์žฅ๋ถ€ํ˜ธ, ์˜๋ฌธ ์ƒํƒœ ํ™•์ธ)

 

 


๋‹จ๊ณ„5 : ๋‹จ์–ด ์„ ๋ณ„(๋‹จ์–ด ๊ธธ์ด 2๊ฐœ ์ด์ƒ)
(1) ํ•œ๊ธ€ ๋‹จ์–ด๊ธธ์ด 2์Œ์ ˆ ~ 8์Œ์ ˆ(ํ•œ๊ธ€ 1๊ฐœ 2byte) 

myCorpusPrepro_term <- TermDocumentMatrix(myCorpusPrepro, 
                                          control=list(wordLengths=c(4,16))) 

myCorpusPrepro_term


(2) Corpus -> ํ‰์„œ๋ฌธ ๋ณ€ํ™˜ : matrix -> data.frame ๋ณ€๊ฒฝ

myTerm_df <- as.data.frame(as.matrix(myCorpusPrepro_term)) 
dim(myTerm_df)

 

 


๋‹จ๊ณ„6 : ๋‹จ์–ด ๋นˆ๋„์ˆ˜ ๊ตฌํ•˜๊ธฐ
(1) ๋‹จ์–ด ๋นˆ๋„์ˆ˜ ๋‚ด๋ฆผ์ฐจ์ˆœ ์ •๋ ฌ

wordResult <- sort(rowSums(myTerm_df), decreasing=TRUE) 
wordResult[1:10] # top10 ๋‹จ์–ด

 

(2) ๋ถˆ์šฉ์–ด ์ œ๊ฑฐ 

myStopwords = c(stopwords('english'), "์‚ฌ์šฉ"); # ์ œ๊ฑฐํ•  ๋ฌธ์ž ์ถ”๊ฐ€
myCorpusPrepro <-tm_map(myCorpusPrepro, removeWords, myStopwords) # ๋ถˆ์šฉ์–ด์ œ๊ฑฐ


(3) ๋‹จ์–ด ์„ ๋ณ„๊ณผ ํ‰์„œ๋ฌธ ๋ณ€ํ™˜

myCorpusPrepro_term <- TermDocumentMatrix(myCorpusPrepro, 
                                          control=list(wordLengths=c(4,16))) # 2์Œ์ ˆ ~ 8์Œ์ ˆ


(4) ๋ง๋ญ‰์น˜ ๊ฐ์ฒด๋ฅผ ํ‰์„œ๋ฌธ์œผ๋กœ ๋ณ€ํ™˜

myTerm_df <- as.data.frame(as.matrix(myCorpusPrepro_term))


(5) ๋‹จ์–ด ์ถœํ˜„ ๋นˆ๋„์ˆ˜ ๊ตฌํ•˜๊ธฐ

wordResult <- sort(rowSums(myTerm_df), decreasing=TRUE) 
wordResult[1:10]

 

 


๋‹จ๊ณ„7 : ๋‹จ์–ด๊ตฌ๋ฆ„์— ๋””์ž์ธ ์ ์šฉ(๋นˆ๋„์ˆ˜, ์ƒ‰์ƒ, ๋žœ๋ค, ํšŒ์ „ ๋“ฑ)
(1) ๋‹จ์–ด ์ด๋ฆ„ ์ƒ์„ฑ -> ๋นˆ๋„์ˆ˜์˜ ์ด๋ฆ„

myName <- names(wordResult)

  
(2) ๋‹จ์–ด์ด๋ฆ„๊ณผ ๋นˆ๋„์ˆ˜๋กœ data.frame ์ƒ์„ฑ

word.df <- data.frame(word=myName, freq=wordResult) 
str(word.df) # word, freq ๋ณ€์ˆ˜
head(word.df)


(3) ๋‹จ์–ด ์ƒ‰์ƒ๊ณผ ๊ธ€๊ผด ์ง€์ •

pal <- brewer.pal(12,"Paired") # 12๊ฐ€์ง€ ์ƒ‰์ƒ pal <- brewer.pal(9,"Set1") # Set1~ Set3


ํฐํŠธ ์„ค์ •์„ธํŒ… : "๋ง‘์€ ๊ณ ๋”•", "์„œ์šธ๋‚จ์‚ฐ์ฒด B"

windowsFonts(malgun=windowsFont("๋ง‘์€ ๊ณ ๋”•"))  #windows


(4) ๋‹จ์–ด ๊ตฌ๋ฆ„ ์‹œ๊ฐํ™”: ํฌ๊ธฐ,์ตœ์†Œ๋นˆ๋„์ˆ˜,์ˆœ์„œ,ํšŒ์ „,์ƒ‰์ƒ,๊ธ€๊ผด ์ง€์ •  

wordcloud(word.df$word, word.df$freq, 
          scale=c(3,1), min.freq=2, random.order=F, 
          rot.per=.1, colors=pal, family="malgun")

 

 

 

๋‹จ๊ณ„8 : ์ฐจํŠธ ์‹œ๊ฐํ™”  
(1) ์ƒ์œ„ 10๊ฐœ ํ† ํ”ฝ์ถ”์ถœ

topWord <- head(sort(wordResult, decreasing=T), 10) # ์ƒ์œ„ 10๊ฐœ ํ† ํ”ฝ์ถ”์ถœ

 

(2) ํŒŒ์ผ ์ฐจํŠธ ์ƒ์„ฑ 

pie(topWord, col=rainbow(10), radius=1) 
#radius=1 : ๋ฐ˜์ง€๋ฆ„ ์ง€์ • - ํ™•๋Œ€ ๊ธฐ๋Šฅ


(3) ๋นˆ๋„์ˆ˜ ๋ฐฑ๋ถ„์œจ ์ ์šฉ 

pct <- round(topWord/sum(topWord)*100, 1) # ๋ฐฑ๋ถ„์œจ


(4) ๋‹จ์–ด์™€ ๋ฐฑ๋ถ„์œจ ํ•˜๋‚˜๋กœ ํ•ฉ์นœ๋‹ค.

label <- paste(names(topWord), "\n", pct, "%")


(5) ํŒŒ์ด์ฐจํŠธ์— ๋‹จ์–ด์™€ ๋ฐฑ๋ถ„์œจ์„ ๋ ˆ์ด๋ธ”๋กœ ์ ์šฉ 

pie(topWord, main="SNS ๋น…๋ฐ์ดํ„ฐ ๊ด€๋ จ ํ† ํ”ฝ๋ถ„์„", 
    col=rainbow(10), cex=0.8, labels=label)

 

 

 

 

 


02. ์—ฐ๊ด€์–ด ๋ถ„์„(๋‹จ์–ด์™€ ๋‹จ์–ด ์‚ฌ์ด ์—ฐ๊ด€์„ฑ ๋ถ„์„) 
 - ์‹œ๊ฐํ™” : ์—ฐ๊ด€์–ด ๋„คํŠธ์›Œํฌ ์‹œ๊ฐํ™”,

ํ•œ๊ธ€ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ํŒจํ‚ค์ง€ ์„ค์น˜

Sys.setenv(JAVA_HOME='C:\\Program Files\\Java\\jdk1.8.0_151')
library(rJava) # ์•„๋ž˜์™€ ๊ฐ™์€ Error ๋ฐœ์ƒ ์‹œ Sys.setenv()ํ•จ์ˆ˜๋กœ java ๊ฒฝ๋กœ ์ง€์ •
library(KoNLP) # rJava ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€ ํ•„์š”ํ•จ


1.ํ…์ŠคํŠธ ํŒŒ์ผ ๊ฐ€์ ธ์˜ค๊ธฐ

marketing <- file("c:/ITWILL/2_Rwork/data/marketing.txt", encoding="UTF-8")
marketing2 <- readLines(marketing) # ์ค„ ๋‹จ์œ„ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ
close(marketing) # ๊ฐ์ฒด ๋‹ซ๊ธฐ 
head(marketing2) # ์•ž๋ถ€๋ถ„ 6์ค„ ๋ณด๊ธฐ - ์ค„ ๋‹จ์œ„ ๋ฌธ์žฅ ํ™•์ธ 
str(marketing2) # chr [1:472]
marketing2[1:5]


2. ์ค„ ๋‹จ์œ„ ๋‹จ์–ด ์ถ”์ถœ
* Map()ํ•จ์ˆ˜ ์ด์šฉ ์ค„ ๋‹จ์œ„ ๋‹จ์–ด ์ถ”์ถœ 

lword <- Map(extractNoun, marketing2) #๋ฌธ์žฅ->๋‹จ์–ด ์ถ”์ถœ
length(lword) # [1] 472 (๋‹จ์–ด๋ฅผ ์ €์žฅํ•˜๊ณ  ์žˆ๋Š” ๋ฌธ์žฅ ๊ฐœ์ˆ˜)
class(lword) # list
str(lword) #list of 472

lword[1]
#$... -> key(๋‹จ์–ด๋ฅผ ์ €์žฅํ•˜๊ณ  ์žˆ๋Š” ๋ฌธ์žฅ)
#[1] "๊ทผ๋ž˜์—"   "์‹œ์žฅ"->value(๋‹จ์–ด)
#[33] "๊ฐ€์ •" ->์ด 33๊ฐœ์˜ ๋‹จ์–ด

lword <- unique(lword) # ์ค‘๋ณต๋ฌธ์žฅ ์ œ๊ฑฐ
length(lword) # [1] 353(119๊ฐœ ์ œ๊ฑฐ)

lword <- sapply(lword, unique) # ์ค‘๋ณต๋‹จ์–ด ์ œ๊ฑฐ 
length(lword)

 

list์ฒ˜๋ฆฌ ํ•จ์ˆ˜ uniqeu/sapply

lst = list(a=c(1,2,1), b=c(2,3,2), a=c(1,2,1)) #list(key=value, key=value)
lst #$a๊ฐ€ ์ค‘๋ณต

unique(lst) #์ค‘๋ณตkey(๋ฌธ์žฅ. ๊ด„ํ˜ธ ์•ˆ ์ˆซ์ž ์ „์ฒด๊ฐ€ ๊ฒน์น˜๋Š”์ง€) ์ œ๊ฑฐ
sapply(lst, unique) #์ค‘๋ณตvalue(๋‹จ์–ด.๊ด„ํ˜ธ ์•ˆ ์ˆซ์ž ํ•˜๋‚˜ํ•˜๋‚˜๊ฐ€ ๊ฒน์น˜๋Š”์ง€) ์ œ๊ฑฐ
#      a b a
#[1,]  1 2 1
#[2,]  2 3 2

 

3. ์ „์ฒ˜๋ฆฌ (ํ•„์š”ํ•œ ๋‹จ์–ด ์„ ๋ณ„)
1) ํ•œ๊ธ€์ด๋ฉด์„œ, ๋‹จ์–ด ๊ธธ์ด 2์Œ์ ˆ~4์Œ์ ˆ์ธ ๋‹จ์–ด ํ•„ํ„ฐ๋ง ํ•จ์ˆ˜ ์ •์˜

filter1 <- function(x){
  nchar(x) <= 4 && nchar(x) >= 2 && is.hangul(x)
}


2) Filter(f,x) -> filter1() ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•˜์—ฌ x ๋ฒกํ„ฐ ๋‹จ์œ„ ํ•„ํ„ฐ๋ง 

filter2 <- function(x){
  Filter(filter1, x)
}

 

is.hangul() : KoNLP ํŒจํ‚ค์ง€ ์ œ๊ณต
Filter(f, x) : base
nchar() : base -> ๋ฌธ์ž ์ˆ˜ ๋ฐ˜ํ™˜

3) ์ค„ ๋‹จ์–ด ๋Œ€์ƒ ํ•„ํ„ฐ๋ง

lword_final <- sapply(lword, filter2)
lword_final # ์ „์ฒ˜๋ฆฌ ๋‹จ์–ดํ™•์ธ(2~4์Œ์ ˆ) [[353]]๋ฒˆ์งธ ๋ฌธ์žฅ์—์„œ ํ•„ํ„ฐ๋ง ๋œ ๋‹จ์–ด [1]~[71]


4. ํŠธ๋žœ์žญ์…˜ ์ƒ์„ฑ : ์—ฐ๊ด€๋ถ„์„์„ ์œ„ํ•ด ๋‹จ์–ด๋ฅผ ํŠธ๋žœ์žญ์…˜์œผ๋กœ ๋ณ€ํ™˜
arules ํŒจํ‚ค์ง€ ์„ค์น˜

install.packages("arules")
library(arules)


* arules ํŒจํ‚ค์ง€ ์ œ๊ณต ๊ธฐ๋Šฅ
Adult,Groceries ๋ฐ์ดํ„ฐ ์…‹
as(),apriori(),inspect(),labels(),crossTable()=table()

as(dataset, 'class') # ํ˜•๋ณ€ํ™˜ (ex.dataset์„ class๋กœ)

wordtran <- as(lword_final, "transactions") # lword์— ์ค‘๋ณต๋ฐ์ดํ„ฐ๊ฐ€ ์žˆ์œผ๋ฉด error๋ฐœ์ƒ
wordtran
# 353 transactions (rows) and : ๊ฑฐ๋ž˜ ์ˆ˜(๋ฌธ์žฅ ์ˆ˜)
# 2349 items (columns) : ์ƒํ’ˆ ์ˆ˜(๋‹จ์–ด ์ˆ˜)

 

ํŠธ๋žœ์žญ์…˜ ๋‚ด์šฉ ๋ณด๊ธฐ -> ๊ฐ ํŠธ๋žœ์žญ์…˜์˜ ๋‹จ์–ด ๋ณด๊ธฐ

inspect(wordtran)

 

5.๋‹จ์–ด ๊ฐ„ ์—ฐ๊ด€๊ทœ์น™ ์‚ฐ์ถœ
ํŠธ๋žœ์žญ์…˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋Œ€์ƒ์œผ๋กœ ์ง€์ง€๋„์™€ ์‹ ๋ขฐ๋„๋ฅผ ์ ์šฉํ•˜์—ฌ ์—ฐ๊ด€๊ทœ์น™ ์ƒ์„ฑ
ํ˜•์‹) apriori(data, parameter = NULL, appearance = NULL, control = NULL)

tranrules <- apriori(wordtran, parameter=list(supp=0.25, conf=0.05)) #supp์ง€์ง€๋„ conf์‹ ๋ขฐ๋„
inspect(tranrules) #์ƒ์„ฑ๋œ rule = ์—ฐ๊ด€๊ทœ์น™. lhs:์„ ํ–‰์‚ฌ๊ฑด rhs:ํ›„ํ–‰์‚ฌ๊ฑด


6.์—ฐ๊ด€์–ด ์‹œ๊ฐํ™” (์—ฐ๊ด€์–ด ๋„คํŠธ์›Œํฌํฌ)
(1) ๋ฐ์ดํ„ฐ ๊ตฌ์กฐ ๋ณ€๊ฒฝ : ์—ฐ๊ด€๊ทœ์น™ -> label ์ถ”์ถœ  

rules <- labels(tranrules, ruleSep=" ")  
rules


๋ฌธ์ž์—ด๋กœ ๋ฌถ์ธ ์—ฐ๊ด€๋‹จ์–ด -> ๊ณต๋ฐฑ ๊ธฐ์ค€ list ๋ณ€๊ฒฝ 

rules <- sapply(rules, strsplit, " ",  USE.NAMES=F) # list ๋ณ€ํ™˜  
rules


list -> matrix ๋ฐ˜ํ™˜

rulemat <- do.call("rbind", rules)
rulemat
class(rulemat)


(2) ์—ฐ๊ด€์–ด ์‹œ๊ฐํ™”๋ฅผ ์œ„ํ•œ igraph ํŒจํ‚ค์ง€ ์„ค์น˜

install.packages("igraph") # graph.edgelist(), plot.igraph(), closeness() ํ•จ์ˆ˜ ์ œ๊ณต
library(igraph)


(3) edgelist๋ณด๊ธฐ - ์—ฐ๊ด€๋‹จ์–ด๋ฅผ ์ •์  ํ˜•ํƒœ์˜ ๋ชฉ๋ก ์ œ๊ณต 

ruleg <- graph.edgelist(rulemat[c(12:59),], directed=F) # [1,]~[11,] "{}" ์ œ์™ธ
ruleg


(4) edgelist ์‹œ๊ฐํ™”
X11() # ํŒ์—…์ฐฝ 

plot.igraph(ruleg, vertex.label=V(ruleg)$name,
            vertex.label.cex=1.2, vertex.label.color='black', 
            vertex.size=20, vertex.color='green', vertex.frame.color='blue')

*๊ทธ๋ž˜ํ”„ ๋ฐ”๊นฅ์ชฝ ํ•ญ๋ชฉ์ด ์„ ํ–‰์‚ฌ๊ฑด, ์•ˆ์ชฝ์ด ํ›„ํ–‰์‚ฌ๊ฑด

 

ํ›„ํ–‰์‚ฌ๊ฑด : '๋งˆ์ผ€ํŒ…'

sub_tran_rules = subset(tranrules, rhs %in% '๋งˆ์ผ€ํŒ…')
sub_tran_rules #set of 14 rules


๋ฐ์ดํ„ฐ ๊ตฌ์กฐ ๋ณ€๊ฒฝ : ์—ฐ๊ด€๊ทœ์น™ -> label ์ถ”์ถœ  

rules <- labels(sub_tran_rules, ruleSep=" ")  
rules


๋ฌธ์ž์—ด๋กœ ๋ฌถ์ธ ์—ฐ๊ด€๋‹จ์–ด -> ๊ณต๋ฐฑ ๊ธฐ์ค€ list ๋ณ€๊ฒฝ 

rules <- sapply(rules, strsplit, " ",  USE.NAMES=F) # list ๋ณ€ํ™˜  
rules


list -> matrix ๋ฐ˜ํ™˜

rulemat <- do.call("rbind", rules)
rulemat
class(rulemat)


edgelist๋ณด๊ธฐ - ์—ฐ๊ด€๋‹จ์–ด๋ฅผ ์ •์  ํ˜•ํƒœ์˜ ๋ชฉ๋ก ์ œ๊ณต 

ruleg <- graph.edgelist(rulemat[c(2:14)], directed=F) #์„ ํ–‰์‚ฌ๊ฑด ์—†๋Š” [1]์€ ์ œ์™ธ
ruleg


edgelist ์‹œ๊ฐํ™”

X11() # ํŒ์—…์ฐฝ 
plot.igraph(ruleg, vertex.label=V(ruleg)$name,
            vertex.label.cex=1.2, vertex.label.color='black', 
            vertex.size=20, vertex.color='green', vertex.frame.color='blue')


๊ฒฐ๊ณผ : ํ›„ํ–‰์‚ฌ๊ฑด '๋งˆ์ผ€ํŒ…'์„ ์ค‘์‹ฌ์œผ๋กœ ํ•œ ๊ทธ๋ž˜ํ”„

 

 

 

 

 

03. ๊ฐ์„ฑ๋ถ„์„

1. ๋ฐ์ดํ„ฐ ๊ฐ€์ ธ์˜ค๊ธฐ() 

setwd("C:/ITWILL/2_Rwork/data")

data<-read.csv("reviews.csv") 
str(data)
#data.frame': 100 obs. of  2 variables:
#$ company, $ review

head(data,2)


2. ๋‹จ์–ด ์‚ฌ์ „์— ๋‹จ์–ด์ถ”๊ฐ€
๊ธ์ •์–ด/๋ถ€์ •์–ด ์˜์–ด ์‚ฌ์ „ ๊ฐ€์ ธ์˜ค๊ธฐ

posDic <- readLines("posDic.txt")
negDic <- readLines("negDic.txt")
length(posDic) # 2006
length(negDic) # 4783


๊ธ์ •์–ด/๋ถ€์ •์–ด ๋‹จ์–ด ์ถ”๊ฐ€ 

posDic.final <-c(posDic, 'victor')
negDic.final <-c(negDic, 'vanquished')


3. ๊ฐ์„ฑ ๋ถ„์„ ํ•จ์ˆ˜ ์ •์˜-sentimental

(1) ๋ฌธ์ž์—ด ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ํŒจํ‚ค์ง€ ๋กœ๋”ฉ 

library(plyr) # laply()ํ•จ์ˆ˜ ์ œ๊ณต
library(stringr) # str_split()ํ•จ์ˆ˜ ์ œ๊ณต


(2) ๊ฐ์„ฑ๋ถ„์„์„ ์œ„ํ•œ ํ•จ์ˆ˜ ์ •์˜ (ํ•จ์ˆ˜ํ•ด์„์ค‘์š”!)

sentimental = function(sentences, posDic, negDic){
  
  scores = laply(sentences, function(sentence, posDic, negDic) { #sentence:๋ถ„์„๋Œ€์ƒ์ด ๋˜๋Š” ๋ฌธ์žฅ, posDic:๊ธ์ •์‚ฌ์ „, negDic:๋ถ€์ •์‚ฌ์ „

sentence์ „์ฒ˜๋ฆฌ:gsub('ํŒจํ„ด', '๊ต์ฒด๋ฌธ์ž', ๋ฌธ์žฅ)
sentence = gsub('[[:punct:]]', '', sentence) #๋ฌธ์žฅ๋ถ€ํ˜ธ ์ œ๊ฑฐ
sentence = gsub('[[:cntrl:]]', '', sentence) #ํŠน์ˆ˜๋ฌธ์ž ์ œ๊ฑฐ
sentence = gsub('\\d+', '', sentence) # ์ˆซ์ž ์ œ๊ฑฐ
sentence = tolower(sentence) # ๋ชจ๋‘ ์†Œ๋ฌธ์ž๋กœ ๋ณ€๊ฒฝ(๋‹จ์–ด๊ฐ€ ๋ชจ๋‘ ์†Œ๋ฌธ์ž ์ž„)
    
๋ฌธ์žฅ -> ๋‹จ์–ด

word.list = str_split(sentence, '\\s+') # ๊ณต๋ฐฑ ๊ธฐ์ค€์œผ๋กœ ๋‹จ์–ด ์ƒ์„ฑ -> \\s+ : ๊ณต๋ฐฑ ์ •๊ทœ์‹, +(1๊ฐœ ์ด์ƒ) 
words = unlist(word.list) # unlist() : list๋ฅผ vector ๊ฐ์ฒด๋กœ ๊ตฌ์กฐ๋ณ€๊ฒฝ

     
๋‹จ์–ด vs ์‚ฌ์ „(Dic)

pos.matches = match(words, posDic) # words์˜ ๋‹จ์–ด๋ฅผ posDic์—์„œ matching
neg.matches = match(words, negDic)

 

์‚ฌ์ „์—์„œ์˜ ์œ„์น˜์ •๋ณด ์ถ”์ถœ

pos.matches = !is.na(pos.matches) # NA ์ œ๊ฑฐ, ์œ„์น˜(์ˆซ์ž)๋งŒ ์ถ”์ถœ
neg.matches = !is.na(neg.matches)

     
๊ธ์ •์ ์ˆ˜-๋ถ€์ •์ ์ˆ˜

core = sum(pos.matches) - sum(neg.matches) # pos.matches:๊ธ์ •์ ์ˆ˜, neg.matches:๋ถ€์ •์ ์ˆ˜   
return(score)
}, posDic, negDic)

scores.df = data.frame(score=scores, text=sentences) #score์˜ ์ •์ˆ˜๊ฐ€ ๋†’์œผ๋ฉด ๊ธ์ •/๋‚ฎ์œผ๋ฉด ๋ถ€์ •
return(scores.df)
}


4. ๊ฐ์„ฑ ๋ถ„์„ : ๋‘๋ฒˆ์งธ ๋ณ€์ˆ˜(review) ์ „์ฒด ๋ ˆ์ฝ”๋“œ ๋Œ€์ƒ ๊ฐ์„ฑ๋ถ„์„

result<-sentimental(data[,2], posDic.final, negDic.final) #data์˜ 2๋ฒˆ์งธ ์นผ๋Ÿผ(review), ์ตœ์ข… ๊ตฌ์ถ•๋œ ๊ธ์ •/๋ถ€์ •์‚ฌ์ „
head(result)
#score
#1     0  (1๋ฒˆ์งธ ๋ฌธ์žฅ ์ ์ˆ˜๋Š” 0์ )
#2     0
#3     7 (3๋ฒˆ์งธ ๋ฌธ์žฅ์€ ๊ธ์ ์ ์œผ๋กœ ํ‰๊ฐ€ํ•˜๊ณ  ์žˆ์Œ)
#4    -1
#5     3
#6     1

names(result) # "score" "text" 
dim(result) # 100   2
result$text
result$score # 100์ค„ ๋‹จ์œ„๋กœ ๊ธ์ •์–ด/๋ถ€์ •์–ด ์‚ฌ์ „์„ ์ ์šฉํ•œ ์ ์ˆ˜ ํ•ฉ๊ณ„


score๊ฐ’์„ ๋Œ€์ƒ์œผ๋กœ color ์นผ๋Ÿผ ์ถ”๊ฐ€

result$color[result$score >=1] <- "blue"
result$color[result$score ==0] <- "green"
result$color[result$score < 0] <- "red"


๊ฐ์„ฑ๋ถ„์„ ๊ฒฐ๊ณผ ์ฐจํŠธ๋ณด๊ธฐ

plot(result$score, col=result$color) # ์‚ฐํฌ๋„ ์ƒ‰์ƒ ์ ์šฉ
barplot(result$score, col=result$color, main ="๊ฐ์„ฑ๋ถ„์„ ๊ฒฐ๊ณผํ™”๋ฉด") # ๋ง‰๋Œ€์ฐจํŠธ


5. ๋‹จ์–ด์˜ ๊ธ์ •/๋ถ€์ • ๋ถ„์„ 
(1) ๊ฐ์„ฑ๋ถ„์„ ๋นˆ๋„์ˆ˜ 

table(result$color)

 

(2) score ์นผ๋Ÿผ ๋ฆฌ์ฝ”๋”ฉ 

result$remark[result$score >=1] <- "๊ธ์ •"
result$remark[result$score ==0] <- "์ค‘๋ฆฝ"
result$remark[result$score < 0] <- "๋ถ€์ •"

sentiment_result<- table(result$remark)
sentiment_result


(3) ์ œ๋ชฉ, ์ƒ‰์ƒ, ์›ํฌ๊ธฐ

pie(sentiment_result, main="๊ฐ์„ฑ๋ถ„์„ ๊ฒฐ๊ณผ", 
col=c("blue","red","green"), radius=0.8) # ->  1.2