22. R 텍스트마이닝 연습문제

LEE_BOMB 2021. 10. 6. 20:47

2021. 10. 6. 20:47

01. 트럼프 연설문(trump.txt)과 오바마 연설문(obama.txt)을 대상으로 빈도수가 2회 이상 단어를 대상으로 단어구름 시각화하시오.
[단계1], [단계4] ~ [단계8]

obama <- file(file.choose(), encoding="UTF-8")
obama_data <- readLines(obama)
str(obama_data) #1:496
obama_data[1:6]

말뭉치

myCorpus <- Corpus(VectorSource(obama_data)) 
myCorpus

inspect(myCorpus[100]) 

myCorpusPrepro <- tm_map(myCorpus, removePunctuation) # 문장부호 제거
myCorpusPrepro <- tm_map(myCorpusPrepro, removeNumbers) # 수치 제거
myCorpusPrepro <- tm_map(myCorpusPrepro, tolower) # 소문자 변경

stopwords('english')
myCorpusPrepro <-tm_map(myCorpusPrepro, removeWords, stopwords('english')) # 불용어제거

전처리 결과 확인

myCorpusPrepro # Content:  documents: 76
inspect(myCorpusPrepro[1:3])

단어선별
(1) 한글 단어길이 2음절 ~ 8음절(한글 1개 2byte)

myCorpusPrepro_term <- TermDocumentMatrix(myCorpusPrepro,
                       control=list(wordLengths=c(2,8))) 

myCorpusPrepro_term

(2) Corpus -> 평서문 변환 : matrix -> data.frame 변경

myTerm_df <- as.data.frame(as.matrix(myCorpusPrepro_term)) 
dim(myTerm_df) #1021 496

단어 빈도수
(1) 단어 빈도수 내림차순 정렬

wordResult <- sort(rowSums(myTerm_df), decreasing=TRUE) 
wordResult[1:10] # top10 단어

(2) 불용어 제거

myStopwords = c(stopwords('english'), 'applause', 'cheers'); # 제거할 문자 추가
myCorpusPrepro <-tm_map(myCorpusPrepro, removeWords, myStopwords) # 불용어제거

(3) 단어 선별과 평서문 변환

myCorpusPrepro_term <- TermDocumentMatrix(myCorpusPrepro, 
                                          control=list(wordLengths=c(4,16))) # 2음절 ~ 8음절

(4) 말뭉치 객체를 평서문으로 변환

myTerm_df <- as.data.frame(as.matrix(myCorpusPrepro_term))

(5) 단어 출현 빈도수 구하기

wordResult <- sort(rowSums(myTerm_df), decreasing=TRUE) 
wordResult[1:10]

단어 구름 시각화
(1) 단어 이름 생성 -> 빈도수의 이름

myName <- names(wordResult)

(2) 단어이름과 빈도수로 data.frame 생성

word.df <- data.frame(word=myName, freq=wordResult) 
str(word.df) # word, freq 변수
head(word.df)

(3) 단어 색상과 글꼴 지정

pal <- brewer.pal(12,"Paired") # 12가지 색상 pal <- brewer.pal(9,"Set1") # Set1~ Set3

폰트 설정세팅 : "맑은 고딕", "서울남산체 B"

windowsFonts(malgun=windowsFont("맑은 고딕"))  #windows

(4) 단어 구름 시각화: 크기,최소빈도수,순서,회전,색상,글꼴 지정

wordcloud(word.df$word, word.df$freq, 
          scale=c(3,1), min.freq=2, random.order=F, 
          rot.per=.1, colors=pal, family="malgun")

02. 공공데이터 사이트에서 관심분야 데이터 셋을 다운로드 받아서 빈도수가 5회 이상 단어를 이용하여 단어 구름으로 시각화 하시오. 공공데이터 사이트 : www.data.go.kr 또는 기타 사이트

women <- file(file.choose(), encoding="UTF-8")
women_data <- readLines(women)
str(women_data)

지정 불용어 제거 > 빈도수에서 확인해보기

women_data = gsub("여성","",women_data)
women_data = gsub("우리","",women_data)

단어 추출

exNouns <- function(x) { #x가 문장을 받아서 1.character처리(문자변환) 2.명사 추출 3.paste : 공백으로 대체
  paste(extractNoun(as.character(x)), collapse=" ")
}

women_nouns <- sapply(women_data, exNouns)
women_nouns

자료 전처리

womenCorpus <- Corpus(VectorSource(women_nouns)) 
womenCorpus

내용보기

inspect(data_unlist[1])  

womenCorpusP <- tm_map(myCorpus, removePunctuation) # 문장부호 제거
womenCorpusP <- tm_map(womenCorpus, removeNumbers) # 수치 제거
womenCorpusP <- tm_map(womenCorpus, tolower) # 소문자 변경

inspect(womenCorpusP[1])

전처리 결과 확인

womenCorpusP
inspect(womenCorpusP[1])

단어 선별

womenCorpusP_term <- TermDocumentMatrix(womenCorpusP, 
                                          control=list(wordLengths=c(4,16))) 
womenCorpusP_term

평서문 변환

myTerm_df <- as.data.frame(as.matrix(womenCorpusP_term)) 
dim(myTerm_df) #548,1
str(myTerm_df)

빈도수 구하기

wordResult <- sort(rowSums(myTerm_df), decreasing=TRUE)
wordResult[1:10]

시각화

myName <- names(wordResult)

(2) 단어이름과 빈도수로 data.frame 생성

word.df <- data.frame(word=myName, freq=wordResult) 
str(word.df)

(3) 단어 색상과 글꼴 지정

pal <- brewer.pal(5,"Paired")
windowsFonts(malgun=windowsFont("맑은 고딕"))

(4) 단어 구름 시각화: 크기,최소빈도수,순서,회전,색상,글꼴 지정

wordcloud(word.df$word, word.df$freq, 
          scale=c(2,1), min.freq=5, random.order=F, 
          rot.per=.1, colors=pal, family="malgun")

myCorpusPrepro_term <- TermDocumentMatrix(myCorpusPrepro, 
                                          control=list(wordLengths=c(4,16))) # 2음절 ~ 8음절

myTerm_df <- as.data.frame(as.matrix(myCorpusPrepro_term)) 

wordResult <- sort(rowSums(myTerm_df), decreasing=TRUE) 
wordResult

'개인공부 > R' 카테고리의 다른 글

28. R T검정 연습문제 (0)	2021.10.12
24. R 카이제곱검정 연습문제 (0)	2021.10.08
21. R 통계기본개념2 (통계분석모델) (0)	2021.10.05
19. R EDA, 데이터전처리 연습문제 (0)	2021.10.03
16. 통계기본개념 (이산확률분포~모평균추정) (0)	2021.09.30

💣

22. R 텍스트마이닝 연습문제

'개인공부 > R' 카테고리의 다른 글

+ Recent posts

티스토리툴바