DAY58. Python TextMining WordCloud (단어구름시각화, 문서유사도)

LEE_BOMB 2021. 12. 13. 21:32

2021. 12. 13. 21:32

ex nouns wordCloud

1. text file(docs) 읽기
2. 문단 -> 문장 추출
3. 문장 -> 단어 추출
4. 단어 전처리 & 단어 카운트
5. 단어구름시각화

from konlpy.tag import Kkma #class - 형태소 분석기
from wordcloud import WordCloud #class - 단어구름 시각화

1. text file(docs) 읽기

path = r"C:\ITWILL\4_Python-2\workspace\chap10_TextMining\data"
file = open(path + '/text_data.txt', encoding='utf-8')
para = file.read() #문자열

print(para)
type(para) #str
file.close()

2. 문단(문자열) -> 문장(list)

kkma = Kkma()

ex_sents = kkma.sentences(para) #list 반환 
print(ex_sents) #['형태소 분석을 시작합니다.', '나는 데이터 분석을 좋아합니다.', '직업은 데이터 분석 전문가 입니다.', 'Text mining 기법은 2000대 초반에 개발된 기술이다.'] 
len(ex_sents) #4

문단(문자열) -> 명사(list)

ex_nouns = kkma.nouns(para)
print(ex_nouns) #['형태소', '분석', '나', '데이터', '직업', '전문가', '기법', '2000', '2000대', '대', '초반', '개발', '기술']
len(ex_nouns) #13

3. 문장 -> 단어(명사) 추출

nouns = []
#중복 명사 저장

for sent in ex_sents : #문단 -> 문장
	for noun in kkma.nouns(sent) : #문장 -> 명사 추출
    	nouns.append(noun)
        
print(nouns) #['형태소', '분석', '데이터', '분석', '직업', '데이터', '분석', '전문가', '기법', '2000', '2000대', '대', '초반', '개발', '기술']
len(nouns) #15

4. 전처리 & 단어 카운트 : 1음절 제외, 서수 제외

from re import match #서수 제외 

wc = {} #단어 카운트 

for noun in nouns :
	if len(noun) > 1 and not(match('^[0-9]', noun)): #단어 전처리 
    wc[noun] = wc.get(noun, 0) + 1 #단어 카운트
    
print(wc) #{'형태소': 1, '분석': 3, '데이터': 2, '직업': 1, '전문가': 1, '기법': 1, '초반': 1, '개발': 1, '기술': 1}
len(wc) #9

5. 단어구름 시각화
1) top5 word선정

from collections import Counter #class

counter = Counter(wc)
top5_word = counter.most_common(5)
print(top5_word) #[('분석', 3), ('데이터', 2), ('형태소', 1), ('직업', 1), ('전문가', 1)]

2) word cloud

wc = WordCloud(font_path='C:/Windows/Fonts/malgun.ttf',
	width=500, height=400,
    max_words=100,max_font_size=150,
    background_color='white')
    
wc_result = wc.generate_from_frequencies(dict(top5_word))

import matplotlib.pyplot as plt

plt.imshow(wc_result)
plt.axis('off') #축 눈금 감추기
plt.show()

news wordCloud

1. pickle file 읽기
2. 문장 추출 : Okt
3. 문장 -> 명사(단어) 추출 : Okt
4. 전처리 : 단어 길이 제한
5. 단어구름 시각화

import pickle #pickle file 읽기
from konlpy.tag import Okt #형태소 분석기
from wordcloud import WordCloud

1. pickLe file 읽기

path = r'C:\ITWILL\4_Python-2\workspace\chap10_TextMining\data'

fiLe Load

file = open(path + '/news_data.pkl', mode='rb')
news_data = pickle.load(file)
file.close()

news_data #([1day], [2day], ... (150day]] - 중첩List 
news_data[0] #[1day] news_data[-1] #[150day] 
type(news_data[-1]) #List len(news_data) #150 -> 150문단(문자열)

okt = Okt() #형태소 분석기

2. 문단(문자열) - > 문장(문자열)

ex_sents = [] #문장 저장
for sent in news_data[:150] : #150day 
	para = sent[0] #문단 추출 
    sents = okt.normalize(para) 
    ex_sents.append(sents) 
    len(ex_sents) #150 ex_sents

3. 문장 -> 명사 추출

nouns = [] #명사 저장

for sent in ex_sents : #문단 -> 문장
for noun in okt.nouns(sent) : #문장 -> 명사 추출

nouns.append(noun) #명사 저장

print(nouns)
len(nouns) #1129

4. 전처리(1음절 제외) & 단어 카운트

wc = {} #단어 카운트

for noun in nouns :
	if len(noun) > 1 :
    wc[noun] = wc.get(noun, 0) + 1 #dict는 중복허용X
    print(wc)

len(wc) #732

5. 단어구름 시각화
1) topN word 선정

from collections import Counter #class
counter = Counter(wc)
top100_word = counter.most_common(100)
print(top100_word)

2) word cloud

wc = WordCloud(font_path='C:/Windows/Fonts/malgun.ttf',
	width=500, height=400, max_words=100,max_font_size=150,
    background_color='white')
    
wc_result = wc.generate_from_frequencies(dict(top100_word))

import matplotlib.pyplot as plt
plt.imshow(wc_result)
plt.axis('off') #축 눈금 감추기
plt.show()

문서 유사도 (Document Similarity)

문서 유사도
문서의 유사도를 구하는 일은 자연어 처리의 주요 주제 중 하나
문서들 간에 동일한 단어 또는 비슷한 단어를 이용하여 유사한 문서 검색
문서 유사도 성능 결정 요인
- 각 문서의 단어들을 수치화하는 방법(DTM, Word2Vec 등),
- 문서 간의 단어들의 유사도 방법(유클리드 거리, 코사인 유사도 등)

* 단어 임베딩(word embedding) : 텍스트 데이터에서 작동하는 모든 알고리즘에는 컴퓨터가 텍스트를 직접 이해하지 못하기 때문에 숫자 형태의 단어 표현 방식

문서 유사도 유형
문서 간의 단어들의 유사도를 계산하는 방법
1. 코사인 유사도(Cosine Similarity)
- 두 벡터 간의 코사인 각도를 이용한 유사도

2. 유클리드 거리(Euclidean distance) - sqrt(sum((p-q)^2))
- 피타고라스의 정리에 의한 두 점 사이의 거리

3. 자카드 유사도(Jaccard similarity)
- 두 문서에서 공통으로 들어간 단어의 비율
- 두 문서 합집합 -> 교집합의 비율 ex) len(교집합 단어) / len(합집합 단어)

코사인 유사도
두 벡터 간의 코사인 각도를 이용한 유사도
두 벡터의 방향이 완전히 동일한 경우 1의 값, 90도 각 0, 180도 각 -1 값

두 벡터 간 코사인 유사도 식

* ||A|| : 행렬 A의 벡터 크기/길이(노름

'데이터분석가 과정 > Python' 카테고리의 다른 글

DAY59. Python TextMining Cosine similarity (코사인 유사도) (0)	2021.12.14
DAY57. Python TextMining (2)WebCrawling (선택자, 뉴스 크롤링) (0)	2021.12.10
DAY56. Python TextMining (1)WebCrawling (url, tag, html) (0)	2021.12.09
DAY55. Python Cluster&Recommand (계층군집분석, KMeans, 추천시스템모형) (0)	2021.12.08
DAY54. Python TreeModel (2)Ensemble (0)	2021.12.07

💣

DAY58. Python TextMining WordCloud (단어구름시각화, 문서유사도)

'데이터분석가 과정 > Python' 카테고리의 다른 글

+ Recent posts

티스토리툴바