ex nouns wordCloud

1. text file(docs) ์ฝ๊ธฐ
2. ๋ฌธ๋‹จ -> ๋ฌธ์žฅ ์ถ”์ถœ
3. ๋ฌธ์žฅ -> ๋‹จ์–ด ์ถ”์ถœ
4. ๋‹จ์–ด ์ „์ฒ˜๋ฆฌ & ๋‹จ์–ด ์นด์šดํŠธ
5. ๋‹จ์–ด๊ตฌ๋ฆ„์‹œ๊ฐํ™”


from konlpy.tag import Kkma #class - ํ˜•ํƒœ์†Œ ๋ถ„์„๊ธฐ
from wordcloud import WordCloud #class - ๋‹จ์–ด๊ตฌ๋ฆ„ ์‹œ๊ฐํ™”



1. text file(docs) ์ฝ๊ธฐ

path = r"C:\ITWILL\4_Python-2\workspace\chap10_TextMining\data"
file = open(path + '/text_data.txt', encoding='utf-8')
para = file.read() #๋ฌธ์ž์—ด

print(para)
type(para) #str
file.close()




2. ๋ฌธ๋‹จ(๋ฌธ์ž์—ด) -> ๋ฌธ์žฅ(list)

kkma = Kkma()

ex_sents = kkma.sentences(para) #list ๋ฐ˜ํ™˜ 
print(ex_sents) #['ํ˜•ํƒœ์†Œ ๋ถ„์„์„ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค.', '๋‚˜๋Š” ๋ฐ์ดํ„ฐ ๋ถ„์„์„ ์ข‹์•„ํ•ฉ๋‹ˆ๋‹ค.', '์ง์—…์€ ๋ฐ์ดํ„ฐ ๋ถ„์„ ์ „๋ฌธ๊ฐ€ ์ž…๋‹ˆ๋‹ค.', 'Text mining ๊ธฐ๋ฒ•์€ 2000๋Œ€ ์ดˆ๋ฐ˜์— ๊ฐœ๋ฐœ๋œ ๊ธฐ์ˆ ์ด๋‹ค.'] 
len(ex_sents) #4


๋ฌธ๋‹จ(๋ฌธ์ž์—ด) -> ๋ช…์‚ฌ(list)

ex_nouns = kkma.nouns(para)
print(ex_nouns) #['ํ˜•ํƒœ์†Œ', '๋ถ„์„', '๋‚˜', '๋ฐ์ดํ„ฐ', '์ง์—…', '์ „๋ฌธ๊ฐ€', '๊ธฐ๋ฒ•', '2000', '2000๋Œ€', '๋Œ€', '์ดˆ๋ฐ˜', '๊ฐœ๋ฐœ', '๊ธฐ์ˆ ']
len(ex_nouns) #13




3. ๋ฌธ์žฅ -> ๋‹จ์–ด(๋ช…์‚ฌ) ์ถ”์ถœ

nouns = []
#์ค‘๋ณต ๋ช…์‚ฌ ์ €์žฅ

for sent in ex_sents : #๋ฌธ๋‹จ -> ๋ฌธ์žฅ
	for noun in kkma.nouns(sent) : #๋ฌธ์žฅ -> ๋ช…์‚ฌ ์ถ”์ถœ
    	nouns.append(noun)
        
print(nouns) #['ํ˜•ํƒœ์†Œ', '๋ถ„์„', '๋ฐ์ดํ„ฐ', '๋ถ„์„', '์ง์—…', '๋ฐ์ดํ„ฐ', '๋ถ„์„', '์ „๋ฌธ๊ฐ€', '๊ธฐ๋ฒ•', '2000', '2000๋Œ€', '๋Œ€', '์ดˆ๋ฐ˜', '๊ฐœ๋ฐœ', '๊ธฐ์ˆ ']
len(nouns) #15




4. ์ „์ฒ˜๋ฆฌ & ๋‹จ์–ด ์นด์šดํŠธ : 1์Œ์ ˆ ์ œ์™ธ, ์„œ์ˆ˜ ์ œ์™ธ

from re import match #์„œ์ˆ˜ ์ œ์™ธ 

wc = {} #๋‹จ์–ด ์นด์šดํŠธ 

for noun in nouns :
	if len(noun) > 1 and not(match('^[0-9]', noun)): #๋‹จ์–ด ์ „์ฒ˜๋ฆฌ 
    wc[noun] = wc.get(noun, 0) + 1 #๋‹จ์–ด ์นด์šดํŠธ
    
print(wc) #{'ํ˜•ํƒœ์†Œ': 1, '๋ถ„์„': 3, '๋ฐ์ดํ„ฐ': 2, '์ง์—…': 1, '์ „๋ฌธ๊ฐ€': 1, '๊ธฐ๋ฒ•': 1, '์ดˆ๋ฐ˜': 1, '๊ฐœ๋ฐœ': 1, '๊ธฐ์ˆ ': 1}
len(wc) #9




5. ๋‹จ์–ด๊ตฌ๋ฆ„ ์‹œ๊ฐํ™”
1) top5 word์„ ์ •

from collections import Counter #class

counter = Counter(wc)
top5_word = counter.most_common(5)
print(top5_word) #[('๋ถ„์„', 3), ('๋ฐ์ดํ„ฐ', 2), ('ํ˜•ํƒœ์†Œ', 1), ('์ง์—…', 1), ('์ „๋ฌธ๊ฐ€', 1)]


2) word cloud

wc = WordCloud(font_path='C:/Windows/Fonts/malgun.ttf',
	width=500, height=400,
    max_words=100,max_font_size=150,
    background_color='white')
    
wc_result = wc.generate_from_frequencies(dict(top5_word))

import matplotlib.pyplot as plt

plt.imshow(wc_result)
plt.axis('off') #์ถ• ๋ˆˆ๊ธˆ ๊ฐ์ถ”๊ธฐ
plt.show()






news wordCloud

1. pickle file ์ฝ๊ธฐ
2. ๋ฌธ์žฅ ์ถ”์ถœ : Okt
3. ๋ฌธ์žฅ -> ๋ช…์‚ฌ(๋‹จ์–ด) ์ถ”์ถœ : Okt
4. ์ „์ฒ˜๋ฆฌ : ๋‹จ์–ด ๊ธธ์ด ์ œํ•œ
5. ๋‹จ์–ด๊ตฌ๋ฆ„ ์‹œ๊ฐํ™”

import pickle #pickle file ์ฝ๊ธฐ
from konlpy.tag import Okt #ํ˜•ํƒœ์†Œ ๋ถ„์„๊ธฐ
from wordcloud import WordCloud



1. pickLe file ์ฝ๊ธฐ

path = r'C:\ITWILL\4_Python-2\workspace\chap10_TextMining\data'


fiLe Load

file = open(path + '/news_data.pkl', mode='rb')
news_data = pickle.load(file)
file.close()

news_data #([1day], [2day], ... (150day]] - ์ค‘์ฒฉList 
news_data[0] #[1day] news_data[-1] #[150day] 
type(news_data[-1]) #List len(news_data) #150 -> 150๋ฌธ๋‹จ(๋ฌธ์ž์—ด)

okt = Okt() #ํ˜•ํƒœ์†Œ ๋ถ„์„๊ธฐ




2. ๋ฌธ๋‹จ(๋ฌธ์ž์—ด) - > ๋ฌธ์žฅ(๋ฌธ์ž์—ด)

ex_sents = [] #๋ฌธ์žฅ ์ €์žฅ
for sent in news_data[:150] : #150day 
	para = sent[0] #๋ฌธ๋‹จ ์ถ”์ถœ 
    sents = okt.normalize(para) 
    ex_sents.append(sents) 
    len(ex_sents) #150 ex_sents




3. ๋ฌธ์žฅ -> ๋ช…์‚ฌ ์ถ”์ถœ

nouns = [] #๋ช…์‚ฌ ์ €์žฅ

for sent in ex_sents : #๋ฌธ๋‹จ -> ๋ฌธ์žฅ
for noun in okt.nouns(sent) : #๋ฌธ์žฅ -> ๋ช…์‚ฌ ์ถ”์ถœ

nouns.append(noun) #๋ช…์‚ฌ ์ €์žฅ

print(nouns)
len(nouns) #1129




4. ์ „์ฒ˜๋ฆฌ(1์Œ์ ˆ ์ œ์™ธ) & ๋‹จ์–ด ์นด์šดํŠธ

wc = {} #๋‹จ์–ด ์นด์šดํŠธ

for noun in nouns :
	if len(noun) > 1 :
    wc[noun] = wc.get(noun, 0) + 1 #dict๋Š” ์ค‘๋ณตํ—ˆ์šฉX
    print(wc)

len(wc) #732




5. ๋‹จ์–ด๊ตฌ๋ฆ„ ์‹œ๊ฐํ™”
1) topN word ์„ ์ •

from collections import Counter #class
counter = Counter(wc)
top100_word = counter.most_common(100)
print(top100_word)


2) word cloud

wc = WordCloud(font_path='C:/Windows/Fonts/malgun.ttf',
	width=500, height=400, max_words=100,max_font_size=150,
    background_color='white')
    
wc_result = wc.generate_from_frequencies(dict(top100_word))

import matplotlib.pyplot as plt
plt.imshow(wc_result)
plt.axis('off') #์ถ• ๋ˆˆ๊ธˆ ๊ฐ์ถ”๊ธฐ
plt.show()






๋ฌธ์„œ ์œ ์‚ฌ๋„ (Document Similarity)

๋ฌธ์„œ ์œ ์‚ฌ๋„
๋ฌธ์„œ์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ตฌํ•˜๋Š” ์ผ์€ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์˜ ์ฃผ์š” ์ฃผ์ œ ์ค‘ ํ•˜๋‚˜
๋ฌธ์„œ๋“ค ๊ฐ„์— ๋™์ผํ•œ ๋‹จ์–ด ๋˜๋Š” ๋น„์Šทํ•œ ๋‹จ์–ด๋ฅผ ์ด์šฉํ•˜์—ฌ ์œ ์‚ฌํ•œ ๋ฌธ์„œ ๊ฒ€์ƒ‰
๋ฌธ์„œ ์œ ์‚ฌ๋„ ์„ฑ๋Šฅ ๊ฒฐ์ • ์š”์ธ
- ๊ฐ ๋ฌธ์„œ์˜ ๋‹จ์–ด๋“ค์„ ์ˆ˜์น˜ํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ•(DTM, Word2Vec ๋“ฑ),
- ๋ฌธ์„œ ๊ฐ„์˜ ๋‹จ์–ด๋“ค์˜ ์œ ์‚ฌ๋„ ๋ฐฉ๋ฒ•(์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ, ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„ ๋“ฑ)

* ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ(word embedding) : ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ์—์„œ ์ž‘๋™ํ•˜๋Š” ๋ชจ๋“  ์•Œ๊ณ ๋ฆฌ์ฆ˜์—๋Š” ์ปดํ“จํ„ฐ๊ฐ€ ํ…์ŠคํŠธ๋ฅผ ์ง์ ‘ ์ดํ•ดํ•˜์ง€ ๋ชปํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ˆซ์ž ํ˜•ํƒœ์˜ ๋‹จ์–ด ํ‘œํ˜„ ๋ฐฉ์‹


๋ฌธ์„œ ์œ ์‚ฌ๋„ ์œ ํ˜•
๋ฌธ์„œ ๊ฐ„์˜ ๋‹จ์–ด๋“ค์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ๋ฒ•
1. ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„(Cosine Similarity)
- ๋‘ ๋ฒกํ„ฐ ๊ฐ„์˜ ์ฝ”์‚ฌ์ธ ๊ฐ๋„๋ฅผ ์ด์šฉํ•œ ์œ ์‚ฌ๋„

2. ์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ(Euclidean distance) - sqrt(sum((p-q)^2))
- ํ”ผํƒ€๊ณ ๋ผ์Šค์˜ ์ •๋ฆฌ์— ์˜ํ•œ ๋‘ ์  ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ

3. ์ž์นด๋“œ ์œ ์‚ฌ๋„(Jaccard similarity)
- ๋‘ ๋ฌธ์„œ์—์„œ ๊ณตํ†ต์œผ๋กœ ๋“ค์–ด๊ฐ„ ๋‹จ์–ด์˜ ๋น„์œจ
- ๋‘ ๋ฌธ์„œ ํ•ฉ์ง‘ํ•ฉ -> ๊ต์ง‘ํ•ฉ์˜ ๋น„์œจ ex) len(๊ต์ง‘ํ•ฉ ๋‹จ์–ด) / len(ํ•ฉ์ง‘ํ•ฉ ๋‹จ์–ด)


์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„
๋‘ ๋ฒกํ„ฐ ๊ฐ„์˜ ์ฝ”์‚ฌ์ธ ๊ฐ๋„๋ฅผ ์ด์šฉํ•œ ์œ ์‚ฌ๋„
๋‘ ๋ฒกํ„ฐ์˜ ๋ฐฉํ–ฅ์ด ์™„์ „ํžˆ ๋™์ผํ•œ ๊ฒฝ์šฐ 1์˜ ๊ฐ’, 90๋„ ๊ฐ 0, 180๋„ ๊ฐ -1 ๊ฐ’


๋‘ ๋ฒกํ„ฐ ๊ฐ„ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„ ์‹

* ||A|| : ํ–‰๋ ฌ A์˜ ๋ฒกํ„ฐ ํฌ๊ธฐ/๊ธธ์ด(๋…ธ๋ฆ„






+ Recent posts