๋ฐ์ดํ„ฐ๋ถ„์„๊ฐ€ ๊ณผ์ •/Python

DAY59. Python TextMining Cosine similarity (์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„)

LEE_BOMB 2021. 12. 14. 20:19
cosine_similarity

<์ž‘์—…์ ˆ์ฐจ>
1. ๋Œ€์ƒ ๋ฌธ์„œ(์ž์—ฐ์–ด) -> ํฌ์†Œํ–‰๋ ฌ(DTM:๋ฌธ์„œ๋‹จ์–ดํ–‰๋ ฌ)
2. ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„ ์ ์šฉ 
-> ๋ฌธ์„œ๋ฅผ ๊ตฌ์„ฑํ•˜๋Š” ๋‹จ์–ด๋“ค ๊ฐ„์˜ ์œ ์‚ฌ๋„ ์ธก์ •(-1 ~ +1)

 

from sklearn.feature_extraction.text import TfidfVectorizer #class. ํฌ์†Œํ–‰๋ ฌ(sparse matrix)
from sklearn.metrics.pairwise import cosine_similarity #function. ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„


๋ฌธ์žฅ(sentence) : 3๊ฐœ ๋ฌธ์žฅ(์ž์—ฐ์–ด) 

sentences = [
    "Mr. Green killed Colonel Mustard in the study with the candlestick. Mr. Green is not a very nice fellow.",
    "Professor Plum has a green plant in his study.",
    "Miss Scarlett watered Professor Plum's green plant while he was away from his office last week."
]

print(sentences)
len(sentences) #3๊ฐœ ๋ฌธ์žฅ

 



1. ๋Œ€์ƒ ๋ฌธ์„œ(์ž์—ฐ์–ด) -> ํฌ์†Œํ–‰๋ ฌ(DTM:๋ฌธ์„œ๋‹จ์–ดํ–‰๋ ฌ)

tfidf = TfidfVectorizer() #1) ๋‹จ์–ด์ƒ์„ฑ๊ธฐ


๋‹จ์–ด ๋ณด๊ธฐ 

fit = tfidf.fit(sentences) #๋ฌธ์žฅ ์ ์šฉ 
voca = fit.vocabulary_
print(voca)


#2) ํฌ์†Œํ–‰๋ ฌ(DTM)

sp_mat = tfidf.fit_transform(sentences) #๋ฌธ์žฅ ์ ์šฉ 
print(sp_mat)

  (ํ–‰,์—ด)
  (0, 3) 0.2205828828763741

scipy -> numpy 

sp_mat_arr = sp_mat.toarray()
print(sp_mat_arr)
sp_mat_arr.shape #(3, 31) -> (๋ฌธ์„œ๊ฐœ์ˆ˜, ๋‹จ์–ด๊ฐœ์ˆ˜)

 



2. ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„ ์ ์šฉ
1) ๊ฒ€์ƒ‰์ฟผ๋ฆฌ : ๊ฒ€์ƒ‰ํ•  ๋ฌธ์„œ 

query = ['green plant in his study']


2) ํฌ์†Œํ–‰๋ ฌ(DTM)

query_sp_mat = tfidf.transform(query) #์ฃผ์˜ : ํ•จ์ˆ˜๋ช…


numpy ํ–‰๋ ฌ 

query_sp_mat_arr = query_sp_mat.toarray()


3) ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„ ๊ณ„์‚ฐ 

sim = cosine_similarity(query_sp_mat_arr, sp_mat_arr)
print(sim) #[[0.25069697 0.74327606 0.24964024]]
sim.shape #(1, 3)


2d -> 1d

sim1d = sim.reshape(3)
sim1d # [0.25069697, 0.74327606, 0.24964024]


4) ๋‚ด๋ฆผ์ฐจ์ˆœ ์ •๋ ฌ(์ƒ‰์ธ ๊ธฐ์ค€) 

sim_idx = sim1d.argsort()[::-1] #[1, 0, 2]


5) query์™€ ๊ฐ€์žฅ ์œ ์‚ฌ๋„๊ฐ€ ๋†’์€ ์ˆœ์œผ๋กœ ๋ฌธ์žฅ ๊ฒ€์ƒ‰

for idx in sim_idx : 
    print(f'์œ ์‚ฌ๋„ : {sim1d[idx]}, ๋ฌธ์žฅ : {sentences[idx]}')

 

 

 

 

 

movie recomm

์œ ์‚ฌ ๋ฌธ์„œ ๊ฒ€์ƒ‰ ์‹œ์Šคํ…œ 

์˜ํ™” ๊ฒ€์ƒ‰(์ถ”์ฒœ) ์‹œ์Šคํ…œ : ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„ ๊ธฐ๋ฐ˜ 
ex) ์˜ํ™” ํ‚ค์›Œ๋“œ -> ์˜ํ™” ํ›„๊ธฐ ํ…์ŠคํŠธ์—์„œ ๊ด€๋ จ ์˜ํ™” ์ค„๊ฑฐ๋ฆฌ ์ œ๊ณต  

import pandas as pd #csv file rad
from sklearn.feature_extraction.text import TfidfVectorizer #class. ํฌ์†Œํ–‰๋ ฌ(sparse matrix) 
from sklearn.metrics.pairwise import cosine_similarity #function. ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„



1. dataset load 

data = pd.read_csv(r'C:\ITWILL\4_Python-2\data\movie_reviews.csv')
data.info()

RangeIndex: 1492 entries, 0 to 1491
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   reviews  1492 non-null   object : ์˜ํ™”ํ›„๊ธฐ 
 1   title    1492 non-null   object : ์˜ํ™”์ œ๋ชฉ 
 2   label    1492 non-null   int64  : ๊ธ์ •/๋ถ€์ • 

data.head()




2. ์ „์ฒ˜๋ฆฌ : ๊ฒฐ์ธก์น˜ ์ œ๊ฑฐ 

data_df = data.dropna()
data_df.info()




3. ํฌ์†Œํ–‰๋ ฌ(DTM) : reviews ๋Œ€์ƒ 

reviews = data_df['reviews']
print(reviews)


1) ๋‹จ์–ด์ƒ์„ฑ๊ธฐ-๋ถˆ์šฉ์–ด ์ œ๊ฑฐ 

tfidf = TfidfVectorizer(stop_words='english')


2) ํฌ์†Œํ–‰๋ ฌ(sparse matrix)

movie_sm = tfidf.fit_transform(reviews)
movie_sm.shape #(1492, 34641) - DTM


numpy array ๋ณ€ํ™˜ 

movie_sm_arr = movie_sm.toarray()
movie_sm_arr.shape #(1492, 34641) - DTM 
print(movie_sm_arr)

title = data_df['title'] #์˜ํ™”์ œ๋ชฉ


#4. query ์ž‘์„ฑ -> ํฌ์†Œํ–‰๋ ฌ -> ์œ ์‚ฌ๋„๊ณ„์‚ฐ -> Top5 ์˜ํ™” ์ถ”์ฒœ 

def movie_search(query) :
    #1) query ์ž‘์„ฑ
    user_query = [query]
    
    #2) query ํฌ์†Œํ–‰๋ ฌ 
    query_sm = tfidf.transform(user_query)
    query_sm_arr = query_sm.toarray() #numpy array 
    
    #3) ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„ 
    sim = cosine_similarity(query_sm_arr, movie_sm_arr)
    print(sim.shape) #(1, 1492)
    #2d -> 1d
    sim1d = sim.reshape(1492)
    
    #4) ๋‚ด๋ฆผ์ฐจ์ˆœ ์ •๋ ฌ : index ์ •๋ ฌ
    sim_idx = sim1d.argsort()[::-1] 
    print('top5 index : ', sim_idx[:5])
    #top5 index :  [1281 1304  373  554  260]
    
    #5) Top5 ์˜ํ™”์ถ”์ฒœํ•˜๊ธฐ 
    for idx in sim_idx[:5] :
        print(f'์œ ์‚ฌ๋„ : {sim1d[idx]}, ์˜ํ™”์ œ๋ชฉ : {title[idx]}')


ํ•จ์ˆ˜ ํ˜ธ์ถœ : ์˜ํ™”๊ด€๋ จ ํ‚ค์›Œ๋“œ(ํ‚ค๋ณด๋“œ ์ž…๋ ฅ)

movie_search(input('search query input : '))

search query input : action
์œ ์‚ฌ๋„ : 0.20192921485638887, ์˜ํ™”์ œ๋ชฉ : Soldier (1998)
์œ ์‚ฌ๋„ : 0.1958404700223592, ์˜ํ™”์ œ๋ชฉ : Romeo Must Die (2000)
์œ ์‚ฌ๋„ : 0.18885169874338412, ์˜ํ™”์ œ๋ชฉ : Aliens (1986)
์œ ์‚ฌ๋„ : 0.18489066174805405, ์˜ํ™”์ œ๋ชฉ : Speed 2: Cruise Control (1997)
์œ ์‚ฌ๋„ : 0.16658803590038168, ์˜ํ™”์ œ๋ชฉ : Total Recall (1990)

search query input : drama
์œ ์‚ฌ๋„ : 0.1931737274266525, ์˜ํ™”์ œ๋ชฉ : Apollo 13 (1995)
์œ ์‚ฌ๋„ : 0.11796112357272329, ์˜ํ™”์ œ๋ชฉ : Double Jeopardy (1999)
์œ ์‚ฌ๋„ : 0.11374906390472769, ์˜ํ™”์ œ๋ชฉ : Practical Magic (1998)
์œ ์‚ฌ๋„ : 0.11037479275255738, ์˜ํ™”์ œ๋ชฉ : Civil Action, A (1998)
์œ ์‚ฌ๋„ : 0.09607905933279662, ์˜ํ™”์ œ๋ชฉ : Truman Show, The (1998)

 

 

 

 

 

word2vec

์œ ์‚ฌ ๋‹จ์–ด ๊ฒ€์ƒ‰ 
1. pip install gensim 
2. spyder ์—์„œ import

Word2Vec ์•Œ๊ณ ๋ฆฌ์ฆ˜ 
1. CBOW 
2. Skip-Gram 

from gensim.models import Word2Vec #์œ ์‚ฌ๋‹จ์–ด ์˜ˆ์ธก ๋ชจ๋ธ 

import nltk #nltk(Natural Langualge Toolkit) : ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋„๊ตฌ 
nltk.download('punkt') #nltk data download
from nltk.tokenize import word_tokenize #๋ฌธ์žฅ -> ๋‹จ์–ด ์ถ”์ถœ 
from nltk.tokenize import sent_tokenize #ํ…์ŠคํŠธ -> ๋ฌธ์žฅ ์ถ”์ถœ 
import pandas as pd #csv file read



1. dataset load 
์ถœ์ฒ˜ : https://www.kaggle.com/rounakbanik/the-movies-dataset

data = pd.read_csv('C:/ITWILL/4_Python-2/data/movies_metadata.csv') 
data.info()

RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):



2. ๋ณ€์ˆ˜ ์„ ํƒ & ์ „์ฒ˜๋ฆฌ

df = data[['title', 'overview']] #์˜ํ™” ์ œ๋ชฉ, ์ค„๊ฑฐ๋ฆฌ๋งŒ ์ถ”์ถœ
df = df.dropna()
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44506 entries, 0 to 45465
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   title     44506 non-null  object : ์˜ํ™” ์ œ๋ชฉ
 1   overview  44506 non-null  object : ์˜ํ™” ์ค„๊ฑฐ๋ฆฌ

df.head()


  

3. ํ† ํฐ(token) ์ƒ์„ฑ
sentendce -> word

sent = "my name is hong." 
words = word_tokenize(sent) 
print(words) #['my', 'name', 'is', 'hong', '.']
len(words) #5


2) text -> sentence

text = "my name is hong. my hobby is reading."
sents = sent_tokenize(text)
print(sents) #['my name is hong.', 'my hobby is reading.']


3) overview ๋‹จ์–ด ๋ฒกํ„ฐ ์ƒ์„ฑ

overview = df['overview'].tolist() #colums -> list ๋ณ€ํ™˜
overview[:5]
len(overview) #44506

result = [] #๋‹จ์–ด ๋ฒกํ„ฐ ์ €์žฅ
for row in overview :
    words = word_tokenize(row) #๋ฌธ์žฅ -> ๋‹จ์–ด ์ถ”์ถœ
    result.append(words) #[[1.๋‹จ์–ด๋ฒกํ„ฐ], [2.๋‹จ์–ด๋ฒกํ„ฐ]...]
print(result)

result[0] #์ฒซ๋ฒˆ์งธ ๋ฌธ์žฅ์˜ ๋‹จ์–ด ๋ฒกํ„ฐ
result[-1] #๋งˆ์ง€๋ง‰ ๋ฌธ์žฅ์˜ ๋‹จ์–ด ๋ฒกํ„ฐ




4. word2vec ๋ชจ๋ธ ์ƒ์„ฑ

model = Word2Vec(sentences=result, window = 5, min_count = 1, sg = 1)

sentences : ๋‹จ์–ด ๋ฒกํ„ฐ
window : 1ํšŒ ํ•™์Šตํ•  ๋‹จ์–ด ์ˆ˜
min_count : ์ตœ์†Œ ์ถœํ˜„ ๋นˆ๋„์ˆ˜ 
sg : 0-CBOW, 1-Skip-Gram



5. ์œ ์‚ฌ ๋‹จ์–ด ๊ฒ€์ƒ‰

def word_search(keyword) :
    search_re = model.wv.most_similar([keyword])
    print('top5 :', search_re[:5])
    
word_search(input('key word input :')) #husband -> woman -> success

('top5 : ', word_search[:5])
top5 :  [('boyfriend', 0.8590863347053528), 
         ('lover', 0.8467974066734314), 
         ('fiancé', 0.7997056245803833), 
         ('ex-husband', 0.7850815653800964), 
         ('fiance', 0.7803053855895996)]

print('top5 : ', word_search[:5])
top5 :  [('man', 0.8099219799041748), 
         ('girl', 0.7905499339103699), 
         ('schoolgirl', 0.7901395559310913), 
         ('lady', 0.7746134996414185), 
         ('spinster', 0.7675780653953552)]

('top5 : ', word_search[:5])
top5 :  [('fame', 0.8123695850372314),
         ('stardom', 0.7987002730369568), 
         ('commercial', 0.7903648614883423), 
         ('popularity', 0.7882120609283447), 
         ('achieves', 0.7871276140213013)]