κ°œμΈκ³΅λΆ€/Python

80. Python TextMining μ—°μŠ΅λ¬Έμ œ(2)

LEE_BOMB 2021. 12. 14. 23:33
λ¬Έ4) review_data.csv νŒŒμΌμ˜ 'review2' μΉΌλŸΌμ„ λŒ€μƒμœΌλ‘œ λ‹€μŒκ³Ό κ°™μ΄ λ‹¨κ³„λ³„λ‘œ λ‹¨μ–΄μ˜ λΉˆλ„μˆ˜λ₯Ό κ΅¬ν•˜κ³ , λ‹¨μ–΄ κ΅¬λ¦„μœΌλ‘œ μ‹œκ°ν™”ν•˜μ‹œμ˜€.

 

import pandas as pd
from konlpy.tag import Okt
from wordcloud import WordCloud # class



1. file load 

review_data = pd.read_csv('c:/ITWILL/4_Python-2/data/review_data.csv', 
                          encoding='utf-8')

review_data.info()

RangeIndex: 34525 entries, 0 to 34524
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   id       34525 non-null  int64 
 1   review   34525 non-null  object
 2   label    34525 non-null  int64 
 3   review2  34525 non-null  object

review2 칼럼 선택 

review = review_data['review2']
len(review) #34525

okt = Okt()



2. λ¬Έμž₯ μΆ”μΆœ :  Okt 클래슀 이용
sent = okt.normalize(문단) #str

ex_sent = [okt.normalize(sent) for sent in review ]
len(ex_sent) #34525

 


3. λͺ…사 μΆ”μΆœ : Okt 클래슀 이용 
okt.nouns(λ¬Έμž₯)

ex_nouns = [] #단어 μ €μž₯ 

for sent in ex_sent : #λ¬Έμž₯ μΆ”μΆœ 
    for noun in okt.nouns(sent) : #단어 μΆ”μΆœ 
        ex_nouns.append(noun) #단어 μ €μž₯ 

len(ex_nouns) #210,849

 

 

 

 

 

λ¬Έ5) ν•œκ΅­μ˜ν™” ν›„κΈ°(review_data.csv) νŒŒμΌμ„ λŒ€μƒμœΌλ‘œ μ•„λž˜μ™€ κ°™μ€ μ‘°κ±΄μœΌλ‘œ ν‚€μ›Œλ“œλ₯Ό μž…λ ₯ν•˜μ—¬ κ΄€λ ¨ μ˜ν™” ν›„κΈ°λ₯Ό κ²€μƒ‰ν•˜λŠ” ν•¨μˆ˜λ₯Ό μ •μ˜ν•˜μ‹œμ˜€.   

<쑰건1> μ‚¬μš©ν•  μΉΌλŸΌ : review2 
<쑰건2> μ‚¬μš©ν•  λ¬Έμ„œ κ°œμˆ˜ : 1번째 ~ 5000번째   
<쑰건3> μ½”사인 μœ μ‚¬λ„ μ μš© - μ˜ν™” ν›„κΈ° κ²€μƒ‰ ν•¨μˆ˜
-> κ²€μƒ‰ ν‚€μ›Œλ“œμ™€ κ°€μž₯ μœ μ‚¬λ„κ°€ λ†’은 μƒμœ„ 3개 review κ²€μƒ‰  
<쑰건4> κ²€μƒ‰ ν‚€μ›Œλ“œ : μ•‘μ…˜μ˜ν™”, μ‹œλ‚˜λ¦¬μ˜€, μ€‘κ΅­μ˜ν™” 
-> μœ„ 검색 ν‚€μ›Œλ“œλ₯Ό ν•˜λ‚˜μ”© μž…λ ₯ν•˜μ—¬ κ΄€λ ¨ ν›„κΈ° 검색 
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity



1. dataset load 

data = pd.read_csv("c:/ITWILL/4_python-2/data/review_data.csv")
data.info() 
'''
 0   id       34525 non-null  int64 
 1   review   34525 non-null  object
 2   label    34525 non-null  int64 
 3   review2  34525 non-null  object -> μ‚¬μš©ν•  칼럼 
'''
print(data.head())



1. μ‚¬μš©ν•  λ¬Έμ„œ 5,000개 μ œν•œ  

review = data.review2[:5000] #1번째 ~ 5000번째 λ¬Έμ„œ



2. sparse matrix 생성 : review 칼럼 λŒ€μƒ 

obj = TfidfVectorizer()
sp_mat = obj.fit_transform(review)
sp_mat.shape #(5000, 19361)

 

numpy ν¬μ†Œν–‰λ ¬ λ³€ν™˜ 

sp_mat_arr = sp_mat.toarray()



3. cosine μœ μ‚¬λ„ : μ˜ν™” ν›„κΈ° 검색 ν•¨μˆ˜  

def review_search(query) : 
    query_data = [query]
    query_sm = obj.transform(query_data)
    query_sm_arr = query_sm.toarray()
    
    sim = cosine_similarity(query_sm_arr, sp_mat_arr)
    sim = sim.squeeze() #2d -> 1d : μ°¨μ›μˆ˜κ°€ 1인 차원 제거 
    #2d(1, 19361) -> 1d(19361,)
    
    sim_idx = sim.argsort()[::-1]
    
    for idx in sim_idx[:3] : #top3
        print(f'sim : {sim[idx]}, review : {review[idx]}')


    
4. 검색 ν‚€μ›Œλ“œ : μ•‘μ…˜μ˜ν™”, μ‹œλ‚˜λ¦¬μ˜€, μ€‘κ΅­μ˜ν™”   

review_search(input('검색할 ν‚€μ›Œλ“œ μž…λ ₯ : '))

검색할 ν‚€μ›Œλ“œ μž…λ ₯ : μ•‘μ…˜μ˜ν™”
sim : 0.5846263639334625, review : μŠ€μ›¨λ΄μ‹ μ•‘μ…˜μ˜ν™” κ°•μΆ”
sim : 0.43431544406444184, review : λ‚˜ λ²”μ£„μ˜ν™”λ‚˜ μŠ€λ¦΄λŸ¬μ˜ν™”λ‚˜ μ•‘μ…˜μ˜ν™” λ””κ²Œ μ’‹μ•„ν•˜λŠ”λ°
sim : 0.24862959769106963, review : λ…„λŒ€ λ§Œλ“€μ—ˆμ„ λ²•ν•œ μ•‘μ…˜μ˜ν™” κ°λ…이 λˆμ΄ λ§Žμ€κ°€ λ³΄λ„€μš”μ–΄μ°Œ μ΄λŸ° μ˜ν™”λ₯Ό μ˜λ„ν•˜μ— λ§Œλ“ κ±΄μ§€ μ‹¬μ‹¬ν•΄μ„œ λ§Œλ“ κ±΄μ§€λΉ„λ””μ˜€μ˜ν™”λ„ μ΄ μ •λ„λŠ” μ•„λ‹Œλ° κ°λ…λŒ€λ‹¨

검색할 ν‚€μ›Œλ“œ μž…λ ₯ : μ‹œλ‚˜λ¦¬μ˜€
sim : 0.6444909451577203, review : μ΅œκ³ μ˜μ˜ν™”μ£  μ‹œλ‚˜λ¦¬μ˜€ κ΅Ώ
sim : 0.5014644310272237, review : μ‹œλ‚˜λ¦¬μ˜€ μ“°μ‹  λΆ„ μ •말 μ‘΄κ²½μŠ€λŸ½λ„€μš” 
sim : 0.39110728021204, review : μ°Έμ‹ ν•˜κ³  λ…νŠΉν•œ μ˜ν™” μšΈλ‚˜λΌλŠ” μ΄λŸ° μ‹œλ‚˜λ¦¬μ˜€ λͺ» μ“°λ‚˜μš”

검색할 ν‚€μ›Œλ“œ μž…λ ₯ : μ€‘κ΅­μ˜ν™”
sim : 0.27677966931755404, review : μš”λž€λ²•μ„λ§Œλ–¨λ©° μ‹œλ„λŸ½κΈ°λ§Œ ν•œ μ€‘κ΅­μ˜ν™” μŠ€ν‹°λΈμ‹œκ±Έμ£Όμ—°μ˜ κΈ‰ λΉ„λ””μ˜€μš© μ˜ν™”λ³΄λŠ”λ“― ν•˜λ‹€ κ°œμ—°μ„±μ„€λ“λ ₯λ¦¬μ–Όλ¦¬ν‹°λŠ” μ œλ‘œ μ‹œλ‚˜λ¦¬μ˜€λŠ” μ € λ©€λ¦¬ λ…„λŒ€ ν™μ½©μ•‘μ…˜μ˜ν™”
sim : 0.2723791172588158, review : κ°ˆμˆ˜λ‘ κ°œνŒλ˜κ°€λŠ” μ€‘κ΅­μ˜ν™” μœ μΉ˜ν•˜κ³  λ‚΄μš©μ—†μŒ νΌμž‘λ‹€ λλ‚¨ λ§λ„μ•ˆλ˜λŠ” λ¬΄κΈ°μ— μœ μΉ˜ν•œλ‚¨λ¬΄ μ•„ κ·Έλ¦½λ‹€ λ™μ‚¬μ„œλ…같은 μ˜ν™”κ°€ μ΄κ±΄ λ₯˜μ•„λ₯˜μž‘이닀
sim : 0.0, review : μ£ΌμΈκ³΅μ΄ λ” μ•…λ‹ΉμΈμ˜ν™”μž¬λ―Έμžˆμ„ λ €κ³  μ˜ν™”λ΄€λŠ”λ°λ” μŠ€νŠΈλ ˆμŠ€λ°›λŠ” μ˜ν™”