80. Python TextMining μ°μ΅λ¬Έμ (2)
λ¬Έ4) review_data.csv νμΌμ 'review2' μΉΌλΌμ λμμΌλ‘ λ€μκ³Ό κ°μ΄ λ¨κ³λ³λ‘ λ¨μ΄μ λΉλμλ₯Ό ꡬνκ³ , λ¨μ΄ ꡬλ¦μΌλ‘ μκ°ννμμ€.
import pandas as pd
from konlpy.tag import Okt
from wordcloud import WordCloud # class
1. file load
review_data = pd.read_csv('c:/ITWILL/4_Python-2/data/review_data.csv',
encoding='utf-8')
review_data.info()
RangeIndex: 34525 entries, 0 to 34524
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 34525 non-null int64
1 review 34525 non-null object
2 label 34525 non-null int64
3 review2 34525 non-null object
review2 μΉΌλΌ μ ν
review = review_data['review2']
len(review) #34525
okt = Okt()
2. λ¬Έμ₯ μΆμΆ : Okt ν΄λμ€ μ΄μ©
sent = okt.normalize(λ¬Έλ¨) #str
ex_sent = [okt.normalize(sent) for sent in review ]
len(ex_sent) #34525
3. λͺ
μ¬ μΆμΆ : Okt ν΄λμ€ μ΄μ©
okt.nouns(λ¬Έμ₯)
ex_nouns = [] #λ¨μ΄ μ μ₯
for sent in ex_sent : #λ¬Έμ₯ μΆμΆ
for noun in okt.nouns(sent) : #λ¨μ΄ μΆμΆ
ex_nouns.append(noun) #λ¨μ΄ μ μ₯
len(ex_nouns) #210,849
λ¬Έ5) νκ΅μν νκΈ°(review_data.csv) νμΌμ λμμΌλ‘ μλμ κ°μ 쑰건μΌλ‘ ν€μλλ₯Ό μ λ ₯νμ¬ κ΄λ ¨ μν νκΈ°λ₯Ό κ²μνλ ν¨μλ₯Ό μ μνμμ€.
<쑰건1> μ¬μ©ν μΉΌλΌ : review2
<쑰건2> μ¬μ©ν λ¬Έμ κ°μ : 1λ²μ§Έ ~ 5000λ²μ§Έ
<쑰건3> μ½μ¬μΈ μ μ¬λ μ μ© - μν νκΈ° κ²μ ν¨μ
-> κ²μ ν€μλμ κ°μ₯ μ μ¬λκ° λμ μμ 3κ° review κ²μ
<쑰건4> κ²μ ν€μλ : μ‘μ μν, μλ리μ€, μ€κ΅μν
-> μ κ²μ ν€μλλ₯Ό νλμ© μ λ ₯νμ¬ κ΄λ ¨ νκΈ° κ²μ
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
1. dataset load
data = pd.read_csv("c:/ITWILL/4_python-2/data/review_data.csv")
data.info()
'''
0 id 34525 non-null int64
1 review 34525 non-null object
2 label 34525 non-null int64
3 review2 34525 non-null object -> μ¬μ©ν μΉΌλΌ
'''
print(data.head())
1. μ¬μ©ν λ¬Έμ 5,000κ° μ ν
review = data.review2[:5000] #1λ²μ§Έ ~ 5000λ²μ§Έ λ¬Έμ
2. sparse matrix μμ± : review μΉΌλΌ λμ
obj = TfidfVectorizer()
sp_mat = obj.fit_transform(review)
sp_mat.shape #(5000, 19361)
numpy ν¬μνλ ¬ λ³ν
sp_mat_arr = sp_mat.toarray()
3. cosine μ μ¬λ : μν νκΈ° κ²μ ν¨μ
def review_search(query) :
query_data = [query]
query_sm = obj.transform(query_data)
query_sm_arr = query_sm.toarray()
sim = cosine_similarity(query_sm_arr, sp_mat_arr)
sim = sim.squeeze() #2d -> 1d : μ°¨μμκ° 1μΈ μ°¨μ μ κ±°
#2d(1, 19361) -> 1d(19361,)
sim_idx = sim.argsort()[::-1]
for idx in sim_idx[:3] : #top3
print(f'sim : {sim[idx]}, review : {review[idx]}')
4. κ²μ ν€μλ : μ‘μ
μν, μλ리μ€, μ€κ΅μν
review_search(input('κ²μν ν€μλ μ
λ ₯ : '))
κ²μν ν€μλ μ
λ ₯ : μ‘μ
μν
sim : 0.5846263639334625, review : μ€μ¨λ΄μ μ‘μ
μν κ°μΆ
sim : 0.43431544406444184, review : λ λ²μ£μνλ μ€λ¦΄λ¬μνλ μ‘μ
μν λκ² μ’μνλλ°
sim : 0.24862959769106963, review : λ
λ λ§λ€μμ λ²ν μ‘μ
μν κ°λ
μ΄ λμ΄ λ§μκ° λ³΄λ€μμ΄μ° μ΄λ° μνλ₯Ό μλνμ λ§λ κ±΄μ§ μ¬μ¬ν΄μ λ§λ 건μ§λΉλμ€μνλ μ΄ μ λλ μλλ° κ°λ
λλ¨
κ²μν ν€μλ μ
λ ₯ : μλ리μ€
sim : 0.6444909451577203, review : μ΅κ³ μμνμ£ μλλ¦¬μ€ κ΅Ώ
sim : 0.5014644310272237, review : μλλ¦¬μ€ μ°μ λΆ μ λ§ μ‘΄κ²½μ€λ½λ€μ
sim : 0.39110728021204, review : μ°Έμ νκ³ λ
νΉν μν μΈλλΌλ μ΄λ° μλλ¦¬μ€ λͺ» μ°λμ
κ²μν ν€μλ μ
λ ₯ : μ€κ΅μν
sim : 0.27677966931755404, review : μλλ²μλ§λ¨λ©° μλλ½κΈ°λ§ ν μ€κ΅μν μ€ν°λΈμκ±Έμ£Όμ°μ κΈ λΉλμ€μ© μν보λλ― νλ€ κ°μ°μ±μ€λλ ₯리μΌλ¦¬ν°λ μ λ‘ μλ리μ€λ μ λ©λ¦¬ λ
λ ν콩μ‘μ
μν
sim : 0.2723791172588158, review : κ°μλ‘ κ°νλκ°λ μ€κ΅μν μ μΉνκ³ λ΄μ©μμ νΌμ‘λ€ λλ¨ λ§λμλλ 무기μ μ μΉνλ¨λ¬΄ μ κ·Έλ¦½λ€ λμ¬μλ
κ°μ μνκ° μ΄κ±΄ λ₯μλ₯μμ΄λ€
sim : 0.0, review : μ£ΌμΈκ³΅μ΄ λ μ
λΉμΈμνμ¬λ―Έμμ λ €κ³ μνλ΄€λλ°λ μ€νΈλ μ€λ°λ μν