๋ฐ์ดํ„ฐ๋ถ„์„๊ฐ€ ๊ณผ์ •/Python

DAY52. Python Classification (2)TF-IDF sparse Matrix (๋‹จ์–ด ์ถ”์ถœ)

LEE_BOMB 2021. 12. 3. 17:32
Tfidf Vectorizer

๋‹จ์–ด์ƒ์„ฑ๊ธฐ(TfidfVectorizer) : ๋ฌธ์žฅ -> ๋‹จ์–ด ์ถ”์ถœ 

TFiDF ๋‹จ์–ด ์ƒ์„ฑ๊ธฐ : TfidfVectorizer  
1. ๋‹จ์–ด ์ƒ์„ฑ๊ธฐ[word tokenizer] : ๋ฌธ์žฅ(sentences) -> ๋‹จ์–ด(word) ์ƒ์„ฑ
2. ๋‹จ์–ด ์‚ฌ์ „[word dictionary] : (word, ๊ณ ์œ ์ˆ˜์น˜)
3. ํฌ์†Œํ–‰๋ ฌ[sparse matrix] : ๋‹จ์–ด ์ถœํ˜„ ๋น„์œจ์— ์˜ํ•ด์„œ ๊ฐ€์ค‘์น˜ ์ ์šฉ ํ–‰๋ ฌ
1) TF ๊ฐ€์ค‘์น˜ : ๋‹จ์–ด์ถœํ˜„๋นˆ๋„์ˆ˜  
2) TFiDF ๊ฐ€์ค‘์น˜ : ๋‹จ์–ด์ถœํ˜„๋นˆ๋„์ˆ˜(TF) x ๋ฌธ์„œ์ถœํ˜„๋นˆ๋„์ˆ˜์˜ ์—ญ์ˆ˜(iDF) 
์‚ฌ์šฉ๋ถ„์•ผ : ๋ฌธ์„œ๋ถ„๋ฅ˜๊ธฐ์—์„œ ์‚ฌ์šฉ๋  ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ   


from sklearn.feature_extraction.text import TfidfVectorizer #class



๋ฌธ์žฅ(sentence) : 3๊ฐœ ๋ฌธ์žฅ 

sentences = [
    "Mr. Green killed Colonel Mustard in the study with the candlestick. Mr. Green is not a very nice fellow.",
    "Professor Plum has a green plant in his study.",
    "Miss Scarlett watered Professor Plum's green plant while he was away from his office last week."
]

print(sentences)

 



1. ๋‹จ์–ด ์ƒ์„ฑ๊ธฐ[word tokenizer]

tfidf = TfidfVectorizer()
tfidf

 



2. ๋‹จ์–ด ์‚ฌ์ „ 

fit = tfidf.fit(sentences) #๋ฌธ์žฅ ์ ์šฉ 
voca = fit.vocabulary_
print(voca) #{'๋‹จ์–ด':๊ณ ์œ ์ˆซ์ž} - ๊ณ ์œ ์ˆซ์ž : ์˜๋ฌธ ์˜ค๋ฆ„์ฐจ์ˆœ 
len(voca) #31

 



3. ํฌ์†Œํ–‰๋ ฌ(sparse matrix)

sp_max = tfidf.fit_transform(sentences)
print(sp_max)

 (doc,word)    TFiDF ๊ฐ€์ค‘์น˜  
  (0, 3) 0.2205828828763741
  (0, 16) 0.2205828828763741
  (0, 25) 0.2205828828763741
  (0, 17) 0.2205828828763741

scipy -> numpy ํฌ์†Œํ–‰๋ ฌ ๋ณ€๊ฒฝ 

sp_max_arr = sp_max.toarray()
print(sp_max_arr)

[[0.         0.22058288 0.22058288 0.22058288 0.         0.26055961
  0.         0.         0.         0.16775897 0.22058288 0.22058288
  0.         0.         0.44116577 0.22058288 0.22058288 0.22058288
  0.         0.         0.         0.         0.         0.16775897
  0.44116577 0.22058288 0.         0.         0.         0.
  0.22058288]
 [0.         0.         0.         0.         0.         0.26903992
  0.45552418 0.         0.34643788 0.34643788 0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.34643788 0.34643788 0.34643788 0.         0.34643788
  0.         0.         0.         0.         0.         0.
  0.        ]
 [0.27054288 0.         0.         0.         0.27054288 0.15978698
  0.         0.27054288 0.20575483 0.         0.         0.
  0.27054288 0.27054288 0.         0.         0.         0.
  0.27054288 0.20575483 0.20575483 0.20575483 0.27054288 0.
  0.         0.         0.27054288 0.27054288 0.27054288 0.27054288
  0.        ]]

sp_max_arr.shape #(3, 31)

 

 

 

 

 

Tfidf sparseMatrix

<์ž‘์—…์ˆœ์„œ>
1. csv file ๊ฐ€์ ธ์˜ค๊ธฐ 
2. texts, target ์ „์ฒ˜๋ฆฌ 
3. max features
4. sparse matrix


import pandas as pd # csv file read 
from sklearn.feature_extraction.text import TfidfVectorizer #ํฌ์†Œํ–‰๋ ฌ



1. csv file ๊ฐ€์ ธ์˜ค๊ธฐ 

path = 'C:/ITWILL/4_Python-II/workspace/chap07_Classification/data' 
spam_data = pd.read_csv(path + '/temp_spam_data.csv',header=None)

print(spam_data)

      0                        1
0   ham    ์šฐ๋ฆฌ๋‚˜๋ผ    ๋Œ€ํ•œ๋ฏผ๊ตญ, ์šฐ๋ฆฌ๋‚˜๋ผ ๋งŒ์„ธ
1  spam      ๋น„์•„๊ทธ๋ผ 500GRAM ์ •๋ ฅ ์ตœ๊ณ !
2   ham               ๋‚˜๋Š” ๋Œ€ํ•œ๋ฏผ๊ตญ ์‚ฌ๋žŒ
3  spam  ๋ณดํ—˜๋ฃŒ 15000์›์— ํ‰์ƒ ๋ณด์žฅ ๋งˆ๊ฐ ์ž„๋ฐ•
4   ham                   ๋‚˜๋Š” ํ™๊ธธ๋™

 



2. texts, target ์ „์ฒ˜๋ฆฌ
1) target ์ „์ฒ˜๋ฆฌ : dummy๋ณ€์ˆ˜ 

target = spam_data[0]
target


list + for 

target = [0 if t=='ham' else 1  for t in target]
target #  [0, 1, 0, 1, 0]

import string #texts ์ „์ฒ˜๋ฆฌ
def text_prepro(texts): #๋ฌธ๋‹จ(sentences)
    #Lower case : ๋ฌธ๋‹จ -> ๋ฌธ์žฅ -> ์˜๋ฌธ์†Œ๋ฌธ์ž ๋ณ€๊ฒฝ  
    texts = [x.lower() for x in texts]
    #Remove punctuation : ๋ฌธ๋‹จ -> ๋ฌธ์žฅ -> ์Œ์ ˆ -> ํ•„ํ„ฐ๋ง -> ๋ฌธ์žฅ  
    texts = [''.join(ch for ch in st if ch not in string.punctuation) for st in texts]
    #Remove numbers : ๋ฌธ๋‹จ -> ๋ฌธ์žฅ -> ์Œ์ ˆ -> ํ•„ํ„ฐ๋ง -> ๋ฌธ์žฅ 
    texts = [''.join(ch for ch in st if ch not in string.digits) for st in texts]
    #Trim extra whitespace : ๋ฌธ๋‹จ -> ๋ฌธ์žฅ -> ๊ณต๋ฐฑ ์ œ๊ฑฐ 
    texts = [' '.join(x.split()) for x in texts]
    return texts


2) texts ์ „์ฒ˜๋ฆฌ : ๋ถˆ์šฉ์–ด(๊ณต๋ฐฑ,ํŠน์ˆ˜๋ฌธ์ž,๋ฌธ์žฅ๋ถ€ํ˜ธ,์ˆซ์ž)

texts = spam_data[1]
texts #์ „์ฒ˜๋ฆฌ ์ „ 

texts = text_prepro(texts)
texts #์ „์ฒ˜๋ฆฌ ํ›„

 



3. max features : ํฌ์†Œํ–‰๋ ฌ์— ์‚ฌ์šฉ๋  ๋‹จ์–ด ๊ฐœ์ˆ˜ 

tfidf = TfidfVectorizer() #๋‹จ์–ด ์ƒ์„ฑ๊ธฐ 
fit = tfidf.fit(texts) #ํ…์ŠคํŠธ ์ ์šฉ 
voca = fit.vocabulary_
print(voca)

len(voca) #16 

max_features = len(voca) #์ „์ฒด ๋‹จ์–ด ์ด์šฉ

* max_features = 10 : ์ค‘์š”๋‹จ์–ด 10๊ฐœ๋งŒ ์„ ์ •ํ•˜์—ฌ ํฌ์†Œํ–‰๋ ฌ 

 



4. sparse matrix

tfidf = TfidfVectorizer(max_features=max_features)
sp_mat = tfidf.fit_transform(texts) 

print(sp_mat)


numpy matrix 

sp_mat_arr = sp_mat.toarray()
print(sp_mat_arr)

[[0.         0.         0.33939315 0.         0.42066906 0.
  0.         0.         0.         0.84133812 0.         0.
  0.         0.         0.         0.        ]
 [0.5        0.         0.         0.         0.         0.
  0.         0.5        0.         0.         0.         0.
  0.5        0.5        0.         0.        ]
 [0.         0.53177225 0.53177225 0.         0.         0.
  0.         0.         0.659118   0.         0.         0.
  0.         0.         0.         0.        ]
 [0.         0.         0.         0.40824829 0.         0.40824829
  0.40824829 0.         0.         0.         0.40824829 0.40824829
  0.         0.         0.40824829 0.        ]
 [0.         0.62791376 0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.77828292]]

 

 

 

 

 

Tfidf sparseMatrix2

<์ž‘์—…์ˆœ์„œ>
1. csv file ๊ฐ€์ ธ์˜ค๊ธฐ[์ˆ˜์ •] 
2. texts, target ์ „์ฒ˜๋ฆฌ 
3. max features[์ˆ˜์ •]
4. sparse matrix
5. train/test split[์ถ”๊ฐ€]
6. file save[์ถ”๊ฐ€]


import pandas as pd # csv file read 
from sklearn.feature_extraction.text import TfidfVectorizer #ํฌ์†Œํ–‰๋ ฌ



1. csv file ๊ฐ€์ ธ์˜ค๊ธฐ[์ˆ˜์ •] 

path = 'C:/ITWILL/4_Python-2/workspace/chap07_Classification/data' 
spam_data = pd.read_csv(path + '/temp_spam_data2.csv',header=None)

print(spam_data)

         0                                                  1
0      ham  Go until jurong point, crazy.. Available only ...
1      ham                      Ok lar... Joking wif u oni...
2     spam  Free entry in 2 a wkly comp to win FA Cup fina...
3      ham  U dun say so early hor... U c already then say...
4      ham  Nah I don't think he goes to usf, he lives aro...

 


2. texts, target ์ „์ฒ˜๋ฆฌ
1) target ์ „์ฒ˜๋ฆฌ : dummy๋ณ€์ˆ˜ 

target = spam_data[0]
target
#list + for 
target = [0 if t=='ham' else 1  for t in target]
target #[0, 1, 0, 1, 0]

import string #texts ์ „์ฒ˜๋ฆฌ
def text_prepro(texts): #๋ฌธ๋‹จ(sentences)
    #Lower case : ๋ฌธ๋‹จ -> ๋ฌธ์žฅ -> ์˜๋ฌธ์†Œ๋ฌธ์ž ๋ณ€๊ฒฝ  
    texts = [x.lower() for x in texts]
    #Remove punctuation : ๋ฌธ๋‹จ -> ๋ฌธ์žฅ -> ์Œ์ ˆ -> ํ•„ํ„ฐ๋ง -> ๋ฌธ์žฅ  
    texts = [''.join(ch for ch in st if ch not in string.punctuation) for st in texts]
    #Remove numbers : ๋ฌธ๋‹จ -> ๋ฌธ์žฅ -> ์Œ์ ˆ -> ํ•„ํ„ฐ๋ง -> ๋ฌธ์žฅ 
    texts = [''.join(ch for ch in st if ch not in string.digits) for st in texts]
    #Trim extra whitespace : ๋ฌธ๋‹จ -> ๋ฌธ์žฅ -> ๊ณต๋ฐฑ ์ œ๊ฑฐ 
    texts = [' '.join(x.split()) for x in texts]
    return texts


2) texts ์ „์ฒ˜๋ฆฌ : ๋ถˆ์šฉ์–ด(๊ณต๋ฐฑ,ํŠน์ˆ˜๋ฌธ์ž,๋ฌธ์žฅ๋ถ€ํ˜ธ,์ˆซ์ž)

texts = spam_data[1]
texts #์ „์ฒ˜๋ฆฌ ์ „ 

texts = text_prepro(texts)
texts #์ „์ฒ˜๋ฆฌ ํ›„




3. max features : ํฌ์†Œํ–‰๋ ฌ์— ์‚ฌ์šฉ๋  ๋‹จ์–ด ๊ฐœ์ˆ˜ 

tfidf = TfidfVectorizer() #๋‹จ์–ด ์ƒ์„ฑ๊ธฐ 
fit = tfidf.fit(texts) #ํ…์ŠคํŠธ ์ ์šฉ 
voca = fit.vocabulary_
print(voca)

len(voca) #8603 

max_features = 5000 #์ค‘์š”๋‹จ์–ด 5,000๊ฐœ ํฌ์†Œํ–‰๋ ฌ

 

 


4. sparse matrix

tfidf = TfidfVectorizer(max_features=max_features)
sp_mat = tfidf.fit_transform(texts) 

print(sp_mat)



numpy matrix 

sp_mat_arr = sp_mat.toarray()
print(sp_mat_arr)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]

sp_mat_arr.shape #(5574, 5000)




5. train/test split[์ถ”๊ฐ€]

from sklearn.model_selection import train_test_split 
import numpy as np 

X_train, X_test, y_train, y_test = train_test_split(
    sp_mat_arr, target, test_size=0.3)

X_train.shape #(3901, 5000)
X_test.shape #(1673, 5000)


list -> numpy 

y_train = np.array(y_train)
y_test = np.array(y_test)
y_train.shape #(3901,)
y_test.shape #(1673,)

 



6. file save[์ถ”๊ฐ€] : np.save()

spam_train_test = (X_train,X_test,y_train,y_test)


np.save("file", object)

np.save(path + "/spam_train_test.npy", spam_train_test)


np.load("file")

X_train,X_test,y_train,y_test = np.load(path + "/spam_train_test.npy",allow_pickle=True)
X_train.shape #(3901, 5000)

 

 

 

 

 

ham spam classifier

๋ฌธ์„œ๋ถ„๋ฅ˜๊ธฐ 
NB vs SVM 
NB : ์—ฐ์†์†๋„ ๋น ๋ฆ„ 
SVM : ์ •ํ™•๋„ ๋†’์Œ


import numpy as np #np.load()
from sklearn.naive_bayes import MultinomialNB #nb model 
from sklearn.svm import SVC #svm model 
from sklearn.metrics import accuracy_score, confusion_matrix #ํ‰๊ฐ€ 
import time #์‹œ๊ฐ„ ์ธก์ •

  

1. dataset load 

path = 'C:/ITWILL/4_Python-II/workspace/chap07_Classification/data' 
X_train,X_test,y_train,y_test = np.load(path + "/spam_train_test.npy",allow_pickle=True)


input data : X -> ํฌ์†Œํ–‰๋ ฌ 

X_train.shape #(3901, 5000)
X_test.shape #(1673, 5000)


output data : y - dummy(0 or 1)

y_train[:10] #array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1])




NB model 

start = time.time()
nb_model = MultinomialNB().fit(X=X_train, y=y_train)
end = time.time() - start
print('์‹คํ–‰์‹œ๊ฐ„ : ', end)

y_pred = nb_model.predict(X = X_test)
y_true = y_test

acc = accuracy_score(y_true, y_pred)
print('NB model ๋ถ„๋ฅ˜์ •ํ™•๋„ :', acc)




SVM model 

start = time.time()
svm_model = SVC(kernel='linear').fit(X=X_train, y=y_train)
end = time.time() - start
print('์‹คํ–‰์‹œ๊ฐ„ : ', end)

y_pred = svm_model.predict(X = X_test)
y_true = y_test

acc = accuracy_score(y_true, y_pred)
print('SVM model ๋ถ„๋ฅ˜์ •ํ™•๋„ :', acc)

์‹คํ–‰์‹œ๊ฐ„ :  0.10205936431884766
NB model ๋ถ„๋ฅ˜์ •ํ™•๋„ : 0.9575612671846981
์‹คํ–‰์‹œ๊ฐ„ :  7.592423677444458
SVM model ๋ถ„๋ฅ˜์ •ํ™•๋„ : 0.9760908547519426

๋ถˆ๊ท ํ˜• ๋น„์œจ 

con_mat = confusion_matrix(y_true, y_pred)
print(con_mat)

     0      1
0 [[1437    1] = 1438
1 [  39  196]] = 235

์ •ํ™•๋ฅ  : ์˜ˆ์ธก์น˜ yes(1) -> yes(1)

p = con_mat[1,1] / con_mat[:,1].sum() #0.9949238578680203


์žฌํ˜„์œจ=๋ฏผ๊ฐ๋„ : ๊ด€์ธก์น˜ YES(1) -> YES(1)

r = con_mat[1,1] / con_mat[1,:].sum() #0.8340425531914893


f1 score : ์กฐํ™”ํ‰๊ท  

f1_score = 2 * ((p*r) / (p+r))
print('f1 score =', f1_score) #f1 score = 0.9074074074074074