DAY52. Python Classification (2)TF-IDF sparse Matrix (단어 추출)

LEE_BOMB 2021. 12. 3. 17:32

Tfidf Vectorizer

단어생성기(TfidfVectorizer) : 문장 -> 단어 추출

TFiDF 단어 생성기 : TfidfVectorizer
1. 단어 생성기[word tokenizer] : 문장(sentences) -> 단어(word) 생성
2. 단어 사전[word dictionary] : (word, 고유수치)
3. 희소행렬[sparse matrix] : 단어 출현 비율에 의해서 가중치 적용 행렬
1) TF 가중치 : 단어출현빈도수
2) TFiDF 가중치 : 단어출현빈도수(TF) x 문서출현빈도수의 역수(iDF)
사용분야 : 문서분류기에서 사용될 텍스트 전처리

from sklearn.feature_extraction.text import TfidfVectorizer #class

문장(sentence) : 3개 문장

sentences = [
    "Mr. Green killed Colonel Mustard in the study with the candlestick. Mr. Green is not a very nice fellow.",
    "Professor Plum has a green plant in his study.",
    "Miss Scarlett watered Professor Plum's green plant while he was away from his office last week."
]

print(sentences)

1. 단어 생성기[word tokenizer]

tfidf = TfidfVectorizer()
tfidf

2. 단어 사전

fit = tfidf.fit(sentences) #문장 적용 
voca = fit.vocabulary_
print(voca) #{'단어':고유숫자} - 고유숫자 : 영문 오름차순 
len(voca) #31

3. 희소행렬(sparse matrix)

sp_max = tfidf.fit_transform(sentences)
print(sp_max)

(doc,word)    TFiDF 가중치
  (0, 3) 0.2205828828763741
  (0, 16) 0.2205828828763741
  (0, 25) 0.2205828828763741
  (0, 17) 0.2205828828763741

scipy -> numpy 희소행렬 변경

sp_max_arr = sp_max.toarray()
print(sp_max_arr)

[[0.         0.22058288 0.22058288 0.22058288 0.         0.26055961
  0.         0.         0.         0.16775897 0.22058288 0.22058288
  0.         0.         0.44116577 0.22058288 0.22058288 0.22058288
  0.         0.         0.         0.         0.         0.16775897
  0.44116577 0.22058288 0.         0.         0.         0.
  0.22058288]
[0.         0.         0.         0.         0.         0.26903992
  0.45552418 0.         0.34643788 0.34643788 0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.34643788 0.34643788 0.34643788 0.         0.34643788
  0.         0.         0.         0.         0.         0.
  0.        ]
[0.27054288 0.         0.         0.         0.27054288 0.15978698
  0.         0.27054288 0.20575483 0.         0.         0.
  0.27054288 0.27054288 0.         0.         0.         0.
  0.27054288 0.20575483 0.20575483 0.20575483 0.27054288 0.
  0.         0.         0.27054288 0.27054288 0.27054288 0.27054288
  0.        ]]

sp_max_arr.shape #(3, 31)

Tfidf sparseMatrix

<작업순서>
1. csv file 가져오기
2. texts, target 전처리
3. max features
4. sparse matrix

import pandas as pd # csv file read 
from sklearn.feature_extraction.text import TfidfVectorizer #희소행렬

1. csv file 가져오기

path = 'C:/ITWILL/4_Python-II/workspace/chap07_Classification/data' 
spam_data = pd.read_csv(path + '/temp_spam_data.csv',header=None)

print(spam_data)

0                        1
0   ham    우리나라    대한민국, 우리나라 만세
1  spam      비아그라 500GRAM 정력 최고!
2   ham               나는 대한민국 사람
3  spam  보험료 15000원에 평생 보장 마감 임박
4   ham                   나는 홍길동

2. texts, target 전처리
1) target 전처리 : dummy변수

target = spam_data[0]
target

list + for

target = [0 if t=='ham' else 1  for t in target]
target #  [0, 1, 0, 1, 0]

import string #texts 전처리
def text_prepro(texts): #문단(sentences)
    #Lower case : 문단 -> 문장 -> 영문소문자 변경  
    texts = [x.lower() for x in texts]
    #Remove punctuation : 문단 -> 문장 -> 음절 -> 필터링 -> 문장  
    texts = [''.join(ch for ch in st if ch not in string.punctuation) for st in texts]
    #Remove numbers : 문단 -> 문장 -> 음절 -> 필터링 -> 문장 
    texts = [''.join(ch for ch in st if ch not in string.digits) for st in texts]
    #Trim extra whitespace : 문단 -> 문장 -> 공백 제거 
    texts = [' '.join(x.split()) for x in texts]
    return texts

2) texts 전처리 : 불용어(공백,특수문자,문장부호,숫자)

texts = spam_data[1]
texts #전처리 전 

texts = text_prepro(texts)
texts #전처리 후

3. max features : 희소행렬에 사용될 단어 개수

tfidf = TfidfVectorizer() #단어 생성기 
fit = tfidf.fit(texts) #텍스트 적용 
voca = fit.vocabulary_
print(voca)

len(voca) #16 

max_features = len(voca) #전체 단어 이용

* max_features = 10 : 중요단어 10개만 선정하여 희소행렬

4. sparse matrix

tfidf = TfidfVectorizer(max_features=max_features)
sp_mat = tfidf.fit_transform(texts) 

print(sp_mat)

numpy matrix

sp_mat_arr = sp_mat.toarray()
print(sp_mat_arr)

[[0.         0.         0.33939315 0.         0.42066906 0.
  0.         0.         0.         0.84133812 0.         0.
  0.         0.         0.         0.        ]
[0.5        0.         0.         0.         0.         0.
  0.         0.5        0.         0.         0.         0.
  0.5        0.5        0.         0.        ]
[0.         0.53177225 0.53177225 0.         0.         0.
  0.         0.         0.659118   0.         0.         0.
  0.         0.         0.         0.        ]
[0.         0.         0.         0.40824829 0.         0.40824829
  0.40824829 0.         0.         0.         0.40824829 0.40824829
  0.         0.         0.40824829 0.        ]
[0.         0.62791376 0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.77828292]]

Tfidf sparseMatrix2

<작업순서>
1. csv file 가져오기[수정]
2. texts, target 전처리
3. max features[수정]
4. sparse matrix
5. train/test split[추가]
6. file save[추가]

import pandas as pd # csv file read 
from sklearn.feature_extraction.text import TfidfVectorizer #희소행렬

1. csv file 가져오기[수정]

path = 'C:/ITWILL/4_Python-2/workspace/chap07_Classification/data' 
spam_data = pd.read_csv(path + '/temp_spam_data2.csv',header=None)

print(spam_data)

0                                                  1
0      ham  Go until jurong point, crazy.. Available only ...
1      ham                      Ok lar... Joking wif u oni...
2     spam  Free entry in 2 a wkly comp to win FA Cup fina...
3      ham  U dun say so early hor... U c already then say...
4      ham  Nah I don't think he goes to usf, he lives aro...

2. texts, target 전처리
1) target 전처리 : dummy변수

target = spam_data[0]
target
#list + for 
target = [0 if t=='ham' else 1  for t in target]
target #[0, 1, 0, 1, 0]

import string #texts 전처리
def text_prepro(texts): #문단(sentences)
    #Lower case : 문단 -> 문장 -> 영문소문자 변경  
    texts = [x.lower() for x in texts]
    #Remove punctuation : 문단 -> 문장 -> 음절 -> 필터링 -> 문장  
    texts = [''.join(ch for ch in st if ch not in string.punctuation) for st in texts]
    #Remove numbers : 문단 -> 문장 -> 음절 -> 필터링 -> 문장 
    texts = [''.join(ch for ch in st if ch not in string.digits) for st in texts]
    #Trim extra whitespace : 문단 -> 문장 -> 공백 제거 
    texts = [' '.join(x.split()) for x in texts]
    return texts

2) texts 전처리 : 불용어(공백,특수문자,문장부호,숫자)

texts = spam_data[1]
texts #전처리 전 

texts = text_prepro(texts)
texts #전처리 후

3. max features : 희소행렬에 사용될 단어 개수

tfidf = TfidfVectorizer() #단어 생성기 
fit = tfidf.fit(texts) #텍스트 적용 
voca = fit.vocabulary_
print(voca)

len(voca) #8603 

max_features = 5000 #중요단어 5,000개 희소행렬

4. sparse matrix

tfidf = TfidfVectorizer(max_features=max_features)
sp_mat = tfidf.fit_transform(texts) 

print(sp_mat)

numpy matrix

sp_mat_arr = sp_mat.toarray()
print(sp_mat_arr)

[[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
...
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]]

sp_mat_arr.shape #(5574, 5000)

5. train/test split[추가]

from sklearn.model_selection import train_test_split 
import numpy as np 

X_train, X_test, y_train, y_test = train_test_split(
    sp_mat_arr, target, test_size=0.3)

X_train.shape #(3901, 5000)
X_test.shape #(1673, 5000)

list -> numpy

y_train = np.array(y_train)
y_test = np.array(y_test)
y_train.shape #(3901,)
y_test.shape #(1673,)

6. file save[추가] : np.save()

spam_train_test = (X_train,X_test,y_train,y_test)

np.save("file", object)

np.save(path + "/spam_train_test.npy", spam_train_test)

np.load("file")

X_train,X_test,y_train,y_test = np.load(path + "/spam_train_test.npy",allow_pickle=True)
X_train.shape #(3901, 5000)

ham spam classifier

문서분류기
NB vs SVM
NB : 연속속도 빠름
SVM : 정확도 높음

import numpy as np #np.load()
from sklearn.naive_bayes import MultinomialNB #nb model 
from sklearn.svm import SVC #svm model 
from sklearn.metrics import accuracy_score, confusion_matrix #평가 
import time #시간 측정

1. dataset load

path = 'C:/ITWILL/4_Python-II/workspace/chap07_Classification/data' 
X_train,X_test,y_train,y_test = np.load(path + "/spam_train_test.npy",allow_pickle=True)

input data : X -> 희소행렬

X_train.shape #(3901, 5000)
X_test.shape #(1673, 5000)

output data : y - dummy(0 or 1)

y_train[:10] #array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1])

NB model

start = time.time()
nb_model = MultinomialNB().fit(X=X_train, y=y_train)
end = time.time() - start
print('실행시간 : ', end)

y_pred = nb_model.predict(X = X_test)
y_true = y_test

acc = accuracy_score(y_true, y_pred)
print('NB model 분류정확도 :', acc)

SVM model

start = time.time()
svm_model = SVC(kernel='linear').fit(X=X_train, y=y_train)
end = time.time() - start
print('실행시간 : ', end)

y_pred = svm_model.predict(X = X_test)
y_true = y_test

acc = accuracy_score(y_true, y_pred)
print('SVM model 분류정확도 :', acc)

실행시간 : 0.10205936431884766
NB model 분류정확도 : 0.9575612671846981
실행시간 : 7.592423677444458
SVM model 분류정확도 : 0.9760908547519426

불균형 비율

con_mat = confusion_matrix(y_true, y_pred)
print(con_mat)

0      1
0 [[1437    1] = 1438
1 [  39  196]] = 235

정확률 : 예측치 yes(1) -> yes(1)

p = con_mat[1,1] / con_mat[:,1].sum() #0.9949238578680203

재현율=민감도 : 관측치 YES(1) -> YES(1)

r = con_mat[1,1] / con_mat[1,:].sum() #0.8340425531914893

f1 score : 조화평균

f1_score = 2 * ((p*r) / (p+r))
print('f1 score =', f1_score) #f1 score = 0.9074074074074074