ํ
์คํธ ๋ฒกํฐํ?
- ํ
์คํธ๋ฅผ ์ซ์ํ ๋ฒกํฐ๋ก ๋ณํํ๋ ์ ์ฒ๋ฆฌ ๊ณผ์
- ๋ฅ๋ฌ๋ ๋ชจ๋ธ์ ์ซ์ํ ์๋ฃ๋ง ์ฒ๋ฆฌ
๋ฐฉ๋ฒ
ํ
์คํธ โ ๋จ์ด โ ๋จ์ด ๋ฒกํฐ ๋ณํ
ํ
์คํธ โ ๋ฌธ์ โ ๋ฌธ์ ๋ฒกํฐ ๋ณํ
ํ
์คํธ โ N-gram(๋จ์ด๋ ๋ฌธ์ ๊ทธ๋ฃน) โ N-gram ๋ฒกํฐ ๋ณํ
* N-gram : ์ฐ์๋ ๋จ์ด๋ ๋ฌธ์์ ๊ทธ๋ฃน ๋จ์ (ํ
์คํธ ์์ ๋จ์ด๋ ๋ฌธ์๋ฅผ ํ๋ ์ฉ ์ด๋ํ๋ฉด์ ์ถ์ถ)
๋ฒกํฐ ๋ณํ ๋ฐฉ๋ฒ (ํ ํฐ โ ์ซ์ํ ๋ฒกํฐ ๋ณํ)

์-ํซ ์ธ์ฝ๋ฉ (ํฌ์ํ๋ ฌ) | ๋จ์ด ์๋ฒ ๋ฉ (๋ฐ์งํ๋ ฌ) | |
์ ์ | ๋ฐ์ดํฐ๋ฅผ ์๋ง์ 0๊ณผ ํ๊ฐ์ 1๋ก ๊ตฌ๋ณ ๋ฒกํฐ ๋๋ ํ๋ ฌ(matrix)์ ๊ฐ์ด ๋๋ถ๋ถ 0์ผ๋ก ํํ๋๋ ํฌ์ ํ๋ ฌ(sparse matrix) |
๋จ์ด์ ์๋ฏธ๋ฅผ ๊ณ ๋ คํ์ฌ ์ข ๋ ์กฐ๋ฐํ ์ฐจ์์ ๋จ์ด๋ฅผ ๋ฒกํฐ๋ก ํํ ์ ์ ์ฐจ์์ผ๋ก ๋ ๋ง์ ์ ๋ณด๋ฅผ ์ ์ฅํ๋ ๋ฐ์งํ๋ ฌ(Dense matrix) |
ํน์ง | 1) ์ธ์ฝ๋ฉ ๋ฐฉ์ ์ฝ๊ณ , ํํ ๋ฐฉ๋ฒ ๋จ์ 2) ๋จ์ด๊ฐ ๋ง์ ๊ฒฝ์ฐ ๊ณ ์ฐจ์ (๋จ์ด์=์ฐจ์์) |
1) ๋ฒกํฐ์ ์ฐจ์์ ๋จ์ด ์งํฉ์ ํฌ๊ธฐ๋ก ํํํ์ง ์๊ณ , ์ฌ์ฉ์๊ฐ ์ค์ ํ ๊ฐ(64, 128, 256, โฆ 1024)์ผ๋ก ๋จ์ด ๋ฒกํฐ์ ์ฐจ์์ด ๊ฒฐ์ (์ด ๊ณผ์ ์์ 0๊ณผ 1์ ๊ฐ์ด ์ค์๊ฐ์ผ๋ก ๋๋ค.) 2) ๋ฐ์ง๋ฒกํฐ์ ๊ฐ์ ๋ฅ๋ฌ๋์ ์ต์ ํ ์๊ณ ๋ฆฌ์ฆ์ผ๋ก ํ์ต๋์ด ๋ง๋ค์ด์ง |
ํํ๋ฐฉ๋ฒ | ์๋ | ํ๋ จ๋ฐ์ดํฐ๋ก ํ์ต |
๋ฌธ์ ์ ๊ฐ์ ์ |
๋จ์ด์ ์๋ฏธ๋ฅผ ์ ํ ๊ณ ๋ คํ์ง ์์ผ๋ฉฐ ๋ฒกํฐ์ ๊ธธ์ด๊ฐ ์ด ๋จ์ด ์๊ฐ ๋๋ฏ๋ก ๋งค์ฐ ํฌ๋ฐ(sparse)ํ ํํ๊ฐ ๋จ ํ๋ ฌ์ ๋ง์ ๊ฐ์ด 0์ด ๋๋ฉด์ ๊ณต๊ฐ์ ๋ญ๋น |
๋ ์ ์ ๋ฐ์ดํฐ๋ก๋ ํ์ต๋ ์๋ฒ ๋ฉ์ ์ฌ์ฉํ์ฌ ๋์ ์ฑ๋ฅ์ ๋ผ ์ ์๋ค |
์ข ๋ฅ | LSA, Word2Vec, GloVe, FastText ๋ฑ |
์-ํซ ๋ฒกํฐ VS ๋จ์ด ์๋ฒ ๋ฉ ๋ฒกํฐ
- ์ ์ฒด ๋ฌธ์ฅ : 5574, ์ ์ฒด ๋จ์ด ํฌ๊ธฐ : 8630 ์ผ ๋

์ง๋ํ์ต์ ์ํ ํน์ง ์ถ์ถ
data ์
๋ ฅ (label) -> Features ์ถ์ถ (0 or 1) -> ์๊ณ ๋ฆฌ์ฆ -> model
TF-IDF (Term Frequency - Inverse Document Frequency) ๊ฐ์ค์น
์ ๋ณด ๊ฒ์๊ณผ ํ
์คํธ ๋ง์ด๋์์ ์ด์ฉํ๋๊ฐ์ค์น.
์ฌ๋ฌ ๋ฌธ์๋ก ์ด๋ฃจ์ด์ง ๋ฌธ์๊ตฐ์ด ์์ ๋ ์ด๋ค ๋จ์ด๊ฐ ํน์ ๋ฌธ์ ๋ด์์ ์ผ๋ง๋ ์ค์ํ ๊ฒ์ธ์ง๋ฅผ ๋ํ๋ด๋ ํต๊ณ์ ์์น
ex) ์คํธ(spam) ๋ฉ์ผ๋ถ๋ฅ๊ธฐ ์์ฑ์ ์ํด์ model์ ์
๋ ฅํ text์๋ฃ๋ฅผ ๋ฌธ์ ๋๋น ๋จ์ด์ ์ถํ ๋น์จ๋ก ๊ฐ์ค์น๋ฅผ ์ ์ฉํ์ฌ ํฌ์ํ๋ ฌ์ ๋ง๋ค๊ณ , ์ด๋ฅผmodel์ ์
๋ ฅ์ผ๋ก ์ด์ฉ
one_hot_encoding = token.texts_to_matrix(texts, mode='tfidf')
๋จ์ด ์๋ฒ ๋ฉ ์ฒ๋ฆฌ ๊ณผ์
๋จ์ด โ ๋จ์ด ๊ณ ์ ์ ์ โ ์๋ฒ ๋ฉ ์ธต โ ์๋ฒ ๋ฉ ๋ฒกํฐ
์๋ฒ ๋ฉ ์ธต ์
๋ ฅ์ผ๋ก ์ฌ์ฉ๋ ์
๋ ฅ ์ํ์ค์ ๊ฐ ๋จ์ด๋ค์ ๋ชจ๋ ์ ์ ์ธ๋ฑ์ค ๋ณํ
{'the' : 1, 'cat' : 2, 'sat' : 3, 'on' : 4, 'mat' : 5, 'dog' : 6, 'ate' : 7, 'my' : 8, 'homework' : 9, โฆ } ์ผ ๋,

vectorzing
์์
์ ์ฐจ
๋จ๊ณ1 : ํ ํฐ(token) ์์ฑ : ํ
์คํธ -> ๋จ์ด ์ถ์ถ
๋จ๊ณ2 : ๋จ์ด์์ ์ธ๋ฑ์ค : ๋จ์ด -> ๊ณ ์ ์ซ์(๋จ์ด ์ญํ )
๋จ๊ณ3 : ํจ๋ฉ(padding) : ๊ฐ ๋ฌธ์ฅ์ ๋จ์ด ๊ธธ์ด ๋ง์ถค(maxlen ๊ธฐ์ค)
๋จ๊ณ4 : ์ธ์ฝ๋ฉ(encoding) : ๋ฅ๋ฌ๋ ๋ชจ๋ธ์ ๊ณต๊ธํ ์๋ฃ(์ซ์ํ ๋ฒกํฐ)
from tensorflow.keras.preprocessing.text import Tokenizer #ํ ํฐ ์์ฑ๊ธฐ from tensorflow.keras.preprocessing.sequence import pad_sequences #ํจ๋ฉ from tensorflow.keras.utils import to_categorical #one-hot encoding
text_sample.txt ์ฐธ๊ณ
texts = ['The dog sat on the table.', 'The dog is my Poodle.']
ํ ํฐ ์์ฑ๊ธฐ
tokenizer = Tokenizer()
๋จ๊ณ1 : ํ ํฐ(token) ์์ฑ
tokenizer.fit_on_texts(texts) token = tokenizer.word_index #ํ ํฐ ๋ฐํ print(token) #{'word':๊ณ ์ ์ซ์} : ๊ณ ์ ์ซ์ -> ๋จ์ด ์์ ์์ธ ''' {'the': 1, 'dog': 2, 'sat': 3, 'on': 4, 'table': 5, 'is': 6, 'my': 7, 'poodle': 8} ''' print('์ ์ฒด ๋จ์ด ๊ธธ์ด =', len(token)) #์ ์ฒด ๋จ์ด ๊ธธ์ด = 8
๋จ๊ณ2 : ๋จ์ด์์ ์ธ๋ฑ์ค -> ์ ์๋ณํ(๋จ์ด ์์ ์ธ๋ฑ์ค)
seq_vector = tokenizer.texts_to_sequences(texts) print(seq_vector) ''' [[1, 2, 3, 4, 1, 5], [1, 2, 6, 7, 8]] '''
๋จ๊ณ3 : ํจ๋ฉ(padding) : maxlen ๊ธฐ์ค
lens = [len(sent) for sent in seq_vector] maxlen = max(lens) print(maxlen) #6
max length : ์ต๋ ๋จ์ด์ ์ง์
ex) max_len = 10 : ๋ชจ๋ ๋ฌธ์ฅ์ ๋จ์ด 10๊ฐ ๋ง์ถค
-> ๋ถ์กฑํ ๋ฌธ์ฅ 0์ผ๋ก ์ฑ์
-> ์ด๊ณผ๋จ์ด 10๊ฐ ์งค๋ฆผ
padding = pad_sequences(seq_vector, maxlen=maxlen) print(padding) ''' [[1 2 3 4 1 5] - 6๊ฐ [0 1 2 6 7 8]] - 6๊ฐ '''
๋จ๊ณ4 : ์ธ์ฝ๋ฉ(encoding) : one-hot encoding(2์ง์)
one_hot = to_categorical(padding) print(one_hot)
[[[0. 1. 0. 0. 0. 0. 0. 0. 0.] - the
[0. 0. 1. 0. 0. 0. 0. 0. 0.] - dog
[0. 0. 0. 1. 0. 0. 0. 0. 0.] - sat
[0. 0. 0. 0. 1. 0. 0. 0. 0.] - on
[0. 1. 0. 0. 0. 0. 0. 0. 0.] - the
[0. 0. 0. 0. 0. 1. 0. 0. 0.]]- table
[[1. 0. 0. 0. 0. 0. 0. 0. 0.] - padding
[0. 1. 0. 0. 0. 0. 0. 0. 0.] - the
[0. 0. 1. 0. 0. 0. 0. 0. 0.] - dog
[0. 0. 0. 0. 0. 0. 1. 0. 0.] - is
[0. 0. 0. 0. 0. 0. 0. 1. 0.] - my
[0. 0. 0. 0. 0. 0. 0. 0. 1.]]]-Poodle
one_hot.shape #(2, 6, 9) - (๋ฌธ์์, ๋จ์ด์, ์ ์ฒด๋จ์ด์+1)
feature extract
1. ํ
์คํธ -> ํน์ง(feature) ์ถ์ถ
-> texts -> ํฌ์ํ๋ ฌ(sparse matrix) : ๋ฅ๋ฌ๋ ๋ชจ๋ธ ๊ณต๊ธ data
-> ๊ฐ์ค์น ๋ฐฉ๋ฒ : ์ถํ์ฌ๋ถ, ์ถํ๋น๋, ๋น์จ, tf*idf(๋จ์ด์ถํ๋น๋ * 1/๋ฌธ์์ถํ)
2. num_words
- ํฌ์ํ๋ ฌ์ ์ฐจ์ ์ง์ (๋จ์ด ๊ฐ์)
ex) num_word=500 : ์ ์ฒด ๋จ์ด์์ ์ค์ ๋จ์ด 500๊ฐ ์ ์
from tensorflow.keras.preprocessing.text import Tokenizer #ํ ํฐ ์์ฑ๊ธฐ
text_sample.txt ์ฐธ๊ณ
texts = ['The dog sat on the table.', 'The dog is my Poodle.']
ํ ํฐ ์์ฑ๊ธฐ
tokenizer = Tokenizer()
๋จ๊ณ1 : ํ ํฐ(token) ์์ฑ
tokenizer.fit_on_texts(texts) token = tokenizer.word_index #ํ ํฐ ๋ฐํ print(token) #{'word':๊ณ ์ ์ซ์} : ๊ณ ์ ์ซ์ -> ๋จ์ด ์์ ์์ธ ''' {'the': 1, 'dog': 2, 'sat': 3, 'on': 4, 'table': 5, 'is': 6, 'my': 7, 'poodle': 8} ''' print('์ ์ฒด ๋จ์ด ๊ธธ์ด =', len(token)) # ์ ์ฒด ๋จ์ด ๊ธธ์ด = 8
1. ํฌ์ํ๋ ฌ(sparse matrix) : texts -> ํน์ง ์ถ์ถ
1) ๋จ์ด ์ถํ์ฌ๋ถ
binary_mat = tokenizer.texts_to_matrix(texts=texts, mode='binary') print(binary_mat) # DTM ''' [[0. 1. 1. 1. 1. 1. 0. 0. 0.] [0. 1. 1. 0. 0. 0. 1. 1. 1.]] ''' binary_mat.shape #(2, 9) - (docs, words+1)
2) ๋จ์ด ์ถํ๋น๋
count_mat = tokenizer.texts_to_matrix(texts=texts, mode='count') print(count_mat) ''' [[0. 2. 1. 1. 1. 1. 0. 0. 0.] - the : 2 [0. 1. 1. 0. 0. 0. 1. 1. 1.]] '''
3) ๋จ์ด ์ถํ๋น์จ
freq_mat = tokenizer.texts_to_matrix(texts=texts, mode='freq') print(freq_mat)
[[0. 0.33333333 0.16666667 0.16666667 0.16666667 0.16666667
0. 0. 0. ]
[0. 0.2 0.2 0. 0. 0.
0.2 0.2 0.2 ]]
the = 0.3333
2/6 #0.3333333333333333
dog = 0.1666
1/6
4) ๋จ์ด ์ถํ๋น์จ : tf * idf(์ ์ฒด๋ฌธ์์/๋จ์ดํฌํจ๋๋ฌธ์์)
tfidf_mat = tokenizer.texts_to_matrix(texts=texts, mode='tfidf') print(tfidf_mat) #the : 0.8649 -> 0.510
[[0. 0.86490296 0.51082562 0.69314718 0.69314718 0.69314718
0. 0. 0. ]
[0. 0.51082562 0.51082562 0. 0. 0.
0.69314718 0.69314718 0.69314718]]
tf = 0.333 tf * (2/1) #0.666 tf * (2/2) #0.333
2. num_words : ํฌ์ํ๋ ฌ ๋จ์ด ๊ธธ์ด ์ ํ
ํ ํฐ ์์ฑ๊ธฐ
tokenizer = Tokenizer(num_words=6) #5๊ฐ ๋จ์ด ์ ์ (๋จ์ด๊ธธ์ด+1) tokenizer.fit_on_texts(texts) #ํ
์คํธ ๋ฐ์ tfidf_max = tokenizer.texts_to_matrix(texts, mode='tfidf') print(tfidf_max) ''' [[0. 0.86490296 0.51082562 0.69314718 0.69314718 0.69314718] [0. 0.51082562 0.51082562 0. 0. 0. ]] ''' tfidf_max.shape #(2, 6) - (docs, words+1)
feature classifier
- ํฌ์ํ๋ ฌ(์ธ์ฝ๋ฉ) + DNN model
<์์
์ ์ฐจ>
1. csv file read
2. texts, label(0 or 1) ์ ์ฒ๋ฆฌ
3. num_words = 4000 ์ ํ
4. texts -> sparse matrix : feature ์ถ์ถ
5. train/val split
6. DNN model
import pandas as pd #csv file rad import numpy as np #list -> array import string #texts ์ ์ฒ๋ฆฌ from sklearn.model_selection import train_test_split from tensorflow.keras.preprocessing.text import Tokenizer #ํ ํฐ ์์ฑ๊ธฐ #DNN model from tensorflow.keras.models import Sequentialb #model from tensorflow.keras.layers import Dense #layer path = r'C:\ITWILL\5_Tensorflow\workspace\chap08_TextVectorizing_RNN\data'
1. csv file read
spam_data = pd.read_csv(path + '/temp_spam_data2.csv', header = None) spam_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5573 entries, 0 to 5572
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 0 5574 non-null object - label(ham/spam) : y๋ณ์
1 1 5574 non-null object - texts : x๋ณ์
label = spam_data[0] texts = spam_data[1] label.value_counts()
ham 4827 -> 0
spam 747 -> 1
texts
2. texts, label(0 or 1) ์ ์ฒ๋ฆฌ
1) label ์ ์ฒ๋ฆฌ
label = [1 if lab=='spam' else 0 for lab in label] #list + for label[:10] # [0, 0, 1, 0, 0, 1, 0, 0, 1, 1]
list -> numpy
label = np.array(label) label.shape # (5574,)
2) texts ์ ์ฒ๋ฆฌ
def text_prepro(texts): #text_sample.txt ์ฐธ๊ณ #Lower case texts = [x.lower() for x in texts] #Remove punctuation texts = [''.join(c for c in x if c not in string.punctuation) for x in texts] #Remove numbers texts = [''.join(c for c in x if c not in string.digits) for x in texts] #Trim extra whitespace texts = [' '.join(x.split()) for x in texts] return texts
ํจ์ ํธ์ถ : texts ์ ์ฒ๋ฆฌ
texts = text_prepro(texts) print(texts)
3. num_words = 4000 ์ ํ
tokenizer = Tokenizer() #์ ์ฒด ๋จ์ด ์ด์ฉ ํฌ์ํ๋ ฌ ์์ฑ tokenizer.fit_on_texts(texts) #ํ
์คํธ ๋ฐ์ words = tokenizer.index_word #๋จ์ด ๋ฐํ print(words) print('์ ์ฒด ๋จ์ด ์ : ', len(words)) #์ ์ฒด ๋จ์ด ์ : 8629
4. Sparse matrix : feature ์ถ์ถ
x_data = tokenizer.texts_to_matrix(texts, mode='tfidf') x_data.shape # (5574, 8630) - (docs, words+1)
5. train_test_split : 80 vs 20
x_train, x_val, y_train, y_val = train_test_split( x_data, label, test_size=0.2) x_train.shape #(4459, 8630) x_val.shape #(1115, 8630) y_train.shape #(4459,) y_val.shape #(1115,)
6. DNN model
model = Sequential() input_shape = (8630,)
hidden layer1 : w[8630, 64]
model.add(Dense(units=64, input_shape=input_shape, activation='relu')) #1์ธต
hidden layer2 : w[64, 32]
model.add(Dense(units=32, activation='relu')) #2์ธต
output layer : ์ดํญ๋ถ๋ฅ๊ธฐ
model.add(Dense(units=1, activation='sigmoid')) #3์ธต model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_2 (Dense) (None, 64) 552384=[8630*64]+64
_________________________________________________________________
dense_3 (Dense) (None, 32) 2080=[64*32]+32
_________________________________________________________________
dense_4 (Dense) (None, 1) 33=[32*1]+1
=================================================================
Total params: 554,497
Trainable params: 554,497
7. model compile : ํ์ต๊ณผ์ ์ค์
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
8. model training
model.fit(x=x_train, y=y_train, #ํ๋ จ์
epochs=10, #๋ฐ๋ณตํ์ต ํ์ verbose=1, #์ถ๋ ฅ์ฌ๋ถ validation_data= (x_val, y_val)) #๊ฒ์ฆ์
9. model evaluation
print('='*30) model.evaluate(x=x_val, y=y_val) #0s 1ms/step - loss: 0.1051 - accuracy: 0.9874
word embedding
* feature classifier ์ฐธ๊ณ
word embedding(์ธ์ฝ๋ฉ) + DNN model
์ธ์ฝ๋ฉ : ํ
์คํธ ์ ์ฒ๋ฆฌ ๊ฒฐ๊ณผ
1. ํฌ์ํ๋ ฌ
texts -> ํฌ์ํ๋ ฌ(์ธ์ฝ๋ฉ) -> DNN model
2. ๋จ์ด์๋ฒ ๋ฉ(๋ฐ์งํ๋ ฌ)
texts -> ์ ์ ์์ธ -> padding -> Embedding์ธต(์ธ์ฝ๋ฉ) -> DNN model
Embedding(input_dim, outpu_dim, input_length)
input_dim : ์ ์ฒด๋จ์ด์+1
outpu_dim : ์๋ฒ ๋ฉ ๋ฒกํฐ ์ฐจ์(32, 63, ...)
import pandas as pd # csv file rad import numpy as np # list -> array import string # texts ์ ์ฒ๋ฆฌ from sklearn.model_selection import train_test_split from tensorflow.keras.preprocessing.text import Tokenizer #ํ ํฐ ์์ฑ๊ธฐ from tensorflow.keras.preprocessing.sequence import pad_sequences #[์ถ๊ฐ] ํจ๋ฉ
DNN model
from tensorflow.keras.models import Sequential #model from tensorflow.keras.layers import Dense, Embedding, Flatten #[์ถ๊ฐ] layer import time #time check path = r'C:\ITWILL\5_Tensorflow\workspace\chap08_TextVectorizing_RNN\data'
1. csv file read
spam_data = pd.read_csv(path + '/temp_spam_data2.csv', header = None) spam_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5573 entries, 0 to 5572
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 0 5574 non-null object - label(ham/spam) : y๋ณ์
1 1 5574 non-null object - texts : x๋ณ์
label = spam_data[0] texts = spam_data[1] label.value_counts() ''' ham 4827 -> 0 spam 747 -> 1 ''' texts
2. texts, label(0 or 1) ์ ์ฒ๋ฆฌ
1) label ์ ์ฒ๋ฆฌ
label = [1 if lab=='spam' else 0 for lab in label] #list + for label[:10] # [0, 0, 1, 0, 0, 1, 0, 0, 1, 1]
list -> numpy
label = np.array(label) label.shape # (5574,)
2) texts ์ ์ฒ๋ฆฌ
def text_prepro(texts): #text_sample.txt ์ฐธ๊ณ #Lower case texts = [x.lower() for x in texts] #Remove punctuation texts = [''.join(c for c in x if c not in string.punctuation) for x in texts] #Remove numbers texts = [''.join(c for c in x if c not in string.digits) for x in texts] #Trim extra whitespace texts = [' '.join(x.split()) for x in texts] return texts
ํจ์ ํธ์ถ : texts ์ ์ฒ๋ฆฌ
texts = text_prepro(texts) print(texts)
3. num_words = 4000 ์ ํ
tokenizer = Tokenizer() #์ ์ฒด ๋จ์ด ์ด์ฉ ํฌ์ํ๋ ฌ ์์ฑ
tokenizer = Tokenizer(num_words = 4000) #2์ฐจ : 4000 ๋จ์ด ์ ํ tokenizer.fit_on_texts(texts) #ํ
์คํธ ๋ฐ์ words = tokenizer.index_word #๋จ์ด ๋ฐํ print(words) print('์ ์ฒด ๋จ์ด ์ : ', len(words)) #์ ์ฒด ๋จ์ด ์ : 8629 input_dim = len(words) + 1 #์ ์ฒด๋จ์ด์+1
4. Sparse matrix : feature ์ถ์ถ
x_data = tokenizer.texts_to_matrix(texts, mode='tfidf') x_data.shape # (5574, 8630) - (docs, words+1) -> (5574, 4000)
[์ถ๊ฐ] 4. ์ ์ ์์ธ : ๋จ์ด ์๋ฒ
seq_result = tokenizer.texts_to_sequences(texts) print(seq_result) lens = [len(sent) for sent in seq_result] lens maxlen = max(lens) maxlen #158
[์ถ๊ฐ] 5.padding : maxlen ๊ธฐ์ค ๋จ์ด๊ธธ์ด ๋ง์ถค
x_data = pad_sequences(seq_result, maxlen=maxlen) x_data.shape #(5574, 158) - (๋ฌธ์ฅ์, ๋จ์ด๊ธธ์ด)
6. train_test_split : 80 vs 20
x_train, x_val, y_train, y_val = train_test_split( x_data, label, test_size=0.2) x_train.shape #(4459, 8630) x_val.shape #(1115, 8630) y_train.shape #(4459,) y_val.shape #(1115,)
7. DNN model
model = Sequential()
[์ถ๊ฐ] 8. Embedding layer : 1์ธต - ์ธ์ฝ๋ฉ
model.add(Embedding(input_dim=input_dim, output_dim=32, input_length=maxlen))
input_dim : ์ ์ฒด๋จ์ด์+1
output_dim : ์๋ฒ ๋ฉ ๋ฒกํฐ ์ฐจ์(32, 64, ...)
input_length : ๋ฌธ์ฅ์ ๊ตฌ์ฑํ๋ ๋จ์ด๊ธธ์ด(maxlen)
[์ถ๊ฐ] 2d -> 1d
model.add(Flatten())
hidden layer1 : w[4000, 64] vs w[32, 64]
model.add(Dense(units=64, activation='relu')) #2์ธต
hidden layer2 : w[64, 32]
model.add(Dense(units=32, activation='relu')) #3์ธต
output layer : ์ดํญ๋ถ๋ฅ๊ธฐ
model.add(Dense(units=1, activation='sigmoid')) #4์ธต model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_2 (Dense) (None, 64) 552384=[8630*64]+64
_________________________________________________________________
dense_3 (Dense) (None, 32) 2080=[64*32]+32
_________________________________________________________________
dense_4 (Dense) (None, 1) 33=[32*1]+1
=================================================================
Total params: 554,497
Trainable params: 554,497
start = time.time()
'๋ฐ์ดํฐ๋ถ์๊ฐ ๊ณผ์ > Tensorflow' ์นดํ ๊ณ ๋ฆฌ์ ๋ค๋ฅธ ๊ธ
DAY73. Tensorflow Text Vectorizing RNN (2) (0) | 2022.01.04 |
---|---|
DAY71. Tensorflow Face detection (2) (0) | 2021.12.30 |
DAY70. Tensorflow Face detection (1)face landmark (0) | 2021.12.29 |
DAY69. Tensorflow Selenium Crawling (0) | 2021.12.28 |
DAY68. Tensorflow CNN model (2)ImageGenerator (0) | 2021.12.27 |