ํ…์ŠคํŠธ ๋ฒกํ„ฐํ™”?
- ํ…์ŠคํŠธ๋ฅผ ์ˆซ์žํ˜• ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ์ „์ฒ˜๋ฆฌ ๊ณผ์ •
- ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์€ ์ˆซ์žํ˜• ์ž๋ฃŒ๋งŒ ์ฒ˜๋ฆฌ


๋ฐฉ๋ฒ•
ํ…์ŠคํŠธ โ†’ ๋‹จ์–ด โ†’ ๋‹จ์–ด ๋ฒกํ„ฐ ๋ณ€ํ™˜
ํ…์ŠคํŠธ โ†’ ๋ฌธ์ž โ†’ ๋ฌธ์ž ๋ฒกํ„ฐ ๋ณ€ํ™˜
ํ…์ŠคํŠธ โ†’ N-gram(๋‹จ์–ด๋‚˜ ๋ฌธ์ž ๊ทธ๋ฃน) โ†’ N-gram ๋ฒกํ„ฐ ๋ณ€ํ™˜
* N-gram : ์—ฐ์†๋œ ๋‹จ์–ด๋‚˜ ๋ฌธ์ž์˜ ๊ทธ๋ฃน ๋‹จ์œ„ (ํ…์ŠคํŠธ ์—์„œ ๋‹จ์–ด๋‚˜ ๋ฌธ์ž๋ฅผ ํ•˜๋‚˜ ์”ฉ ์ด๋™ํ•˜๋ฉด์„œ ์ถ”์ถœ)


๋ฒกํ„ฐ ๋ณ€ํ™˜ ๋ฐฉ๋ฒ• (ํ† ํฐ โ†’ ์ˆซ์žํ˜• ๋ฒกํ„ฐ ๋ณ€ํ™˜)

  ์›-ํ•ซ ์ธ์ฝ”๋”ฉ (ํฌ์†Œํ–‰๋ ฌ) ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ (๋ฐ€์ง‘ํ–‰๋ ฌ)
์ •์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜๋งŽ์€ 0๊ณผ ํ•œ๊ฐœ์˜ 1๋กœ ๊ตฌ๋ณ„
๋ฒกํ„ฐ ๋˜๋Š” ํ–‰๋ ฌ(matrix)์˜ ๊ฐ’์ด ๋Œ€๋ถ€๋ถ„ 0์œผ๋กœ ํ‘œํ˜„๋˜๋Š” ํฌ์†Œ ํ–‰๋ ฌ(sparse matrix)
๋‹จ์–ด์˜ ์˜๋ฏธ๋ฅผ ๊ณ ๋ คํ•˜์—ฌ ์ข€ ๋” ์กฐ๋ฐ€ํ•œ ์ฐจ์›์— ๋‹จ์–ด๋ฅผ ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„
์ ์€ ์ฐจ์›์œผ๋กœ ๋” ๋งŽ์€ ์ •๋ณด๋ฅผ ์ €์žฅํ•˜๋Š” ๋ฐ€์ง‘ํ–‰๋ ฌ(Dense matrix)
ํŠน์ง• 1) ์ธ์ฝ”๋”ฉ ๋ฐฉ์‹ ์‰ฝ๊ณ , ํ‘œํ˜„ ๋ฐฉ๋ฒ• ๋‹จ์ˆœ
2) ๋‹จ์–ด๊ฐ€ ๋งŽ์€ ๊ฒฝ์šฐ ๊ณ ์ฐจ์› (๋‹จ์–ด์ˆ˜=์ฐจ์›์ˆ˜)
1) ๋ฒกํ„ฐ์˜ ์ฐจ์›์„ ๋‹จ์–ด ์ง‘ํ•ฉ์˜ ํฌ๊ธฐ๋กœ ํ‘œํ˜„ํ•˜์ง€ ์•Š๊ณ , ์‚ฌ์šฉ์ž๊ฐ€ ์„ค์ •ํ•œ ๊ฐ’(64, 128, 256, โ€ฆ 1024)์œผ๋กœ ๋‹จ์–ด ๋ฒกํ„ฐ์˜ ์ฐจ์›์ด ๊ฒฐ์ • (์ด ๊ณผ์ •์—์„œ 0๊ณผ 1์˜ ๊ฐ’์ด ์‹ค์ˆ˜๊ฐ’์œผ๋กœ ๋œ๋‹ค.)
2) ๋ฐ€์ง‘๋ฒกํ„ฐ์˜ ๊ฐ’์€ ๋”ฅ๋Ÿฌ๋‹์˜ ์ตœ์ ํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ํ•™์Šต๋˜์–ด ๋งŒ๋“ค์–ด์ง
ํ‘œํ˜„๋ฐฉ๋ฒ• ์ˆ˜๋™ ํ›ˆ๋ จ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต
๋ฌธ์ œ์ 
๊ฐœ์„ ์ 
๋‹จ์–ด์˜ ์˜๋ฏธ๋ฅผ ์ „ํ˜€ ๊ณ ๋ คํ•˜์ง€ ์•Š์œผ๋ฉฐ ๋ฒกํ„ฐ์˜ ๊ธธ์ด๊ฐ€ ์ด ๋‹จ์–ด ์ˆ˜๊ฐ€ ๋˜๋ฏ€๋กœ ๋งค์šฐ ํฌ๋ฐ•(sparse)ํ•œ ํ˜•ํƒœ๊ฐ€ ๋จ
ํ–‰๋ ฌ์˜ ๋งŽ์€ ๊ฐ’์ด 0์ด ๋˜๋ฉด์„œ ๊ณต๊ฐ„์  ๋‚ญ๋น„
๋” ์ ์€ ๋ฐ์ดํ„ฐ๋กœ๋„ ํ•™์Šต๋œ ์ž„๋ฒ ๋”ฉ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋†’์€ ์„ฑ๋Šฅ์„ ๋‚ผ ์ˆ˜ ์žˆ๋‹ค
์ข…๋ฅ˜   LSA, Word2Vec, GloVe, FastText ๋“ฑ



์›-ํ•ซ ๋ฒกํ„ฐ VS ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ
- ์ „์ฒด ๋ฌธ์žฅ : 5574, ์ „์ฒด ๋‹จ์–ด ํฌ๊ธฐ : 8630 ์ผ ๋•Œ



์ง€๋„ํ•™์Šต์„ ์œ„ํ•œ ํŠน์ง• ์ถ”์ถœ
data ์ž…๋ ฅ (label) -> Features ์ถ”์ถœ (0 or 1) -> ์•Œ๊ณ ๋ฆฌ์ฆ˜ -> model


TF-IDF (Term Frequency - Inverse Document Frequency) ๊ฐ€์ค‘์น˜
์ •๋ณด ๊ฒ€์ƒ‰๊ณผ ํ…์ŠคํŠธ ๋งˆ์ด๋‹์—์„œ ์ด์šฉํ•˜๋Š”๊ฐ€์ค‘์น˜.
์—ฌ๋Ÿฌ ๋ฌธ์„œ๋กœ ์ด๋ฃจ์–ด์ง„ ๋ฌธ์„œ๊ตฐ์ด ์žˆ์„ ๋•Œ ์–ด๋–ค ๋‹จ์–ด๊ฐ€ ํŠน์ • ๋ฌธ์„œ ๋‚ด์—์„œ ์–ผ๋งˆ๋‚˜ ์ค‘์š”ํ•œ ๊ฒƒ์ธ์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ํ†ต๊ณ„์  ์ˆ˜์น˜

ex) ์ŠคํŒธ(spam) ๋ฉ”์ผ๋ถ„๋ฅ˜๊ธฐ ์ƒ์„ฑ์„ ์œ„ํ•ด์„œ model์— ์ž…๋ ฅํ•  text์ž๋ฃŒ๋ฅผ ๋ฌธ์„œ ๋Œ€๋น„ ๋‹จ์–ด์˜ ์ถœํ˜„ ๋น„์œจ๋กœ ๊ฐ€์ค‘์น˜๋ฅผ ์ ์šฉํ•˜์—ฌ ํฌ์†Œํ–‰๋ ฌ์„ ๋งŒ๋“ค๊ณ , ์ด๋ฅผmodel์˜ ์ž…๋ ฅ์œผ๋กœ ์ด์šฉ

one_hot_encoding = token.texts_to_matrix(texts, mode='tfidf')



๋‹จ์–ด ์ž„๋ฒ ๋”ฉ ์ฒ˜๋ฆฌ ๊ณผ์ •
๋‹จ์–ด โ†’ ๋‹จ์–ด ๊ณ ์œ  ์ •์ˆ˜ โ†’ ์ž„๋ฒ ๋”ฉ ์ธต โ†’ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ
์ž„๋ฒ ๋”ฉ ์ธต ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉ๋  ์ž…๋ ฅ ์‹œํ€€์Šค์˜ ๊ฐ ๋‹จ์–ด๋“ค์€ ๋ชจ๋‘ ์ •์ˆ˜ ์ธ๋ฑ์Šค ๋ณ€ํ™˜

{'the' : 1, 'cat' : 2, 'sat' : 3, 'on' : 4, 'mat' : 5, 'dog' : 6, 'ate' : 7, 'my' : 8, 'homework' : 9, โ€ฆ } ์ผ ๋•Œ,





vectorzing

์ž‘์—…์ ˆ์ฐจ
๋‹จ๊ณ„1 : ํ† ํฐ(token) ์ƒ์„ฑ : ํ…์ŠคํŠธ -> ๋‹จ์–ด ์ถ”์ถœ
๋‹จ๊ณ„2 : ๋‹จ์–ด์ˆœ์„œ ์ธ๋ฑ์Šค : ๋‹จ์–ด -> ๊ณ ์œ ์ˆซ์ž(๋‹จ์–ด ์—ญํ• )
๋‹จ๊ณ„3 : ํŒจ๋”ฉ(padding) : ๊ฐ ๋ฌธ์žฅ์˜ ๋‹จ์–ด ๊ธธ์ด ๋งž์ถค(maxlen ๊ธฐ์ค€)
๋‹จ๊ณ„4 : ์ธ์ฝ”๋”ฉ(encoding) : ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์— ๊ณต๊ธ‰ํ•  ์ž๋ฃŒ(์ˆซ์žํ˜• ๋ฒกํ„ฐ)


from tensorflow.keras.preprocessing.text import Tokenizer #ํ† ํฐ ์ƒ์„ฑ๊ธฐ from tensorflow.keras.preprocessing.sequence import pad_sequences #ํŒจ๋”ฉ from tensorflow.keras.utils import to_categorical #one-hot encoding


text_sample.txt ์ฐธ๊ณ 

texts = ['The dog sat on the table.', 'The dog is my Poodle.']


ํ† ํฐ ์ƒ์„ฑ๊ธฐ

tokenizer = Tokenizer()


๋‹จ๊ณ„1 : ํ† ํฐ(token) ์ƒ์„ฑ

tokenizer.fit_on_texts(texts) token = tokenizer.word_index #ํ† ํฐ ๋ฐ˜ํ™˜ print(token) #{'word':๊ณ ์œ ์ˆซ์ž} : ๊ณ ์œ ์ˆซ์ž -> ๋‹จ์–ด ์ˆœ์„œ ์ƒ‰์ธ ''' {'the': 1, 'dog': 2, 'sat': 3, 'on': 4, 'table': 5, 'is': 6, 'my': 7, 'poodle': 8} ''' print('์ „์ฒด ๋‹จ์–ด ๊ธธ์ด =', len(token)) #์ „์ฒด ๋‹จ์–ด ๊ธธ์ด = 8


๋‹จ๊ณ„2 : ๋‹จ์–ด์ˆœ์„œ ์ธ๋ฑ์Šค -> ์ •์ˆ˜๋ณ€ํ™˜(๋‹จ์–ด ์ˆœ์„œ ์ธ๋ฑ์Šค)

seq_vector = tokenizer.texts_to_sequences(texts) print(seq_vector) ''' [[1, 2, 3, 4, 1, 5], [1, 2, 6, 7, 8]] '''


๋‹จ๊ณ„3 : ํŒจ๋”ฉ(padding) : maxlen ๊ธฐ์ค€

lens = [len(sent) for sent in seq_vector] maxlen = max(lens) print(maxlen) #6

max length : ์ตœ๋Œ€ ๋‹จ์–ด์ˆ˜ ์ง€์ •
ex) max_len = 10 : ๋ชจ๋“  ๋ฌธ์žฅ์˜ ๋‹จ์–ด 10๊ฐœ ๋งž์ถค
-> ๋ถ€์กฑํ•œ ๋ฌธ์žฅ 0์œผ๋กœ ์ฑ„์›€
-> ์ดˆ๊ณผ๋‹จ์–ด 10๊ฐœ ์งค๋ฆผ

padding = pad_sequences(seq_vector, maxlen=maxlen) print(padding) ''' [[1 2 3 4 1 5] - 6๊ฐœ [0 1 2 6 7 8]] - 6๊ฐœ '''


๋‹จ๊ณ„4 : ์ธ์ฝ”๋”ฉ(encoding) : one-hot encoding(2์ง„์ˆ˜)

one_hot = to_categorical(padding) print(one_hot)

[[[0. 1. 0. 0. 0. 0. 0. 0. 0.] - the
[0. 0. 1. 0. 0. 0. 0. 0. 0.] - dog
[0. 0. 0. 1. 0. 0. 0. 0. 0.] - sat
[0. 0. 0. 0. 1. 0. 0. 0. 0.] - on
[0. 1. 0. 0. 0. 0. 0. 0. 0.] - the
[0. 0. 0. 0. 0. 1. 0. 0. 0.]]- table

[[1. 0. 0. 0. 0. 0. 0. 0. 0.] - padding
[0. 1. 0. 0. 0. 0. 0. 0. 0.] - the
[0. 0. 1. 0. 0. 0. 0. 0. 0.] - dog
[0. 0. 0. 0. 0. 0. 1. 0. 0.] - is
[0. 0. 0. 0. 0. 0. 0. 1. 0.] - my
[0. 0. 0. 0. 0. 0. 0. 0. 1.]]]-Poodle

one_hot.shape #(2, 6, 9) - (๋ฌธ์ž์ˆ˜, ๋‹จ์–ด์ˆ˜, ์ „์ฒด๋‹จ์–ด์ˆ˜+1)





feature extract

1. ํ…์ŠคํŠธ -> ํŠน์ง•(feature) ์ถ”์ถœ
-> texts -> ํฌ์†Œํ–‰๋ ฌ(sparse matrix) : ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ ๊ณต๊ธ‰ data
-> ๊ฐ€์ค‘์น˜ ๋ฐฉ๋ฒ• : ์ถœํ˜„์—ฌ๋ถ€, ์ถœํ˜„๋นˆ๋„, ๋น„์œจ, tf*idf(๋‹จ์–ด์ถœํ˜„๋นˆ๋„ * 1/๋ฌธ์„œ์ถœํ˜„)
2. num_words
- ํฌ์†Œํ–‰๋ ฌ์˜ ์ฐจ์ˆ˜ ์ง€์ •(๋‹จ์–ด ๊ฐœ์ˆ˜)
ex) num_word=500 : ์ „์ฒด ๋‹จ์–ด์—์„œ ์ค‘์š” ๋‹จ์–ด 500๊ฐœ ์„ ์ •

from tensorflow.keras.preprocessing.text import Tokenizer #ํ† ํฐ ์ƒ์„ฑ๊ธฐ


text_sample.txt ์ฐธ๊ณ 

texts = ['The dog sat on the table.', 'The dog is my Poodle.']


ํ† ํฐ ์ƒ์„ฑ๊ธฐ

tokenizer = Tokenizer()


๋‹จ๊ณ„1 : ํ† ํฐ(token) ์ƒ์„ฑ

tokenizer.fit_on_texts(texts) token = tokenizer.word_index #ํ† ํฐ ๋ฐ˜ํ™˜ print(token) #{'word':๊ณ ์œ ์ˆซ์ž} : ๊ณ ์œ ์ˆซ์ž -> ๋‹จ์–ด ์ˆœ์„œ ์ƒ‰์ธ ''' {'the': 1, 'dog': 2, 'sat': 3, 'on': 4, 'table': 5, 'is': 6, 'my': 7, 'poodle': 8} ''' print('์ „์ฒด ๋‹จ์–ด ๊ธธ์ด =', len(token)) # ์ „์ฒด ๋‹จ์–ด ๊ธธ์ด = 8



1. ํฌ์†Œํ–‰๋ ฌ(sparse matrix) : texts -> ํŠน์ง• ์ถ”์ถœ
1) ๋‹จ์–ด ์ถœํ˜„์—ฌ๋ถ€

binary_mat = tokenizer.texts_to_matrix(texts=texts, mode='binary') print(binary_mat) # DTM ''' [[0. 1. 1. 1. 1. 1. 0. 0. 0.] [0. 1. 1. 0. 0. 0. 1. 1. 1.]] ''' binary_mat.shape #(2, 9) - (docs, words+1)


2) ๋‹จ์–ด ์ถœํ˜„๋นˆ๋„

count_mat = tokenizer.texts_to_matrix(texts=texts, mode='count') print(count_mat) ''' [[0. 2. 1. 1. 1. 1. 0. 0. 0.] - the : 2 [0. 1. 1. 0. 0. 0. 1. 1. 1.]] '''


3) ๋‹จ์–ด ์ถœํ˜„๋น„์œจ

freq_mat = tokenizer.texts_to_matrix(texts=texts, mode='freq') print(freq_mat)

[[0. 0.33333333 0.16666667 0.16666667 0.16666667 0.16666667
0. 0. 0. ]
[0. 0.2 0.2 0. 0. 0.
0.2 0.2 0.2 ]]

the = 0.3333

2/6 #0.3333333333333333


dog = 0.1666

1/6


4) ๋‹จ์–ด ์ถœํ˜„๋น„์œจ : tf * idf(์ „์ฒด๋ฌธ์„œ์ˆ˜/๋‹จ์–ดํฌํ•จ๋œ๋ฌธ์„œ์ˆ˜)

tfidf_mat = tokenizer.texts_to_matrix(texts=texts, mode='tfidf') print(tfidf_mat) #the : 0.8649 -> 0.510

[[0. 0.86490296 0.51082562 0.69314718 0.69314718 0.69314718
0. 0. 0. ]
[0. 0.51082562 0.51082562 0. 0. 0.
0.69314718 0.69314718 0.69314718]]

tf = 0.333 tf * (2/1) #0.666 tf * (2/2) #0.333




2. num_words : ํฌ์†Œํ–‰๋ ฌ ๋‹จ์–ด ๊ธธ์ด ์ œํ•œ
ํ† ํฐ ์ƒ์„ฑ๊ธฐ

tokenizer = Tokenizer(num_words=6) #5๊ฐœ ๋‹จ์–ด ์„ ์ •(๋‹จ์–ด๊ธธ์ด+1) tokenizer.fit_on_texts(texts) #ํ…์ŠคํŠธ ๋ฐ˜์˜ tfidf_max = tokenizer.texts_to_matrix(texts, mode='tfidf') print(tfidf_max) ''' [[0. 0.86490296 0.51082562 0.69314718 0.69314718 0.69314718] [0. 0.51082562 0.51082562 0. 0. 0. ]] ''' tfidf_max.shape #(2, 6) - (docs, words+1)





feature classifier

- ํฌ์†Œํ–‰๋ ฌ(์ธ์ฝ”๋”ฉ) + DNN model

<์ž‘์—…์ ˆ์ฐจ>
1. csv file read
2. texts, label(0 or 1) ์ „์ฒ˜๋ฆฌ
3. num_words = 4000 ์ œํ•œ
4. texts -> sparse matrix : feature ์ถ”์ถœ
5. train/val split
6. DNN model


import pandas as pd #csv file rad import numpy as np #list -> array import string #texts ์ „์ฒ˜๋ฆฌ from sklearn.model_selection import train_test_split from tensorflow.keras.preprocessing.text import Tokenizer #ํ† ํฐ ์ƒ์„ฑ๊ธฐ #DNN model from tensorflow.keras.models import Sequentialb #model from tensorflow.keras.layers import Dense #layer path = r'C:\ITWILL\5_Tensorflow\workspace\chap08_TextVectorizing_RNN\data'



1. csv file read

spam_data = pd.read_csv(path + '/temp_spam_data2.csv', header = None) spam_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5573 entries, 0 to 5572
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 0 5574 non-null object - label(ham/spam) : y๋ณ€์ˆ˜
1 1 5574 non-null object - texts : x๋ณ€์ˆ˜

label = spam_data[0] texts = spam_data[1] label.value_counts()

ham 4827 -> 0
spam 747 -> 1

texts




2. texts, label(0 or 1) ์ „์ฒ˜๋ฆฌ
1) label ์ „์ฒ˜๋ฆฌ

label = [1 if lab=='spam' else 0 for lab in label] #list + for label[:10] # [0, 0, 1, 0, 0, 1, 0, 0, 1, 1]


list -> numpy

label = np.array(label) label.shape # (5574,)


2) texts ์ „์ฒ˜๋ฆฌ

def text_prepro(texts): #text_sample.txt ์ฐธ๊ณ  #Lower case texts = [x.lower() for x in texts] #Remove punctuation texts = [''.join(c for c in x if c not in string.punctuation) for x in texts] #Remove numbers texts = [''.join(c for c in x if c not in string.digits) for x in texts] #Trim extra whitespace texts = [' '.join(x.split()) for x in texts] return texts


ํ•จ์ˆ˜ ํ˜ธ์ถœ : texts ์ „์ฒ˜๋ฆฌ

texts = text_prepro(texts) print(texts)




3. num_words = 4000 ์ œํ•œ

tokenizer = Tokenizer() #์ „์ฒด ๋‹จ์–ด ์ด์šฉ ํฌ์†Œํ–‰๋ ฌ ์ƒ์„ฑ tokenizer.fit_on_texts(texts) #ํ…์ŠคํŠธ ๋ฐ˜์˜ words = tokenizer.index_word #๋‹จ์–ด ๋ฐ˜ํ™˜ print(words) print('์ „์ฒด ๋‹จ์–ด ์ˆ˜ : ', len(words)) #์ „์ฒด ๋‹จ์–ด ์ˆ˜ : 8629




4. Sparse matrix : feature ์ถ”์ถœ

x_data = tokenizer.texts_to_matrix(texts, mode='tfidf') x_data.shape # (5574, 8630) - (docs, words+1)



5. train_test_split : 80 vs 20

x_train, x_val, y_train, y_val = train_test_split( x_data, label, test_size=0.2) x_train.shape #(4459, 8630) x_val.shape #(1115, 8630) y_train.shape #(4459,) y_val.shape #(1115,)




6. DNN model

model = Sequential() input_shape = (8630,)


hidden layer1 : w[8630, 64]

model.add(Dense(units=64, input_shape=input_shape, activation='relu')) #1์ธต


hidden layer2 : w[64, 32]

model.add(Dense(units=32, activation='relu')) #2์ธต


output layer : ์ดํ•ญ๋ถ„๋ฅ˜๊ธฐ

model.add(Dense(units=1, activation='sigmoid')) #3์ธต model.summary()

_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_2 (Dense) (None, 64) 552384=[8630*64]+64
_________________________________________________________________
dense_3 (Dense) (None, 32) 2080=[64*32]+32
_________________________________________________________________
dense_4 (Dense) (None, 1) 33=[32*1]+1
=================================================================
Total params: 554,497
Trainable params: 554,497



7. model compile : ํ•™์Šต๊ณผ์ • ์„ค์ •

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])




8. model training

model.fit(x=x_train, y=y_train, #ํ›ˆ๋ จ์…‹ epochs=10, #๋ฐ˜๋ณตํ•™์Šต ํšŸ์ˆ˜ verbose=1, #์ถœ๋ ฅ์—ฌ๋ถ€ validation_data= (x_val, y_val)) #๊ฒ€์ฆ์…‹




9. model evaluation

print('='*30) model.evaluate(x=x_val, y=y_val) #0s 1ms/step - loss: 0.1051 - accuracy: 0.9874





word embedding

* feature classifier ์ฐธ๊ณ 
word embedding(์ธ์ฝ”๋”ฉ) + DNN model

์ธ์ฝ”๋”ฉ : ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ ๊ฒฐ๊ณผ
1. ํฌ์†Œํ–‰๋ ฌ
texts -> ํฌ์†Œํ–‰๋ ฌ(์ธ์ฝ”๋”ฉ) -> DNN model
2. ๋‹จ์–ด์ž„๋ฒ ๋”ฉ(๋ฐ€์ง‘ํ–‰๋ ฌ)
texts -> ์ •์ˆ˜ ์ƒ‰์ธ -> padding -> Embedding์ธต(์ธ์ฝ”๋”ฉ) -> DNN model

Embedding(input_dim, outpu_dim, input_length)
input_dim : ์ „์ฒด๋‹จ์–ด์ˆ˜+1
outpu_dim : ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ ์ฐจ์›(32, 63, ...)

import pandas as pd # csv file rad import numpy as np # list -> array import string # texts ์ „์ฒ˜๋ฆฌ from sklearn.model_selection import train_test_split from tensorflow.keras.preprocessing.text import Tokenizer #ํ† ํฐ ์ƒ์„ฑ๊ธฐ from tensorflow.keras.preprocessing.sequence import pad_sequences #[์ถ”๊ฐ€] ํŒจ๋”ฉ


DNN model

from tensorflow.keras.models import Sequential #model from tensorflow.keras.layers import Dense, Embedding, Flatten #[์ถ”๊ฐ€] layer import time #time check path = r'C:\ITWILL\5_Tensorflow\workspace\chap08_TextVectorizing_RNN\data'



1. csv file read

spam_data = pd.read_csv(path + '/temp_spam_data2.csv', header = None) spam_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5573 entries, 0 to 5572
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 0 5574 non-null object - label(ham/spam) : y๋ณ€์ˆ˜
1 1 5574 non-null object - texts : x๋ณ€์ˆ˜

label = spam_data[0] texts = spam_data[1] label.value_counts() ''' ham 4827 -> 0 spam 747 -> 1 ''' texts




2. texts, label(0 or 1) ์ „์ฒ˜๋ฆฌ
1) label ์ „์ฒ˜๋ฆฌ

label = [1 if lab=='spam' else 0 for lab in label] #list + for label[:10] # [0, 0, 1, 0, 0, 1, 0, 0, 1, 1]


list -> numpy

label = np.array(label) label.shape # (5574,)


2) texts ์ „์ฒ˜๋ฆฌ

def text_prepro(texts): #text_sample.txt ์ฐธ๊ณ  #Lower case texts = [x.lower() for x in texts] #Remove punctuation texts = [''.join(c for c in x if c not in string.punctuation) for x in texts] #Remove numbers texts = [''.join(c for c in x if c not in string.digits) for x in texts] #Trim extra whitespace texts = [' '.join(x.split()) for x in texts] return texts


ํ•จ์ˆ˜ ํ˜ธ์ถœ : texts ์ „์ฒ˜๋ฆฌ

texts = text_prepro(texts) print(texts)




3. num_words = 4000 ์ œํ•œ
tokenizer = Tokenizer() #์ „์ฒด ๋‹จ์–ด ์ด์šฉ ํฌ์†Œํ–‰๋ ฌ ์ƒ์„ฑ

tokenizer = Tokenizer(num_words = 4000) #2์ฐจ : 4000 ๋‹จ์–ด ์ œํ•œ tokenizer.fit_on_texts(texts) #ํ…์ŠคํŠธ ๋ฐ˜์˜ words = tokenizer.index_word #๋‹จ์–ด ๋ฐ˜ํ™˜ print(words) print('์ „์ฒด ๋‹จ์–ด ์ˆ˜ : ', len(words)) #์ „์ฒด ๋‹จ์–ด ์ˆ˜ : 8629 input_dim = len(words) + 1 #์ „์ฒด๋‹จ์–ด์ˆ˜+1




4. Sparse matrix : feature ์ถ”์ถœ

x_data = tokenizer.texts_to_matrix(texts, mode='tfidf') x_data.shape # (5574, 8630) - (docs, words+1) -> (5574, 4000)


[์ถ”๊ฐ€] 4. ์ •์ˆ˜ ์ƒ‰์ธ : ๋‹จ์–ด ์ˆœ๋ฒˆ

seq_result = tokenizer.texts_to_sequences(texts) print(seq_result) lens = [len(sent) for sent in seq_result] lens maxlen = max(lens) maxlen #158




[์ถ”๊ฐ€] 5.padding : maxlen ๊ธฐ์ค€ ๋‹จ์–ด๊ธธ์ด ๋งž์ถค

x_data = pad_sequences(seq_result, maxlen=maxlen) x_data.shape #(5574, 158) - (๋ฌธ์žฅ์ˆ˜, ๋‹จ์–ด๊ธธ์ด)




6. train_test_split : 80 vs 20

x_train, x_val, y_train, y_val = train_test_split( x_data, label, test_size=0.2) x_train.shape #(4459, 8630) x_val.shape #(1115, 8630) y_train.shape #(4459,) y_val.shape #(1115,)




7. DNN model

model = Sequential()




[์ถ”๊ฐ€] 8. Embedding layer : 1์ธต - ์ธ์ฝ”๋”ฉ

model.add(Embedding(input_dim=input_dim, output_dim=32, input_length=maxlen))

input_dim : ์ „์ฒด๋‹จ์–ด์ˆ˜+1
output_dim : ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ ์ฐจ์›(32, 64, ...)
input_length : ๋ฌธ์žฅ์„ ๊ตฌ์„ฑํ•˜๋Š” ๋‹จ์–ด๊ธธ์ด(maxlen)

[์ถ”๊ฐ€] 2d -> 1d

model.add(Flatten())


hidden layer1 : w[4000, 64] vs w[32, 64]

model.add(Dense(units=64, activation='relu')) #2์ธต


hidden layer2 : w[64, 32]

model.add(Dense(units=32, activation='relu')) #3์ธต


output layer : ์ดํ•ญ๋ถ„๋ฅ˜๊ธฐ

model.add(Dense(units=1, activation='sigmoid')) #4์ธต model.summary()

_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_2 (Dense) (None, 64) 552384=[8630*64]+64
_________________________________________________________________
dense_3 (Dense) (None, 32) 2080=[64*32]+32
_________________________________________________________________
dense_4 (Dense) (None, 1) 33=[32*1]+1
=================================================================
Total params: 554,497
Trainable params: 554,497

start = time.time()



+ Recent posts