01. ์•™์ƒ๋ธ”(Ensemble) ๋ชจ๋ธ

์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋ถ„๋ฅ˜๊ธฐ(Classifier)๋ฅผ ์ƒ์„ฑํ•˜๊ณ , ๊ฐ ์˜ˆ์ธก ๊ฒฐ๊ณผ๋ฅผ ๊ฒฐํ•ฉํ•˜์—ฌ ์˜ˆ์ธก๋ ฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋ชจ๋ธ

๋‹จ์ผ ๋ชจ๋ธ(Decision Tree)์— ๋น„ํ•ด์„œ ๋ถ„๋ฅ˜ ์„ฑ๋Šฅ์ด ์šฐ์ˆ˜ (๊ณผ์ ํ•ฉ, ์˜ˆ์ธก๋ฅ  ์ €ํ•˜ ํ•ด๊ฒฐ)

 

 

๋‹จ์ผ Tree vs ์•™์ƒ๋ธ” ๋ชจ๋ธ

Decision tree Random Forest (์•™์ƒ๋ธ” ๋ชจ๋ธ ์ƒ์„ฑ ํŒจํ‚ค์ง€)
๋™์ผํ•œ ํ•˜๋‚˜์˜ ๋ฐ์ดํ„ฐ ์ง‘ํ•ฉ์—์„œ ํ•œ ๊ฐœ์˜ ํ›ˆ๋ จ์šฉ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑ ๋™์ผํ•œ ํ•˜๋‚˜์˜ ๋ฐ์ดํ„ฐ ์ง‘ํ•ฉ์—์„œ ์ž„์˜๋ณต์› ์ƒ˜ํ”Œ๋ง์„ ํ†ตํ•ด ์—ฌ๋Ÿฌ ๊ฐœ์˜ ํ›ˆ๋ จ์šฉ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑ
ํ•œ ๋ฒˆ์˜ ํ•™์Šต์„ ํ†ตํ•ด์„œ ํ•˜๋‚˜์˜ ๋ถ„๋ฅ˜ ํŠธ๋ฆฌ ์ƒ์„ฑ ๋ฐ ๋ชฉํ‘œ๋ณ€์ˆ˜ ์˜ˆ์ธก ์—ฌ๋Ÿฌ ๋ฒˆ์˜ ํ•™์Šต์„ ํ†ตํ•ด ์—ฌ๋Ÿฌ ๊ฐœ์˜ ํŠธ๋ฆฌ ์ƒ์„ฑํ•˜๊ณ , ์ด๋ฅผ ๊ฒฐํ•ฉํ•˜์—ฌ ์ตœ์ข…์ ์œผ๋กœ ๋ชฉํ‘œ๋ณ€์ˆ˜ ์˜ˆ์ธก
์ƒ์„ฑ๋œ ๋ถ„๋ฅ˜๋ชจ๋ธ์„ ๊ฒ€์ •๋ฐ์ดํ„ฐ์— ์ ์šฉํ•˜์—ฌ ๋ชฉํ‘œ๋ณ€์ˆ˜ ์˜ˆ์ธก ์—ฌ๋Ÿฌ ๋ฒˆ์˜ ํ•™์Šต์„ ํ†ตํ•ด ์—ฌ๋Ÿฌ ๊ฐœ์˜ ํŠธ๋ฆฌ ์ƒ์„ฑํ•˜๊ณ , ์ด๋ฅผ ๊ฒฐํ•ฉํ•˜์—ฌ ์ตœ์ข…์ ์œผ๋กœ ๋ชฉํ‘œ๋ณ€์ˆ˜ ์˜ˆ์ธก

 

 

์•™์ƒ๋ธ”(Ensemble) ์•Œ๊ณ ๋ฆฌ์ฆ˜

์ข…๋ฅ˜ : ๋ฐฐ๊น…(Bagging), ๋ถ€์ŠคํŒ…(Boosting)

์•™์ƒ๋ธ” ํ•™์Šต ๋ชจ๋ธ ์ƒ์„ฑ ์ ˆ์ฐจ : ์ „์ฒด ๋ฐ์ดํ„ฐ์—์„œ ํ›ˆ๋ จ ์ง‘ํ•ฉ ์ƒ์„ฑ > ๊ฐ ํ›ˆ๋ จ์ง‘ํ•ฉ ๋ชจ๋ธ ํ•™์Šต > ํ•™์Šต๋œ ๋ชจ๋ธ ์•™์ƒ๋ธ” ๋„์ถœ

* ๋‹จ, ๋ชจ๋ธ ๊ฒฐ๊ณผ์˜ ํ•ด์„์ด ์–ด๋ ต๊ณ , ์˜ˆ์ธก ์‹œ๊ฐ„์ด ๋งŽ์ด ์†Œ์š”๋œ๋‹ค.

 

 

์•™์ƒ๋ธ” ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋น„๊ต

  ๋ฐฐ๊น…(Bagging) ๋ถ€์ŠคํŒ…(Boosting)
๊ณตํ†ต์  ์ „์ฒด ๋ฐ์ดํ„ฐ ์ง‘ํ•ฉ์œผ๋กœ๋ถ€ํ„ฐ ๋ณต์› ๋žœ๋ค ์ƒ˜ํ”Œ๋ง(bootstrap) ์œผ๋กœ ํ›ˆ๋ จ ์ง‘ํ•ฉ ์ƒ์„ฑ
์ฐจ์ด์  ๋ณ‘๋ ฌํ•™์Šต : ๊ฐ ๋ชจ๋ธ์˜ ๊ฒฐ๊ณผ ๋ฅผ ์กฐํ•ฉํ•˜์—ฌ ํˆฌํ‘œ ๊ฒฐ์ • ์ˆœ์ฐจํ•™์Šต : ํ˜„์žฌ ๋ชจ๋ธ ๊ฐ€์ค‘์น˜ -> ๋‹ค์Œ ๋ชจ๋ธ ์ „๋‹ฌ
ํŠน์ง• ๊ท ์ผํ•œ ํ™•๋ฅ ๋ถ„ํฌ์— ์˜ํ•ด์„œ ํ›ˆ๋ จ ์ง‘ํ•ฉ ์ƒ์„ฑ ๋ถ„๋ฅ˜ํ•˜๊ธฐ ์–ด๋ ค์šด ํ›ˆ๋ จ ์ง‘ํ•ฉ ์ƒ์„ฑ
๊ฐ•์  ๊ณผ๋Œ€์ ํ•ฉ์— ๊ฐ•ํ•จ ๋†’์€ ์ •ํ™•๋„
์•ฝ์  ํŠน์ • ์˜์—ญ ์ •ํ™•๋„ ๋‚ฎ์Œ Outlier ์ทจ์•ฝ
R ํŒจํ‚ค์ง€๋ช… randomForest XGboost

 

 

์•™์ƒ๋ธ” ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜

Bagging์€ ์ผ๋ฐ˜์  ๋ชจ๋ธ์„ ๋งŒ๋“œ๋Š” ๋ฐ ์ง‘์ค‘, Boosting์€ ๋งž์ถ”๊ธฐ ์–ด๋ ค์šด ๋ฌธ์ œ๋ฅผ ๋งž์ถ”๋Š” ๋ฐ ์ดˆ์ ์„ ๋งž์ถ˜ ์•™์ƒ๋ธ” ํ•™์Šต ๋ฐฉ๋ฒ•

 

 

 

 

 

02. ๋ถ€ํŠธ์ŠคํŠธ๋žฉ(Bootstrap)

์•™์ƒ๋ธ” ๋ชจ๋ธ์—์„œ ์›๋ž˜์˜ ๋ฐ์ดํ„ฐ ์…‹์œผ๋กœ๋ถ€ํ„ฐ ๋ณต์› ์ถ”์ถœ๋ฐฉ์‹์œผ๋กœ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ ์…‹์„ ์ƒ์„ฑํ•˜๋Š” ๊ธฐ๋ฒ•

๋ฐ์ดํ„ฐ์˜ ์–‘์„ ์ž„์˜์ ์œผ๋กœ ๋Š˜๋ฆฌ๊ณ , ๋ฐ์ดํ„ฐ ์…‹์˜ ๋ถ„ํฌ๊ฐ€ ๊ณ ๋ฅด์ง€ ์•Š์„ ๋•Œ ๊ณ ๋ฅด๊ฒŒ ๋งŒ๋“œ๋Š” ํšจ๊ณผ

 

 

 

 

 

03. ๋ฐฐ๊น… ์•Œ๊ณ ๋ฆฌ์ฆ˜

Bagging : Bootstrap Aggregating(“์ฃผ๋จธ๋‹ˆ ํ†ตํ•ฉ”)

๋ถ€ํŠธ์ŠคํŠธ๋žฉ์„ ํ†ตํ•ด์„œ ์กฐ๊ธˆ์”ฉ ์„œ๋กœ ๋‹ค๋ฅธ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜์—ฌ ๋ชจ๋ธ (ํ›ˆ๋ จ ๋œ ํŠธ๋ฆฌ)์„ ์ƒ์„ฑํ•˜๊ณ , ๊ฒฐ๊ณผ๋ฅผ ๊ฒฐํ•ฉ(aggregating) ์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•

 

1. D๊ฐœ์˜ ์ „์ฒด๋ฐ์ดํ„ฐ๊ฐ€ ์žˆ๋‹ค.

2. ์ „์ฒด๋ฐ์ดํ„ฐ์—์„œ D๊ฐœ์™€ ๊ฐ™์€ ๊ฐฏ์ˆ˜์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณต์›์ถ”์ถœํ•˜์—ฌ D1 (์ฃผ๋จธ๋‹ˆ) ์ƒ์„ฑ

3. D1 ๋ฐ์ดํ„ฐ๋ฅผ ํ•™์Šตํ•˜์—ฌ ํŠธ๋ฆฌ๋ชจ๋ธ ์ƒ์„ฑ

4. ์œ„ 2๋ฒˆ~3๋ฒˆ์œผ๋กœ Dm(์ฃผ๋จธ๋‹ˆ) ์ƒ์„ฑ, ํŠธ๋ฆฌ๋ชจ๋ธ ์ƒ์„ฑ

5. m๊ฐœ ํŠธ๋ฆฌ์˜ ํ‰๊ท ์œผ๋กœ y์˜ˆ์ธก - ์–‘์  ๋ฐ˜์‘๋ณ€์ˆ˜(ํšŒ๊ท€ํŠธ๋ฆฌ) : ๊ฐ ํŠธ๋ฆฌ ํ‰๊ท  ์˜ˆ์ธก - ์งˆ์  ๋ฐ˜์‘๋ณ€์ˆ˜(๋ถ„๋ฅ˜ํŠธ๋ฆฌ) : ๊ฐ ํŠธ๋ฆฌ voting ์˜ˆ์ธก

 

 

 

 

 

04. ๋ถ€์ŠคํŒ…(boosting) ์•Œ๊ณ ๋ฆฌ์ฆ˜

์ž˜๋ชป ๋ถ„๋ฅ˜๋œ ๊ฐ์ฒด๋“ค์— ์ง‘์ค‘ํ•˜์—ฌ ์ƒˆ๋กœ์šด ๋ถ„๋ฅ˜๊ทœ์น™์„ ์ƒ์„ฑํ•˜๋Š” ๋‹จ๊ณ„๋ฅผ ๋ฐ˜๋ณตํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜(์ˆœ์ฐจ์  ํ•™์Šต)

์•ฝํ•œ ์˜ˆ์ธก๋ชจํ˜•๋“ค์„ ๊ฒฐํ•ฉํ•˜์—ฌ ๊ฐ•ํ•œ ์˜ˆ์ธก๋ชจํ˜•์„ ๋งŒ๋“œ๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜

์˜ค ๋ถ„๋ฅ˜๋œ ๊ฐœ์ฒด๋Š” ๋†’์€ ๊ฐ€์ค‘์น˜, ์ • ๋ถ„๋ฅ˜๋œ ๊ฐœ์ฒด๋Š” ๋‚ฎ์€ ๊ฐ€์ค‘์น˜๋ฅผ ์ ์šฉํ•˜์—ฌ ์˜ˆ์ธก๋ชจํ˜•์˜ ์ •ํ™•๋„๋ฅผ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•

 

 

 

 

 

05. RandomForest

ํŠน์ง•

๋ฐฐ๊น… ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ์œ ์‚ฌํ•จ(์•™์ƒ๋ธ” ๋ชจ๋ธ)

ํšŒ๊ท€๋ถ„์„, ๋ถ„๋ฅ˜๋ถ„์„ ๋ชจ๋‘ ๊ฐ€๋Šฅ

๋ณ„๋„ ํŠœ๋‹(์Šค์ผ€์ผ ์กฐ์ •) ๊ณผ์ • ์—†์Œ

๋‹จ์ผ tree ๋ชจ๋ธ ๋‹จ์  ๋ณด์™„(์„ฑ๋Šฅ, ๊ณผ๋Œ€์ ํ•ฉ)

๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ ์…‹์œผ๋กœ ์ฒ˜๋ฆฌ์‹œ๊ฐ„ ์ฆ๊ฐ€(๋‹จ์ )

๋ฉ€ํ‹ฐ์ฝ”์–ด ํ”„๋กœ์„ธ์Šค ์ด์šฉ ๋ณ‘๋ ฌ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅ

 

์ฐจ์ด์ 

๋ฐฐ๊น… : ์ƒ˜ํ”Œ ๋ณต์› ์ถ”์ถœ ์‹œ ๋ชจ๋“  ๋…๋ฆฝ๋ณ€์ˆ˜(์„ค๋ช…๋ณ€์ˆ˜) ์‚ฌ์šฉ

๋žœ๋คํฌ๋ ˆ์ŠคํŠธ : a๊ฐœ ๋…๋ฆฝ๋ณ€์ˆ˜(์„ค๋ช…๋ณ€์ˆ˜)๋งŒ์œผ๋กœ ๋ณต์› ์ถ”์ถœ

์„ค๋ช…๋ณ€์ˆ˜ ๊ฐœ์ˆ˜ : sqrt(์„ค๋ช…๋ณ€์ˆ˜ ๊ฐœ์ˆ˜) (์˜ˆ: 15๊ฐœ ๋ณ€์ˆ˜๋ผ๋ฉด 4๊ฐœ ์ •๋„)

์„ค๋ช…๋ณ€์ˆ˜๊ฐ€ ๋งŽ์„ ๊ฒฝ์šฐ ๋ณ€์ˆ˜๊ฐ„ ์ƒ๊ด€์„ฑ์ด ๋†’์€ ๋ณ€์ˆ˜๊ฐ€ ์„ž์ผ ํ™•๋ฅ  ์ œ๊ฑฐ

 

 

Random Forest ๋ชจ๋ธ ์ƒ์„ฑ ๋‹จ๊ณ„

1. ๋ฐฐ๊น…์„ ์ด์šฉํ•˜์—ฌ BootStrap ๋งŒ๋“ ๋‹ค. (์ค‘๋ณต์„ ํ—ˆ์šฉํ•œ data sample)

2. Bootstrap์„ ์ด์šฉํ•˜์—ฌ Decision Tree๋ฅผ ํ•™์Šตํ•˜๊ณ  ์˜ˆ์ธกํ•œ๋‹ค.

3. 1,2๋ฒˆ์„ ์ถฉ๋ถ„ํžˆ ๋ฐ˜๋ณตํ•˜์—ฌ ์—ฌ๋Ÿฌ ๊ฐœ์˜ Decision Tree๋ฅผ ๋งŒ๋“ค๊ณ  ์˜ˆ์ธก์„ ๋ชจ์€๋‹ค.

* ๋ถ„๋ฅ˜์„ฑ๋Šฅ๊ณผ ๋…ธ์ด์ฆˆ์— ๊ฐ•ํ•œ ๋ชจ๋ธ (๊ณผ๋Œ€์ ํ•ฉ ์˜ˆ๋ฐฉ)

4. ์˜ˆ์ธก๊ฒฐ๊ณผ๋ฅผ ๋ชจ์•„์„œ ์•™์ƒ๋ธ” ๋ชจ๋ธ์˜ ๊ฒฐ๊ณผ๋ฅผ ๋„์ถœํ•œ๋‹ค.

* ํšŒ๊ท€ํŠธ๋ฆฌ : ์˜ˆ์ธก๋“ค์˜ ํ‰๊ท ์œผ๋กœ ๊ฒฐ์ •

* ๋ถ„๋ฅ˜ํŠธ๋ฆฌ : ๋‹ค์ค‘ ํˆฌํ‘œ(Majority Vote)๋กœ ๊ฒฐ์ •

 

 

Random Forest ์‚ฌ์šฉ ์ด์œ 

1. ๋‹จ์ผ ์˜์‚ฌ๊ฒฐ์ •ํŠธ๋ฆฌ๋Š” ๊ณผ๋Œ€์ ํ•ฉ(overfitting)์˜ ์œ„ํ—˜์ด ํผ

2. ๋ฐฐ๊น…์„ ์ด์šฉํ•ด ๊ฐ ํŠธ๋ฆฌ์˜ ํ‰๊ท , ํ™•๋ฅ , ํˆฌํ‘œ๋ฅผ ํ†ตํ•ด ๋ชฉํ‘œ๋ณ€์ˆ˜ ์˜ˆ์ธก

3. ํŠธ๋ฆฌ์˜ ํŽธํ–ฅ์€ ์œ ์ง€๋˜๊ณ , ๋ถ„์‚ฐ์€ ๊ฐ์†Œ๋˜๊ธฐ ๋•Œ๋ฌธ์— ๋†’์€ ์ •ํ™•๋„์™€ ๊ณผ๋Œ€์ ํ•ฉ ํ•ด๊ฒฐ

4. ๋น…๋ฐ์ดํ„ฐ ์‹œ์Šคํ…œ์—์„œ ๋ถ„์‚ฐ์ฒ˜๋ฆฌ ์‹œ์Šคํ…œ์— ๋งž๋Š” ๋ถ„๋ฅ˜๋ถ„์„ ๊ธฐ๋ฒ•

* ์—ฌ๋Ÿฌ ๊ฐœ์˜ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœํ•˜์—ฌ ํŠธ๋ฆฌ ์ƒ์„ฑ, ๊ฒฐํ•ฉ์„ ํ†ตํ•ด ๋ชฉํ‘œ๋ณ€์ˆ˜ ์˜ˆ์ธก์˜ ๊ตฌ์กฐ๊ฐ€ ๋ถ„์‚ฐ์ฒ˜๋ฆฌ์‹œ์Šคํ…œ์— ์ ํ•ฉ

 

 

Random Forest ๋ชจ๋ธ ํ‰๊ฐ€

OOB error(Out-of-Bag error)

OOB(Out-of-Bag) : Bootstrap์„ ํ†ตํ•ด์„œ ๋žœ๋ค์œผ๋กœ ์ค‘๋ณต ์ถ”์ถœํ•  ๊ฒฝ์šฐ ํ›ˆ๋ จ์…‹์— ํฌํ•จ๋˜์ง€ ์•Š์€ ์ž๋ฃŒ

OOB error : ํ›ˆ๋ จ์…‹(train)์— ํฌํ•จ๋˜์ง€ ์•Š์€ ์ž๋ฃŒ๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ฐ ๋ถ„๋ฅ˜๊ธฐ์—์„œ ์˜ˆ์ธกํ•˜๊ณ  ์‹ค์ œ๊ฐ’๊ณผ ๋น„๊ตํ•˜์—ฌ ์—๋Ÿฌ์œจ์„ ๊ณ„์‚ฐํ•œ ํ‰๊ท ์œผ๋กœ ๋ชจ๋ธ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐฉ์‹

 

 

Tree ๊ฐœ์ˆ˜์™€ ์„ค๋ช…๋ณ€์ˆ˜ ๊ฐœ์ˆ˜

1. bootstrap์˜ ์ˆ˜ m์€ ์–ด๋А ์ •๋„ ์ปค์•ผํ• ๊นŒ?

m์ด 100์ด์ƒ์ด๋ฉด ์ถฉ๋ถ„ํ•˜์ง€๋งŒ ๊ฒ€์ •์˜ค์ฐจ๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•ด์„œ ๋” ํฐ ๊ฐ’์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. (400~500๊ฐœ)

 

2. ๋žœ๋คํฌ๋ฆฌ์ŠคํŠธ์˜ ์„ค๋ช…๋ณ€์ˆ˜ ๊ฐœ์ˆ˜ a๋Š” ์–ผ๋งˆ๊ฐ€ ์ ๋‹นํ•œ๊ฐ€?

๋ถ„๋ฅ˜Tree : a = ์„ค๋ช…๋ณ€์ˆ˜ p์˜ ์ œ๊ณฑ๊ทผ

ํšŒ๊ท€Tree : a = ์„ค๋ช…๋ณ€์ˆ˜ p / 3 โœ“ ๋ณ€์ˆ˜๊ฐ„ ์ƒ๊ด€์„ฑ์— ๋”ฐ๋ผ ์ตœ์ ์˜ a๊ฐ’์ด ๋‹ค๋ฅผ ์ˆ˜ ์žˆ๋‹ค.

 

 

 

 

 

[์‹ค์Šต1]

install.packages('randomForest')
library(randomForest) #randomForest()ํ•จ์ˆ˜ ์ œ๊ณต 

data(iris)


1. ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ ๋ชจ๋ธ ์ƒ์„ฑ 
ํ˜•์‹) randomForest(y ~ x, data, ntree=500, mtry) -> y:์ข…์†๋ณ€์ˆ˜ x:๋…๋ฆฝ๋ณ€์ˆ˜
y๊ฐ€ ๋ฒ”์ฃผํ˜•์ด๋ฏ€๋กœ ๋ถ„๋ฅ˜tree ์ƒ์„ฑ

model = randomForest(Species~., data=iris, #row๋ฐ์ดํ„ฐ ์ „์ฒด๋ฅผ ์ด์šฉ
                     ntree=500, mtry=2, na.action=na.omit) #์ค‘๋ณต์ด ํ—ˆ์šฉ๋œ 500๊ฐœ์˜ dataset  
model

ntree : ์ƒ์„ฑํ•  tree๊ฐœ์ˆ˜
mtry : ์„ค๋ช…๋ณ€์ˆ˜(๋…๋ฆฝ๋ณ€์ˆ˜) ๊ฐœ์ˆ˜ (๋ถ„๋ฅ˜ํŠธ๋ฆฌ์ผ ๊ฒฝ์šฐ sqrt(p), ์„ค๋ช…๋ณ€์ˆ˜์ผ ๊ฒฝ์šฐ p/3)
OOB estimate of  error rate: 4.67% -> ๋ชจ๋ธํ‰๊ฐ€

 

sqrt(4) #์„ค๋ช…๋ณ€์ˆ˜ ๊ฐœ์ˆ˜ : 2

aac = (50+47+46) / nrow(iris) #0.9466667
1-aac #OOB error : 0.04666667




2. model ์ •๋ณด ํ™•์ธ 

names(model) # 19์ปฌ๋Ÿผ ์ œ๊ณต 

con = model$confusion
con


y ์˜ˆ์ธก์น˜

pred = model$predicted
pred


y ๊ด€์ธก์น˜(์ •๋‹ต)

model$y




3. ์ค‘์š” ๋ณ€์ˆ˜ ์ƒ์„ฑ  

model2 = randomForest(Species ~ ., data=iris, 
                      ntree=500, mtry=2, 
                      importance = T,
                      na.action=na.omit)
model2


y์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ์ค‘์š”๋ณ€์ˆ˜ imporance์— ๋Œ€ํ•œ ์ •๋ณด ๋”ฐ๋กœ ๋ณด๊ธฐ

model2$importance

MeanDecreaseAccuracy : ๋ถ„๋ฅ˜ ์ •ํ™•๋„ ๊ฐœ์„ ์˜ ๊ณตํ—Œ๋„
[ํ•ด์„] Petal.Length, Petal.Width์˜ ์ˆซ์ž๊ฐ€ ๋น„๊ต์  ํฌ๋‹ค -> ์ •ํ™•๋„์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ๋ณ€์ˆ˜์ด๋‹ค.

 

MeanDecreaseGini : ๋ถˆ์ˆœ๋„ ๊ฐœ์„ ์˜ ๊ณตํ—Œ๋„
[ํ•ด์„] Petal.Length, Petal.Width๊ฐ€ ์ •ํ™•๋„์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ๋ณ€์ˆ˜์ด๋‹ค.

varImpPlot(model2)

* ๋…๋ฆฝ๋ณ€์ˆ˜๊ฐ€ ๋งŽ์„ ๋•Œ, ์ฐจ์›์„ ์ถ•์†Œํ•˜์ง€ ์•Š๊ณ  ์ฃผ์š” ๋ณ€์ˆ˜ ์„ ํƒ์— ์žˆ์–ด ์ค‘์š” ์ •๋ณด ์ œ๊ณต

 

 

 

 

 

[์‹ค์Šต2] ํšŒ๊ท€ํŠธ๋ฆฌ(y๋ณ€์ˆ˜ : ๋น„์œจ์ฒ™๋„)

library(MASS)
data("Boston")
str(Boston)

#506 obs. of 14variables / ์ „๋ถ€ ์ˆซ์žํ˜•->ํšŒ๊ท€tree์— ์ ํ•ฉ
#crim : ๋„์‹œ 1์ธ๋‹น ๋ฒ”์ฃ„์œจ 
#zn : 25,000 ํ‰๋ฐฉํ”ผํŠธ๋ฅผ ์ดˆ๊ณผํ•˜๋Š” ๊ฑฐ์ฃผ์ง€์—ญ ๋น„์œจ
#indus : ๋น„์ƒ์—…์ง€์—ญ์ด ์ ์œ ํ•˜๊ณ  ์žˆ๋Š” ํ† ์ง€ ๋น„์œจ  
#chas : ์ฐฐ์Šค๊ฐ•์— ๋Œ€ํ•œ ๋”๋ฏธ๋ณ€์ˆ˜(1:๊ฐ•์˜ ๊ฒฝ๊ณ„ ์œ„์น˜, 0:์•„๋‹Œ ๊ฒฝ์šฐ)
#nox : 10ppm ๋‹น ๋†์ถ• ์ผ์‚ฐํ™”์งˆ์†Œ 
#rm : ์ฃผํƒ 1๊ฐ€๊ตฌ๋‹น ํ‰๊ท  ๋ฐฉ์˜ ๊ฐœ์ˆ˜ 
#age : 1940๋…„ ์ด์ „์— ๊ฑด์ถ•๋œ ์†Œ์œ ์ฃผํƒ ๋น„์œจ 
#dis : 5๊ฐœ ๋ณด์Šคํ„ด ์ง์—…์„ผํ„ฐ๊นŒ์ง€์˜ ์ ‘๊ทผ์„ฑ ์ง€์ˆ˜  
#rad : ๊ณ ์†๋„๋กœ ์ ‘๊ทผ์„ฑ ์ง€์ˆ˜ 
#tax : 10,000 ๋‹ฌ๋Ÿฌ ๋‹น ์žฌ์‚ฐ์„ธ์œจ 
#ptratio : ๋„์‹œ๋ณ„ ํ•™์ƒ/๊ต์‚ฌ ๋น„์œจ 
#black : ์ž์น˜ ๋„์‹œ๋ณ„ ํ‘์ธ ๋น„์œจ 
#lstat : ํ•˜์œ„๊ณ„์ธต ๋น„์œจ 
#medv(y) : ์†Œ์œ  ์ฃผํƒ๊ฐ€๊ฒฉ ์ค‘์•™๊ฐ’ (๋‹จ์œ„ : $1,000)

Boston$medv #scaling๋˜์ง€ ์•Š์€ ์ž๋ฃŒ

p = 13 #13๊ฐœ์˜ ๋…๋ฆฝ๋ณ€์ˆ˜
mtry = p / 3 
mtry #4.33333 -> 5๋กœ ๋…๋ฆฝ๋ณ€์ˆ˜ ๊ฒฐ์ •
boston_model <- randomForest(medv ~ ., data = Boston,
                             mtree = 500, mtry = 5,
                             importance = T,
                             na.action=na.omit) #๊ฒฐ์ธก์น˜๊ฐ€ ์žˆ์„ ๋•Œ ์–ด๋–ป๊ฒŒ ํ•˜๊ฒ ๋Š”๊ฐ€? = ๊ฒฐ์ธก์น˜ ์ œ๊ฑฐ

boston_model

Mean of squared residuals 9.763377 : MSE(ํ‰๊ท ์ œ๊ณฑ์˜ค์ฐจ)๋กœ ๋ชจ๋ธ ํ‰๊ฐ€ 
% Var explained 88.43 : R2 score๋กœ ๋ชจ๋ธ ํ‰๊ฐ€

names(boston_model)


์˜ˆ์ธก์น˜ vs ๊ด€์ธก์น˜

y_pred = boston_model$predicted
y_true = boston_model$y


R2 score

R2_score = cor(y_true, y_pred)^2
R2_score #0.8887293
MSE = mean((y_true - y_pred)^2)
MSE #9.763377

* y๊ฐ€ scaling๋˜์ง€ ์•Š์•˜์„ ๋•Œ, R2 score๋ฐฉ๋ฒ•์œผ๋กœ ํ‰๊ฐ€


์ค‘์š”๋ณ€์ˆ˜ ํ™•์ธ 

importance(boston_model)
varImpPlot(boston_model)

 

 

 

 

 

[์‹ค์Šต3] ๋ถ„๋ฅ˜ํŠธ๋ฆฌ(y๋ณ€์ˆ˜ : ๋ฒ”์ฃผํ˜•)

titanic = read.csv(file.choose()) # titanic3.csv
str(titanic)

titanic3.csv ๋ณ€์ˆ˜ ์„ค๋ช…
'data.frame': 1309 obs. of 14 variables:
1.pclass : 1, 2, 3๋“ฑ์„ ์ •๋ณด๋ฅผ ๊ฐ๊ฐ 1, 2, 3์œผ๋กœ ์ €์žฅ
2.survived : ์ƒ์กด ์—ฌ๋ถ€. survived(์ƒ์กด=1), dead(์‚ฌ๋ง=0)
3.name : ์ด๋ฆ„(์ œ์™ธ)
4.sex : ์„ฑ๋ณ„. female(์—ฌ์„ฑ), male(๋‚จ์„ฑ)
5.age : ๋‚˜์ด
6.sibsp : ํ•จ๊ป˜ ํƒ‘์Šนํ•œ ํ˜•์ œ ๋˜๋Š” ๋ฐฐ์šฐ์ž์˜ ์ˆ˜
7.parch : ํ•จ๊ป˜ ํƒ‘์Šนํ•œ ๋ถ€๋ชจ ๋˜๋Š” ์ž๋…€์˜ ์ˆ˜
8.ticket : ํ‹ฐ์ผ“ ๋ฒˆํ˜ธ(์ œ์™ธ)
9.fare : ํ‹ฐ์ผ“ ์š”๊ธˆ
10.cabin : ์„ ์‹ค ๋ฒˆํ˜ธ(์ œ์™ธ)
11.embarked : ํƒ‘์Šนํ•œ ๊ณณ(์ œ์™ธ) C(Cherbourg), Q(Queenstown), S(Southampton)
12.boat     : (์ œ์™ธ)Factor w/ 28 levels "","1","10","11",..: 13 4 1 1 1 14 3 1 28 1 ...
13.body     : (์ œ์™ธ)int  NA NA NA 135 NA NA NA NA NA 22 ...
14.home.dest: (์ œ์™ธ)

์‚ญ์ œ ์นผ๋Ÿผ : 3, 8, 10~14

df = titanic[, -c(3, 8, 10:14)]
dim(df) #1309  7

sqrt(7) #2.645751 -> 3


์ˆซ์žํ˜• -> ์š”์ธํ˜•(๋”๋ฏธ๋ณ€์ˆ˜) ๋ณ€ํ™˜

df$survived = factor(df$survived)
titanic_model = randomForest(survived ~ ., data = df,
                   mtree = 500, mtry = 3,
                   importance = T,
                   na.action=na.omit)

titanic_model #19.04
con = titanic_model$confusion

    0   1 class.error
0 548  70   0.1132686
1 129 298   0.3021077

acc = (con[1,1] + con[2,2]) / sum(con)
acc #0.8092477


์ค‘์š”๋ณ€์ˆ˜ ํ™•์ธ

importance(titanic_model)
varImpPlot(titanic_model)

+ Recent posts