AI ์‹ค๋ฌด ๊ธฐ๋ณธ ๊ณผ์ •
02 ์œ ๊ฐ€ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•œ ์ฃผ์œ ์†Œ ์‹œ์žฅ ๋ถ„์„

ํ”„๋กœ์ ํŠธ ๋ชฉํ‘œ

<ํ•œ๊ตญ์„์œ ๊ณต์‚ฌ ์ œํ’ˆ๋ณ„ ์ฃผ์œ ์†Œ ํŒ๋งค๊ฐ€๊ฒฉ> ๋ฐ์ดํ„ฐ์˜ ๊ฐ€๊ฒฉ/์ง€์—ญ/๋ธŒ๋žœ๋“œ/์…€ํ”„์—ฌ๋ถ€๋ฅผ ๋ถ„์„ํ•˜์—ฌ ์ฃผ์œ ์†Œ ์‹œ์žฅ ๊ด€๋ จ ์ธ์‚ฌ์ดํŠธ ๋„์ถœ

์‹ค์ œ ์—ฐ๊ตฌ/์‹ค๋ฌด์—์„œ ํ™œ์šฉ๋˜๋Š” ํ•„์ˆ˜ ๋ถ„์„ ๊ณผ์ • ๋ฐ ๊ธฐ๋ฒ•์— ๋Œ€ํ•ด ํ•™์Šต

 

ํ”„๋กœ์ ํŠธ ๋ชฉ์ฐจ

  1. ๋ฐ์ดํ„ฐ Cleansing ๋ฐ Feature Engineering: ๋ถ„์„์„ ์œ„ํ•œ ์‚ฌ์ „ ์ ๊ฒ€ ๋ฐ ๋ฐ์ดํ„ฐ ๊ฐœ๊ด„ ์ดํ•ด
    1.1. 2018๋…„ ๋ฐ์ดํ„ฐ ๊ธฐ์ค€ ๋ฐ์ดํ„ฐ Cleansing ๋ฐ Feature Engineering
    1.2. Cleansing ๋ฐ Feature Engineering ํ•จ์ˆ˜ ์ƒ์„ฑ ๋ฐ ์ „์ฒด ๋…„๋„ ๋ฐ์ดํ„ฐ ์ ์šฉ
    1.3. ์—ฐ๋„๋ณ„ ๋ฐ์ดํ„ฐ Outer Join

  2. ์ฃผ์œ ์†Œ ๊ฐœํ์—… ํ˜„ํ™ฉ ๋ถ„์„: ์—ฐ๋„๋ณ„ ์ฃผ์œ ์†Œ ID ๋น„๊ต๋ฅผ ํ†ตํ•œ ๊ฐœํ์—… ํ˜„ํ™ฉ ๋ถ„์„
    2.1. ์—ฐ๋„๋ณ„ ๊ฐœํ์—… ์ˆ˜์น˜ ๋ถ„์„

  3. ๋ธŒ๋žœ๋“œ ๋ถ„์„: ๋ธŒ๋žœ๋“œ๋ณ„ ๊ฐ€๊ฒฉ๊ฒฝ์Ÿ๋ ฅ ๋ฐ ์‹œ์žฅ์ ์œ ์œจ ๋ถ„์„
    3.1. ์ฃผ์š” ๋ธŒ๋žœ๋“œ๋ณ„ ๊ฐ€๊ฒฉ Line Plot ๋ถ„์„
    3.2. ์ฃผ์š” ๋ธŒ๋žœ๋“œ๋ณ„ ์ง€๋‚œ 4๋…„๊ฐ„ ์‹œ์žฅ ์ ์œ ์œจ Stacked Bar Plot ๋ฐ Heatmap ๋ถ„์„

  4. ๊ฐ€๊ฒฉ ๋ถ„์„: ์ฃผ์œ ์†Œ ๋ฐ ์ง€์—ญ ๋ณ„ ๊ฐ€๊ฒฉ ํŽธ์ฐจ ๋ถ„์„
    4.1. ๊ฐ€๊ฒฉ ๋ถ„ํฌ Boxplot
    4.2. ์ง€์—ญ๋ณ„ ๊ฐ€๊ฒฉ ๋ถ„ํฌ Boxplot (Multiple Columns)

 

๋ฐ์ดํ„ฐ ์ถœ์ฒ˜

  • https://www.data.go.kr/data/15044628/fileData.do
  • Opinet ์œ ๊ฐ€๋‚ด๋ ค๋ฐ›๊ธฐ: 2018 ~ 2021๋…„ 4๊ฐœ๋…„์— ๋Œ€ํ•ด ๊ฐ๊ฐ 6์›” 1์ผ~7์ผ ๋ฐ์ดํ„ฐ ์ถ”์ถœ
  • ํ”„๋กœ์ ํŠธ์— ํ•„์š”ํ•œ ์ปฌ๋Ÿผ๋งŒ ์ถ”์ถœ

 

 

 

 

 

 

1. ๋ฐ์ดํ„ฐ Cleansing ๋ฐ Feature Engineering

ํ•„์š”ํ•œ ํŒจํ‚ค์ง€๋ฅผ importํ•œ ํ›„ ๋ถ„์„์„ ์œ„ํ•œ ์‚ฌ์ „ ์ ๊ฒ€๊ณผ ํ•จ๊ป˜ ๋ฐ์ดํ„ฐ ๊ฐœ๊ด„์„ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค.

 

1-1. 18๋…„๋„ ๊ธฐ์ค€ ๋ฐ์ดํ„ฐ ์ ๊ฒ€ ๋ฐ Cleansing 

import numpy as np 
import pandas as pd 
import seaborn as sns
sns.set_style('darkgrid')
import matplotlib.pyplot as plt
import matplotlib.font_manager as fm

font_dirs = ['/usr/share/fonts/truetype/nanum', ]
font_files = fm.findSystemFonts(fontpaths=font_dirs)
for font_file in font_files:
    fm.fontManager.addfont(font_file)

plt.rcParams['font.family'] = 'NanumBarunGothic'
plt.rcParams['axes.unicode_minus']=False

f18 = pd.read_csv(f'./data/แ„€แ…ชแ„€แ…ฅ_แ„‘แ…กแ†ซแ„†แ…ขแ„€แ…กแ„€แ…งแ†จ(แ„Œแ…ฎแ„‹แ…ฒแ„‰แ…ฉ)_2018.csv')

f18.head()
๊ธฐ์ค€ : ์ผ๊ฐ„(20180601~20180607) NaN NaN NaN NaN NaN
A0011536 20180601.0 ๊ฐ•์› ๊ฐ•๋ฆ‰์‹œ SK์—๋„ˆ์ง€ ์…€ํ”„ 1579.0
A0011536 20180602.0 ๊ฐ•์› ๊ฐ•๋ฆ‰์‹œ SK์—๋„ˆ์ง€ ์…€ํ”„ 1579.0
A0011536 20180603.0 ๊ฐ•์› ๊ฐ•๋ฆ‰์‹œ SK์—๋„ˆ์ง€ ์…€ํ”„ 1579.0
A0011536 20180604.0 ๊ฐ•์› ๊ฐ•๋ฆ‰์‹œ SK์—๋„ˆ์ง€ ์…€ํ”„ 1579.0
# 0๋ฒˆ row ์ œ๊ฑฐ
f18 = f18.drop(0)

# ๋ณ€์ˆ˜๋ณ„ null๊ฐ’ ํ™•์ธ ๊ฒฐ๊ณผ null ์—†์Œ
f18.isna().sum()

๋ฒˆํ˜ธ      0
๊ธฐ๊ฐ„      0
์ง€์—ญ      0
์ƒํ‘œ      0
์…€ํ”„์—ฌ๋ถ€    0
ํœ˜๋ฐœ์œ      0
dtype: int64

# include='all': ์นดํ…Œ๊ณ ๋ฆฌํ˜• ๋ณ€์ˆ˜๋„ ์ •๋ณด ์ œ๊ณต
f18.describe(include='all')

 

 

 

  • unique ๋ฒˆํ˜ธ๊ฐ€ 11673๊ฐœ์ด๋ฉฐ ์ตœ๋Œ€ 7๋ฒˆ๊นŒ์ง€ ๊ธฐ๋ก๋˜์—ˆ์Œ
  • ๊ธฐ๊ฐ„์ด ์ˆ˜์น˜๋กœ ์ธ์‹๋˜๊ณ  ์žˆ์Œ
  • unique ์ง€์—ญ ๊ฐœ์ˆ˜๊ฐ€ 229์ด์–ด์„œ ๋„ˆ๋ฌด ๋งŽ์Œ
  • unique ์ƒํ‘œ ๊ฐœ์ˆ˜๊ฐ€ 9๊ฐœ์ด๋ฏ€๋กœ ์ ์ ˆํ•จ
  • unique ์…€ํ”„์—ฌ๋ถ€ ๊ฐœ์ˆ˜๊ฐ€ 2๊ฐœ์ด๋ฉฐ, ์…€ํ”„์—ฌ๋ถ€๋Š” ๊ฐ๊ฐ ์ ˆ๋ฐ˜์ •๋„ ๋น„์ค‘์„ ์ฐจ์ง€ํ•จ
  • ํœ˜๋ฐœ์œ  min์ด 0์ž„
# ๊ธฐ๊ฐ„์„ datetime ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜
f18['๊ธฐ๊ฐ„'] = f18['๊ธฐ๊ฐ„'].apply(lambda x:pd.to_datetime(str(int(x)))) 

# ์ง€์—ญ ๋ณ€์ˆ˜ ์ค‘ ์ฒซ ์ง€์—ญ ๊ตฌ๋ถ„๋งŒ ์ปฌ๋Ÿผ ํ˜•์„ฑ
region_len = f18['์ง€์—ญ'].apply(lambda x: len(x.split())) 
print(f"min: {min(region_len)},max: {max(region_len)}")

f18['์ง€์—ญ2'] = f18['์ง€์—ญ'].apply(lambda x:x.split()[0])
import collections
collections.Counter(f18['์ง€์—ญ2'])

# ํœ˜๋ฐœ์œ ๊ฐ’ 0์ธ ROW ํ™•์ธ
f18.loc[f18['ํœ˜๋ฐœ์œ ']==0].head(10) 

f18.loc[f18['๋ฒˆํ˜ธ']=='A0010629']

# ํœ˜๋ฐœ์œ ๊ฐ’ 0์ธ ROW ์ œ๊ฑฐ
f18 = f18.loc[f18['ํœ˜๋ฐœ์œ ']!=0,:]

f18.describe(include='all',datetime_is_numeric=True)

#์ฃผ์œ ์†Œ๋ณ„ ๋ฐ์ดํ„ฐ ์ •ํ•ฉ์„ฑ ํ™•์ธ(7์ผ๋™์•ˆ ๋ณ€ํ™” ์—†์—ˆ๋‹ค๋Š” ์ „์ œ)
unique_count = f18.groupby('๋ฒˆํ˜ธ')[['์ง€์—ญ','์ƒํ‘œ','์…€ํ”„์—ฌ๋ถ€']].nunique()
unique_count.head()

target = unique_count.loc[(unique_count!=1).sum(axis=1)!=0]
target

f18.loc[f18['๋ฒˆํ˜ธ'].isin(target.index)]

f18 = f18.loc[~f18['๋ฒˆํ˜ธ'].isin(target.index)]

# ์ฃผ์œ ์†Œ๋ณ„ ๋ฐ์ดํ„ฐ ํ†ตํ•ฉ
f18 = f18.groupby('๋ฒˆํ˜ธ')\
    .agg({'์ง€์—ญ':'first','์ง€์—ญ2':'first','์ƒํ‘œ':'first','์…€ํ”„์—ฌ๋ถ€':'first','ํœ˜๋ฐœ์œ ':'mean'})\
    .reset_index() 
    
f18.describe(include='all')

 

1-2. Cleansing ๋ฐ Feature Engineering ํ•จ์ˆ˜ ์ƒ์„ฑ ๋ฐ ์ „์ฒด ๋…„๋„ ๋ฐ์ดํ„ฐ ์ ์šฉ

def preprocess(df):
    df_copy=df.copy() # ํ•„ํ„ฐ๋ง ์ „
    
    df = df.drop(0)
    df['๊ธฐ๊ฐ„'] = df['๊ธฐ๊ฐ„'].apply(lambda x:pd.to_datetime(str(int(x))))
    df['์ง€์—ญ2'] = df['์ง€์—ญ'].apply(lambda x:x.split()[0])
    df = df.loc[df['ํœ˜๋ฐœ์œ ']!=0,:]
    unique_count = df.groupby('๋ฒˆํ˜ธ')[['๋ฒˆํ˜ธ','์ง€์—ญ','์ƒํ‘œ','์…€ํ”„์—ฌ๋ถ€']].nunique()
    target = unique_count.loc[(unique_count!=1).sum(axis=1)!=0,:]
    df = df.loc[~df['๋ฒˆํ˜ธ'].isin(target.index),:]
    df = df.groupby('๋ฒˆํ˜ธ')\
        .agg({'์ง€์—ญ':'first','์ง€์—ญ2':'first','์ƒํ‘œ':'first','์…€ํ”„์—ฌ๋ถ€':'first','ํœ˜๋ฐœ์œ ':'mean'})\
        .reset_index()
    
    out = set(df_copy['๋ฒˆํ˜ธ']).difference(set(df['๋ฒˆํ˜ธ'])) # ํ•„ํ„ฐ๋ง ํ›„ 
    return(df,out)
    
f_dict = dict()
out_all = set() # ์ด์ƒ์น˜ ๋ฐœ๊ฒฌํ•œ ์ฃผ์œ ์†Œ ๋ฒˆํ˜ธ ์ €์žฅ
for year in range(2018,2022):
    df = pd.read_csv(f'./data/แ„€แ…ชแ„€แ…ฅ_แ„‘แ…กแ†ซแ„†แ…ขแ„€แ…กแ„€แ…งแ†จ(แ„Œแ…ฎแ„‹แ…ฒแ„‰แ…ฉ)_{year}.csv')
    f_dict[year], out = preprocess(df)
    out_all.update(out)

 

1.3. ์—ฐ๋„๋ณ„ ๋ฐ์ดํ„ฐ Outer Join

key = list(f_dict[2018].columns)
key.remove('ํœ˜๋ฐœ์œ ')
print(key)

m1 = pd.merge(f_dict[2018],f_dict[2019],on=key,how='outer',suffixes=('_2018', '_2019'))
m2 = pd.merge(f_dict[2020],f_dict[2021],on=key,how='outer',suffixes=('_2020', '_2021'))
m = pd.merge(m1,m2,on=key,how='outer')

m.head()

m.groupby('๋ฒˆํ˜ธ').size().sort_values(ascending=False).head()

m.loc[m['๋ฒˆํ˜ธ']=='A0019752']

(m.groupby('๋ฒˆํ˜ธ').size()>1).sum()

key.remove('์ƒํ‘œ')
key

m1 = pd.merge(f_dict[2018],f_dict[2019],on=key,how='outer',suffixes=('_2018', '_2019'))
m2 = pd.merge(f_dict[2020],f_dict[2021],on=key,how='outer',suffixes=('_2020', '_2021'))
m = pd.merge(m1,m2,on=key,how='outer')

m.head()

size = m.groupby('๋ฒˆํ˜ธ').size().sort_values(ascending=False)
size.head()

target = size[size>1].index
m.loc[m['๋ฒˆํ˜ธ'].isin(target)].sort_values('๋ฒˆํ˜ธ')

m = m.loc[~m['๋ฒˆํ˜ธ'].isin(target)]
m.groupby('๋ฒˆํ˜ธ').size().sort_values(ascending=False).head()

# ์ด์ƒ์น˜ ๋ฐœ๊ฒฌ๋˜์—ˆ๋˜ ์ฃผ์œ ์†Œ ํ•„ํ„ฐ๋ง
m = m.loc[[x not in out_all for x in m['๋ฒˆํ˜ธ']]]

m.head()

 

 

 

2. ์ฃผ์œ ์†Œ ๊ฐœํ์—… ํ˜„ํ™ฉ ๋ถ„์„: ์—ฐ๋„๋ณ„ ์ฃผ์œ ์†Œ ID ๋น„๊ต๋ฅผ ํ†ตํ•œ ๊ฐœํ์—… ํ˜„ํ™ฉ ๋ถ„์„

2.1. ์—ฐ๋„๋ณ„ ๊ฐœํ์—… ์ˆ˜์น˜ ๋ถ„์„

id_dict=dict()
for year in range(2018,2022):
    id_dict[year] = set(m.loc[~m[f'์ƒํ‘œ_{year}'].isna()]['๋ฒˆํ˜ธ'].unique())
    
diff_dict=dict()
for year in range(2018,2021):
    opened = len(id_dict[year+1].difference(id_dict[year]))
    closed = len(id_dict[year].difference(id_dict[year+1]))
    diff_dict[f'{year}_{year+1}']=[opened,closed]
diff_df = pd.DataFrame(diff_dict,index=['OPENED','CLOSED'])

diff_df
diff_df.plot()
diff_df.T.plot(color=['r','b'])

diff_df.plot()

diff_df.T.plot(color=['r','b'])

 

 

[๋ฌธ์ œ1] 2020๋…„์— ์‹ ๊ทœ ๊ฐœ์—…ํ•œ ์…€ํ”„ ์ฃผ์œ ์†Œ์˜ ๊ฐœ์ˆ˜๋ฅผ ๊ตฌํ•˜์‹œ์˜ค.

id_dict=dict()
for year in range(2018,2022):
    id_dict[year] = set(m.loc[(~m[f'์ƒํ‘œ_{year}'].isna())&(m['์…€ํ”„์—ฌ๋ถ€']=='์…€ํ”„')]['๋ฒˆํ˜ธ'].unique())
diff_dict=dict()
for year in range(2018,2021):
    opened = len(id_dict[year+1].difference(id_dict[year]))
    closed = len(id_dict[year].difference(id_dict[year+1]))
    diff_dict[f'{year}_{year+1}']=[opened,closed]
diff_df = pd.DataFrame(diff_dict,index=['OPENED','CLOSED'])
diff_df

# 2020๋…„์— ์‹ ๊ทœ ๊ฐœ์—…ํ•œ ์…€ํ”„ ์ฃผ์œ ์†Œ์˜ ๊ฐœ์ˆ˜๋ฅผ ๊ตฌํ•˜์—ฌ quiz_1 ๋ณ€์ˆ˜์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.
# ์ˆซ์žํ˜•์œผ๋กœ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.
quiz_1 = diff_df['2019_2020']['OPENED']

 

 

 

3. ๋ธŒ๋žœ๋“œ ๋ถ„์„: ๋ธŒ๋žœ๋“œ๋ณ„ ๊ฐ€๊ฒฉ๊ฒฝ์Ÿ๋ ฅ ๋ฐ ์‹œ์žฅ์ ์œ ์œจ ๋ถ„์„

3.1. ์ฃผ์š” ๋ธŒ๋žœ๋“œ๋ณ„ ๊ฐ€๊ฒฉ Line Plot ๋ถ„์„

brand_price_dict=dict()
for year in range(2018,2022):
    brand_price_dict[str(year)]=m.groupby(f'์ƒํ‘œ_{year}')[f'ํœ˜๋ฐœ์œ _{year}'].mean()
    
brand_price_df = pd.DataFrame(brand_price_dict)
brand_price_df

brand_price_df = brand_price_df.drop('SK๊ฐ€์Šค')
brand_price_df.T.plot(figsize=(10,5))

 

3.2. ์ฃผ์š” ๋ธŒ๋žœ๋“œ๋ณ„ ์ง€๋‚œ 4๋…„๊ฐ„ ์‹œ์žฅ ์ ์œ ์œจ Stacked Bar Plot ๋ฐ Heatmap

brand_share_dict=dict()
for year in range(2018,2022):
    brand_share_dict[str(year)]=m.groupby(f'์ƒํ‘œ_{year}').size()
    
brand_share_df = pd.DataFrame(brand_share_dict)
brand_share_df

brand_share_df = brand_share_df.drop('SK๊ฐ€์Šค')
brand_ratio_df = brand_share_df.apply(lambda x:x/brand_share_df.sum(),axis=1)
brand_ratio_df = brand_ratio_df.sort_values('2018',ascending=False)

brand_ratio_df

brand_ratio_df.T.plot(kind='bar',stacked=True,rot=0,figsize=(10,5))
plt.legend(bbox_to_anchor=(1, 1))

 

plt.figure(figsize=(10,5))
sns.heatmap(brand_ratio_df, cmap= 'RdBu_r', linewidths=1, linecolor='black',annot=True)

 

[๋ฌธ์ œ2] 2019๋…„ ์ฃผ์œ ์†Œ๋ฅผ ์…€ํ”„ ๋ฐ ์ผ๋ฐ˜ ์ฃผ์œ ์†Œ๋กœ ๊ตฌ๋ถ„ํ•˜๊ณ  ์ผ๋ฐ˜ ์ฃผ์œ ์†Œ๊ฐ€ ์ฐจ์ง€ํ•˜๋Š” ๋น„์œจ์„ ๊ตฌํ•˜์‹œ์˜ค

self_share_dict = m.loc[~m['์ƒํ‘œ_2019'].isna()].groupby('์…€ํ”„์—ฌ๋ถ€').size()
self_ratio_dict = self_share_dict/self_share_dict.sum()
self_ratio_dic

# 2019๋…„ ๊ธฐ์ค€ ์…€ํ”„์ฃผ์œ ์†Œ์˜ ์‹œ์žฅ ์ ์œ ์œจ์„ quiz_2 ๋ณ€์ˆ˜์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.
# ๋น„์œจ์€ ์†Œ์ˆซ์  ๋‘˜์งธ์ž๋ฆฌ ๊นŒ์ง€ ๋ฐ˜์˜ฌ๋ฆผํ•˜์—ฌ ์ˆซ์žํ˜•์œผ๋กœ ์ œ์ถœํ•ฉ๋‹ˆ๋‹ค.
quiz_2 = round(self_ratio_dict['์ผ๋ฐ˜'],2)

 

 

 

 

4. ๊ฐ€๊ฒฉ ๋ถ„์„: ์ฃผ์œ ์†Œ ๋ฐ ์ง€์—ญ ๋ณ„ ๊ฐ€๊ฒฉ ํŽธ์ฐจ ๋ถ„์„

4.1. ๊ฐ€๊ฒฉ ๋ถ„ํฌ Boxplot

sns.boxplot(x=m['ํœ˜๋ฐœ์œ _2021'])

 

4.2. ์ง€์—ญ๋ณ„ ๊ฐ€๊ฒฉ ๋ถ„ํฌ Boxplot (Multiple Columns)

boxplot_order = m.groupby('์ง€์—ญ2')['ํœ˜๋ฐœ์œ _2021'].median().sort_values(ascending=False).index
plt.figure(figsize=(15,7))
sns.boxplot(x="์ง€์—ญ2", y="ํœ˜๋ฐœ์œ _2021", data=m, orient="v", order=boxplot_order)

 

 

+ Recent posts