LEE_BOMB 2021. 11. 25. 17:56
groupby

์ง‘๋‹จ๋ณ€์ˆ˜ (=๋ฒ”์ฃผํ˜• ๋ฒˆ์ˆ˜, ์นดํ…Œ๊ณ ๋ฆฌ)๋ฅผ ์ด์šฉํ•œ ์ž๋ฃŒ ์ฒ˜๋ฆฌ

1. ์ง‘๋‹จ๋ณ€์ˆ˜ ๊ธฐ์ค€ subset ๋งŒ๋“ค๊ธฐ
2. ์ง‘๋‹จ๋ณ€์ˆ˜ ๊ธฐ์ค€ ๊ทธ๋ฃน, ํ†ต๊ณ„๋Ÿ‰ ๊ตฌํ•˜๊ธฐ
3. ๊ทธ๋ฃน๊ณผ ํ†ต๊ณ„๋Ÿ‰ ์‹œ๊ฐํ™”



import pandas as pd #csv file
import matplotlib.pyplot as plt #๊ทธ๋ฃน๊ณผ ํ†ต๊ณ„๋Ÿ‰ ์‹œ๊ฐํ™”

 

dataset load & ๋ณ€์ˆ˜ ํ™•์ธ

wine = pd.read_csv(r'C:\ITWILL\4_Python-2\data\winequality_both.csv')
wine.info()


๊ณต๋ฐฑ -> '_' ๊ต์ฒด

wine.columns = wine.columns.str.replace(' ','_')
wine.columns

wine.fixed_acidity


5๊ฐœ ๋ณ€์ˆ˜ ์„ ํƒ : subset ๋งŒ๋“ค๊ธฐ

wine_df = wine.iloc[:,[0,1,4,11,12]] #iloc : ์œ„์น˜๊ธฐ๋ฐ˜ ํ–‰๋ ฌ ์„ ํƒ / ํ–‰ ์ „์ฒด ์„ ํƒ / ์—ด ์ƒ‰์ธ
wine_df.info()

0   type            6497 non-null   object 
1   fixed_acidity   6497 non-null   float64
2   residual_sugar  6497 non-null   float64
3   alcohol         6497 non-null   float64
4   quality         6497 non-null   int64  
  
์นผ๋Ÿผ๋ช… ๋ณ€๊ฒฝ (์ „์ฒด๋ณ€๊ฒฝ)

wine_df.columns = ['type', 'acidity', 'sugar', 'alcohol', 'quality']


์นผ๋Ÿผ๋ช… ๋ณ€๊ฒฝ (์ผ๋ถ€๋ถ„ ๋ณ€๊ฒฝ) : {'old':'new'}

columns = {'fixed_acidity' : 'acidity', 'residual_sugar' : 'sugar'}
wine_df = wine_df.rename(columns = columns)
wine_df.info()

 0   type     6497 non-null   object 
 1   acidity  6497 non-null   float64 -> ๋ณ€๊ฒฝ
 2   sugar    6497 non-null   float64 -> ๋ณ€๊ฒฝ
 3   alcohol  6497 non-null   float64
 4   quality  6497 non-null   int64

์ง‘๋‹จ๋ณ€์ˆ˜ ํ™•์ธ : ์™€์ธ ์œ ํ˜•

wine_df.type.unique() #array(['red', 'white'], dtype=object)
wine_df.type.nunique() #2
wine_df['type'].value_counts()

white    4898
red      1599

์ด์‚ฐ๋ณ€์ˆ˜ ํ™•์ธ : ์™€์ธ ํ’ˆ์งˆ

wine_df.quality.unique() #[5, 6, 7, 4, 8, 3, 9]
wine_df.quality.value_counts()

6    2836 -> 6์˜ ํ€„๋ฆฌํ‹ฐ๋ฅผ ๊ฐ€์ง„ ์™€์ธ์˜ ๋นˆ๋„์ˆ˜๊ฐ€ ์ œ์ผ ๋งŽ๋‹ค
5    2138
7    1079
4     216
8     193
3      30
9       5



1. ์ง‘๋‹จ๋ณ€์ˆ˜ ๊ธฐ์ค€ subset ๋งŒ๋“ค๊ธฐ
1) 1๊ฐœ ์ง‘๋‹จ ๊ธฐ์ค€

red_wine = wine_df[wine['type']=='red'] #red์™€์ธ ์„ ํƒ
red_wine.shape() #(1599, 5)

red_wine.head()


2) 2๊ฐœ ์ด์ƒ ์ง‘๋‹จ ๊ธฐ์ค€

two_wine = wine_df[wine['type'].isin(['red', 'white'])] #type ์นผ๋Ÿผ์˜ red or white ์กฐ๊ฑด ์ง€์ •ํ•ด subset ๋งŒ๋“ค๊ธฐ
two_wine.shape #(6497, 5)
two_wine.head() #red์™€์ธ ํ™•์ธ
two_wine.tail() #white์™€์ธ ํ™•์ธ

two_wine['type'].value_counts() #white 4898 / red 1599


3) ์ง‘๋‹จ๋ณ€์ˆ˜ ๊ธฐ์ค€ ํŠน์ • ๋ณ€์ˆ˜ ์„ ํƒ : 1์ฐจ์›
์ง‘๋‹จ1 : ๋ ˆ๋“œ์™€์ธ ํ’ˆ์งˆ

red_wine_quality = wine_df.loc[wine['type']=='red', 'quality'] #๋ช…์นญ ๊ธฐ๋ฐ˜ ํ–‰๋ ฌ ์„ ํƒ
red_wine_quality.shape #(1599,) : 2์ฐจ์› -> 1์ฐจ์›. quality๊ฐ€ ๊ฐ€์ง„ vector๊ธธ์ด๋งŒ ๋ฐ˜ํ™˜


์ง‘๋‹จ2 : ํ™”์ดํŠธ์™€์ธ ํ’ˆ์งˆ

white_wine_quality = wine_df.loc[wine['type']=='white', 'quality']
white_wine_quality.shape #(4898,)




2. ์ง‘๋‹จ๋ณ€์ˆ˜ ๊ธฐ์ค€ ๊ทธ๋ฃน & ํ†ต๊ณ„๋Ÿ‰
1) ์ง‘๋‹จ๋ณ€์ˆ˜ 1๊ฐœ ์ด์šฉ ๊ทธ๋ฃนํ™”
ํ˜•์‹) DF.groupby('์ง‘๋‹จ๋ณ€์ˆ˜')

type_group = wine_df.groupby('type')
print(type_group) #DataFrameGroupBy object

dir(type_group) #ํ˜ธ์ถœ ๊ฐ€๋Šฅํ•œ ๋ฉ”์„œ๋“œ ํ™•์ธ
type_group.size() #์ง‘๋‹จ ๋ณ„ ๋นˆ๋„์ˆ˜

red      1599
white    4898

2) ์ง‘๋‹จ๋ณ„ ํ†ต๊ณ„๋Ÿ‰ : group + apply

type_group.mean()

        acidity     sugar    alcohol   quality -> ํ‰๊ท ์— ๋Œ€ํ•œ ํ†ต๊ณ„๋Ÿ‰ ์ œ๊ณต
type                                           -> type์— ์˜ํ•ด ๋‘ ์ง‘๋‹จ์œผ๋กœ ๋‚˜๋‰จ
red    8.319637  2.538806  10.422983  5.636023
white  6.854788  6.391415  10.514267  5.877909

type_group.std() #ํ‘œ์ค€ํŽธ์ฐจ

        acidity     sugar   alcohol   quality
type                                         
red    1.741096  1.409928  1.065668  0.807569
white  0.843868  5.072058  1.230621  0.885639

ํŠน์ • ๋ณ€์ˆ˜ ๋Œ€์ƒ ํ†ต๊ณ„๋Ÿ‰

type_group['quality'].mean()
type_group['quality'].describe() #์š”์•ฝํ†ต๊ณ„๋Ÿ‰


2) ์ง‘๋‹จ๋ณ€์ˆ˜ 2๊ฐœ ์ด์šฉ
ํ˜•์‹) DF.groupby(['์ง‘๋‹จ๋ณ€์ˆ˜1', '์ง‘๋‹จ๋ณ€์ˆ˜2']) : 1์ฐจ -> 2์ฐจ

wine_group = wine_df.groupby(['type','quality'])
wine_group.size()

group_size = wine_group.size()
group_size.shape #(13,) : 1์ฐจ์›


1์ฐจ์› -> 2์ฐจ์›

group_size_2d = group_size.unstack()
group_size_2d

quality     3      4       5       6      7      8    9
type                                                   
red      10.0   53.0   681.0   638.0  199.0   18.0  NaN -> ๊ฒฐ์ธก์น˜ ๋ฐœ์ƒ
white    20.0  163.0  1457.0  2198.0  880.0  175.0  5.0



3. ๊ทธ๋ฃน & ํ†ต๊ณ„๋Ÿ‰ ์‹œ๊ฐํ™”
1) ์™€์ธ ์œ ํ˜•๋ณ„ ํ’ˆ์งˆ : ๋นˆ๋„์ˆ˜ ๊ธฐ์ค€

type(group_size_2d) #pandas.core.frame.DataFrame

group_size_2d.plot(kind='barh', stacked=True,
                   title='red vs white wind quality')
plt.show

 

2) ๊ทธ๋ฃน ํ†ต๊ณ„ ์‹œ๊ฐํ™”

type_mean = type_group.mean()
type_mean[['sugar', 'alcohol']].plot(kind='bar')
plt.show()

 

 

 

 

 

apply

๊ทธ๋ฃน ๊ฐ์ฒด์— ์™ธ๋ถ€ํ•จ์ˆ˜ ์ ์šฉ(apply)
1. group.apply (์™ธ๋ถ€ํ•จ์ˆ˜)
group.member() vs gropy.apply(์™ธ๋ถ€ํ•จ์ˆ˜)
2. group.agg ([์™ธ๋ถ€ํ•จ์ˆ˜1, ์™ธ๋ถ€ํ•จ์ˆ˜2, ...])

dataset load
import seaborn as sns

iris = sns.load_dataset('iris')
iris.info()


ํŠน์ • ์นผ๋Ÿผ 1๊ฐœ ๋Œ€์ƒ ๊ทธ๋ฃน ๊ฐ์ฒด

iris_grp = iris['sepal_length'].groupby(iris['species'])

iris_grp.size()

setosa        50
versicolor    50
virginica     50

dir(iris_grp)
iris_grp.mean() #group.member()

setosa        5.006
versicolor    5.936
virginica     6.588

์‚ฌ์šฉ์ž ์ •์˜ ํ•จ์ˆ˜

def avg(group) :
    return group.mean()

def avg(diff) :
    return group.max() - group.min()

 



1. group.apply(์™ธ๋ถ€ํ•จ์ˆ˜)

iris_grp.apply(sum) #๋‚ด์žฅํ•จ์ˆ˜
iris_grp.apply(avg) #์‚ฌ์šฉ์žํ•จ์ˆ˜
iris_grp.apply(max) #๋‚ด์žฅํ•จ์ˆ˜
iris_grp.apply(min) #๋‚ด์žฅํ•จ์ˆ˜




2. group.agg(['๋‚ด์žฅํ•จ์ˆ˜', ์‚ฌ์šฉ์ž ์ •์˜ํ•จ์ˆ˜, '๋‚ด๋ถ€ํ•จ์ˆ˜'])

iris_grp.agg(['sum', avg, 'max', 'min', diff])

              sum    avg  max  min
species                           
setosa      250.3  5.006  5.8  4.3
versicolor  296.8  5.936  7.0  4.9
virginica   329.4  6.588  7.9  4.9

 

 

 

 

pivot_table

ํ”ผ๋ฒ—ํ…Œ์ด๋ธ”(pivot table) 
DF ๊ฐ์ฒด๋ฅผ ๋Œ€์ƒ์œผ๋กœ ํ–‰๊ณผ ์—ด ๊ทธ๋ฆฌ๊ณ  ๊ต์ฐจ ์…€์— ํ‘œ์‹œ๋  ์นผ๋Ÿผ์„ ์ง€์ •ํ•˜์—ฌ ๋งŒ๋“ค์–ด์ง„ ํ…Œ์ด๋ธ” 
ํ˜•์‹) pivot_table(DF, values='๊ต์ฐจ์…€ ์นผ๋Ÿผ',
index = 'ํ–‰ ์นผ๋Ÿผ', columns = '์—ด ์นผ๋Ÿผ'
,aggFunc = '๊ต์ฐจ์…€์— ์ ์šฉ๋  ํ•จ์ˆ˜')   
        

import pandas as pd



csv file load 

pivot_data = pd.read_csv(r'C:\ITWILL\4_Python-2\data\pivot_data.csv')
pivot_data.info()

 0   year     8 non-null      int64  : ๋…„๋„
 1   quarter  8 non-null      object : ๋ถ„๊ธฐ 
 2   size     8 non-null      object : ๋งค์ถœ๊ทœ๋ชจ
 3   price    8 non-null      int64  : ๋งค์ถœ์•ก 

ํ”ผ๋ฒ—ํ…Œ์ด๋ธ”(pivot table) 

ptable = pd.pivot_table(pivot_data, values='price',
                index = ['year','quarter'], columns = 'size'
                ,aggfunc= 'sum')

print(ptable)

size          LARGE  SMALL
year quarter              
2016 1Q        2000   1000
     2Q        2500   1200
2017 3Q        2200   1300
     4Q        2800   2300

type(ptable) #pandas.core.frame.DataFram

ptable.plot(kind='barh', stacked=True)