๋ฌธ1) titanic ๋ฐ์ดํ„ฐ์…‹์„ ์ด์šฉํ•˜์—ฌ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์นด์ด์ œ๊ณฑ ๊ฒ€์ •ํ•˜์‹œ์˜ค.
<๋‹จ๊ณ„1> ์ƒ์กด์—ฌ๋ถ€(survived), ์‚ฌํšŒ์ ์ง€์œ„(pclass) ๋ณ€์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ต์ฐจ๋ถ„ํ• ํ‘œ ์ž‘์„ฑ 
<๋‹จ๊ณ„2> ์นด์ด์ œ๊ณฑ ๊ฒ€์ •ํ†ต๊ณ„๋Ÿ‰, ์œ ์˜ํ™•๋ฅ , ์ž์œ ๋„, ๊ธฐ๋Œ€๊ฐ’ ์ถœ๋ ฅ      
<๋‹จ๊ณ„3> ๊ฐ€์„ค๊ฒ€์ • ๊ฒฐ๊ณผ ํ•ด์„ค  

 

import seaborn as sn
import pandas as pd
from scipy import stats # ํ™•๋ฅ ๋ถ„ํฌ ๊ฒ€์ •


titanic dataset load 

titanic = sn.load_dataset('titanic')
print(titanic.info())



1. ๊ต์ฐจ๋ถ„ํ• ํ‘œ 

tab = pd.crosstab(index=titanic.survived, 
                  columns=titanic.pclass)
print(tab)

pclass      1   2    3
survived              
0          80  97  372
1         136  87  119

 


2. ์นด์ด์ œ๊ณฑ ๊ฒ€์ •ํ†ต๊ณ„๋Ÿ‰, ์œ ์˜ํ™•๋ฅ , ์ž์œ ๋„, ๊ธฐ๋Œ€๊ฐ’  

chi2, pvalue, df, evalue = stats.chi2_contingency(observed= tab) # ์ด์› chi-square ๊ฒ€์ • 

print('chi2 = %.6f, pvalue = %.8f, d.f = %d'%(chi2, pvalue, df))
# chi2 = 102.888989, pvalue = 0.00000000, d.f = 2

pvalue # 4.549251711298793e-23

 


3. ๊ฐ€์„ค๊ฒ€์ • ํ•ด์„ค 
๋งค์šฐ ์œ ์˜๋ฏธํ•œ ์ˆ˜์ค€์—์„œ ์ƒ์กด์—ฌ๋ถ€์™€ ์‚ฌํšŒ์ ์ง€์œ„ ๊ฐ„์˜ ๊ด€๋ จ์„ฑ์ด ์žˆ๋‹ค. 

 

 

 

 

 

๋ฌธ2) winequality-both.csv ๋ฐ์ดํ„ฐ์…‹์„ ์ด์šฉํ•˜์—ฌ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ฒ˜๋ฆฌํ•˜์‹œ์˜ค.

<์กฐ๊ฑด1> quality, type ์นผ๋Ÿผ์œผ๋กœ ๊ต์ฐจ๋ถ„ํ• ํ‘œ ์ž‘์„ฑ 
<์กฐ๊ฑด2> ๊ต์ฐจ๋ถ„ํ• ํ‘œ๋ฅผ ๋Œ€์ƒ์œผ๋กœ white ์™€์ธ ๋‚ด๋ฆผ์ฐจ์ˆœ ์ •๋ ฌ       
<์กฐ๊ฑด3> red ์™€์ธ๊ณผ white ์™€์ธ์˜ quality์— ๋Œ€ํ•œ ๋‘ ์ง‘๋‹จ ํ‰๊ท  ๊ฒ€์ • -> ๊ฐ ์ง‘๋‹จ ํ‰๊ท  ํ†ต๊ณ„๋Ÿ‰ ์ถœ๋ ฅ
<์กฐ๊ฑด4> alcohol ์นผ๋Ÿผ๊ณผ ๋‹ค๋ฅธ ์นผ๋Ÿผ ๊ฐ„์˜ ์ƒ๊ด€๊ณ„์ˆ˜ ์ถœ๋ ฅ

 

import pandas as pd
import os 
from scipy import stats


os.chdir('c:/itwill/4_python-ii/data')
wine = pd.read_csv('winequality-both.csv')
print(wine.info())



<์กฐ๊ฑด1> quality, type ์นผ๋Ÿผ์œผ๋กœ ๊ต์ฐจ๋ถ„ํ• ํ‘œ ์ž‘์„ฑ 

wine_tab = pd.crosstab(index=wine['quality'], 
                       columns=wine['type'])

print(wine_tab)

type     red  white
quality            
3         10     20
4         53    163
5        681   1457
6        638   2198
7        199    880
8         18    175
9          0      5

<์กฐ๊ฑด2> ๊ต์ฐจ๋ถ„ํ• ํ‘œ๋ฅผ ๋Œ€์ƒ์œผ๋กœ white ์™€์ธ ๋‚ด๋ฆผ์ฐจ์ˆœ ์ •๋ ฌ

wine_tab_sort = wine_tab.sort_values('white', ascending=False)
print(wine_tab_sort)

type     red  white
quality            
6        638   2198
5        681   1457
7        199    880
8         18    175
4         53    163
3         10     20
9          0      5

<์กฐ๊ฑด3> red ์™€์ธ๊ณผ white ์™€์ธ์˜ quality์— ๋Œ€ํ•œ ๋‘ ์ง‘๋‹จ ํ‰๊ท  ๊ฒ€์ •

red_wine = wine.loc[wine['type']=='red', 'quality']
white_wine = wine.loc[wine['type']=='white', 'quality']
two_sample = stats.ttest_ind(red_wine, white_wine) 

print(two_sample)
print('t๊ฒ€์ • ํ†ต๊ณ„๋Ÿ‰ = %.3f, pvalue = %.5f'%(two_sample))

* t๊ฒ€์ • ํ†ต๊ณ„๋Ÿ‰ = -9.686, pvalue = 0.00000 < 0.05

<์กฐ๊ฑด4> alcohol ์นผ๋Ÿผ๊ณผ ๋‹ค๋ฅธ ์นผ๋Ÿผ ๊ฐ„์˜ ์ƒ๊ด€๊ณ„์ˆ˜ ์ถœ๋ ฅ

corr = wine.corr()
print(corr)


print(corr['alcohol'])

 

 

 

 

 

 

 

๋ฌธ3) score ๋ฐ์ดํ„ฐ์…‹์„ ์ด์šฉํ•˜์—ฌ ๋‹จ์ˆœ์„ ํ˜•ํšŒ๊ท€๋ชจ๋ธ์„ ์ด์šฉํ•˜์—ฌ ๊ฐ€์„ค๊ฒ€์ •์œผ๋กœ ์ˆ˜ํ–‰ํ•˜์‹œ์˜ค.

๊ท€๋ฌด๊ฐ€์„ค : academy๋Š” score์— ์˜ํ–ฅ์„ ๋ฏธ์น˜์ง€ ์•Š๋Š”๋‹ค.
๋Œ€๋ฆฝ๊ฐ€์„ค : academy๋Š” score์— ์˜ํ–ฅ์„ ๋ฏธ์นœ๋‹ค.

<์กฐ๊ฑด1> y๋ณ€์ˆ˜ : score, x๋ณ€์ˆ˜ : academy      
<์กฐ๊ฑด2> ํšŒ๊ท€๋ชจ๋ธ ์ƒ์„ฑ๊ณผ ๊ฒฐ๊ณผํ™•์ธ(ํšŒ๊ท€๊ณ„์ˆ˜, ์„ค๋ช…๋ ฅ, pvalue, ํ‘œ์ค€์˜ค์ฐจ) 
<์กฐ๊ฑด3> ํšŒ๊ท€์„  ์ ์šฉ ์‹œ๊ฐํ™” 

 

from scipy import stats
import pandas as pd
import matplotlib.pyplot as plt  # ํšŒ๊ท€๋ชจ๋ธ ๊ด€๋ จ ์‹œ๊ฐํ™”


dataset ๊ฐ€์ ธ์˜ค๊ธฐ 

score = pd.read_csv(r'c:/itwill/4_python-2/data/score_iq.csv')
print(score.info())
print(score.head())



1. ๋ณ€์ˆ˜ ์„ ํƒ

x = score.academy # ๋…๋ฆฝ๋ณ€์ˆ˜  
y = score.score # ์ข…์†๋ณ€์ˆ˜

 


2. ๋‹จ์ˆœ ์„ ํ˜•ํšŒ๊ท€๋ถ„์„(stats)

model = stats.linregress(x, y)
print(model)
print('x ๊ธฐ์šธ๊ธฐ : ', model.slope) 
print('y ์ ˆํŽธ :', model.intercept)
print('์„ค๋ช…๋ ฅ : ', model.rvalue)
print('p๊ฐ’ : ', model.pvalue) # F๊ฒ€์ • ํ†ต๊ณ„๋Ÿ‰ 
print('x ํ‘œ์ค€์˜ค์ฐจ :' , model.stderr)

 

 

3. ํšŒ๊ท€์„  ์‹œ๊ฐํ™”  

a = model.slope
b = model.intercept
y_pred = (x*a) + b # ์˜ˆ์ธก์น˜


์‚ฐ์ ๋„ 

plt.plot(score['academy'], score['score'], 'b.') # ํŒŒ๋ž‘์ƒ‰ 
plt.plot(score['academy'], y_pred, 'r.-') # ๋นจ๊ฐ•์ƒ‰ 
plt.title('regression plotting') # ์ œ๋ชฉ 
plt.legend(['x y scatter', 'regression line']) # ๋ฒ”๋ก€ 
plt.show()

 

 

 

 

 

 

๋ฌธ4) irsi.csv ๋ฐ์ดํ„ฐ์…‹์„ ์ด์šฉํ•˜์—ฌ ๋‹ค์ค‘์„ ํ˜•ํšŒ๊ท€๋ชจ๋ธ์„ ์ƒ์„ฑํ•˜์‹œ์˜ค.

<์กฐ๊ฑด1> ์นผ๋Ÿผ๋ช…์— ํฌํ•จ๋œ '.' ์„ '_'๋กœ ์ˆ˜์ •
iris.columns = iris.columns.str.replace('.', '_')   
<์กฐ๊ฑด2> model์˜ formula ๊ตฌ์„ฑ 
y๋ณ€์ˆ˜ : 1๋ฒˆ์งธ ์นผ๋Ÿผ, x๋ณ€์ˆ˜ : 2 ~ 3๋ฒˆ์งธ ์นผ๋Ÿผ       
<์กฐ๊ฑด3> ํšŒ๊ท€๊ณ„์ˆ˜ ํ™•์ธ    
<์กฐ๊ฑด4> ํšŒ๊ท€๋ชจ๋ธ ๊ฒฐ๊ณผ ํ™•์ธ ๋ฐ ํ•ด์„  : summary()ํ•จ์ˆ˜ ์ด์šฉ 

   

import pandas as pd
import statsmodels.formula.api as sm # ๋‹ค์ค‘ํšŒ๊ท€๋ชจ๋ธ


dataset ๊ฐ€์ ธ์˜ค๊ธฐ  

iris = pd.read_csv('c:/itwill/4_python-ii/data/iris.csv')
print(iris.head())



1. iris ์นผ๋Ÿผ๋ช… ์ˆ˜์ • 

iris.columns = iris.columns.str.replace('.', '_')
print(iris.info())



2. formula ๊ตฌ์„ฑ ๋ฐ ๋‹ค์ค‘ํšŒ๊ท€๋ชจ๋ธ ์ƒ์„ฑ  

iris_model = sm.ols(formula="Sepal_Length ~ Sepal_Width+Petal_Length", data=iris).fit()



3. ํšŒ๊ท€๊ณ„์ˆ˜ ํ™•์ธ 

print(iris_model.params)

Intercept       2.249140
Sepal_Width     0.595525
Petal_Length    0.471920


4. ํšŒ๊ท€๋ชจ๋ธ ๊ฒฐ๊ณผ ํ™•์ธ

print(iris_model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:           Sepal_Length   R-squared:                       0.840
Model:                            OLS   Adj. R-squared:                  0.838
Method:                 Least Squares   F-statistic:                     386.4
Date:                Tue, 30 Nov 2021   Prob (F-statistic):           2.93e-59
Time:                        21:02:08   Log-Likelihood:                -46.513
No. Observations:                 150   AIC:                             99.03
Df Residuals:                     147   BIC:                             108.1
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept        2.2491      0.248      9.070      0.000       1.759       2.739
Sepal_Width      0.5955      0.069      8.590      0.000       0.459       0.733
Petal_Length     0.4719      0.017     27.569      0.000       0.438       0.506
==============================================================================
Omnibus:                        0.164   Durbin-Watson:                   2.021
Prob(Omnibus):                  0.921   Jarque-Bera (JB):                0.319
Skew:                          -0.044   Prob(JB):                        0.853
Kurtosis:                       2.792   Cond. No.                         48.3
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

+ Recent posts