66. Python Statis Scipy 연습문제

LEE_BOMB 2021. 11. 30. 21:02

2021. 11. 30. 21:02

문1) titanic 데이터셋을 이용하여 다음과 같이 카이제곱 검정하시오.
<단계1> 생존여부(survived), 사회적지위(pclass) 변수를 이용하여 교차분할표 작성
<단계2> 카이제곱 검정통계량, 유의확률, 자유도, 기대값 출력
<단계3> 가설검정 결과 해설

import seaborn as sn
import pandas as pd
from scipy import stats # 확률분포 검정

titanic dataset load

titanic = sn.load_dataset('titanic')
print(titanic.info())

1. 교차분할표

tab = pd.crosstab(index=titanic.survived, 
                  columns=titanic.pclass)
print(tab)

pclass      1   2    3
survived
0          80  97  372
1         136  87  119

2. 카이제곱 검정통계량, 유의확률, 자유도, 기대값

chi2, pvalue, df, evalue = stats.chi2_contingency(observed= tab) # 이원 chi-square 검정 

print('chi2 = %.6f, pvalue = %.8f, d.f = %d'%(chi2, pvalue, df))
# chi2 = 102.888989, pvalue = 0.00000000, d.f = 2

pvalue # 4.549251711298793e-23

3. 가설검정 해설
매우 유의미한 수준에서 생존여부와 사회적지위 간의 관련성이 있다.

문2) winequality-both.csv 데이터셋을 이용하여 다음과 같이 처리하시오.

<조건1> quality, type 칼럼으로 교차분할표 작성
<조건2> 교차분할표를 대상으로 white 와인 내림차순 정렬
<조건3> red 와인과 white 와인의 quality에 대한 두 집단 평균 검정 -> 각 집단 평균 통계량 출력
<조건4> alcohol 칼럼과 다른 칼럼 간의 상관계수 출력

import pandas as pd
import os 
from scipy import stats


os.chdir('c:/itwill/4_python-ii/data')
wine = pd.read_csv('winequality-both.csv')
print(wine.info())

<조건1> quality, type 칼럼으로 교차분할표 작성

wine_tab = pd.crosstab(index=wine['quality'], 
                       columns=wine['type'])

print(wine_tab)

type     red  white
quality
3         10     20
4         53    163
5        681   1457
6        638   2198
7        199    880
8         18    175
9          0      5

<조건2> 교차분할표를 대상으로 white 와인 내림차순 정렬

wine_tab_sort = wine_tab.sort_values('white', ascending=False)
print(wine_tab_sort)

type     red  white
quality
6        638   2198
5        681   1457
7        199    880
8         18    175
4         53    163
3         10     20
9          0      5

<조건3> red 와인과 white 와인의 quality에 대한 두 집단 평균 검정

red_wine = wine.loc[wine['type']=='red', 'quality']
white_wine = wine.loc[wine['type']=='white', 'quality']
two_sample = stats.ttest_ind(red_wine, white_wine) 

print(two_sample)
print('t검정 통계량 = %.3f, pvalue = %.5f'%(two_sample))

* t검정 통계량 = -9.686, pvalue = 0.00000 < 0.05

<조건4> alcohol 칼럼과 다른 칼럼 간의 상관계수 출력

corr = wine.corr()
print(corr)


print(corr['alcohol'])

문3) score 데이터셋을 이용하여 단순선형회귀모델을 이용하여 가설검정으로 수행하시오.

귀무가설 : academy는 score에 영향을 미치지 않는다.
대립가설 : academy는 score에 영향을 미친다.

<조건1> y변수 : score, x변수 : academy
<조건2> 회귀모델 생성과 결과확인(회귀계수, 설명력, pvalue, 표준오차)
<조건3> 회귀선 적용 시각화

from scipy import stats
import pandas as pd
import matplotlib.pyplot as plt  # 회귀모델 관련 시각화

dataset 가져오기

score = pd.read_csv(r'c:/itwill/4_python-2/data/score_iq.csv')
print(score.info())
print(score.head())

1. 변수 선택

x = score.academy # 독립변수  
y = score.score # 종속변수

2. 단순 선형회귀분석(stats)

model = stats.linregress(x, y)
print(model)
print('x 기울기 : ', model.slope) 
print('y 절편 :', model.intercept)
print('설명력 : ', model.rvalue)
print('p값 : ', model.pvalue) # F검정 통계량 
print('x 표준오차 :' , model.stderr)

3. 회귀선 시각화

a = model.slope
b = model.intercept
y_pred = (x*a) + b # 예측치

산점도

plt.plot(score['academy'], score['score'], 'b.') # 파랑색 
plt.plot(score['academy'], y_pred, 'r.-') # 빨강색 
plt.title('regression plotting') # 제목 
plt.legend(['x y scatter', 'regression line']) # 범례 
plt.show()

문4) irsi.csv 데이터셋을 이용하여 다중선형회귀모델을 생성하시오.

<조건1> 칼럼명에 포함된 '.' 을 '_'로 수정
iris.columns = iris.columns.str.replace('.', '_')
<조건2> model의 formula 구성
y변수 : 1번째 칼럼, x변수 : 2 ~ 3번째 칼럼
<조건3> 회귀계수 확인
<조건4> 회귀모델 결과 확인 및 해석  : summary()함수 이용

import pandas as pd
import statsmodels.formula.api as sm # 다중회귀모델

dataset 가져오기

iris = pd.read_csv('c:/itwill/4_python-ii/data/iris.csv')
print(iris.head())

1. iris 칼럼명 수정

iris.columns = iris.columns.str.replace('.', '_')
print(iris.info())

2. formula 구성 및 다중회귀모델 생성

iris_model = sm.ols(formula="Sepal_Length ~ Sepal_Width+Petal_Length", data=iris).fit()

3. 회귀계수 확인

print(iris_model.params)

Intercept       2.249140
Sepal_Width     0.595525
Petal_Length    0.471920

4. 회귀모델 결과 확인

print(iris_model.summary())

                            OLS Regression Results
==============================================================================
Dep. Variable:           Sepal_Length   R-squared:                       0.840
Model:                            OLS   Adj. R-squared:                  0.838
Method:                 Least Squares   F-statistic:                     386.4
Date:                Tue, 30 Nov 2021   Prob (F-statistic):           2.93e-59
Time:                        21:02:08   Log-Likelihood:                -46.513
No. Observations:                 150   AIC:                             99.03
Df Residuals:                     147   BIC:                             108.1
Df Model:                           2
Covariance Type:            nonrobust
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept        2.2491      0.248      9.070      0.000       1.759       2.739
Sepal_Width      0.5955      0.069      8.590      0.000       0.459       0.733
Petal_Length     0.4719      0.017     27.569      0.000       0.438       0.506
==============================================================================
Omnibus:                        0.164   Durbin-Watson:                   2.021
Prob(Omnibus):                  0.921   Jarque-Bera (JB):                0.319
Skew:                          -0.044   Prob(JB):                        0.853
Kurtosis:                       2.792   Cond. No.                         48.3
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

'개인공부 > Python' 카테고리의 다른 글

68. NIPA AI온라인 교육 AI 실무 응용 과정(1) 자료형태, 데이터 전처리 (0)	2021.12.02
67. Python Regression 연습문제 (0)	2021.12.01
65. Python Numpy 연습문제(2) (0)	2021.11.29
64. NIPA AI온라인 교육 AI 실무 기본 과정 (4)주유소 시장 분석 (0)	2021.11.28
63. NIPA AI온라인 교육 AI 실무 기본 과정 (3)국내 코로나 환자 추이 분석 (0)	2021.11.27

💣

66. Python Statis Scipy 연습문제

'개인공부 > Python' 카테고리의 다른 글

+ Recent posts

티스토리툴바