DAY49. Python Statis Scipy (카이제곱검정, T검정, 공분산, 회귀분석)

데이터분석가 과정/Python

DAY49. Python Statis Scipy (카이제곱검정, T검정, 공분산, 회귀분석)

LEE_BOMB 2021. 11. 30. 17:55

statistics

statistics 모듈
기술통계 : 대표값, 산포도, 왜도/첨도 등

import statistics as st #기술통계 제공 
import pandas as pd  #csv file read

기술통계 : 비율척도 or 등간척도

dataset = pd.read_csv(r'C:\ITWILL\4_Python-II\data\descriptive.csv')
dataset.info()

x = dataset['cost'] #구매비용 
x

1. 대표값

print('평균 =', st.mean(x))
print('중위수 =', st.median(x))
print('최빈수 =', st.mode(x)) #최빈수 = 6.0

x.value_counts() #6.0    21

2. 산포도 : 분산, 표준편차, 사분위수

var = st.variance(x)
var # 6.0    21

std = st.stdev(x)
std # 1.1421532501476073

st.quantiles(x)
# [4.425000000000001, 5.4, 6.2]

x.describe()

25%        4.475000
50%        5.400000
75%        6.200000

import scipy.stats as sts

3. 왜도/첨도
왜도 = 0 : 정규분포
왜도 > 0 : 왼쪽 기울어짐
왜도 < 0 : 오른쪽 기울어짐

sts.skew(x) # -0.1531779106237012

첨도 = 0 or 3

sts.kurtosis(x, fisher=True) #fisher = 0
sts.kurtosis(x, fisher=False) #pearson = 3

정규분포 = 0 or 3
첨도 > 정규분포 : 위로 뾰족함
첨도 < 정규분포 : 완만함

히스토그램 + 밀도분포곡선

import seaborn as sn 
sn.displot(data=x, kde=True)

chisquare_test

확률변수의 적합성 검정 - 일원
두 집단변수 간의 독립성 검정 - 이원

from scipy import stats #확률분포+가설검정 
import numpy as np #수식 계산 
import pandas as pd #csv file read

1. 일원 chi-square(1개 변수) : 적합성 검정
귀무가설 : 관측치와 기대치는 차이가 없다.
대립가설 : 관측치와 기대치는 차이가 있다.

real_data = [4, 6, 17, 16, 8, 9] # 관측치
exp_data = [10,10,10,10,10,10] # 기대치

chis = stats.chisquare(real_data, exp_data)
print(chis)

(statistic=14.200000000000001, pvalue=0.014387678176921308)

print('검정통계량 =', chis[0])
print('유의확률 =', chis[1])

유의확률 = 0.014387678176921308

real_arr = np.array(real_data)
exp_arr = np.array(exp_data)

statistic = sum((real_arr - exp_arr)**2 / exp_arr)
statistic # 14.200000000000001

[해설] 유의미한 수준에서 차이가 있다.

2. 이원 chi-square(2개 변수) : 교차분할표
귀무가설 : 교육수준과 흡연율 간에 관련성이 없다.(기가)
대립가설 : 교육수준과 흡연율 간에 관련성이 있다.(채택)

smoke = pd.read_csv(r'C:\ITWILL\4_Python-II\data\smoke.csv')
smoke.info()

0 education 355 non-null int64
1 smoking 355 non-null int64

smoke.education.value_counts()

1    211
3     92
2     52

smoke.smoking.value_counts()

2    141
1    116
3     98

단계1 : 변수 선택

education = smoke.education
smoking = smoke.smoking

단계2 : 교차분할표 : 관측치

tab = pd.crosstab(index=education, columns=smoking, margins=True)
tab

smoking      1    2   3  All
education
1           51   92  68  211
2           22   21   9   52
3           43   28  21   92
All        116  141  98  355

단계3 : 카이제곱 검정 : 교차분할표 이용

chi2, pvalue, df, evalue = stats.chi2_contingency(observed=tab) 

print('검정통계량 : %.6f, 유의확률 : %.6f, 자유도 : %d'%(chi2, pvalue,df))

검정통계량 : 18.910916, 유의확률 : 0.000818, 자유도 : 4

단계4 : 기대치

print(evalue)

[[68.94647887 83.8056338 58.24788732]
[16.9915493 20.65352113 14.35492958]
[30.06197183 36.54084507 25.3971831 ]]
[해석] 유의미한 수준에서 교육수준과 흡연율 간의 관련성이 있다.

51 기준 기대치 = (행합 * 열합) / 전체합

e11 = (211 * 116) / 355
e11 #  68.94647887323944

51 기준 기대비율 = sum((관측치-기대치)**2) / 기대치

e11_ratio = (51.0-e11)**2 / e11 
e11_ratio # 4.671393074906378

t검정

t 분포에 대한 가설검정

t검정 : 모집단이 정규분포이고, 모분산이 알려지지 않은 경우
z검정 : 모집단이 정규분포이고, 모분산이 알려진 경우
1. 한 집단 평균 검정 : 모평균 검정
2. 두 집단 평균 검정
3. 대응 두 집단

from scipy import stats #t-test
import numpy as np #sampling
import pandas as pd #csv file read

1. 한 집단 평균 검정 : 남자 평균 키(모평균) : 175.5cm

sample_data = np.random.uniform(172,179, size=29) 
print(sample_data)

기술통계

print('평균 키 =', sample_data.mean())

단일집단 평균차이 검정

one_group_test = stats.ttest_1samp(sample_data, 175.5) 
print('t검정 통계량 = %.3f, pvalue = %.5f'%(one_group_test))

t검정 통계량 = 0.381, pvalue = 0.70619
[해설] 표본의 평균은 모평균과 차이가 없다.

2. 두 집단 평균 검정 : 남여 평균 점수 차이 검정

female_score = np.random.uniform(50, 100, size=30) # 여성 
male_score = np.random.uniform(45, 95, size=30) # 남성 

two_sample = stats.ttest_ind(female_score, male_score)
print(two_sample)
print('두 집단 평균 차이 검정 = %.3f, pvalue = %.3f'%(two_sample))

두 집단 평균 차이 검정 = 0.321, pvalue = 0.750
[해설] 남여 평균 점수 차이가 없다.

file 자료 이용 : 교육방법에 따른 실기점수의 평균차이 검정

sample = pd.read_csv('c:/itwill/4_python-ii/data/two_sample.csv')
print(sample.info())

two_df = sample[['method', 'score']]
print(two_df)

two_df.method.value_counts()

1 120
2 120

교육방법 기준 subset

method1 = two_df[two_df.method==1]
method2 = two_df[two_df.method==2]

method1.info()

0 method 120 non-null int64
1 score 88 non-null float64

score 칼럼 추출

score1 = method1.score
score2 = method2.score

전처리 : 결측치 처리

score1 = score1.fillna(0)
score2 = score2.fillna(0)

두 집단 평균차이 검정

two_sample = stats.ttest_ind(score1, score2)
print(two_sample)

Ttest_indResult(statistic=-0.7833843755616479, pvalue=0.4341802874737909)
[해설] 두 교육방법에 따른 평균에 차이가 없다.

3. 대응 두 집단 : 복용전 65 -> 복용후 60 몸무게 변환

before = np.random.randint(60, 65, size=30)  
after = np.random.randint(59, 64,  size=30) 

paired_sample = stats.ttest_rel(before, after)
print(paired_sample)
print('t검정 통계량 = %.5f, pvalue = %.5f'%paired_sample)

공분산 vs 상관계수

공통점 : 변수(비율,등간 척도) 간의 상관성 분석

1. 공분산 : 두 확률변수 간의 분산(평균에서 퍼짐 정도)를 나타내는 통계
확률변수 : X, Y
식 : Cov(X,Y) = sum( (X-xmu) * (Y-ymu) ) / n

Cov(X, Y) > 0 : X가 증가할 때 Y도 증가
Cov(X, Y) < 0 : X가 증가할 때 Y는 감소
Cov(X, Y) = 0 : 두 변수는 선형관계 아님(서로 독립적 관계)
문제점 : 값이 큰 변수에 영향을 받는다.(값 큰 변수가 상관성 높음)

2. 상관계수 : 공분산을 각각의 표준편차로 나눈어 정규화한 통계
공분산 문제점 해결
부호는 공분산과 동일, 값은 절대값 1을 넘지 않음(-1 ~ 1)
식 : Corr(X, Y) = Cov(X,Y) / std(X) * std(Y)

import pandas as pd 
score_iq = pd.read_csv(r'c:/itwill/4_python-ii/data/score_iq.csv')
print(score_iq)

1. 피어슨 상관계수 행렬

corr = score_iq.corr(method='pearson')
print(corr)

sid     score        iq   academy      game        tv
sid      1.000000 -0.014399 -0.007048 -0.004398  0.018806  0.024565
score   -0.014399  1.000000  0.882220  0.896265 -0.298193 -0.819752
iq      -0.007048  0.882220  1.000000  0.671783 -0.031516 -0.585033
academy -0.004398  0.896265  0.671783  1.000000 -0.351315 -0.948551
game     0.018806 -0.298193 -0.031516 -0.351315  1.000000  0.239217
tv       0.024565 -0.819752 -0.585033 -0.948551  0.239217  1.000000

2. 공분산 행렬

cov = score_iq.cov()
print(cov)

sid      score         iq   academy      game        tv
sid      1887.500000  -4.100671  -2.718121 -0.231544  1.208054  1.432886
score      -4.100671  42.968412  51.337539  7.119911 -2.890201 -7.214586
iq         -2.718121  51.337539  78.807338  7.227293 -0.413691 -6.972975
academy    -0.231544   7.119911   7.227293  1.468680 -0.629530 -1.543400
game        1.208054  -2.890201  -0.413691 -0.629530  2.186309  0.474899
tv          1.432886  -7.214586  -6.972975 -1.543400  0.474899  1.802640

3. 공분산 vs 상관계수 식 적용
1) 공분산 : Cov(X, Y) = sum( (X-xmu) * (Y-ymu) ) / n-1

X = score_iq['score']
Y = score_iq['iq']

표본평균

muX = X.mean()
muY = Y.mean()

표본의 공분산

Cov = sum((X - muX) * (Y - muY)) / (len(X)-1)
print('Cov =', Cov)

2) 상관계수 : Corr(X, Y) = Cov(X,Y) / std(X) * std(Y)

stdX = X.std()
stdY = Y.std()

Corr = Cov / (stdX * stdY)
print('Corr =', Corr)

regression

scipy 패키지 이용
1. 단순선형회귀분석
2. 다중선형회귀분석

from scipy import stats #회귀분석 
import pandas as pd #csv file read 
import matplotlib.pyplot as plt 

score_iq = pd.read_csv(r'C:\ITWILL\4_Python-2\data\score_iq.csv')
score_iq.info()

1. 단순선형회귀분석 x -> y
1) 변수 선택

x = score_iq['iq'] # score_iq.iq
y = score_iq['score']

2) model 생성

model = stats.linregress(x, y) # iq -> score
print(model)

LinregressResult
(slope=0.6514309527270075,   : x기울기(회귀계수)
intercept=-2.8564471221974657, : y절편
rvalue=0.8822203446134699,     : R^2 - 설명력(결정계수)
pvalue=2.8476895206683644e-50, : F검정 - 유의성검정
stderr=0.028577934409305443,   : 표준오차
intercept_stderr=3.546211918048538)

a = model.slope #기울기 
b = model.intercept #절편

score_iq.head()

회귀방정식 -> y예측치

X = 140 
Y = 90
y_pred = X*a + b 
print(y_pred) # 88.34388625958358

err = Y - y_pred
print('error = ', err) # error =  1.6561137404164157

전체 관측치 대상

len(x) # 150
y_pred = x*a + b
print(y_pred)

관측치 vs 예측치

y.mean() # 77.77333333333333
y_pred.mean() # 77.77333333333334

2. 회귀모델 시각화
산점도

plt.plot(score_iq['iq'], score_iq['score'], 'b.')

회귀선

plt.plot(score_iq['iq'], y_pred, 'r-')
plt.title('line regression')
plt.legend(['x y scatter', 'line regression'])
plt.show()

3. 다중선형회귀분석 : formula 형식(y ~ x1 + x2, ...)
변수명 : 점(.), 공백 -> '_' 교체

from statsmodels.formula.api import ols

상관계수 행렬

corr = score_iq.corr()
corr['score'] # x1 = iq, x2 = academy, x3 = tv

obj = ols(formula='score ~ iq + academy + tv', data=score_iq)
model = obj.fit()
model #object info

회귀분석 결과 제공

print(model.summary()) #R : summary(model)

OLS Regression Results
==============================================================================
Dep. Variable:                  score   R-squared:                       0.946
Model:                            OLS   Adj. R-squared:                  0.945
Method:                 Least Squares   F-statistic:                     860.1
Date:                Fri, 12 Nov 2021   Prob (F-statistic):           1.50e-92
Time:                        11:18:16   Log-Likelihood:                -274.84
No. Observations:                 150   AIC:                             557.7
Df Residuals:                     146   BIC:                             569.7
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     24.7223      2.332     10.602      0.000      20.114      29.331
iq             0.3742      0.020     19.109      0.000       0.335       0.413
academy        3.2088      0.367      8.733      0.000       2.483       3.935
tv             0.1926      0.303      0.636      0.526      -0.406       0.791
==============================================================================
Omnibus:                       36.802   Durbin-Watson:                   1.905
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               57.833
Skew:                           1.252   Prob(JB):                     2.77e-13
Kurtosis:                       4.728   Cond. No.                     2.32e+03
==============================================================================
'''
# Adj. R-squared:                  0.945 : 모델 설명력
# Prob (F-statistic):           1.50e-92 : 모델 유의성 검정(<0.05)
# coef    std err          t      P>|t| : x변수 유의성 검정
# Durbin-Watson:                   1.905 : (1~4) 다중공정성 문제

dir(model)

회귀계수 반환

model.params

Intercept    24.722251 - y절편
iq            0.374196 - x1 기울기
academy       3.208802 - x2 기울기
tv            0.192573 - x3 기울기

y 적합치

model.fittedvalues

행렬곱(@) 이용 회귀방정식

X = score_iq[['iq', 'academy', 'tv']]
X.shape # (150, 3)

import numpy as np 
a = np.array([[0.374196],[3.208802],[0.192573]])
a.shape # (3, 1) - 기울기 행렬 

b = 24.722251 # 절편 
# y = x1*a1 + x2*a2 + x3*a3 + b
y_fitted = X @ a + b
print(y_fitted)

차트 보기 : y vs y_fitted

y = score_iq['score']

plt.plot(y[:50], label='real value')
plt.plot(y_fitted[:50], label='predicted value')
plt.legend(loc= 'best')
plt.show()