DAY44. Python Pandas (2)DataFrame, 기술통계

LEE_BOMB 2021. 11. 16. 17:14

DataFrame

pandas 2차원 자료구조
DB의 table과 유사
칼럼(열) 단위 상이한 자료형
DataFrame칼럼 : Series

방법1

import pandas as pd

방법2

from pandas import DataFrame

1. DataFrame 객체 생성 방법
list 칼럼 생성

name = ['hong', 'lee', 'kang', 'yoo']
age = [35, 44, 45, 25]
pay = [350, 450, 500, 250]

dict(사전형) data생성

data = {'name':name, 'age':age, 'pay':pay}

dataframe 생성

frame = pd.DataFrame(data=data,
                     columns=['name', 'age', 'pay']) #dict형으로 dataframe 생성하면 순서 랜덤일수도
print(frame)

name  age  pay
0  hong   35  350
1   lee   44  450
2  kang   45  500
3   yoo   25  250

dataframe 생성 방법2

frame2 = DataFrame(data=data,
                     columns=['name', 'age', 'pay'])
print(frame2)

name  age  pay
0  hong   35  350
1   lee   44  450
2  kang   45  500
3   yoo   25  250
-> 두 생성방법의 결과는 동일

DataFrame 정보 확인

type(frame) #pandas.core.frame.DataFrame
frame.info() #str(df)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3 -> 행의 개수
Data columns (total 3 columns):
#   Column  Non-Null Count  Dtype
---  ------  --------------  -----
0   name    4 non-null      object
1   age     4 non-null      int64
2   pay     4 non-null      int64
dtypes: int64(2), object(1)
memory usage: 224.0+ bytes

DF['column'] vs DF$column

pay = frame['pay'] #DF에서 특청 칼럼 추출
type(pay) #pandas.core.series.Series : Series 객체

print('급여 평균=', pay.mean()) #급여 평균= 387.5

2. csv & 칼럼 참조

import os #file 경로 변경/확인

os.chdir(r'C:\ITWILL\4_Python-2\data')
emp = pd.read_csv('emp.csv', encoding='utf-8')
print(emp.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
#   Column  Non-Null Count  Dtype
---  ------  --------------  -----
0   No      5 non-null      int64
1   Name    5 non-null      object
2   Pay     5 non-null      int64
dtypes: int64(2), object(1)
memory usage: 248.0+ bytes
None

print(emp)

emp.No #DF.column 방식
emp['No'] #index 방식
emp['Pay']

emp['Pay'].plot() #선 그래프

2) 복수 칼럼

emp[['No', 'Pay']] #중첩 list
emp[['No', 'Pay']].plot

3. subset만들기 : old DF -> new DF
1) 특정 칼럼 선택 : 칼럼 수가 적은 경우

subset1 = emp[['Name', 'Pay']]
subset1

Name  Pay
0  홍길동  150
1  이순신  450
2  강감찬  500
3  유관순  350
4  김유신  400

2) 특정 행 제외

subset2 = emp.drop(1) #2번째 행 제거
subset2

0  101  홍길동  150
2  103  강감찬  500
3  104  유관순  350
4  105  김유신  400

3) 조건식으로 선택 : 특정 칼럼 기준
ex. 급여가 400이상 관측치 선택(350미만 제외)

subset3 = emp[emp.Pay >= 400] #DF[조건식]
subset3

No Name  Pay
1  102  이순신  450
2  103  강감찬  500
4  105  김유신  400
-> 기존 원본의 색인 그대로 유지

4) columns 이용 : 칼럼 수가 많은 경우

iris = pd.read_csv('iris.csv') 
iris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
#   Column        Non-Null Count  Dtype
---  ------        --------------  -----
0   Sepal.Length  150 non-null    float64
1   Sepal.Width   150 non-null    float64
2   Petal.Length  150 non-null    float64
3   Petal.Width   150 non-null    float64
4   Species       150 non-null    object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB

cols = list(iris.columns)
cols #['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species']

cols[:4] #x변수
cols[-1] #y변수

iris_x = iris[['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species']]
iris_x.head

iris_x = iris[cols[:4]]
iris_x.head() #동일한 결과. 더 선호되는 방법

iris_y = iris[cols[-1]] #iris['Species']와 동일

점, 특수문자, 공백 포함 변수 선택
iris.Sepal.Length #DF.colum -> 사용 불가

iris['Sepal.Length'] #Length: 150, dtype: float64

iris['Sepal.Length'].mean() #5.843333333333335

4) DataFrame 행렬 참조 : DF[row, col]
콜론(:)으로 행렬 참조 방식
형식1) DF.loc[row,col] : 명칭(label) 기반
형식2) DF.iloc[row, col] : 위치(integer) 기반

1) loc 속성 : 명칭 색인 기반 - 문자 색인
    형식) DF.loc[행label, 열label]
    숫자 색인 -> 명칭 색인 해석

2) iloc 속성 : 위치 색인 기반 - 숫자 색인
    형식) DF.loc[행integer, 열integer]

print(emp)

No Name  Pay
0  101  홍길동  150
1  102  이순신  450
2  103  강감찬  500
3  104  유관순  350
4  105  김유신  400
열 이름 : label (레이블)
행 이름 : integer

1) loc 속성 : 명칭 기반

emp.loc[0, :] #1행 전체
emp.loc[0] #1행 전체(열 생략)
emp.loc[0:3] #3번째(X) 3번의 레이블(O) 1행~4행 선택
emp.loc[0:3, ['No', 'Pay']] #불연속
emp.loc[0:3, 'No' : 'Pay'] #연속

2) iloc 속성 : 위치(정수=integer) 기반

emp.iloc[0, :] #1행 전체
emp.iloc[0]
emp.iloc[0:3] #1행~3행 선택 
emp.iloc[0:3, ['No' : 'Pay']] #[SyntaxError: invalid syntax] 문자열 사용 불가 반드시 int형 이용
emp.iloc[0:4, [0,2]] #불연속
emp.iloc[0:4, 0:] #연속

3) example : box 선택

emp.loc[[1,3], ['No', 'Pay']]
emp.iloc[[1,3], [0, 2]]

No  Pay
1  102  450
3  104  350
-> 결과는 동일

Descriptive

1. 요약통계량
2. 상관계수

import pandas as pd
import os

os.chdir(r'C:\ITWILL\4_Python-2\data')

product = pd.read_csv('product.csv')
product.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264 entries, 0 to 263
Data columns (total 3 columns):
#   Column  Non-Null Count  Dtype
---  ------  --------------  -----
0   a       264 non-null    int64 -> 결측치 없음, 정수형 통계량
1   b       264 non-null    int64
2   c       264 non-null    int64
dtypes: int64(3)
memory usage: 6.3 KB

product.head()
product.tail()

1. 요약통계량

product.describe() #summary(product)

행/열 통계

product.shape #(264, 3)

product.sum(axis=0) #행축 : 같은 열 모음 (열의 합)
product.sum(axis=1) #열축 : 같은 행 모음 (행의 합)

산포도 : 분산, 표준편차

product.var() #axis=0
product.std() #axis=0

DF['칼럼명']

product['a'].sum() #773

빈도수

product['a'].value_counts()

3    126
4     64
2     37
1     30
5      7

유일값 : 중복되지 않는 값

product['a'].unique() #array([3, 4, 2, 5, 1], dtype=int64)

2. 상관계수

product.corr() #상관계수 정방행렬 반환

a         b         c
a  1.000000  0.499209  0.467145
b  0.499209  1.000000  0.766853
c  0.467145  0.766853  1.000000

iris = pd.read_csv('iris.csv')
iris.info()

4개 변수 선택 (일종의 subset 생성 방법)

df = iris.iloc[:, :4]
df.shape #(150, 4)

df.corr()

Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
Sepal.Length      1.000000    -0.117570      0.871754     0.817941
Sepal.Width      -0.117570     1.000000     -0.428440    -0.366126
Petal.Length      0.871754    -0.428440      1.000000     0.962865 -> 연관성이 제일 높음. 다중공선성 문제 발생.
Petal.Width       0.817941    -0.366126      0.962865     1.000000