๋ฐ์ดํ„ฐ๋ถ„์„๊ฐ€ ๊ณผ์ •/Python

DAY42. Python Pandas (2)DFmerge, csvExcel, DummyEncoding

LEE_BOMB 2021. 11. 18. 19:40
DF_merge

1. DF๋ณ‘ํ•ฉ(merge) = join
DF1(id) + DF2(id) = DF3

2. DF๊ฒฐํ•ฉ(concat) = cbind, rbind
DF1 + DF2 = DF3

import pandas as pd
import os

os.chdir(r'C:\ITWILL\4_Python-2\data')
wdbc = pd.read_csv('wdbc_data.csv')

wdbc.shape #(569, 32)
cols = list(wdbc.columns)
print(cols)


1. DF๋ณ‘ํ•ฉ(merge) = join

DF1 = wdbc[cols[:16]] #1~15๋ฒˆ์งธ ์นผ๋Ÿผ
DF1.shape #(569, 16)

DF2 = wdbc[cols[16:]]
DF2.shape #(569, 16)


์นผ๋Ÿผ ์ถ”๊ฐ€

DF2['id'] = wdbc.id
DF2.shape #(569, 17)


how="inner" -> ๋‚ด๋ถ€ ์กฐ์ธ(inner join)

DF3 = pd.merge(left=DF1, right=DF2, how="inner", on="id")
DF3.shape #(569, 32)

how = "inner" : ์–‘์ชฝ DF์— ์กด์žฌํ•˜๋Š” id๊ธฐ์ค€ ๋ณ‘ํ•ฉ
how = "outer" : ํ•œ์ชฝ DF์— ์กด์žฌํ•˜๋Š” id๊ธฐ์ค€ ๋ณ‘ํ•ฉ



2. DF๊ฒฐํ•ฉ(concat)

DF1 = wdbc[cols[:16]] #1~16
DF2 = wdbc[cols[:16]] #16~32

DF4 = pd.concat(objs = [DF1, DF2], axis = 1) #axis (์ถ•์˜ ๋ฐฉํ–ฅ) 1์€ ์—ด =cbind
DF4.shape #(569, 32)

 

 

 

 

 

csvExcelIO

1. csv file read
2. data ์ฒ˜๋ฆฌ 
3. csv file write
4. excel file read/write

 

import pandas as pd #csv/excel file read/write
import os # file path 

os.chdir(r'c:\ITWILL\4_Python-2\data')



1. csv file read
1) ์นผ๋Ÿผ๋ช…์ด ์—†๋Š” ๊ฒฝ์šฐ 

st = pd.read_csv('student.csv', header=None)
st #      0     1    2   3 -> ๊ธฐ๋ณธ ์นผ๋Ÿผ๋ช…


์นผ๋Ÿผ๋ช… ์ˆ˜์ • 

col_names = ['sno','name','height','weight']

st.columns =  col_names
print(st)


2) ์นผ๋Ÿผ๋ช…์ด ํŠน์ˆ˜๋ฌธ์ž(.) or ๊ณต๋ฐฑ -> '_'

iris = pd.read_csv('iris.csv')
iris.info()

iris.columns = iris.columns.str.replace('.','_')
iris.info()

iris.Sepal_Length


3) ํŠน์ˆ˜๊ตฌ๋ถ„์ž, ์ฒœ๋‹จ์œ„ ์ฝค๋งˆ 
pd.read_csv('file', delimiter='\t', thousands=',')



2. data ์ฒ˜๋ฆฌ : ํŒŒ์ƒ๋ณ€์ˆ˜ 

print(st)

   sno  name  height  weight
0  101  hong     175      65
1  201   lee     185      85
2  301   kim     173      60
3  401  park     180      70

๋น„๋งŒ๋„์ง€์ˆ˜(BMI)
BMI = ๋ชธ๋ฌด๊ฒŒ / (ํ‚ค**2)
๋ชธ๋ฌด๊ฒŒ ๋‹จ์œ„ : kg
ํ‚ค ๋‹จ์œ„ : cm -> m

175 * 0.01 #1.75

bmi = st['weight'] / (st['height']*0.01)**2
bmi


ํŒŒ์ƒ๋ณ€์ˆ˜1 ์ถ”๊ฐ€ 

st['bmi'] = bmi
print(st)

label = ์ •์ƒ : 18~23, 23์ดˆ๊ณผ : '๋น„๋งŒ', 18๋ฏธ๋งŒ : '์ €์ฒด์ค‘'

label = [] 

for bmi in st.bmi :
    if bmi >= 18 and bmi <= 23 :
        label.append('์ •์ƒ')
    elif bmi > 23 :
        label.append('๋น„๋งŒ')
    else :
        label.append('์ €์ฒด์ค‘')

print(label)


ํŒŒ์ƒ๋ณ€์ˆ˜2 ์ถ”๊ฐ€ 

st['label'] = label

print(st)

 



3. csv file write

type(st) # pandas.core.frame.DataFrame


index=None : ํ–‰ ์ด๋ฆ„ ์ œ์™ธ 

st.to_csv('st_info.csv', index=None, encoding='utf-8') 

new_st = pd.read_csv('st_info.csv', encoding='utf-8')
print(new_st)

 



4. excel file read/write

ex = pd.ExcelFile('sam_kospi.xlsx') #class() -> object
kospi = ex.parse('sam_kospi') #์‹œํŠธ๋ช… 

print(kospi)
kospi.info() #<class 'pandas.core.frame.DataFrame'>


excel file write 

kospi.to_excel('kospi.xlsx', index = None)

 

 

 

 

 

dummy_encoding

1. one hot encoding : 2์ง„์ˆ˜
๋ฒ”์ฃผ ๊ฐœ์ˆ˜ k๊ฐœ -> k๊ฐœ ๊ฐ€๋ณ€์ˆ˜(dummy) ์ƒ์„ฑ
๋ถ„๋ฅ˜๋ถ„์„์—์„œ ์ข…์†๋ณ€์ˆ˜(y๋ณ€์ˆ˜) ๋Œ€์ƒ

2. ๊ฐ€๋ณ€์ˆ˜(dummy)
๋ฒ”์ฃผ ๊ฐœ์ˆ˜ k๊ฐœ -> k-1๊ฐœ ๊ฐ€๋ณ€์ˆ˜ ์ƒ์„ฑ
ํšŒ๊ท€๋ถ„์„์—์„œ ๋…๋ฆฝ๋ณ€์ˆ˜(x๋ณ€์ˆ˜) ๋Œ€์ƒ 
ex. ์„ฑ๋ณ„(๋‚จ/์—ฌ) -> 1๊ฐœ, ํ˜ˆ์•กํ˜•(A/B/O/AB) -> 3๊ฐœ

3. ๋ ˆ์ด๋ธ” ์ธ์ฝ”๋”ฉ(label encoding)
๋ฌธ์žํ˜• -> 10์ง„์ˆ˜ ๋ณ€ํ™˜
ex.๋‹จ์–ด -> ๊ณ ์œ  ์ˆซ์ž ๋ณ€ํ™˜


import pandas as pd
import os

os.chdir(r'C:\ITWILL\4_Python-2\data')
iris = pd.read_csv('iris.csv')
iris.info()


๊ฝƒ์˜ ์ข… : ๋นˆ๋„ ์ˆ˜

iris.Species.value_counts()

versicolor    50
setosa        50
virginica     50


1. one hot encoding : 2์ง„์ˆ˜

iris_one_hot = pd.get_dummies(data=iris, columns = ['Species'])

iris_one_hot.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Sepal.Length        150 non-null    float64
 1   Sepal.Width         150 non-null    float64
 2   Petal.Length        150 non-null    float64
 3   Petal.Width         150 non-null    float64
 4   Species_setosa      150 non-null    uint8  
 5   Species_versicolor  150 non-null    uint8  
 6   Species_virginica   150 non-null    uint8  

iris_one_hot.iloc[[0,50,100],4:]

     Species_setosa  Species_versicolor  Species_virginica
0                 1                   0                  0
50                0                   1                  0
100               0                   0                  1
๋ถ„๋ฅ˜๋ถ„์„ : ๋…๋ฆฝ๋ณ€์ˆ˜(x) 4๊ฐœ -> ์ข…์†๋ณ€์ˆ˜(y) 3๊ฐœ


2. ๊ฐ€๋ณ€์ˆ˜(dummy) : k-1๊ฐœ ๊ฐ€๋ณ€์ˆ˜ ์ƒ์„ฑ

iris_dummy = pd.get_dummies(data=iris, columns = ['Species'], drop_first=True)
iris_dummy.info()

 0   Sepal.Length        150 non-null    float64
 1   Sepal.Width         150 non-null    float64
 2   Petal.Length        150 non-null    float64
 3   Petal.Width         150 non-null    float64
 4   Species_versicolor  150 non-null    uint8  -> ๋”๋ฏธ๋ณ€์ˆ˜ : ์นผ๋Ÿผ๋ช…_๊ฐ’
 5   Species_virginica   150 non-null    uint8 
setosa : ๊ธฐ์ค€(base)๋ฅผ ์ œ์™ธํ•œ ๋‚˜๋จธ์ง€ 2๊ฐœ ๋ณ€์ˆ˜ ์ƒ์„ฑ



* ํšŒ๊ท€๋ถ„์„ : ๋…๋ฆฝ๋ณ€์ˆ˜(x) 5๊ฐœ(3๊ฐœ ์—ฐ์†ํ˜•/2๊ฐœ ๋”๋ฏธ), ์ข…์†๋ณ€์ˆ˜(y) 1๊ฐœ 
3. ๋ฒ”์ˆ˜ ๊ธฐ์ค€ ๋ณ€๊ฒฝ 'versicolor(base)' -> 'virginica' -> 'setosa'
1) object๋ฅผ ์ˆœ์„œ ๋ณ€๊ฒฝ ๊ฐ€๋Šฅํ•œ categoryํ˜•์œผ๋กœ ๋ณ€๊ฒฝ

iris['Species2'] = iris['Species'].astype('category') #object -> category
iris.info()


2) ์ˆœ์„œ ๋ณ€๊ฒฝ

iris['Species2'] = iris['Species2'].cat.set_categories(['versicolor', 'virginica', 'setosa'])

['versicolor', 'virginica', 'setosa']

 

iris['Species2'].value_counts()

versicolor    50
virginica     50
setosa        50

3)dummy ๋ณ€์ˆ˜

iris_dummy2 = pd.get_dummies(data=iris, columns = ['Species2'],
                             drop_first=True)
iris_dummy2.info()

 0   Sepal.Length        150 non-null    float64
 1   Sepal.Width         150 non-null    float64
 2   Petal.Length        150 non-null    float64
 3   Petal.Width         150 non-null    float64
 4   Species             150 non-null    object 
 5   Species2_virginica  150 non-null    uint8  
 6   Species2_setosa     150 non-null    uint8  

 
 
 4. ๋ ˆ์ด๋ธ” ์ธ์ฝ”๋”ฉ(label encoding) : ๋ฌธ์ž์—ด -> 10์ง„์ˆ˜ ๋ณ€ํ™˜

iris.info()

iris.Species.value_counts()

versicolor    50
setosa        50
virginica     50

from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt #์‹œ๊ฐํ™” ๋„๊ตฌ


์ƒ์„ฑ์ž -> object

encoder = LabelEncoder()


object.member()

encoder.fit(iris.Species)


data๋ณ€ํ™˜ : label -> 10์ง„์ˆ˜

labels = encoder.transform(iris.Species)
print(labels) #0~2

versicolor    0
setosa        1
virginica     2

plt.scatter(x=iris['Sepal.Length'],
            y=iris['Petal.Length'], c=labels)