DAY42. Python Pandas (2)DFmerge, csvExcel, DummyEncoding
DF_merge
1. DF๋ณํฉ(merge) = join
DF1(id) + DF2(id) = DF3
2. DF๊ฒฐํฉ(concat) = cbind, rbind
DF1 + DF2 = DF3
import pandas as pd
import os
os.chdir(r'C:\ITWILL\4_Python-2\data')
wdbc = pd.read_csv('wdbc_data.csv')
wdbc.shape #(569, 32)
cols = list(wdbc.columns)
print(cols)
1. DF๋ณํฉ(merge) = join
DF1 = wdbc[cols[:16]] #1~15๋ฒ์งธ ์นผ๋ผ
DF1.shape #(569, 16)
DF2 = wdbc[cols[16:]]
DF2.shape #(569, 16)
์นผ๋ผ ์ถ๊ฐ
DF2['id'] = wdbc.id
DF2.shape #(569, 17)
how="inner" -> ๋ด๋ถ ์กฐ์ธ(inner join)
DF3 = pd.merge(left=DF1, right=DF2, how="inner", on="id")
DF3.shape #(569, 32)
how = "inner" : ์์ชฝ DF์ ์กด์ฌํ๋ id๊ธฐ์ค ๋ณํฉ
how = "outer" : ํ์ชฝ DF์ ์กด์ฌํ๋ id๊ธฐ์ค ๋ณํฉ
2. DF๊ฒฐํฉ(concat)
DF1 = wdbc[cols[:16]] #1~16
DF2 = wdbc[cols[:16]] #16~32
DF4 = pd.concat(objs = [DF1, DF2], axis = 1) #axis (์ถ์ ๋ฐฉํฅ) 1์ ์ด =cbind
DF4.shape #(569, 32)
csvExcelIO
1. csv file read
2. data ์ฒ๋ฆฌ
3. csv file write
4. excel file read/write
import pandas as pd #csv/excel file read/write
import os # file path
os.chdir(r'c:\ITWILL\4_Python-2\data')
1. csv file read
1) ์นผ๋ผ๋ช
์ด ์๋ ๊ฒฝ์ฐ
st = pd.read_csv('student.csv', header=None)
st # 0 1 2 3 -> ๊ธฐ๋ณธ ์นผ๋ผ๋ช
์นผ๋ผ๋ช
์์
col_names = ['sno','name','height','weight']
st.columns = col_names
print(st)
2) ์นผ๋ผ๋ช
์ด ํน์๋ฌธ์(.) or ๊ณต๋ฐฑ -> '_'
iris = pd.read_csv('iris.csv')
iris.info()
iris.columns = iris.columns.str.replace('.','_')
iris.info()
iris.Sepal_Length
3) ํน์๊ตฌ๋ถ์, ์ฒ๋จ์ ์ฝค๋ง
pd.read_csv('file', delimiter='\t', thousands=',')
2. data ์ฒ๋ฆฌ : ํ์๋ณ์
print(st)
sno name height weight
0 101 hong 175 65
1 201 lee 185 85
2 301 kim 173 60
3 401 park 180 70
๋น๋ง๋์ง์(BMI)
BMI = ๋ชธ๋ฌด๊ฒ / (ํค**2)
๋ชธ๋ฌด๊ฒ ๋จ์ : kg
ํค ๋จ์ : cm -> m
175 * 0.01 #1.75
bmi = st['weight'] / (st['height']*0.01)**2
bmi
ํ์๋ณ์1 ์ถ๊ฐ
st['bmi'] = bmi
print(st)
label = ์ ์ : 18~23, 23์ด๊ณผ : '๋น๋ง', 18๋ฏธ๋ง : '์ ์ฒด์ค'
label = []
for bmi in st.bmi :
if bmi >= 18 and bmi <= 23 :
label.append('์ ์')
elif bmi > 23 :
label.append('๋น๋ง')
else :
label.append('์ ์ฒด์ค')
print(label)
ํ์๋ณ์2 ์ถ๊ฐ
st['label'] = label
print(st)
3. csv file write
type(st) # pandas.core.frame.DataFrame
index=None : ํ ์ด๋ฆ ์ ์ธ
st.to_csv('st_info.csv', index=None, encoding='utf-8')
new_st = pd.read_csv('st_info.csv', encoding='utf-8')
print(new_st)
4. excel file read/write
ex = pd.ExcelFile('sam_kospi.xlsx') #class() -> object
kospi = ex.parse('sam_kospi') #์ํธ๋ช
print(kospi)
kospi.info() #<class 'pandas.core.frame.DataFrame'>
excel file write
kospi.to_excel('kospi.xlsx', index = None)
dummy_encoding
1. one hot encoding : 2์ง์
๋ฒ์ฃผ ๊ฐ์ k๊ฐ -> k๊ฐ ๊ฐ๋ณ์(dummy) ์์ฑ
๋ถ๋ฅ๋ถ์์์ ์ข
์๋ณ์(y๋ณ์) ๋์
2. ๊ฐ๋ณ์(dummy)
๋ฒ์ฃผ ๊ฐ์ k๊ฐ -> k-1๊ฐ ๊ฐ๋ณ์ ์์ฑ
ํ๊ท๋ถ์์์ ๋
๋ฆฝ๋ณ์(x๋ณ์) ๋์
ex. ์ฑ๋ณ(๋จ/์ฌ) -> 1๊ฐ, ํ์กํ(A/B/O/AB) -> 3๊ฐ
3. ๋ ์ด๋ธ ์ธ์ฝ๋ฉ(label encoding)
๋ฌธ์ํ -> 10์ง์ ๋ณํ
ex.๋จ์ด -> ๊ณ ์ ์ซ์ ๋ณํ
import pandas as pd
import os
os.chdir(r'C:\ITWILL\4_Python-2\data')
iris = pd.read_csv('iris.csv')
iris.info()
๊ฝ์ ์ข
: ๋น๋ ์
iris.Species.value_counts()
versicolor 50
setosa 50
virginica 50
1. one hot encoding : 2์ง์
iris_one_hot = pd.get_dummies(data=iris, columns = ['Species'])
iris_one_hot.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Sepal.Length 150 non-null float64
1 Sepal.Width 150 non-null float64
2 Petal.Length 150 non-null float64
3 Petal.Width 150 non-null float64
4 Species_setosa 150 non-null uint8
5 Species_versicolor 150 non-null uint8
6 Species_virginica 150 non-null uint8
iris_one_hot.iloc[[0,50,100],4:]
Species_setosa Species_versicolor Species_virginica
0 1 0 0
50 0 1 0
100 0 0 1
๋ถ๋ฅ๋ถ์ : ๋
๋ฆฝ๋ณ์(x) 4๊ฐ -> ์ข
์๋ณ์(y) 3๊ฐ
2. ๊ฐ๋ณ์(dummy) : k-1๊ฐ ๊ฐ๋ณ์ ์์ฑ
iris_dummy = pd.get_dummies(data=iris, columns = ['Species'], drop_first=True)
iris_dummy.info()
0 Sepal.Length 150 non-null float64
1 Sepal.Width 150 non-null float64
2 Petal.Length 150 non-null float64
3 Petal.Width 150 non-null float64
4 Species_versicolor 150 non-null uint8 -> ๋๋ฏธ๋ณ์ : ์นผ๋ผ๋ช
_๊ฐ
5 Species_virginica 150 non-null uint8
setosa : ๊ธฐ์ค(base)๋ฅผ ์ ์ธํ ๋๋จธ์ง 2๊ฐ ๋ณ์ ์์ฑ
* ํ๊ท๋ถ์ : ๋
๋ฆฝ๋ณ์(x) 5๊ฐ(3๊ฐ ์ฐ์ํ/2๊ฐ ๋๋ฏธ), ์ข
์๋ณ์(y) 1๊ฐ
3. ๋ฒ์ ๊ธฐ์ค ๋ณ๊ฒฝ 'versicolor(base)' -> 'virginica' -> 'setosa'
1) object๋ฅผ ์์ ๋ณ๊ฒฝ ๊ฐ๋ฅํ categoryํ์ผ๋ก ๋ณ๊ฒฝ
iris['Species2'] = iris['Species'].astype('category') #object -> category
iris.info()
2) ์์ ๋ณ๊ฒฝ
iris['Species2'] = iris['Species2'].cat.set_categories(['versicolor', 'virginica', 'setosa'])
['versicolor', 'virginica', 'setosa']
iris['Species2'].value_counts()
versicolor 50
virginica 50
setosa 50
3)dummy ๋ณ์
iris_dummy2 = pd.get_dummies(data=iris, columns = ['Species2'],
drop_first=True)
iris_dummy2.info()
0 Sepal.Length 150 non-null float64
1 Sepal.Width 150 non-null float64
2 Petal.Length 150 non-null float64
3 Petal.Width 150 non-null float64
4 Species 150 non-null object
5 Species2_virginica 150 non-null uint8
6 Species2_setosa 150 non-null uint8
4. ๋ ์ด๋ธ ์ธ์ฝ๋ฉ(label encoding) : ๋ฌธ์์ด -> 10์ง์ ๋ณํ
iris.info()
iris.Species.value_counts()
versicolor 50
setosa 50
virginica 50
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt #์๊ฐํ ๋๊ตฌ
์์ฑ์ -> object
encoder = LabelEncoder()
object.member()
encoder.fit(iris.Species)
data๋ณํ : label -> 10์ง์
labels = encoder.transform(iris.Species)
print(labels) #0~2
versicolor 0
setosa 1
virginica 2
plt.scatter(x=iris['Sepal.Length'],
y=iris['Petal.Length'], c=labels)