83. 2020 Kaggle Machine Learning & Data Science Survey (1)
Dataset & ํ์ต ๋ชฉํ
https://www.kaggle.com/c/kaggle-survey-2020/data
kaggle_survey_2020_responses.csv : ์บ๊ธ ์ฌ์ฉ์์ ๋์ด, ์ฑ๋ณ, ์ต์ข ํ๋ ฅ, ๊ตญ๊ฐ, ๊ฒฝ๋ ฅ ์ฃผ ์ฌ์ฉ ์ธ์ด ๋ฑ์ ์๋ต์ด ๊ธฐ์ฌ๋์ด ์๋ค. (39๊ฐ ์ด์์ ์ง๋ฌธ / 20,036๊ฐ์ ์๋ต)
์ฑ๋ณ, ์ฐ๋ น ์๊ฐํ
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import set_matplotlib_formats
set_matplotlib_formats("retina")
plt.style.use("seaborn-whitegrid")
์๋ฃ ๊ฐ์ ธ์ค๊ธฐ
raw = pd.read_csv(r"../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv", low_memory=False)
raw.shape
๋ฐ์ดํฐ ํ์ธ
raw.head() #(20037, 355)
0๋ฒ์งธ ํ ๊ฐ์ ธ์ค๊ธฐ ( * iloc : index๊ฐ)
question = raw.iloc[0]
question
'''
Time from Start to Finish (seconds) Duration (in seconds)
Q1 What is your age (# years)?
Q2 What is your gender? - Selected Choice
Q3 In which country do you currently reside?
Q4 What is the highest level of formal education ...
...
Q35_B_Part_7 In the next 2 years, do you hope to become mor...
Q35_B_Part_8 In the next 2 years, do you hope to become mor...
Q35_B_Part_9 In the next 2 years, do you hope to become mor...
Q35_B_Part_10 In the next 2 years, do you hope to become mor...
Q35_B_OTHER In the next 2 years, do you hope to become mor...
Name: 0, Length: 355, dtype: object
'''
0๋ฒ์งธ ํ ์ ๊ฑฐ
answer = raw.drop([0])
answer
์๋ก์ด ๊ฐ์ฒด ์ ๋ณด
answer.info()
'''
<class 'pandas.core.frame.DataFrame'>
Int64Index: 20036 entries, 1 to 20036
Columns: 355 entries, Time from Start to Finish (seconds) to Q35_B_OTHER -> ์นผ๋ผ 355๊ฐ
dtypes: object(355)
memory usage: 54.4+ MB -> ๋ฉ๋ชจ๋ฆฌ ์ฌ์ฉ๋
'''
Q1 What is your age (# years)?
question["Q1"] #'What is your age (# years)?'
๋น๋์ ์ธ๊ธฐ .value_counts()
answer["Q1"].value_counts(normalize=True)#normalize=True : ๋น์จ
'''
25-29 0.200190
22-24 0.188960
18-21 0.173138
30-34 0.140297
35-39 0.099371
40-44 0.069724
45-49 0.049311
50-54 0.034837
55-59 0.020513
60-69 0.019864
70+ 0.003793
Name: Q1, dtype: float64
'''
index์ ์ ๋ ฌ .sort_index()
Q1 = answer['Q1'].value_counts().sort_index().plot() #.plot() ์ ๊ทธ๋ํ
seaborn์ countplot์ผ๋ก ๋ง๋ ๊ทธ๋ํ ๊ทธ๋ฆฌ๊ธฐ
sns.countplot(data=answer.sort_values("Q1"), x="Q1", #๋ฐ์ดํฐ.Q1๊ฐ์ผ๋ก ์ ๋ ฌ, x์ถ ๊ธฐ์ค
palette="Blues_r").set_title(question["Q1"])
#Blues : ๋น๋์ ๋ง์ผ๋ฉด ์ฐํ ์, set_title : ๊ทธ๋ํ ์ ๋ชฉ
Q2 What is your gender? - Selected Choice
question_no = "Q2"
Q2 = answer[question_no].value_counts()
Q2
'''
Man 15789 -> ๋จ์ฑ
Woman 3878 -> ์ฌ์ฑ
Prefer not to say 263 -> ๋ฌด์๋ต
Prefer to self-describe 54
Nonbinary 52
Name: Q2, dtype: int64
'''
๋ง๋๊ทธ๋ํ ์๊ฐํ
sns.countplot(data=answer,
y=question_no).set_title(question[question_no])
๋ ๋ณ์์ ๋ํ ๋น๋์ ๊ทธ๋ํ ์๊ฐํ .crosstab
q1q2 = pd.crosstab(answer["Q1"], answer["Q2"])
q1q2[["Man", "Woman"]].plot.bar(rot=0) #๋์ด ๊ธ์จ ์ธ์ฐ๊ธฐ -> rot=0
q1q2[["Man", "Woman"]].sort_index(ascending=False).plot.barh(figsize=(10, 6), title="Age & Gender")
'''
index๋ก ์ ๋ ฌ -> .sort_index(ascending=False)
๊ทธ๋ํ ์ฌ์ด์ฆ ๋ณ๊ฒฝ -> plot.barh(figsize=(10, 6)
๊ทธ๋ํ ํ์ดํ ์ง์ -> title="Age & Gender"
'''
seaborn์ผ๋ก ๊ทธ๋ํ ๊ทธ๋ ค๋ณด๊ธฐ
plt.figure(figsize=(10, 6)) #๊ทธ๋ํ ์ฌ์ด์ฆ ์กฐ์
sns.countplot(data=answer.sort_values("Q1"), x="Q1", hue="Q2").set_title("Age & Gender")
'''
Q1์ ๊ธฐ์ค์ผ๋ก ์ ๋ ฌ -> .sort_values("Q1")
์๋ต๊ฐ Q1 -> x="Q1"
Q2์ ๋ฐ๋ผ ์์์ ๋ค๋ฅด๊ฒ ํ์ํ๊ฒ ๋ค -> hue="Q2"
'''
์ง์ญ, ์ง์ ์๊ฐํ (๋ฐ๋ณต๋๋ ์์ ํจ์ ๋ง๋ค์ด๋ณด๊ธฐ)
Q3 In which country do you currently reside?
question["Q1"] #'What is your age (# years)?'
ํจ์ ์ ์ํ๊ธฐ
def show_countplot_by_qno(qno):
sns.countplot(data=answer, y=qno).set_title(question[qno])
show_countplot_by_qno("Q1") #ํจ์์ "Q2" ์
๋ ฅํ๋ฉด 2๋ฒ์งธ ๋ฌธํญ์ ๊ทธ๋ํ ์ถ๋ ฅ
data์ ๋ฐ๋ผ ํจ์ ์กฐ๊ฑด ์กฐ์ ํ๊ธฐ
def show_countplot_by_qno(qno, fsize=(10,6)):
plt.figure(figsize=fsize) #fsize๋ฅผ ๋ณ์๋ก ๋ฐ์ ๊ทธ๋ํ ํฌ๊ธฐ ์ค์
sns.countplot(data=answer, y=qno).set_title(question[qno])
show_countplot_by_qno("Q3", fsize=(10,12)) #๊ธฐ๋ณธ๊ฐ์ (10, 6)์ด์ง๋ง (10, 12)์ฌ์ด์ฆ๋ก ์ถ๋ ฅ
ํจ์ ์ ์ํ๊ธฐ2
def show_countplot_by_qno(qno, fsize=(10, 6), order=None):
"""
qno : question_no, ex) Q12 -> ์ง๋ฌธ ๋ฒํธ
fsize : figsize default (10, 6) -> ๊ทธ๋ํ ๊ธฐ๋ณธ ์ฌ์ด์ฆ
order : optional order list, default value_counts().index -> ์ ๋ ฌ ๊ธฐ์ค ์ง์
"""
if not order :
order = answer[qno].value_counts().index #order๊ฐ ์ง์ ์์ผ๋ฉด ๋น๋์ ๋์์์ผ๋ก ์ ๋ ฌ๋จ
plt.figure(figsize=fsize)
sns.countplot(data=answer,
y=qno,
order=order,
palette="Blues_r"
).set_title(question[qno])
์๊ฐํ
show_countplot_by_qno("Q3", fsize=(12, 12))
Q4 ๐ What is the highest level of formal education that you have attained or plan to attain within the next 2 years?
show_countplot_by_qno("Q4")
Q5 Select the title most similar to your current role (or most recent title if retired):
show_countplot_by_qno("Q5")
Q6 For how many years have you been writing code and/or programming?
show_countplot_by_qno("Q6")
ํน์ ์์๋๋ก ์ ๋ ฌํด๋ณด๊ธฐ
q6_cols = ['I have never written code', '< 1 years', '1-2 years', '3-5 years', '5-10 years',
'10-20 years', '20+ years'] #listํํ์ ์นผ๋ผ ์ง์
how_countplot_by_qno("Q6", order=q6_cols) #order์ต์
์ง์
Pandas filter๋ก ๊ท์น์ฑ์ด ์๋ ์ฌ๋ฌ ๊ฐ ์นผ๋ผ ๊ฐ์ ธ์ค๊ธฐ
Q7 What programming languages do you use on a regular basis? (Select all that apply)
show_countplot_by_qno("Q7") #KeyError
'''
7๋ฒ ๋ฌธํญ๋ถํฐ๋ ์ฌํ๊น์ง ์ฌ์ฉํ๋ ํจ์๋ฅผ ์ด์ฉํด์ ๋ถ์ํ ์ ์๋ค.
WHY? 7๋ฒ ์ง๋ฌธ์ ์ฌ๋ฌ ๋ฌธํญ์ด ํฌํจ๋์ด ์๊ธฐ ๋๋ฌธ
'''
์ง๋ฌธ ๊ฐ์ ธ์ค๊ธฐ
#1~6๋ฒ ์ง๋ฌธ ๊ฐ์ ธ์ค๊ธฐ
question["Q1"}
#7๋ฒ ์ง๋ฌธ ๊ฐ์ ธ์ค๊ธฐ
question.filter(regex="Q7")[0].split("-")[0]
'''
regex : Q1์ด ๋ค์ด๊ฐ๋ ๋ฌธํญ๋ง ์ถ๋ฆฌ๊ธฐ
split("-") : -๋ผ๋ ๋ฌธ์๋ก ๋๋์ด ๋ฌ๋ผ
[0] : 0๋ฒ์งธ ๊ฐ๋ง ๊ฐ์ ธ์ค๊ธฐ (listํํ๋ก ๋ฐํ)
'''
#'What programming languages do you use on a regular basis? (Select all that apply) '
data์ดํด๋ณด๊ธฐ
answer_Q7 = answer.filter(regex="Q7")
answer_Q7
null๊ฐ ๋ณด๊ธฐ
answer_Q7.isnull #null๊ฐ์ด True
answer_Q7.notnull #null์๋ ๊ฐ์ด True
answer_Q7.notnull.sum() #์๋ตํ ๊ฐ์
๊ธฐ์ ํต๊ณ๊ฐ ๋ณด๊ธฐ .describe()
answer_Q7_desc = answer_Q7.describe()
answer_Q7_desc
indexing
#TOP, countํ๋ง ๋ฝ์๋ด๊ธฐ
answer_Q7_count = answer_Q7_desc.loc[["top", "count"]].T #๊ดํธ [] ํ ๊ฒน์ key error ๋ฐ์, .T๋ ํ๊ณผ ์ด์ ๋ฐ๊ฟ
#TOP์ index๋ก ๋ง๋ค๊ธฐ
answer_Q7_count = answer_Q7_count.set_index("top")
#count๊ฐ์ผ๋ก ์ ๋ ฌ
answer_Q7_count = answer_Q7_count.sort_values("count", ascending=False)
answer_Q7_count
How to get title in multiple choice question
q7_title = question.filter(regex="Q7")[0].split("-")[0] #Q7์ด ๋ค์ด๊ฐ๋ ์ง๋ฌธ๋ง ์ฐพ๊ธฐ
์ฃผ๋ก ์ฌ์ฉํ๋ ์ธ์ด ๋น๋ ๊ทธ๋ํ
sns.barplot(data=answer_Q7_count,
y=answer_Q7_count.index, x="count", palette="Blues_r").set_title(q7_title)
์ฐธ๊ณ https://inf.run/h8B9
https://www.kaggle.com/corazzon/how-to-use-pandas-filter-in-survey-eda