๊ฐœ์ธ๊ณต๋ถ€/Python

83. 2020 Kaggle Machine Learning & Data Science Survey (1)

LEE_BOMB 2021. 12. 18. 16:32
Dataset & ํ•™์Šต ๋ชฉํ‘œ

https://www.kaggle.com/c/kaggle-survey-2020/data 

kaggle_survey_2020_responses.csv.zip
1.94MB

 

kaggle_survey_2020_responses.csv : ์บ๊ธ€ ์‚ฌ์šฉ์ž์˜ ๋‚˜์ด, ์„ฑ๋ณ„, ์ตœ์ข…ํ•™๋ ฅ, ๊ตญ๊ฐ€, ๊ฒฝ๋ ฅ ์ฃผ ์‚ฌ์šฉ ์–ธ์–ด ๋“ฑ์˜ ์‘๋‹ต์ด ๊ธฐ์žฌ๋˜์–ด ์žˆ๋‹ค. (39๊ฐœ ์ด์ƒ์˜ ์งˆ๋ฌธ / 20,036๊ฐœ์˜ ์‘๋‹ต)

 

 

 

 

 

 

์„ฑ๋ณ„, ์—ฐ๋ น ์‹œ๊ฐํ™”
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session


import seaborn as sns
import matplotlib.pyplot as plt

from IPython.display import set_matplotlib_formats
set_matplotlib_formats("retina")

plt.style.use("seaborn-whitegrid")

 

์ž๋ฃŒ ๊ฐ€์ ธ์˜ค๊ธฐ

raw = pd.read_csv(r"../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv", low_memory=False)
raw.shape

 

๋ฐ์ดํ„ฐ ํ™•์ธ

raw.head() #(20037, 355)

 

0๋ฒˆ์งธ ํ–‰ ๊ฐ€์ ธ์˜ค๊ธฐ ( * iloc : index๊ฐ’)

question = raw.iloc[0]
question
'''
Time from Start to Finish (seconds)                                Duration (in seconds)
Q1                                                           What is your age (# years)?
Q2                                                What is your gender? - Selected Choice
Q3                                             In which country do you currently reside?
Q4                                     What is the highest level of formal education ...
                                                             ...                        
Q35_B_Part_7                           In the next 2 years, do you hope to become mor...
Q35_B_Part_8                           In the next 2 years, do you hope to become mor...
Q35_B_Part_9                           In the next 2 years, do you hope to become mor...
Q35_B_Part_10                          In the next 2 years, do you hope to become mor...
Q35_B_OTHER                            In the next 2 years, do you hope to become mor...
Name: 0, Length: 355, dtype: object
'''

 

0๋ฒˆ์งธ ํ–‰ ์ œ๊ฑฐ 

answer = raw.drop([0])
answer

 

์ƒˆ๋กœ์šด ๊ฐ์ฒด ์ •๋ณด

answer.info()
'''
<class 'pandas.core.frame.DataFrame'>
Int64Index: 20036 entries, 1 to 20036
Columns: 355 entries, Time from Start to Finish (seconds) to Q35_B_OTHER -> ์นผ๋Ÿผ 355๊ฐœ
dtypes: object(355)
memory usage: 54.4+ MB -> ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰
'''

 

 

 

Q1 What is your age (# years)?

question["Q1"] #'What is your age (# years)?'

 

๋นˆ๋„์ˆ˜ ์„ธ๊ธฐ .value_counts()

answer["Q1"].value_counts(normalize=True)#normalize=True : ๋น„์œจ 
'''
25-29    0.200190
22-24    0.188960
18-21    0.173138
30-34    0.140297
35-39    0.099371
40-44    0.069724
45-49    0.049311
50-54    0.034837
55-59    0.020513
60-69    0.019864
70+      0.003793
Name: Q1, dtype: float64
'''

 

index์ˆœ ์ •๋ ฌ .sort_index()

Q1 = answer['Q1'].value_counts().sort_index().plot() #.plot() ์„  ๊ทธ๋ž˜ํ”„

 

 

seaborn์˜ countplot์œผ๋กœ ๋ง‰๋Œ€ ๊ทธ๋ž˜ํ”„ ๊ทธ๋ฆฌ๊ธฐ

sns.countplot(data=answer.sort_values("Q1"), x="Q1", #๋ฐ์ดํ„ฐ.Q1๊ฐ’์œผ๋กœ ์ •๋ ฌ, x์ถ• ๊ธฐ์ค€
              palette="Blues_r").set_title(question["Q1"])
#Blues : ๋นˆ๋„์ˆ˜ ๋งŽ์œผ๋ฉด ์—ฐํ•œ ์ƒ‰, set_title : ๊ทธ๋ž˜ํ”„ ์ œ๋ชฉ

 

 

 

 

Q2 What is your gender? - Selected Choice

question_no = "Q2"
Q2 = answer[question_no].value_counts()
Q2
'''
Man                        15789 -> ๋‚จ์„ฑ
Woman                       3878 -> ์—ฌ์„ฑ
Prefer not to say            263 -> ๋ฌด์‘๋‹ต
Prefer to self-describe       54
Nonbinary                     52
Name: Q2, dtype: int64
'''

 

๋ง‰๋Œ€๊ทธ๋ž˜ํ”„ ์‹œ๊ฐํ™”

sns.countplot(data=answer, 
              y=question_no).set_title(question[question_no])

 

๋‘ ๋ณ€์ˆ˜์— ๋Œ€ํ•œ ๋นˆ๋„์ˆ˜ ๊ทธ๋ž˜ํ”„ ์‹œ๊ฐํ™” .crosstab

q1q2 = pd.crosstab(answer["Q1"], answer["Q2"])
q1q2[["Man", "Woman"]].plot.bar(rot=0) #๋ˆ„์šด ๊ธ€์”จ ์„ธ์šฐ๊ธฐ -> rot=0

 

q1q2[["Man", "Woman"]].sort_index(ascending=False).plot.barh(figsize=(10, 6), title="Age & Gender")
'''
index๋กœ ์ •๋ ฌ -> .sort_index(ascending=False)
๊ทธ๋ž˜ํ”„ ์‚ฌ์ด์ฆˆ ๋ณ€๊ฒฝ -> plot.barh(figsize=(10, 6)
๊ทธ๋ž˜ํ”„ ํƒ€์ดํ‹€ ์ง€์ • -> title="Age & Gender"
'''

 

seaborn์œผ๋กœ ๊ทธ๋ž˜ํ”„ ๊ทธ๋ ค๋ณด๊ธฐ

plt.figure(figsize=(10, 6)) #๊ทธ๋ž˜ํ”„ ์‚ฌ์ด์ฆˆ ์กฐ์ •
sns.countplot(data=answer.sort_values("Q1"), x="Q1", hue="Q2").set_title("Age & Gender")

'''
Q1์„ ๊ธฐ์ค€์œผ๋กœ ์ •๋ ฌ -> .sort_values("Q1")
์‘๋‹ต๊ฐ’ Q1 -> x="Q1"
Q2์— ๋”ฐ๋ผ ์ƒ‰์ƒ์„ ๋‹ค๋ฅด๊ฒŒ ํ‘œ์‹œํ•˜๊ฒ ๋‹ค -> hue="Q2"
'''

 

 

 

 

 

์ง€์—ญ, ์ง์—… ์‹œ๊ฐํ™” (๋ฐ˜๋ณต๋˜๋Š” ์ž‘์—… ํ•จ์ˆ˜ ๋งŒ๋“ค์–ด๋ณด๊ธฐ)

Q3 In which country do you currently reside?

question["Q1"] #'What is your age (# years)?'

 

ํ•จ์ˆ˜ ์ •์˜ํ•˜๊ธฐ

def show_countplot_by_qno(qno):
    sns.countplot(data=answer, y=qno).set_title(question[qno])
    
show_countplot_by_qno("Q1") #ํ•จ์ˆ˜์— "Q2" ์ž…๋ ฅํ•˜๋ฉด 2๋ฒˆ์งธ ๋ฌธํ•ญ์˜ ๊ทธ๋ž˜ํ”„ ์ถœ๋ ฅ

data์— ๋”ฐ๋ผ ํ•จ์ˆ˜ ์กฐ๊ฑด ์กฐ์ ˆํ•˜๊ธฐ

def show_countplot_by_qno(qno, fsize=(10,6)):
    plt.figure(figsize=fsize) #fsize๋ฅผ ๋ณ€์ˆ˜๋กœ ๋ฐ›์•„ ๊ทธ๋ž˜ํ”„ ํฌ๊ธฐ ์„ค์ •
    sns.countplot(data=answer, y=qno).set_title(question[qno])
    
show_countplot_by_qno("Q3", fsize=(10,12)) #๊ธฐ๋ณธ๊ฐ’์€ (10, 6)์ด์ง€๋งŒ (10, 12)์‚ฌ์ด์ฆˆ๋กœ ์ถœ๋ ฅ

(10, 6) -> (10, 12)

 

ํ•จ์ˆ˜ ์ •์˜ํ•˜๊ธฐ2

def show_countplot_by_qno(qno, fsize=(10, 6), order=None):  
    """
    qno : question_no, ex) Q12 -> ์งˆ๋ฌธ ๋ฒˆํ˜ธ
    fsize : figsize default (10, 6) -> ๊ทธ๋ž˜ํ”„ ๊ธฐ๋ณธ ์‚ฌ์ด์ฆˆ
    order : optional order list, default value_counts().index -> ์ •๋ ฌ ๊ธฐ์ค€ ์ง€์ •
    """
    if not order :
        order = answer[qno].value_counts().index #order๊ฐ’ ์ง€์ • ์—†์œผ๋ฉด ๋นˆ๋„์ˆ˜ ๋†’์€์ˆœ์œผ๋กœ ์ •๋ ฌ๋จ
        
    plt.figure(figsize=fsize)
    sns.countplot(data=answer, 
                  y=qno,
                  order=order,
                  palette="Blues_r"
                 ).set_title(question[qno])

 

์‹œ๊ฐํ™”

show_countplot_by_qno("Q3", fsize=(12, 12))

 

 

 

Q4 ๐ŸŽ“ What is the highest level of formal education that you have attained or plan to attain within the next 2 years?

show_countplot_by_qno("Q4")

 

 

 

Q5 Select the title most similar to your current role (or most recent title if retired):

show_countplot_by_qno("Q5")

 

 

 

 

 

Q6 For how many years have you been writing code and/or programming?

show_countplot_by_qno("Q6")

 

ํŠน์ • ์ˆœ์„œ๋Œ€๋กœ ์ •๋ ฌํ•ด๋ณด๊ธฐ

q6_cols = ['I have never written code', '< 1 years', '1-2 years', '3-5 years', '5-10 years',  
       '10-20 years', '20+ years'] #listํ˜•ํƒœ์˜ ์นผ๋Ÿผ ์ง€์ •

how_countplot_by_qno("Q6", order=q6_cols) #order์˜ต์…˜ ์ง€์ •

 

 

 

 

 

Pandas filter๋กœ ๊ทœ์น™์„ฑ์ด ์žˆ๋Š” ์—ฌ๋Ÿฌ ๊ฐœ ์นผ๋Ÿผ ๊ฐ€์ ธ์˜ค๊ธฐ

Q7 What programming languages do you use on a regular basis? (Select all that apply)

show_countplot_by_qno("Q7") #KeyError

'''
7๋ฒˆ ๋ฌธํ•ญ๋ถ€ํ„ฐ๋Š” ์—ฌํƒœ๊นŒ์ง€ ์‚ฌ์šฉํ–ˆ๋˜ ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•ด์„œ ๋ถ„์„ํ•  ์ˆ˜ ์—†๋‹ค.
WHY? 7๋ฒˆ ์งˆ๋ฌธ์— ์—ฌ๋Ÿฌ ๋ฌธํ•ญ์ด ํฌํ•จ๋˜์–ด ์žˆ๊ธฐ ๋•Œ๋ฌธ
'''

 

์งˆ๋ฌธ ๊ฐ€์ ธ์˜ค๊ธฐ

#1~6๋ฒˆ ์งˆ๋ฌธ ๊ฐ€์ ธ์˜ค๊ธฐ
question["Q1"}

#7๋ฒˆ ์งˆ๋ฌธ ๊ฐ€์ ธ์˜ค๊ธฐ
question.filter(regex="Q7")[0].split("-")[0]
'''
regex : Q1์ด ๋“ค์–ด๊ฐ€๋Š” ๋ฌธํ•ญ๋งŒ ์ถ”๋ฆฌ๊ธฐ 
split("-") : -๋ผ๋Š” ๋ฌธ์ž๋กœ ๋‚˜๋ˆ„์–ด ๋‹ฌ๋ผ
[0] : 0๋ฒˆ์งธ ๊ฐ’๋งŒ ๊ฐ€์ ธ์˜ค๊ธฐ (listํ˜•ํƒœ๋กœ ๋ฐ˜ํ™˜)
'''
#'What programming languages do you use on a regular basis? (Select all that apply) '

 

data์‚ดํŽด๋ณด๊ธฐ

answer_Q7 = answer.filter(regex="Q7")
answer_Q7

 

null๊ฐ’ ๋ณด๊ธฐ

answer_Q7.isnull #null๊ฐ’์ด True
answer_Q7.notnull #null์•„๋‹Œ ๊ฐ’์ด True
answer_Q7.notnull.sum() #์‘๋‹ตํ•œ ๊ฐœ์ˆ˜

 

๊ธฐ์ˆ ํ†ต๊ณ„๊ฐ’ ๋ณด๊ธฐ .describe()

answer_Q7_desc = answer_Q7.describe()
answer_Q7_desc

 

indexing

#TOP, countํ–‰๋งŒ ๋ฝ‘์•„๋‚ด๊ธฐ
answer_Q7_count = answer_Q7_desc.loc[["top", "count"]].T #๊ด„ํ˜ธ [] ํ•œ ๊ฒน์€ key error ๋ฐœ์ƒ, .T๋Š” ํ–‰๊ณผ ์—ด์„ ๋ฐ”๊ฟˆ

#TOP์„ index๋กœ ๋งŒ๋“ค๊ธฐ
answer_Q7_count = answer_Q7_count.set_index("top")

#count๊ฐ’์œผ๋กœ ์ •๋ ฌ
answer_Q7_count = answer_Q7_count.sort_values("count", ascending=False)

answer_Q7_count

 

 

 

How to get title in multiple choice question

q7_title = question.filter(regex="Q7")[0].split("-")[0] #Q7์ด ๋“ค์–ด๊ฐ€๋Š” ์งˆ๋ฌธ๋งŒ ์ฐพ๊ธฐ

 

์ฃผ๋กœ ์‚ฌ์šฉํ•˜๋Š” ์–ธ์–ด ๋นˆ๋„ ๊ทธ๋ž˜ํ”„

sns.barplot(data=answer_Q7_count, 
            y=answer_Q7_count.index, x="count", palette="Blues_r").set_title(q7_title)

 

 

 

 

 


์ฐธ๊ณ  https://inf.run/h8B9
https://www.kaggle.com/corazzon/how-to-use-pandas-filter-in-survey-eda