83. 2020 Kaggle Machine Learning & Data Science Survey (1)

개인공부/Python

83. 2020 Kaggle Machine Learning & Data Science Survey (1)

LEE_BOMB 2021. 12. 18. 16:32

Dataset & 학습 목표

https://www.kaggle.com/c/kaggle-survey-2020/data

kaggle_survey_2020_responses.csv.zip

1.94MB

kaggle_survey_2020_responses.csv : 캐글 사용자의 나이, 성별, 최종학력, 국가, 경력 주 사용 언어 등의 응답이 기재되어 있다. (39개 이상의 질문 / 20,036개의 응답)

성별, 연령 시각화

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session


import seaborn as sns
import matplotlib.pyplot as plt

from IPython.display import set_matplotlib_formats
set_matplotlib_formats("retina")

plt.style.use("seaborn-whitegrid")

자료 가져오기

raw = pd.read_csv(r"../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv", low_memory=False)
raw.shape

데이터 확인

raw.head() #(20037, 355)

0번째 행 가져오기 ( * iloc : index값)

question = raw.iloc[0]
question
'''
Time from Start to Finish (seconds)                                Duration (in seconds)
Q1                                                           What is your age (# years)?
Q2                                                What is your gender? - Selected Choice
Q3                                             In which country do you currently reside?
Q4                                     What is the highest level of formal education ...
                                                             ...                        
Q35_B_Part_7                           In the next 2 years, do you hope to become mor...
Q35_B_Part_8                           In the next 2 years, do you hope to become mor...
Q35_B_Part_9                           In the next 2 years, do you hope to become mor...
Q35_B_Part_10                          In the next 2 years, do you hope to become mor...
Q35_B_OTHER                            In the next 2 years, do you hope to become mor...
Name: 0, Length: 355, dtype: object
'''

0번째 행 제거

answer = raw.drop([0])
answer

새로운 객체 정보

answer.info()
'''
<class 'pandas.core.frame.DataFrame'>
Int64Index: 20036 entries, 1 to 20036
Columns: 355 entries, Time from Start to Finish (seconds) to Q35_B_OTHER -> 칼럼 355개
dtypes: object(355)
memory usage: 54.4+ MB -> 메모리 사용량
'''

Q1 What is your age (# years)?

question["Q1"] #'What is your age (# years)?'

빈도수 세기 .value_counts()

answer["Q1"].value_counts(normalize=True)#normalize=True : 비율 
'''
25-29    0.200190
22-24    0.188960
18-21    0.173138
30-34    0.140297
35-39    0.099371
40-44    0.069724
45-49    0.049311
50-54    0.034837
55-59    0.020513
60-69    0.019864
70+      0.003793
Name: Q1, dtype: float64
'''

index순 정렬 .sort_index()

Q1 = answer['Q1'].value_counts().sort_index().plot() #.plot() 선 그래프

seaborn의 countplot으로 막대 그래프 그리기

sns.countplot(data=answer.sort_values("Q1"), x="Q1", #데이터.Q1값으로 정렬, x축 기준
              palette="Blues_r").set_title(question["Q1"])
#Blues : 빈도수 많으면 연한 색, set_title : 그래프 제목

Q2 What is your gender? - Selected Choice

question_no = "Q2"
Q2 = answer[question_no].value_counts()
Q2
'''
Man                        15789 -> 남성
Woman                       3878 -> 여성
Prefer not to say            263 -> 무응답
Prefer to self-describe       54
Nonbinary                     52
Name: Q2, dtype: int64
'''

막대그래프 시각화

sns.countplot(data=answer, 
              y=question_no).set_title(question[question_no])

두 변수에 대한 빈도수 그래프 시각화 .crosstab

q1q2 = pd.crosstab(answer["Q1"], answer["Q2"])
q1q2[["Man", "Woman"]].plot.bar(rot=0) #누운 글씨 세우기 -> rot=0

q1q2[["Man", "Woman"]].sort_index(ascending=False).plot.barh(figsize=(10, 6), title="Age & Gender")
'''
index로 정렬 -> .sort_index(ascending=False)
그래프 사이즈 변경 -> plot.barh(figsize=(10, 6)
그래프 타이틀 지정 -> title="Age & Gender"
'''

seaborn으로 그래프 그려보기

plt.figure(figsize=(10, 6)) #그래프 사이즈 조정
sns.countplot(data=answer.sort_values("Q1"), x="Q1", hue="Q2").set_title("Age & Gender")

'''
Q1을 기준으로 정렬 -> .sort_values("Q1")
응답값 Q1 -> x="Q1"
Q2에 따라 색상을 다르게 표시하겠다 -> hue="Q2"
'''

지역, 직업 시각화 (반복되는 작업 함수 만들어보기)

Q3 In which country do you currently reside?

question["Q1"] #'What is your age (# years)?'

함수 정의하기

def show_countplot_by_qno(qno):
    sns.countplot(data=answer, y=qno).set_title(question[qno])
    
show_countplot_by_qno("Q1") #함수에 "Q2" 입력하면 2번째 문항의 그래프 출력

data에 따라 함수 조건 조절하기

def show_countplot_by_qno(qno, fsize=(10,6)):
    plt.figure(figsize=fsize) #fsize를 변수로 받아 그래프 크기 설정
    sns.countplot(data=answer, y=qno).set_title(question[qno])
    
show_countplot_by_qno("Q3", fsize=(10,12)) #기본값은 (10, 6)이지만 (10, 12)사이즈로 출력

함수 정의하기2

def show_countplot_by_qno(qno, fsize=(10, 6), order=None):  
    """
    qno : question_no, ex) Q12 -> 질문 번호
    fsize : figsize default (10, 6) -> 그래프 기본 사이즈
    order : optional order list, default value_counts().index -> 정렬 기준 지정
    """
    if not order :
        order = answer[qno].value_counts().index #order값 지정 없으면 빈도수 높은순으로 정렬됨
        
    plt.figure(figsize=fsize)
    sns.countplot(data=answer, 
                  y=qno,
                  order=order,
                  palette="Blues_r"
                 ).set_title(question[qno])

시각화

show_countplot_by_qno("Q3", fsize=(12, 12))

Q4 🎓 What is the highest level of formal education that you have attained or plan to attain within the next 2 years?

show_countplot_by_qno("Q4")

Q5 Select the title most similar to your current role (or most recent title if retired):

show_countplot_by_qno("Q5")

Q6 For how many years have you been writing code and/or programming?

show_countplot_by_qno("Q6")

특정 순서대로 정렬해보기

q6_cols = ['I have never written code', '< 1 years', '1-2 years', '3-5 years', '5-10 years',  
       '10-20 years', '20+ years'] #list형태의 칼럼 지정

how_countplot_by_qno("Q6", order=q6_cols) #order옵션 지정

Pandas filter로 규칙성이 있는 여러 개 칼럼 가져오기

Q7 What programming languages do you use on a regular basis? (Select all that apply)

show_countplot_by_qno("Q7") #KeyError

'''
7번 문항부터는 여태까지 사용했던 함수를 이용해서 분석할 수 없다.
WHY? 7번 질문에 여러 문항이 포함되어 있기 때문
'''

질문 가져오기

#1~6번 질문 가져오기
question["Q1"}

#7번 질문 가져오기
question.filter(regex="Q7")[0].split("-")[0]
'''
regex : Q1이 들어가는 문항만 추리기 
split("-") : -라는 문자로 나누어 달라
[0] : 0번째 값만 가져오기 (list형태로 반환)
'''
#'What programming languages do you use on a regular basis? (Select all that apply) '

data살펴보기

answer_Q7 = answer.filter(regex="Q7")
answer_Q7

null값 보기

answer_Q7.isnull #null값이 True
answer_Q7.notnull #null아닌 값이 True
answer_Q7.notnull.sum() #응답한 개수

기술통계값 보기 .describe()

answer_Q7_desc = answer_Q7.describe()
answer_Q7_desc

indexing

#TOP, count행만 뽑아내기
answer_Q7_count = answer_Q7_desc.loc[["top", "count"]].T #괄호 [] 한 겹은 key error 발생, .T는 행과 열을 바꿈

#TOP을 index로 만들기
answer_Q7_count = answer_Q7_count.set_index("top")

#count값으로 정렬
answer_Q7_count = answer_Q7_count.sort_values("count", ascending=False)

answer_Q7_count

How to get title in multiple choice question

q7_title = question.filter(regex="Q7")[0].split("-")[0] #Q7이 들어가는 질문만 찾기

주로 사용하는 언어 빈도 그래프

sns.barplot(data=answer_Q7_count, 
            y=answer_Q7_count.index, x="count", palette="Blues_r").set_title(q7_title)

참고 https://inf.run/h8B9
https://www.kaggle.com/corazzon/how-to-use-pandas-filter-in-survey-eda