전처리 (Pre-Processing)


개요

1. 전처리의 정의

데이터 전처리는 데이터 분석에 적합하게 데이터를 가공/ 변경/ 처리/ 클리닝하는 과정이다


2. 전처리의 종류

  1. 결측치 - Imputer

  2. 이상치

  3. 정규화 (Normalization)

    • 0~1사이의 분포로 조정
    • xnew=xxminxmaxxminx_{new} = \frac{x-x_{min}}{x_{max}-x_{min}}
  4. 표준화 (Standardization)

    • 평균을 0, 표준편차를 1로 맞춤
    • xnew=xμσx_{new} = \frac{x-\mu}{\sigma}
  5. 샘플링 (over/under sampling)

  6. 피처 공학 (Feature Engineering)

    • feature 생성/ 연산
    • 구간 생성, 스케일 변경



실습 – Titanic

1
2
import numpy as np
import pandas as pd
1
2
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

0. 데이터 셋 파악

1
train.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

  • PassengerId: 승객 아이디

  • Survived: 생존 여부, 1: 생존, 0: 사망

  • Pclass: 등급

  • Name: 성함

  • Sex: 성별

  • Age: 나이

  • SibSp: 형제, 자매, 배우자 수

  • Parch: 부모, 자식 수

  • Ticket: 티켓번호

  • Fare: 요즘

  • Cabin: 좌석번호

  • Embarked: 탑승 항구



1. train / validation 셋 나누기

STEP 1. feature & label 정의하기

1
2
3
feature = [
'Pclass', 'Sex', 'Age', 'Fare'
]
1
2
3
label = [
'Survived'
]
1
train[feature].head()
Pclass Sex Age Fare
0 3 male 22.0 7.2500
1 1 female 38.0 71.2833
2 3 female 26.0 7.9250
3 1 female 35.0 53.1000
4 3 male 35.0 8.0500

1
train[label].head()
Survived
0 0
1 1
2 1
3 1
4 0

STEP 2. 적절한 비율로 train / validation set 나누기

1
from sklearn.model_selection import train_test_split

reference: < train_test_split > Document

train_test_split ( X, y, test_size=…, random_state=…, shuffle=True )

  • test_size: validation set에 할당할 비율 (20% -> 0.2)
  • random_state: random seed 설정
  • shuffle: 기본 True: shuffle the data before splitting

1
x_train, x_valid, y_train, y_valid = train_test_split(train[feature], train[label], test_size=0.2, random_state=30, shuffle=True)
1
x_train.shape, y_train.shape
((712, 4), (712, 1))

1
x_valid.shape, y_valid.shape
((179, 4), (179, 1))



2. 결측치 처리

2-0. 결측치 확인

방법 1. pandas의 info()

1
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

방법 2. pandas의 isnull()
합계를 구하는 sum()을 통해 한 눈에 확인할 수 있다

1
train.isnull().sum()
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

개별 column의 결측치 확인하기

1
train['Age'].isnull().sum()
177

2-1. Numerical Column의 결측치 처리

1
train.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

1
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

1. Pandas의 "fillna()"를 사용: 1개의 column을 처리할 때

a. 숫자"0"으로 채우기

1
train['Age'].fillna(0).describe()
count    891.000000
mean      23.799293
std       17.596074
min        0.000000
25%        6.000000
50%       24.000000
75%       35.000000
max       80.000000
Name: Age, dtype: float64

b. 통계값(평균)으로 채우기

1
train['Age'].fillna(train['Age'].mean()).describe()
count    891.000000
mean      29.699118
std       13.002015
min        0.420000
25%       22.000000
50%       29.699118
75%       35.000000
max       80.000000
Name: Age, dtype: float64

2. sklearn의 "SimpleImputer"를 사용: 2개 이상의 column을 한 번에 처리할 때

reference:

  1. Impute 도큐먼트
  2. SimplrImputer 도큐먼트

SimpleImputer( *, missing_values=nan, strategy=‘mean’, fill_value=None, verbose=0, copy=True, add_indicator=False )

  • strategy: “mean” / “median” / “most_frequent” / “constant”

1
from sklearn.impute import SimpleImputer

a. 숫자"0"으로 채우기

1
2
# STEP 1. imputer 만들기
imputer = SimpleImputer(strategy='constant', fill_value=0)
1
2
# STEP 2. fit() 을 통해 결측치에 대한 학습을 진행하기
imputer.fit(train[['Age', 'Pclass']])
SimpleImputer(add_indicator=False, copy=True, fill_value=0, missing_values=nan,
              strategy='constant', verbose=0)
1
2
3
# STEP 3. transform() 을 통해 실제 결측치에 대해 처리하기
result = imputer.transform(train[['Age', 'Pclass']])
result
array([[22.,  3.],
       [38.,  1.],
       [26.,  3.],
       ...,
       [ 0.,  3.],
       [26.,  1.],
       [32.,  3.]])
1
2
# STEP 4. 처리 결과를 original data에 대입
train[['Age', 'Pclass']] = result
1
train[['Age', 'Pclass']].isnull().sum()
Age       0
Pclass    0
dtype: int64

fit_transform() 은 fit()과 transform()을 한 번에 해주는 합수다.

1
train = pd.read_csv('train.csv')
1
train[['Age', 'Pclass']].isnull().sum()
Age       177
Pclass      0
dtype: int64

1
2
# STEP 1. imputer 만들기
imputer = SimpleImputer(strategy='constant', fill_value=0)
1
2
# STEP 2. fit and transform
result = imputer.fit_transform(train[['Age', 'Pclass']])
1
2
# STEP 3. 결과 대입
train[['Age', 'Pclass']] = result
1
train[['Age', 'Pclass']].isnull().sum()
Age       0
Pclass    0
dtype: int64
1
train[['Age', 'Pclass']].describe()
Age Pclass
count 891.000000 891.000000
mean 23.799293 2.308642
std 17.596074 0.836071
min 0.000000 1.000000
25% 6.000000 2.000000
50% 24.000000 3.000000
75% 35.000000 3.000000
max 80.000000 3.000000

b. 통계값(평균)으로 채우기

1
train = pd.read_csv('train.csv')
1
train[['Age', 'Pclass']].isnull().sum()
Age       177
Pclass      0
dtype: int64

1
2
3
imputer = SimpleImputer(strategy='mean')
result = imputer.fit_transform(train[['Age', 'Pclass']])
train[['Age', 'Pclass']] = result
1
train[['Age', 'Pclass']].isnull().sum()
Age       0
Pclass    0
dtype: int64
1
train[['Age', 'Pclass']].describe()
Age Pclass
count 891.000000 891.000000
mean 29.699118 2.308642
std 13.002015 0.836071
min 0.420000 1.000000
25% 22.000000 2.000000
50% 29.699118 3.000000
75% 35.000000 3.000000
max 80.000000 3.000000

2-2. Categorical Column의 결측치 처리

1
train = pd.read_csv('train.csv')
1
train.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

1
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

1. Pandas의 "fillna()"를 사용: 1개의 column을 처리할 때

1
train['Embarked'].fillna('S')
0      S
1      C
2      S
3      S
4      S
      ..
886    S
887    S
888    S
889    C
890    Q
Name: Embarked, Length: 891, dtype: object

2. sklearn의 "SimpleImputer"를 사용: 2개 이상의 column을 한 번에 처리할 때

1
2
3
imputer = SimpleImputer(strategy = 'most_frequent')
result = imputer.fit_transform(train[['Embarked', 'Cabin']])
train[['Embarked', 'Cabin']] = result
1
train[['Embarked', 'Cabin']].isnull().sum()
Embarked    0
Cabin       0
dtype: int64



3. Label Encoding: 문자(categorivcal)를 수치(numerical)로 변환

기계학습을 위해서 모든 문자로된 데이터는 수치로 변환해야 한다

1
2
train = pd.read_csv('train.csv')
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

1
train['Sex']
0        male
1      female
2      female
3      female
4        male
        ...  
886      male
887    female
888    female
889      male
890      male
Name: Sex, Length: 891, dtype: object

방법 1: convert함수를 직접 정의하기

1
train['Sex'].value_counts()
male      577
female    314
Name: Sex, dtype: int64

1
2
3
4
5
6
# STEP 1. 함수 정의
def convert(data):
if data == 'female':
return 1
elif data == 'male':
return 0
1
2
# STEP 2. 함수 apply
train['Sex'].apply(convert)
0      0
1      1
2      1
3      1
4      0
      ..
886    0
887    1
888    1
889    0
890    0
Name: Sex, Length: 891, dtype: int64

방법 2: sklearn의 “LabelEncoder” 사용

  • 변환 규칙: value name의 alphabet 순서대로 0, 1, 2… 숫자를 부여
1
from sklearn.preprocessing import LabelEncoder
1
train['Sex'].value_counts()
male      577
female    314
Name: Sex, dtype: int64

1
le = LabelEncoder()
1
train['Sex_num'] = le.fit_transform(train['Sex'])
1
train['Sex_num'].value_counts()
1    577
0    314
Name: Sex_num, dtype: int64

1
2
# class 확인
le.classes_
array(['female', 'male'], dtype=object)

1
2
# 숫자 -> 문자
le.inverse_transform([0, 1, 1, 0, 0, 1, 1])
array(['female', 'male', 'male', 'female', 'female', 'male', 'male'],
      dtype=object)

NaN 값이 포함되어 있으면, LabeEncoder가 정상 동작하지 않음

1
train['Embarked']
0      S
1      C
2      S
3      S
4      S
      ..
886    S
887    S
888    S
889    C
890    Q
Name: Embarked, Length: 891, dtype: object

1
train['Embarked'].value_counts()
S    644
C    168
Q     77
Name: Embarked, dtype: int64

1
le.fit_transform(train['Embarked'])
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

D:\Anaconda\lib\site-packages\sklearn\preprocessing\_label.py in _encode(values, uniques, encode, check_unknown)
    111         try:
--> 112             res = _encode_python(values, uniques, encode)
    113         except TypeError:


D:\Anaconda\lib\site-packages\sklearn\preprocessing\_label.py in _encode_python(values, uniques, encode)
     59     if uniques is None:
---> 60         uniques = sorted(set(values))
     61         uniques = np.array(uniques, dtype=values.dtype)


TypeError: '<' not supported between instances of 'float' and 'str'

During handling of the above exception, another exception occurred:


TypeError                                 Traceback (most recent call last)

<ipython-input-38-86525b1fc929> in <module>
----> 1 le.fit_transform(train['Embarked'])


D:\Anaconda\lib\site-packages\sklearn\preprocessing\_label.py in fit_transform(self, y)
    250         """
    251         y = column_or_1d(y, warn=True)
--> 252         self.classes_, y = _encode(y, encode=True)
    253         return y
    254 


D:\Anaconda\lib\site-packages\sklearn\preprocessing\_label.py in _encode(values, uniques, encode, check_unknown)
    112             res = _encode_python(values, uniques, encode)
    113         except TypeError:
--> 114             raise TypeError("argument must be a string or number")
    115         return res
    116     else:


TypeError: argument must be a string or number

1
train['Embarked'] = train['Embarked'].fillna('S')
1
train['Embarked'] = le.fit_transform(train['Embarked'])
1
train['Embarked']
0      2
1      0
2      2
3      2
4      2
      ..
886    2
887    2
888    2
889    0
890    1
Name: Embarked, Length: 891, dtype: int32

1
train['Embarked'].value_counts()
2    646
0    168
1     77
Name: Embarked, dtype: int64


4. 원 핫 인코딩 (One Hot Encoding)

pd.get_dummies ( df_name [ ‘col_name’ ] )

1
train = pd.read_csv('train.csv')
1
train.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

1
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

"Embarked"를 살펴보기

1
2
# Unique Value 확인하기
train['Embarked'].value_counts()
S    644
C    168
Q     77
Name: Embarked, dtype: int64

1
2
3
# NA 채우기
train['Embarked'] = train['Embarked'].fillna('S')
train['Embarked'].value_counts()
S    646
C    168
Q     77
Name: Embarked, dtype: int64

1
2
3
# Label Encoding (문자 to 숫자)
train['Embarked_num'] = LabelEncoder().fit_transform(train['Embarked'])
train['Embarked_num'].value_counts()
2    646
0    168
1     77
Name: Embarked_num, dtype: int64

Embarked는 탑승 항구의 이니셜을 나타낸다. 우리는 LabelEncoder를 통해서 값을 수치형으로 변환해주었다, 하지만 이대로 데이터를 기계학습 시키면, 기계는 데이터 안에서 관계를 학습한다.

예를 들면, ‘S’= 2, ‘Q’= 1 이라고 되어 있는데, Q+Q=S가 된다라고 학습해버린다

그렇기 때문에, 우리는 각 unique value를 별도의 column으로 분리하고, 값에 해당하는 column는 True (1), 나머지 column는 False (0) 를 갖게 한다.이것이 바로 원 핫 인코딩 이다.


1
train['Embarked'][:6]
0    S
1    C
2    S
3    S
4    S
5    Q
Name: Embarked, dtype: object

1
train['Embarked_num'][:6]
0    2
1    0
2    2
3    2
4    2
5    1
Name: Embarked_num, dtype: int32

1
2
one_hot = pd.get_dummies(train['Embarked_num'][:6])
one_hot
0 1 2
0 0 0 1
1 1 0 0
2 0 0 1
3 0 0 1
4 0 0 1
5 0 1 0

1
2
one_hot.columns = ['C', 'Q', 'S']
one_hot
C Q S
0 0 0 1
1 1 0 0
2 0 0 1
3 0 0 1
4 0 0 1
5 0 1 0

원핫인코딩은 카테고리의 특성을(계절, 항구, 성별, 종류…) 가지는 column에 대해서 적용한다



5. Normalize (정규화)

정규화: column간에 다른 min,max 값을 가지는 경우, 정규화를 통해 min / max 의 척도를 맞추어 주는 작업이다

sklearn.preprocessing --> MinMaxScaler()


예: 영화평점

  • 네이버 영화평점 (0점 ~ 10점): [2, 4, 6, 8, 10]
  • 넷플릭스 영화평점 (0점 ~ 5점): [1, 2, 3, 4, 5]
1
2
3
movie = {'naver': [2, 4, 6, 8, 10],
'netflix': [1, 2, 3, 4, 5]
}
1
2
movie = pd.DataFrame(data=movie)
movie
naver netflix
0 2 1
1 4 2
2 6 3
3 8 4
4 10 5

1
from sklearn.preprocessing import MinMaxScaler
1
min_max_scaler = MinMaxScaler()
1
min_max_movie = min_max_scaler.fit_transform(movie)
1
pd.DataFrame(min_max_movie, columns = ['naver', 'netfllix'])
naver netfllix
0 0.00 0.00
1 0.25 0.25
2 0.50 0.50
3 0.75 0.75
4 1.00 1.00


6. Standard Scaling (표준화)

표준화: 평균이 0, 표준편차가 1이 되도록 변환해주는 작업

sklearn.preprocessing --> StandardScaler()

1
2
from sklearn.preprocessing import StandardScaler
standard_scaler = StandardScaler()
1
2
3
# 샘플데이터 생성
x = np.arange(10)
x[9] = 1000 # oulier 추가
1
x.mean(), x.std()
(103.6, 298.8100399919654)

1
2
# 원본 데이터 표준화하기
scaled = standard_scaler.fit_transform(x.reshape(-1, 1))
1
scaled.mean(), scaled.std()
(4.4408920985006264e-17, 1.0)

1
round(scaled.mean(), 2), scaled.std()  # mean값 반올림
(0.0, 1.0)