전처리 (Pre-Processing)
개요
1. 전처리의 정의
데이터 전처리 는 데이터 분석에 적합하게 데이터를 가공/ 변경/ 처리/ 클리닝 하는 과정이다
2. 전처리의 종류
결측치 - Imputer
이상치
정규화 (Normalization)
0~1사이의 분포로 조정
x n e w = x − x m i n x m a x − x m i n x_{new} = \frac{x-x_{min}}{x_{max}-x_{min}} x n e w = x m a x − x m i n x − x m i n
표준화 (Standardization)
평균을 0, 표준편차를 1로 맞춤
x n e w = x − μ σ x_{new} = \frac{x-\mu}{\sigma} x n e w = σ x − μ
샘플링 (over/under sampling)
피처 공학 (Feature Engineering)
feature 생성/ 연산
구간 생성, 스케일 변경
실습 – Titanic
1 2 import numpy as npimport pandas as pd
1 2 train = pd.read_csv('train.csv' ) test = pd.read_csv('test.csv' )
0. 데이터 셋 파악
PassengerId
Survived
Pclass
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked
0
1
0
3
Braund, Mr. Owen Harris
male
22.0
1
0
A/5 21171
7.2500
NaN
S
1
2
1
1
Cumings, Mrs. John Bradley (Florence Briggs Th...
female
38.0
1
0
PC 17599
71.2833
C85
C
2
3
1
3
Heikkinen, Miss. Laina
female
26.0
0
0
STON/O2. 3101282
7.9250
NaN
S
3
4
1
1
Futrelle, Mrs. Jacques Heath (Lily May Peel)
female
35.0
1
0
113803
53.1000
C123
S
4
5
0
3
Allen, Mr. William Henry
male
35.0
0
0
373450
8.0500
NaN
S
1. train / validation 셋 나누기
STEP 1. feature & label 정의하기
1 2 3 feature = [ 'Pclass' , 'Sex' , 'Age' , 'Fare' ]
Pclass
Sex
Age
Fare
0
3
male
22.0
7.2500
1
1
female
38.0
71.2833
2
3
female
26.0
7.9250
3
1
female
35.0
53.1000
4
3
male
35.0
8.0500
Survived
0
0
1
1
2
1
3
1
4
0
STEP 2. 적절한 비율로 train / validation set 나누기
1 from sklearn.model_selection import train_test_split
reference: < train_test_split > Document
train_test_split ( X, y, test_size=…, random_state=…, shuffle=True )
test_size: validation set에 할당할 비율 (20% -> 0.2)
random_state: random seed 설정
shuffle: 기본 True: shuffle the data before splitting
1 x_train, x_valid, y_train, y_valid = train_test_split(train[feature], train[label], test_size=0.2 , random_state=30 , shuffle=True )
1 x_train.shape, y_train.shape
((712, 4), (712, 1))
1 x_valid.shape, y_valid.shape
((179, 4), (179, 1))
2. 결측치 처리
2-0. 결측치 확인
방법 1. pandas의 info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
방법 2. pandas의 isnull()
합계를 구하는 sum()을 통해 한 눈에 확인할 수 있다
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
개별 column의 결측치 확인하기
1 train['Age' ].isnull().sum()
177
2-1. Numerical Column의 결측치 처리
PassengerId
Survived
Pclass
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked
0
1
0
3
Braund, Mr. Owen Harris
male
22.0
1
0
A/5 21171
7.2500
NaN
S
1
2
1
1
Cumings, Mrs. John Bradley (Florence Briggs Th...
female
38.0
1
0
PC 17599
71.2833
C85
C
2
3
1
3
Heikkinen, Miss. Laina
female
26.0
0
0
STON/O2. 3101282
7.9250
NaN
S
3
4
1
1
Futrelle, Mrs. Jacques Heath (Lily May Peel)
female
35.0
1
0
113803
53.1000
C123
S
4
5
0
3
Allen, Mr. William Henry
male
35.0
0
0
373450
8.0500
NaN
S
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
1. Pandas의 "fillna()" 를 사용 : 1개의 column을 처리할 때
a. 숫자"0"으로 채우기
1 train['Age' ].fillna(0 ).describe()
count 891.000000
mean 23.799293
std 17.596074
min 0.000000
25% 6.000000
50% 24.000000
75% 35.000000
max 80.000000
Name: Age, dtype: float64
b. 통계값(평균)으로 채우기
1 train['Age' ].fillna(train['Age' ].mean()).describe()
count 891.000000
mean 29.699118
std 13.002015
min 0.420000
25% 22.000000
50% 29.699118
75% 35.000000
max 80.000000
Name: Age, dtype: float64
2. sklearn의 "SimpleImputer" 를 사용 : 2개 이상의 column을 한 번에 처리할 때
reference:
Impute 도큐먼트
SimplrImputer 도큐먼트
SimpleImputer ( *, missing_values=nan, strategy=‘mean’, fill_value=None, verbose=0, copy=True, add_indicator=False )
strategy: “mean” / “median” / “most_frequent” / “constant”
1 from sklearn.impute import SimpleImputer
a. 숫자"0"으로 채우기
1 2 imputer = SimpleImputer(strategy='constant' , fill_value=0 )
1 2 imputer.fit(train[['Age' , 'Pclass' ]])
SimpleImputer(add_indicator=False, copy=True, fill_value=0, missing_values=nan,
strategy='constant', verbose=0)
1 2 3 result = imputer.transform(train[['Age' , 'Pclass' ]]) result
array([[22., 3.],
[38., 1.],
[26., 3.],
...,
[ 0., 3.],
[26., 1.],
[32., 3.]])
1 2 train[['Age' , 'Pclass' ]] = result
1 train[['Age' , 'Pclass' ]].isnull().sum()
Age 0
Pclass 0
dtype: int64
fit_transform() 은 fit()과 transform()을 한 번에 해주는 합수다.
1 train = pd.read_csv('train.csv' )
1 train[['Age' , 'Pclass' ]].isnull().sum()
Age 177
Pclass 0
dtype: int64
1 2 imputer = SimpleImputer(strategy='constant' , fill_value=0 )
1 2 result = imputer.fit_transform(train[['Age' , 'Pclass' ]])
1 2 train[['Age' , 'Pclass' ]] = result
1 train[['Age' , 'Pclass' ]].isnull().sum()
Age 0
Pclass 0
dtype: int64
1 train[['Age' , 'Pclass' ]].describe()
Age
Pclass
count
891.000000
891.000000
mean
23.799293
2.308642
std
17.596074
0.836071
min
0.000000
1.000000
25%
6.000000
2.000000
50%
24.000000
3.000000
75%
35.000000
3.000000
max
80.000000
3.000000
b. 통계값(평균)으로 채우기
1 train = pd.read_csv('train.csv' )
1 train[['Age' , 'Pclass' ]].isnull().sum()
Age 177
Pclass 0
dtype: int64
1 2 3 imputer = SimpleImputer(strategy='mean' ) result = imputer.fit_transform(train[['Age' , 'Pclass' ]]) train[['Age' , 'Pclass' ]] = result
1 train[['Age' , 'Pclass' ]].isnull().sum()
Age 0
Pclass 0
dtype: int64
1 train[['Age' , 'Pclass' ]].describe()
Age
Pclass
count
891.000000
891.000000
mean
29.699118
2.308642
std
13.002015
0.836071
min
0.420000
1.000000
25%
22.000000
2.000000
50%
29.699118
3.000000
75%
35.000000
3.000000
max
80.000000
3.000000
2-2. Categorical Column의 결측치 처리
1 train = pd.read_csv('train.csv' )
PassengerId
Survived
Pclass
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked
0
1
0
3
Braund, Mr. Owen Harris
male
22.0
1
0
A/5 21171
7.2500
NaN
S
1
2
1
1
Cumings, Mrs. John Bradley (Florence Briggs Th...
female
38.0
1
0
PC 17599
71.2833
C85
C
2
3
1
3
Heikkinen, Miss. Laina
female
26.0
0
0
STON/O2. 3101282
7.9250
NaN
S
3
4
1
1
Futrelle, Mrs. Jacques Heath (Lily May Peel)
female
35.0
1
0
113803
53.1000
C123
S
4
5
0
3
Allen, Mr. William Henry
male
35.0
0
0
373450
8.0500
NaN
S
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
1. Pandas의 "fillna()" 를 사용 : 1개의 column을 처리할 때
1 train['Embarked' ].fillna('S' )
0 S
1 C
2 S
3 S
4 S
..
886 S
887 S
888 S
889 C
890 Q
Name: Embarked, Length: 891, dtype: object
2. sklearn의 "SimpleImputer" 를 사용 : 2개 이상의 column을 한 번에 처리할 때
1 2 3 imputer = SimpleImputer(strategy = 'most_frequent' ) result = imputer.fit_transform(train[['Embarked' , 'Cabin' ]]) train[['Embarked' , 'Cabin' ]] = result
1 train[['Embarked' , 'Cabin' ]].isnull().sum()
Embarked 0
Cabin 0
dtype: int64
3. Label Encoding: 문자(categorivcal)를 수치(numerical)로 변환
기계학습을 위해서 모든 문자 로된 데이터는 수치로 변환 해야 한다
1 2 train = pd.read_csv('train.csv' ) train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
0 male
1 female
2 female
3 female
4 male
...
886 male
887 female
888 female
889 male
890 male
Name: Sex, Length: 891, dtype: object
방법 1: convert함수를 직접 정의하기
1 train['Sex' ].value_counts()
male 577
female 314
Name: Sex, dtype: int64
1 2 3 4 5 6 def convert (data) : if data == 'female' : return 1 elif data == 'male' : return 0
1 2 train['Sex' ].apply(convert)
0 0
1 1
2 1
3 1
4 0
..
886 0
887 1
888 1
889 0
890 0
Name: Sex, Length: 891, dtype: int64
방법 2: sklearn의 “LabelEncoder” 사용
변환 규칙: value name의 alphabet 순서대로 0, 1, 2… 숫자를 부여
1 from sklearn.preprocessing import LabelEncoder
1 train['Sex' ].value_counts()
male 577
female 314
Name: Sex, dtype: int64
1 train['Sex_num' ] = le.fit_transform(train['Sex' ])
1 train['Sex_num' ].value_counts()
1 577
0 314
Name: Sex_num, dtype: int64
array(['female', 'male'], dtype=object)
1 2 le.inverse_transform([0 , 1 , 1 , 0 , 0 , 1 , 1 ])
array(['female', 'male', 'male', 'female', 'female', 'male', 'male'],
dtype=object)
NaN 값이 포함되어 있으면, LabeEncoder
가 정상 동작하지 않음
0 S
1 C
2 S
3 S
4 S
..
886 S
887 S
888 S
889 C
890 Q
Name: Embarked, Length: 891, dtype: object
1 train['Embarked' ].value_counts()
S 644
C 168
Q 77
Name: Embarked, dtype: int64
1 le.fit_transform(train['Embarked' ])
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
D:\Anaconda\lib\site-packages\sklearn\preprocessing\_label.py in _encode(values, uniques, encode, check_unknown)
111 try:
--> 112 res = _encode_python(values, uniques, encode)
113 except TypeError:
D:\Anaconda\lib\site-packages\sklearn\preprocessing\_label.py in _encode_python(values, uniques, encode)
59 if uniques is None:
---> 60 uniques = sorted(set(values))
61 uniques = np.array(uniques, dtype=values.dtype)
TypeError: '<' not supported between instances of 'float' and 'str'
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
<ipython-input-38-86525b1fc929> in <module>
----> 1 le.fit_transform(train['Embarked'])
D:\Anaconda\lib\site-packages\sklearn\preprocessing\_label.py in fit_transform(self, y)
250 """
251 y = column_or_1d(y, warn=True)
--> 252 self.classes_, y = _encode(y, encode=True)
253 return y
254
D:\Anaconda\lib\site-packages\sklearn\preprocessing\_label.py in _encode(values, uniques, encode, check_unknown)
112 res = _encode_python(values, uniques, encode)
113 except TypeError:
--> 114 raise TypeError("argument must be a string or number")
115 return res
116 else:
TypeError: argument must be a string or number
1 train['Embarked' ] = train['Embarked' ].fillna('S' )
1 train['Embarked' ] = le.fit_transform(train['Embarked' ])
0 2
1 0
2 2
3 2
4 2
..
886 2
887 2
888 2
889 0
890 1
Name: Embarked, Length: 891, dtype: int32
1 train['Embarked' ].value_counts()
2 646
0 168
1 77
Name: Embarked, dtype: int64
4. 원 핫 인코딩 (One Hot Encoding)
pd.get_dummies ( df_name [ ‘col_name’ ] )
1 train = pd.read_csv('train.csv' )
PassengerId
Survived
Pclass
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked
0
1
0
3
Braund, Mr. Owen Harris
male
22.0
1
0
A/5 21171
7.2500
NaN
S
1
2
1
1
Cumings, Mrs. John Bradley (Florence Briggs Th...
female
38.0
1
0
PC 17599
71.2833
C85
C
2
3
1
3
Heikkinen, Miss. Laina
female
26.0
0
0
STON/O2. 3101282
7.9250
NaN
S
3
4
1
1
Futrelle, Mrs. Jacques Heath (Lily May Peel)
female
35.0
1
0
113803
53.1000
C123
S
4
5
0
3
Allen, Mr. William Henry
male
35.0
0
0
373450
8.0500
NaN
S
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
"Embarked"를 살펴보기
1 2 train['Embarked' ].value_counts()
S 644
C 168
Q 77
Name: Embarked, dtype: int64
1 2 3 train['Embarked' ] = train['Embarked' ].fillna('S' ) train['Embarked' ].value_counts()
S 646
C 168
Q 77
Name: Embarked, dtype: int64
1 2 3 train['Embarked_num' ] = LabelEncoder().fit_transform(train['Embarked' ]) train['Embarked_num' ].value_counts()
2 646
0 168
1 77
Name: Embarked_num, dtype: int64
Embarked는 탑승 항구의 이니셜을 나타낸다. 우리는 LabelEncoder
를 통해서 값을 수치형으로 변환해주었다, 하지만 이대로 데이터를 기계학습 시키면, 기계는 데이터 안에서 관계를 학습한다.
예를 들면, ‘S’= 2, ‘Q’= 1 이라고 되어 있는데, Q
+Q
=S
가 된다라고 학습해버린다
그렇기 때문에, 우리는 각 unique value를 별도의 column으로 분리하고, 값에 해당하는 column는 True (1) , 나머지 column는 False (0) 를 갖게 한다 .이것이 바로 원 핫 인코딩 이다.
0 S
1 C
2 S
3 S
4 S
5 Q
Name: Embarked, dtype: object
1 train['Embarked_num' ][:6 ]
0 2
1 0
2 2
3 2
4 2
5 1
Name: Embarked_num, dtype: int32
1 2 one_hot = pd.get_dummies(train['Embarked_num' ][:6 ]) one_hot
0
1
2
0
0
0
1
1
1
0
0
2
0
0
1
3
0
0
1
4
0
0
1
5
0
1
0
1 2 one_hot.columns = ['C' , 'Q' , 'S' ] one_hot
C
Q
S
0
0
0
1
1
1
0
0
2
0
0
1
3
0
0
1
4
0
0
1
5
0
1
0
원핫인코딩 은 카테고리의 특성을(계절, 항구, 성별, 종류…) 가지는 column에 대해서 적용한다
5. Normalize (정규화)
정규화: column간에 다른 min,max 값을 가지는 경우, 정규화를 통해 min / max 의 척도를 맞추어 주는 작업이다
sklearn.preprocessing --> MinMaxScaler()
예: 영화평점
네이버 영화평점 (0점 ~ 10점): [2, 4, 6, 8, 10]
넷플릭스 영화평점 (0점 ~ 5점): [1, 2, 3, 4, 5]
1 2 3 movie = {'naver' : [2 , 4 , 6 , 8 , 10 ], 'netflix' : [1 , 2 , 3 , 4 , 5 ] }
1 2 movie = pd.DataFrame(data=movie) movie
naver
netflix
0
2
1
1
4
2
2
6
3
3
8
4
4
10
5
1 from sklearn.preprocessing import MinMaxScaler
1 min_max_scaler = MinMaxScaler()
1 min_max_movie = min_max_scaler.fit_transform(movie)
1 pd.DataFrame(min_max_movie, columns = ['naver' , 'netfllix' ])
naver
netfllix
0
0.00
0.00
1
0.25
0.25
2
0.50
0.50
3
0.75
0.75
4
1.00
1.00
6. Standard Scaling (표준화)
표준화: 평균이 0, 표준편차가 1이 되도록 변환해주는 작업
sklearn.preprocessing --> StandardScaler()
1 2 from sklearn.preprocessing import StandardScalerstandard_scaler = StandardScaler()
1 2 3 x = np.arange(10 ) x[9 ] = 1000
(103.6, 298.8100399919654)
1 2 scaled = standard_scaler.fit_transform(x.reshape(-1 , 1 ))
1 scaled.mean(), scaled.std()
(4.4408920985006264e-17, 1.0)
1 round(scaled.mean(), 2 ), scaled.std()
(0.0, 1.0)