ML DL을 활용한 심장마비 데이터 분석_(1) 데이터 전처리 및 Oversampling(SMOTE)

1 minute read

작성중

Project Process 별 자료 (총 3편)

데이터 파악

데이터 column 별 의미
- age : 연령
- sex : 성별 (1-남성 / 0-여성)
- cp : 가슴 통증 유형 (1-협심증, 2-비정형협심증, 3-비관절 통증, 4-무증상)
- trestbps : 안정시 혈압
- chol - 혈청 콜레스테롤 (mg/dl)
- fbs - 공복 혈당 120 mg/dl 이상 (1-true / 0-false)
- restecg - 안정시 심전도 결과 (0-2)
- thalach - 달성 된 최대 심박수
- exang - 운동으로 인한 협심증 (1-yes / 0-no)
- oldpeak - 휴식에 비해 운동에 의해 유발 된 ST 우울증
- slp - 최고 운동 ST 세그먼트의 기울기 크기(1,2,3)
- ca - 형광 투시로 채색 된 주요 혈관 (0-3)
- thal - 달성 된 최대 심박수
결측치 및 Outlier 확인
- 결측치는 없었음
- Chol 항목에서 일부 Outlier가 보였음 → 500 넘는 행은 제거함 ```python import pandas as pd import seaborn as sns
chol = df.loc[:,[‘chol’]] sns.boxplot(x = ‘variable’,y = ‘value’,data = chol.melt())

df.drop(df[(df[‘chol’] > 500)].index,inplace=True) ```

데이터 Oversampling

데이터 수합된 데이터 양이 너무 적어 추가 Data 필요
뿐만 아니라 Google AutoML을 돌리기 위해서는 최소 1,000개 이상의 Data 필요
SMOTE 방식으로 Data Oversampling 진행
원본 데이터 중 30%는 Test Data 추출 후 잔여 Data로 Oversampling 진행
(SMOTE Data 1,808개 + Test Data 91개 = 총 Data 1,899개)

코드 특이사항

SMOTE

from sklearn.datasets import make_classification
from sklearn.decomposition import PCA
from imblearn.over_sampling import SMOTE

# 모델설정
sm = SMOTE(ratio='auto', kind='regular')

for i in range(3):
  # train데이터를 넣어 3회 복제함
  X_resampled, y_resampled = sm.fit_sample(train_data,train_label)
  print('After OverSampling, the shape of train_X: {}'.format(X_resampled.shape))
  print('After OverSampling, the shape of train_y: {} '.format(y_resampled.shape))

  print("After OverSampling, counts of label '1': {}".format(sum(y_resampled==1)))
  print("After OverSampling, counts of label '0': {}\n".format(sum(y_resampled==0)))
  X_resampled = pd.DataFrame(X_resampled)
  y_resampled = pd.DataFrame(y_resampled)

  #복제된 샘플과 원본 데이터 셋 concat
  train_data = np.concatenate((train_data, X_resampled), axis=0)
  train_label = np.concatenate((train_label, y_resampled), axis=0)

#Result 
print("결과 샘플 수",train_data.shape)
print("결과 샘플 수",train_label.shape)

## 주의할 점

Data 분석시 최소 10만개의 Dataset이 확보되어야 ML DL 진행이 가능
원칙적으로는 Data 부족 시 추가 Data를 확보하여 진행해야 함
위 Data의 경우, Kaggle에서 차용한 Dataset인 관계로 SMOTE를 이용하여 진행
타 심장병 Data도 있었으나 용어 이해 부족 및 Outlier가 다량으로 보여 위 Data를 선택함

Share on

Twitter Facebook LinkedIn

황준우(JUNU.HWANG)

ML DL을 활용한 심장마비 데이터 분석_(1) 데이터 전처리 및 Oversampling(SMOTE)

작성중

Project Process 별 자료 (총 3편)

데이터 파악

데이터 Oversampling

코드 특이사항

Share on

You may also enjoy

ML DL을 활용한 심장마비 데이터 분석_(3) DL분석 및 Sequential 함수와 Google AutoML 결과 비교

ML DL을 활용한 심장마비 데이터 분석_(2) 원본 Data 와 Oversampling Data 간 ML 결과 비교

ML DL을 활용한 심장마비 데이터 분석_요약

복지 혜택 유사도에 따른 회사 추천 시스템