Iris 데이터 분석¶

iris 데이터 불러오기¶

iris.xlsx를 불러와서 iris에 저장하고 사용

import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt # from matplot import pyplot as plt 똑같은 의미
import seaborn as sns
iris = pd.read_excel("C:/Users/stat/Downloads/iris.xlsx")

사이킷런 패캐지 안의 Bunch 형식의 샘플 데이터(iris)로도 할 수 있다.

from sklearn.datasets import load_iris

iris.head() # 첫 5행을 보여준다.

iris[0:5] # 첫 5행을 인덱스를 사용하여 출력

iris.columns # iris 의 컬럼명을 보여준다.

Index(['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width',
       'Species'],
      dtype='object')

iris["Species"].value_counts()

setosa        50
virginica     50
versicolor    50
Name: Species, dtype: int64

파이썬에서는 R과 달리 행을 인덱스로 출력할 수 없고 변수명을 사용하여 출력 한다.
Sepal.Length, Sepal.Width 등 변수명이 길기 때문에 사용에 용이하도록 변수명을 변경한다.

iris.rename(columns={iris.columns[0] : 'SL',
                     iris.columns[1] : 'SW',
                     iris.columns[2] : 'PL',
                     iris.columns[3] : 'PW',
                     iris.columns[4] : 'Y'}, inplace = True) # inplace=True를 써서 실제로 컬럼명을 바꿔줌 없으면 안바뀜
iris.head()

iris[["SL", "SW"]][:5] # Sepall.Length와 Sepal.Width의 처음부터 5번째 행까지 출력

EDA¶

st = iris.groupby(iris.Y).mean() # Y를 기준으로 그룹화를 하여 각 그룹의 평균을 구하여 준다.
st.columns.name = "변수" # columns의 이름을 "변수"로 지정한다.
st

barplot¶

폰트 설정이 되어있지 않아 바로 그릴시에는 아래와 같이 한글이 들어간 곳은 폰트가 깨진다.

st.T.plot.bar(rot=0) # rot : x축 변수명의 기울기
plt.title("각 변수별 종류의 평균")
plt.xlabel("평균")
plt.ylabel("변수")
plt.ylim(0,8)
plt.show()

한글 폰트 해결을 위해서 컴퓨터에 폰트위치를 확인하고 원하는 폰트를 찾아서 matplotlib에 폰트를 적용시켜준다.

font_location = "C:/Windows/Fonts/malgunbd.ttf"
font_name = matplotlib.font_manager.FontProperties(fname = font_location).get_name()
matplotlib.rc("font", family = font_name)

st.T.plot.bar(rot=0) # rot : x축 변수명의 기울기
plt.title("각 변수별 종류의 평균")
plt.xlabel("평균")
plt.ylabel("변수")
plt.ylim(0,8)
plt.show()

Boxplot¶

iris[["SL","Y"]].boxplot(by='Y')
plt.tight_layout(pad=2, h_pad=1) # pad: 박스플랏과 제목 사이 간격, h_pad: 1행 2행 사이 간격
plt.title("type에 따른 Sepal.Length Boxplot")
plt.xlabel("type")
plt.ylabel("cm")
plt.show()

iris[["SW","Y"]].boxplot(by='Y')
plt.tight_layout(pad=2, h_pad=1) # pad: 박스플랏과 제목 사이 간격, h_pad: 1행 2행 사이 간격
plt.title("type에 따른 Sepal.Width Boxplot")
plt.xlabel("type")
plt.ylabel("cm")
plt.show()

iris[["PL","Y"]].boxplot(by='Y')
plt.tight_layout(pad=2, h_pad=1) # pad: 박스플랏과 제목 사이 간격, h_pad: 1행 2행 사이 간격
plt.title("type에 따른 Petal.Length Boxplot")
plt.xlabel("type")
plt.ylabel("cm")
plt.show()

iris[["PW","Y"]].boxplot(by='Y')
plt.tight_layout(pad=2, h_pad=1) # pad: 박스플랏과 제목 사이 간격, h_pad: 1행 2행 사이 간격
plt.title("type에 따른 Petal.Width Boxplot")
plt.xlabel("type")
plt.ylabel("cm")
plt.show()

Scatter Plot¶

from copy import copy
exiris = copy(iris)
exiris["id"] = list(range(1,151))
sns.pairplot(exiris.drop("id", axis=1), hue="Y", size=3)

C:\Users\stat\Anaconda3\lib\site-packages\seaborn\axisgrid.py:2065: UserWarning: The `size` parameter has been renamed to `height`; pleaes update your code.
  warnings.warn(msg, UserWarning)
C:\Users\stat\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

<seaborn.axisgrid.PairGrid at 0x217e1458710>

Logistic regression¶

로지스틱 회귀분석 모델을 사용하여 분류한 예측율

#from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

Y = iris["Y"] # 종속변수 설정
X = iris[["SL","SW","PL","PW"]] # 설명변수 설정

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3) # 자료를 train set(0.7)과 test set(0.3)을 나눔

log_clf = LogisticRegression() # log_cif에 logistic regression 입력
log_clf.fit(X_train,Y_train) # train set 을 이용하여 만들 모델 생성
print("예측률 : ", log_clf.score(X_test, Y_test)) # test set 을 사용하여 예측한 예측율

예측률 :  0.9555555555555556

C:\Users\stat\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
C:\Users\stat\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:460: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)

이진 분류 시스템¶

이진분류 결과표¶

True Positive(TP) : T를 T로 예측
True Negative(TN) : F를 F로 예측
False Positive(FP) : F를 T로 예측
False Negative(FN) : T를 F로 예측

평가점수¶

정확도(Accuracy) : TP + TN / TP + TN + FP + FN
- 전체 샘플 중 맞게 예측한 샘플 수의 비율
- 모형 트레이닝 즉 최적화에서 목적함수로 사용
정밀도(Precision) : TP / TP + FP
- True라고 예측한 샘플 중 실제로 True인 샘플의 비율
재현율(Recall) : TP / TP + FN
- 민감도(sensitivity)라고도 한다.
- 실제 True인 샘플들 중 True로 에측된 샘플의 비율
Fall-Out 위양성율 : FP / FP + TN
- 실제 False인 샘플들 중 True라고 예측된 샘플의 비율
특이도(sensitivity) : 1 - Fall-Out
- 실제 False인 샘플들 중 False라고 예측된 샘플의 비율
f1-score : 2precisionrecall / (precision + recall)
- 정밀도와 재현율의 가중 조화 평균
support : (1+B^2)(precision*recall) / (B^2 precision + recall)

from sklearn import metrics
expected = Y_test
predicted = log_clf.predict(X_test)
print(metrics.classification_report(Y_test, predicted)) # 전체 모형의 성능 평가

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        18
  versicolor       1.00      0.85      0.92        13
   virginica       0.88      1.00      0.93        14

   micro avg       0.96      0.96      0.96        45
   macro avg       0.96      0.95      0.95        45
weighted avg       0.96      0.96      0.96        45

print(metrics.confusion_matrix(expected, predicted)) # 실제 예측된 값과 실제 값 비교

[[18  0  0]
 [ 0 11  2]
 [ 0  0 14]]

ROC(Receiver operator Characteristic) 커브¶

ROC 커브란 클래스 판별 기준값의 변화에 따른 위양성률(Fall-Out)과 재현율(recall)의 변화를 시각화 한것이다.
ROC 커브의 면적인 AUC가 1에 가까울 수록 정확한 모델이 된다.
모델을 사용하는 상황에 따라 어떠한 모델을 사용하는 게 좋을지 결정하는데에 도움이 된다.

ROC커브는 다항 로지스틱 분석에서 그릴수 없기 때문에 이항형으로 바꿔주고 새롭게 로지스틱회귀분석을 실시한다.

#Y_new = Y[:]
Y_new = copy(Y)
Y_new.loc[Y_new != "versicolor"] = 0
Y_new.loc[Y_new == "versicolor"] = 1
X_new = copy(X)

X_train, X_test, Y_train, Y_test = train_test_split(X_new, Y_new, test_size = 0.3) # 자료를 train set(0.7)과 test set(0.3)을 나눔

log_clf = LogisticRegression() # log_cif에 logistic regression 입력
log_clf.fit(X_train,Y_train) # train set 을 이용하여 만들 모델 생성
print("예측률 : ", log_clf.score(X_test, Y_test)) # test set 을 사용하여 예측한 예측율

예측률 :  0.6

C:\Users\stat\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

ROC 커브 판단¶

AUC(Area Under the Curve)는 곡선 아래 면적을 나타낸다.
레퍼런스 라인인 0.5 위쪽으로 형성 되어야 의미있는 검사법이다.
Muller가 2005년 발표한 논문의 AUC 레벨 등급
- 0.90 - 1.00 : Excellent
- 0.80 - 0.90 : Good
- 0.70 - 0.80 : Fair
- 0.60 - 0.70 : Poor
- 0.50 - 0.60 : Fail

from sklearn import svm, datasets
from sklearn.datasets import load_breast_cancer

y_pred_proba = log_clf.predict_proba(X_test)[::,1] # 각 값이 1일 확률

fpr, tpr, threshold = metrics.roc_curve(Y_test, y_pred_proba) # ROC Curve
roc_auc = metrics.auc(fpr, tpr) # AUC

plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
#plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

	SL	SW	PL	PW	Y
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

	SL	SW
0	5.1	3.5
1	4.9	3.0
2	4.7	3.2
3	4.6	3.1
4	5.0	3.6

변수	SL	SW	PL	PW
Y
setosa	5.006	3.428	1.462	0.246
versicolor	5.936	2.770	4.260	1.326
virginica	6.588	2.974	5.552	2.026

Ruser

Ruser

파이썬을 이용한 Iris 데이터 분석 본문

파이썬을 이용한 Iris 데이터 분석

Iris 데이터 분석¶

iris 데이터 불러오기¶

EDA¶

barplot¶

Boxplot¶

Scatter Plot¶

Logistic regression¶

이진 분류 시스템¶

이진분류 결과표¶

평가점수¶

ROC(Receiver operator Characteristic) 커브¶

ROC 커브 판단¶

'Python > 분석' 카테고리의 다른 글

티스토리툴바