1-3 Iris flowers 데이터 Scikit k-means clustering 비지도학습

머신러닝

1-3 Iris flowers 데이터 Scikit k-means clustering 비지도학습

coding art 2020. 1. 4. 09:44

728x90

Scikit-learn 라이브러리 모듈에서 제공하는 make_blobs 데이터를 사용하여 연습만 할 것이 아니라 Iris flowers 데이터를 대상으로 k-means 클러스터링 기법을 응용해 보기로 하자.

이미 필자의 저서 “파이선 코딩 초보자를 위한 Scikit∙PyTorch 머신러닝” 1장에서 Iris flowers 데이터를 대상으로 하는 각종 Classification 기법을 다루었으며 이들 코드의 앞부분에 인터넷 서버에 저장되어 있는 Iris flowers 데이터를 읽어 들이는 코드가 사용되었으므로 이 부분과 k-means 클러스터링 코드를 결합하여 그래픽 처리가 가능한 코드를 작성하기로 한다.

한편 Iris flowers 데이터는 3종류 즉 setosa, versicolor 및 virginica 3 종류로 구성되며 여기서는 setosa, versicolor 2종류를 대상으로 이미 50개씩 각각의 라벨 값을 알고 있는 petal(꽃잎) 및 sepal(꽃받침) 길이 값에 대해 k-means 클러스터링 기법을 적용하여 Classification 정확도를 비교해 보기로 한다.

왼편의 라벨 값을 기준으로 오른편의 k-means 처리 결과에서 versicolor 샘플 하나가 양쪽 중심에서 상당히 멀리 있는 관계로 잘못 처리되었음을 알 수 있다. 이런 수준의 에러는 최고의 Classification 능력을 보여 준다는 SVC 또는 SVM 머신 러닝 기법에서도 흔히 일어나는 수준의 오차이다.

#Iris_data_kmean_plot

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

# ### Plotting the Iris data
# select setosa and versicolor
df = pd.read_csv('https://archive.ics.uci.edu/ml/'
'machine-learning-databases/iris/iris.data', header=None)
y = df.iloc[0:100, 4].values
y = np.where(y == 'Iris-setosa', -1, 1)

# extract sepal length and petal length
X = df.iloc[0:100, [0, 2]].values

# plot data
plt.scatter(X[:50, 0], X[:50, 1],
color='red', marker='o', label='setosa')
plt.scatter(X[50:100, 0], X[50:100, 1],
color='blue', marker='x', label='versicolor')

plt.xlabel('sepal length [cm]')
plt.ylabel('petal length [cm]')
plt.legend(loc='upper left')
plt.show()

#Kmeans processing
from sklearn.cluster import KMeans

km = KMeans(n_clusters=2, init='k-means++', n_init=10, max_iter=300,
tol=1e-04,random_state=0)

y_km = km.fit_predict(X)

plt.scatter(X[y_km == 0, 0],X[y_km == 0, 1], s=50, c='lightgreen',
            marker='s',edgecolor='black', label='cluster 1')
plt.scatter(X[y_km == 1, 0], X[y_km == 1, 1], s=50, c='orange',
            marker='o',edgecolor='black',label='cluster 2')
#plt.scatter(X[y_km == 2, 0], X[y_km == 2, 1],s=50, c='lightblue',
            #marker='v', edgecolor='black',label='cluster 3')
plt.scatter(km.cluster_centers_[:, 0], km.cluster_centers_[:, 1],
            s=250, marker='*', c='red', edgecolor='black',
            label='centroids')
plt.xlabel('sepal length [cm]')
plt.ylabel('petal length [cm]')
plt.legend(loc='upper left')

저작자표시 비영리 변경금지 (새창열림)

'머신러닝' 카테고리의 다른 글

1-5 Sklearn SVC(Support Vector Classification) XOR 예제 TensorFlow 코딩 (0)	2020.01.09
1-4 Iris flowers 데이터 Scikit k-means clustering 비지도학습 (0)	2020.01.04
1-2 Scikit-learn 에 의한 k-means++ clustering 비지도학습 (0)	2020.01.03
Back Door에서 시작하는 머신 러닝에의 초대: (0)	2020.01.01
1-17 AI 머신러닝의 원리와 Regression (0)	2020.01.01

현재글1-3 Iris flowers 데이터 Scikit k-means clustering 비지도학습

Machine Learning , AI, Arduino Coding

후 실행,

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Machine Learning , AI, Arduino Coding