K-MOOC 2주차 머신러닝 빅데이터 분석

K-MOOC 머신러닝 빅데이터 분석 2주차

1차시 머신러닝 알고리즘의 종류

Untitled

Untitled

Untitled

Untitled

Untitled

Untitled

Untitled

Untitled

Untitled

Untitled

2차시 Scikit-learn 활용하기

Untitled

Untitled

Untitled

Untitled

Untitled

Untitled

Untitled

Untitled

Untitled

Scikit-learn 실습

라이브러리 추가

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn.metrics import accuracy_score
import seaborn as sns

데이터셋 준비

iris = load_iris()

iris_data = iris.data 
iris_label = iris.target

데이터셋 탐색

iris_df = pd.DataFrame(data=iris_data, columns=iris.feature_names) 
iris_df ['label'] = iris.target 

sns.pairplot(iris_df, x_vars=["sepal length (cm)"],  
             y_vars=["sepal width (cm)"], hue="label", height=5) 
<seaborn.axisgrid.PairGrid at 0x7f2cb01c3b80>

png

머신러닝 수행

X_train, X_test, y_train, y_test = train_test_split(iris_data, iris_label, test_size=0.2) 

dt_clf = DecisionTreeClassifier() 
knn_clf = KNeighborsClassifier() 

dt_clf.fit(X_train, y_train)
knn_clf.fit(X_train, y_train)

y1_pred = dt_clf.predict(X_test) 
y2_pred = knn_clf.predict(X_test) 

print('DecisionTree 예측정확도 : {0:.4f}'.format(accuracy_score(y_test, y1_pred))) 
print('k-NN 예측정확도 : {0: .4f}'.format(accuracy_score(y_test, y2_pred))) 
DecisionTree 예측정확도 : 0.9333
k-NN 예측정확도 :  0.9000

3차시 K-NN 알고리즘

Untitled

Untitled

Untitled

Untitled

Untitled

Untitled

Untitled

Untitled

Untitled

Untitled

Untitled

Untitled

Untitled

Untitled

Untitled

Untitled

실습

k-NN 알고리즘 활용 실습

라이브러리 import

import pandas as pd
import numpy as np
df = pd.read_csv("https://raw.githubusercontent.com/wikibook/machine-learning/2.0/data/csv/basketball_stat.csv") 
df.head() 
Player Pos 3P 2P TRB AST STL BLK
0 Alex Abrines SG 1.4 0.6 1.3 0.6 0.5 0.1
1 Steven Adams C 0.0 4.7 7.7 1.1 1.1 1.0
2 Alexis Ajinca C 0.0 2.3 4.5 0.3 0.5 0.6
3 Chris Andersen C 0.0 0.8 2.6 0.4 0.4 0.6
4 Will Barton SG 1.5 3.5 4.3 3.4 0.8 0.5
df.Pos.value_counts() 
SG    50
C     50
Name: Pos, dtype: int64

데이터 시각화

스틸, 2점슛 데이터 시각화

import matplotlib.pyplot as plt
import seaborn as sns

sns.lmplot(x='STL', y='2P', data=df, fit_reg=False,  
           scatter_kws={"s": 150}, 
           markers=["o", "x"], 
           hue="Pos") 
<seaborn.axisgrid.FacetGrid at 0x7ff871b96950>

png

블로킹, 3점슛 데이터 시각화

sns.lmplot(x='BLK', y='3P', data=df, fit_reg=False,  
           scatter_kws={"s": 150}, 
           markers=["o", "x"], 
           hue="Pos") 
<seaborn.axisgrid.FacetGrid at 0x7ff871b14d10>

png

리바운드, 3점슛 데이터 시각화

sns.lmplot(x='TRB', y='3P', data=df, fit_reg=False,  
           scatter_kws={"s": 150}, 
           markers=["o", "x"], 
           hue="Pos") 
<seaborn.axisgrid.FacetGrid at 0x7ff871a7e890>

png

데이터 다듬기

df.drop(['2P', 'AST', 'STL'], axis=1, inplace = True) 
df.head() 
Player Pos 3P TRB BLK
0 Alex Abrines SG 1.4 1.3 0.1
1 Steven Adams C 0.0 7.7 1.0
2 Alexis Ajinca C 0.0 4.5 0.6
3 Chris Andersen C 0.0 2.6 0.6
4 Will Barton SG 1.5 4.3 0.5

데이터 나누기

from sklearn.model_selection import train_test_split 

train, test = train_test_split(df, test_size=0.2) 

print(train.shape[0])
print(test.shape[0])
80
20

라이브러리 import

import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns 
import numpy as np

np.random.seed(5)

최적의 kNN 파라미터 찾기

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

max_k_range = train.shape[0] // 2 
k_list = [] 
for i in range(3, max_k_range, 2): 
    k_list.append(i) 
cross_validation_scores = [] 
x_train = train[['3P', 'BLK' , 'TRB']] 
y_train = train[['Pos']] 
for k in k_list:  
    knn = KNeighborsClassifier(n_neighbors=k) 
    scores = cross_val_score(knn, x_train, y_train.values.ravel(), 
                             cv=10, scoring='accuracy') 
    cross_validation_scores.append(scores.mean()) 

cross_validation_scores
[0.925,
 0.9,
 0.9,
 0.8875,
 0.8875,
 0.9,
 0.9,
 0.8875,
 0.8875,
 0.8625,
 0.8625,
 0.8625,
 0.8375,
 0.8375,
 0.8375,
 0.8375,
 0.8125,
 0.8,
 0.7875]

k의 변화에 따른 정확도 시각화

plt.plot(k_list, cross_validation_scores) 
plt.xlabel('the number of k') 
plt.ylabel('Accuracy') 
plt.show() 
cvs = cross_validation_scores 
k = k_list[cvs.index(max(cross_validation_scores))] 
print("The best number of k : " + str(k) ) 

png

The best number of k : 3

k-NN 모델 테스트

from sklearn.metrics import accuracy_score 

knn = KNeighborsClassifier(n_neighbors=k) 

x_train = train[['3P', 'BLK', 'TRB']] 
y_train = train[['Pos']] 

knn.fit(x_train, y_train.values.ravel()) 

x_test = test[['3P', 'BLK', 'TRB']] 
y_test = test[['Pos']] 
pred = knn.predict(x_test) 
print("accuracy : "+  
          str(accuracy_score(y_test.values.ravel(), pred)) ) 

comparison = pd.DataFrame({'prediction':pred, 'ground_truth':y_test.values.ravel()}) 
comparison.head(10) 
accuracy : 0.95
prediction ground_truth
0 C C
1 C C
2 SG SG
3 SG SG
4 C C
5 SG SG
6 SG SG
7 C C
8 C C
9 SG SG

Categories:

Updated:

Leave a comment