Data Science

Data Analysis with Python - Module Evaluation & Learning Objectvies

Hiru_93 2022. 8. 13. 18:16
  • In sample evaluation 단점: it does not tell us how well the trained model can be used to predict new data
  • Solution? Separate the data to two dataset (Training set, Testing set)
  • First we build our data with training set, then use testing set to assess the our model

 

 

  • Training data 많이 넣을수록 Generalization error 발생할 가능성이 높아짐 / 때문에 여러 training data & testing data set 넣어서 이를 보완한다. 이를 Cross validation이라

 

  • 코드:
  • Ir = Linear regression, cv = 파티션조합)의 개수        
  • 그렇다면 actual predicted values 구하고 싶을 경우에는
function cross_val_predict()
  • Order(차수) Error MSE와의 관계: order 너무 높아지면 overfitting 발생하고, 너무 낮아지면 underfitting 발생한다

 

 

 

  • How to calculate different R-squared values?
Rsqu_test = []  # create an empty list to store the value
order = [1,2,3,4]  # create a list containing different polynomial orders (1,2,3,4차수)
for n in order:  # iterate through the list using a loop
	pr = PolynomialFeatures(degree = n)
	x_train_pr = pr.fit_transform(x_train[['horsepower']])
	x_test_pr = pr.fit_transform(x_test[['horsepower']])
	lr.fit(x_train_pr, y_train)
	Rsqu_test.append(lr.score(x_test_pr, y_test)

 

  • Ridge regression: used to prevent overfitting
  • Ridge regression 만들기 위해서는 alpha 지정해주어야
  • alpha 0인경우 -> underfit
  • alpha 10인경우 -> 마찬가지로 underfit
  • alpha 0.01정도가 좋다 (cross validation 사용해 적절한 값을 찾아야 )

 

  • 아래는 코드

 

 

  • 과정:
  • 가장 높은 R^2값을 가진 alpha 보통 선택한다. MSE 고려해서 정할 있다.(가장 낮은 MSE)

 

  • Grid search: allows us to scan through multiple free parameters with few lines of code
  • 코드:
  •  

 

  • Three different sets in grid search: Training, Validation, Test
  • 코드는 아래와 같다
from sklearn.linear_model import ridge
from sklearn.model_selection import GridSearchCV # import ridge, GridSearchCV

 

# The dictionary of parameter values

prameters1 = [{'alpha': [0.001, 0.1, 1, 10, 100, 1000, 10000, 100000, 1000000]}]

 

# Create a ridge regression object

RR = Ridge()

 

# Create a GridSearchCV object

Grid1 = GridSearchCV(RR, parameters1, cv=4)

 

# Fit the object

Grid1.fit(x_data[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']], y_data)

 

# Find the best values for the free parameters

Grid1.best_estimator_

 

# Get the mean score

scores = Grid1.cv_results_
scores['mean_test_score']