ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • Data Analysis with Python - Module Evaluation & Learning Objectvies
    Data Science 2022. 8. 13. 18:16
    • In sample evaluation 단점: it does not tell us how well the trained model can be used to predict new data
    • Solution? Separate the data to two dataset (Training set, Testing set)
    • First we build our data with training set, then use testing set to assess the our model

     

     

    • Training data 많이 넣을수록 Generalization error 발생할 가능성이 높아짐 / 때문에 여러 training data & testing data set 넣어서 이를 보완한다. 이를 Cross validation이라

     

    • 코드:
    • Ir = Linear regression, cv = 파티션조합)의 개수        
    • 그렇다면 actual predicted values 구하고 싶을 경우에는
    function cross_val_predict()
    • Order(차수) Error MSE와의 관계: order 너무 높아지면 overfitting 발생하고, 너무 낮아지면 underfitting 발생한다

     

     

     

    • How to calculate different R-squared values?
    Rsqu_test = []  # create an empty list to store the value
    order = [1,2,3,4]  # create a list containing different polynomial orders (1,2,3,4차수)
    for n in order:  # iterate through the list using a loop
    	pr = PolynomialFeatures(degree = n)
    	x_train_pr = pr.fit_transform(x_train[['horsepower']])
    	x_test_pr = pr.fit_transform(x_test[['horsepower']])
    	lr.fit(x_train_pr, y_train)
    	Rsqu_test.append(lr.score(x_test_pr, y_test)

     

    • Ridge regression: used to prevent overfitting
    • Ridge regression 만들기 위해서는 alpha 지정해주어야
    • alpha 0인경우 -> underfit
    • alpha 10인경우 -> 마찬가지로 underfit
    • alpha 0.01정도가 좋다 (cross validation 사용해 적절한 값을 찾아야 )

     

    • 아래는 코드

     

     

    • 과정:
    • 가장 높은 R^2값을 가진 alpha 보통 선택한다. MSE 고려해서 정할 있다.(가장 낮은 MSE)

     

    • Grid search: allows us to scan through multiple free parameters with few lines of code
    • 코드:
    •  

     

    • Three different sets in grid search: Training, Validation, Test
    • 코드는 아래와 같다
    from sklearn.linear_model import ridge
    from sklearn.model_selection import GridSearchCV # import ridge, GridSearchCV

     

    # The dictionary of parameter values

    prameters1 = [{'alpha': [0.001, 0.1, 1, 10, 100, 1000, 10000, 100000, 1000000]}]

     

    # Create a ridge regression object

    RR = Ridge()

     

    # Create a GridSearchCV object

    Grid1 = GridSearchCV(RR, parameters1, cv=4)

     

    # Fit the object

    Grid1.fit(x_data[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']], y_data)

     

    # Find the best values for the free parameters

    Grid1.best_estimator_

     

    # Get the mean score

    scores = Grid1.cv_results_
    scores['mean_test_score']

     

     

Designed by Joshua Chung.