ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • Data Analysis with Python - Model Development
    Data Science 2022. 8. 13. 18:08
    • A model can be thought of as a mathematical equation used to predict a value given one or more other values
    • More relevant data → more accurate model
    • 3 types of linear Regression
    1. Simple linear regression
    2. Multiple linear regression
    3. Polynomial regression

     

    • Simple linear regression: The method to help us understand the relationship between two variables
    • Multiple linear regression: The method to help us understand the relationship between multiple variables

     

    과정: training points 모델에 집어넣음 -> predict값 도출

     

     

     

    • Regression plot: strength of the correlation, direction of the relationship(positive/negative) 나타냄
    • 수식은 다음과 같다:
    import seabron as sns
    sns.regplot(x = 'highway-mpg', y = 'price', data = df)
    plt.ylim(0, )
    • 변수간의 상관성 구하는 코드:
    df[['peak-rpm', 'highway-mpg', 'price']].corr()

     

     

    • Distribution plots: simple plot보다 정확하게 모델을 구할 있음
    • Polynomial regression(다항 회귀): A special case of the general linear regression model
    • Curvilinear relationship:

     

    • Pipeline library 이용해 코드 단순화 가능

     

     

    • Two important measures to determine the fit of a model:
    1. Mean Squared Error(MSE)

    아래 그림과 같이 선형회귀곡선과 value 값의 차이의 평균(아래 빨간색 박스의 평균)

    파이썬 코드는 다음과 같다

    from sklearn.metrics import mean_squared_error
    mean_squared_error(df['price'], y_predict_simple_fit)

     

     

    2. R-squared(R^2)

    0-1 값을 지지며, 식은 아래와 같다

     

    쉽게 표현하면 다음과 같다:

     

     

    파이썬 수식으로 나타내면:

    X = df[['highway-mpg']]
    Y = df['price']
    lm.fit(X, Y)
    lm.score(x, y)

     

    R^2: 0.xxxx 만약 음의 값이 나올 경우, overfitting 의미함

     

    • Predict the values that make sense

     

    # First we train the model

    lm.fit(df['highway-mpg'], df['rpices'])

     

    # Let's predict the price of a car with 30 highway-mpg

    lm.predict(np.array(30.0).reshape(-1,1))

     

    # Result: $13771.30

    Im.coef_

    result: -821.73337832 # highway-mpg

     

     

    • How to generate sequence of values in specified range?

     

    # Import numpy

    import numpy as np

     

    # We use numpy function arrange to generate a sequence from 1 to 100

    new_input = np.arange(1,101,1).reshape(-1,1) # np.arrange(first point of the sequence, endpoint +1, step size)

     

     

     

    • Lower Mean Square error doesn't imply better fit

     

    • When comparing models, higher R-squared value is a better fit for the data
    • When comparing models, smallest MSE value is a better fit for the data
Designed by Joshua Chung.