Data Science

Data Analysis with Python - Model Development

Hiru_93 2022. 8. 13. 18:08
  • A model can be thought of as a mathematical equation used to predict a value given one or more other values
  • More relevant data → more accurate model
  • 3 types of linear Regression
  1. Simple linear regression
  2. Multiple linear regression
  3. Polynomial regression

 

  • Simple linear regression: The method to help us understand the relationship between two variables
  • Multiple linear regression: The method to help us understand the relationship between multiple variables

 

과정: training points 모델에 집어넣음 -> predict값 도출

 

 

 

  • Regression plot: strength of the correlation, direction of the relationship(positive/negative) 나타냄
  • 수식은 다음과 같다:
import seabron as sns
sns.regplot(x = 'highway-mpg', y = 'price', data = df)
plt.ylim(0, )
  • 변수간의 상관성 구하는 코드:
df[['peak-rpm', 'highway-mpg', 'price']].corr()

 

 

  • Distribution plots: simple plot보다 정확하게 모델을 구할 있음
  • Polynomial regression(다항 회귀): A special case of the general linear regression model
  • Curvilinear relationship:

 

  • Pipeline library 이용해 코드 단순화 가능

 

 

  • Two important measures to determine the fit of a model:
  1. Mean Squared Error(MSE)

아래 그림과 같이 선형회귀곡선과 value 값의 차이의 평균(아래 빨간색 박스의 평균)

파이썬 코드는 다음과 같다

from sklearn.metrics import mean_squared_error
mean_squared_error(df['price'], y_predict_simple_fit)

 

 

2. R-squared(R^2)

0-1 값을 지지며, 식은 아래와 같다

 

쉽게 표현하면 다음과 같다:

 

 

파이썬 수식으로 나타내면:

X = df[['highway-mpg']]
Y = df['price']
lm.fit(X, Y)
lm.score(x, y)

 

R^2: 0.xxxx 만약 음의 값이 나올 경우, overfitting 의미함

 

  • Predict the values that make sense

 

# First we train the model

lm.fit(df['highway-mpg'], df['rpices'])

 

# Let's predict the price of a car with 30 highway-mpg

lm.predict(np.array(30.0).reshape(-1,1))

 

# Result: $13771.30

Im.coef_

result: -821.73337832 # highway-mpg

 

 

  • How to generate sequence of values in specified range?

 

# Import numpy

import numpy as np

 

# We use numpy function arrange to generate a sequence from 1 to 100

new_input = np.arange(1,101,1).reshape(-1,1) # np.arrange(first point of the sequence, endpoint +1, step size)

 

 

 

  • Lower Mean Square error doesn't imply better fit

 

  • When comparing models, higher R-squared value is a better fit for the data
  • When comparing models, smallest MSE value is a better fit for the data