Data Analysis with Python - Exploratory Data Analysis

Data Science

Data Analysis with Python - Exploratory Data Analysis

Hiru_93 2022. 8. 13. 17:57

Descriptive statistics: Giving summaries about the sample and measures of the data
판다스 라이브러리를 이용해 data summarizing 하는 방법:

df.describe() # Any NaN data are automatically skipped in these statistics

To return counts of unique values:

value_counts()

표에 넣어서 도출하는 법:

to_frame()

Box plot:

Scatter Plot: shows the relationship between two variables

Predictor/independent variable on x axis
Target variable on y axis

Grouping Data: 어떤 drive system이 가격에 영향을 미치는지? 그렇다면 어떤 system이 자동차 가치에 가장 기여하는지?
판다스 라이브러리 사용:

dataframe.groupby()
Df_test = df[['drive-wheels', 'body-style', 'price']]
Df_grp = df_test.groupby(['drive-wheels', 'body-style', 'price'], as_index = False).mean())
Print(df_grp)

Pivot을 이용해 visualizing

Df_pivot = df_grp.pivot(index = 'drive-wheels', columns = 'body-style')

Scatter plot 그리기:

import seaborn as
sns.plot(x='engine-size', y='price', data=df)
plt.ylim(0, )

Heatmap을 이용해 visualizing:

Correlation: lung cancer → smoking, Rain → Umbrella
Correlation doesn't imply causation. Umbrella doesn't cause the rain, or smoking doesn't cause lung cancer solely.
Politive linear relationship: 양의 상관관계; 엔진 사이즈가 커질수록 가격이 비싸지는 관계
Negative linear relationship: 음의 상관관계; 연비가 높아질수록 가격 싸지는 관계
Weak linear relationship: 서로 상관성 없음; peak rpm과 가격간의 관계는 연관성 없음
Pearson Correlation: correlation coefficient, P-value를 구해 두 값관의 상관도를 구한다
Correlation coefficient, P-value 제공

피어슨 상관성 구하는 법:

Pearson_coef, p_value = stats.pearsonr(df['horsepower'], df['price'])

Correlation vs Causation 차이점
Correlation: a measure of the extent of interdependence between variables
Causation: the relationship between cause and effect between two variables

저작자표시 변경금지