3.6 LAB: 선형회귀
LinearRegression¶
1.Importing packages¶
- 어떤 라이브러리가 사용되는지 알기 위해 상단에 import
New imports¶
In [ ]:
import numpy as np
import pandas as pd
from matplotlib.pyplot import subplots
In [ ]:
import statsmodels.api as sm
In [ ]:
from statsmodels.stats.outliers_influence import variance_inflation_factor as VIF
from statsmodels.stats.anova import anova_lm
In [ ]:
pip install ISLP
Collecting ISLP
Downloading ISLP-0.3.18-py3-none-any.whl (3.6 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.6/3.6 MB 10.5 MB/s eta 0:00:00
Requirement already satisfied: numpy>=1.7.1 in /usr/local/lib/python3.10/dist-packages (from ISLP) (1.23.5)
Requirement already satisfied: scipy>=0.9 in /usr/local/lib/python3.10/dist-packages (from ISLP) (1.10.1)
Requirement already satisfied: matplotlib>=3.3.3 in /usr/local/lib/python3.10/dist-packages (from ISLP) (3.7.1)
Requirement already satisfied: pandas>=0.20 in /usr/local/lib/python3.10/dist-packages (from ISLP) (1.5.3)
Requirement already satisfied: statsmodels>=0.13 in /usr/local/lib/python3.10/dist-packages (from ISLP) (0.14.0)
Requirement already satisfied: scikit-learn>=1.2 in /usr/local/lib/python3.10/dist-packages (from ISLP) (1.2.2)
Collecting jupyter>=0.0 (from ISLP)
Downloading jupyter-1.0.0-py2.py3-none-any.whl (2.7 kB)
Requirement already satisfied: lxml>=0.0 in /usr/local/lib/python3.10/dist-packages (from ISLP) (4.9.3)
Requirement already satisfied: joblib>=0.0 in /usr/local/lib/python3.10/dist-packages (from ISLP) (1.3.2)
Collecting pygam>=0.0 (from ISLP)
Downloading pygam-0.9.0-py3-none-any.whl (522 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 522.2/522.2 kB 16.4 MB/s eta 0:00:00
Collecting lifelines>=0.0 (from ISLP)
Downloading lifelines-0.27.7-py3-none-any.whl (409 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 409.4/409.4 kB 13.3 MB/s eta 0:00:00
Requirement already satisfied: notebook in /usr/local/lib/python3.10/dist-packages (from jupyter>=0.0->ISLP) (6.4.8)
Collecting qtconsole (from jupyter>=0.0->ISLP)
Downloading qtconsole-5.4.3-py3-none-any.whl (121 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 121.9/121.9 kB 11.5 MB/s eta 0:00:00
Requirement already satisfied: jupyter-console in /usr/local/lib/python3.10/dist-packages (from jupyter>=0.0->ISLP) (6.1.0)
Requirement already satisfied: nbconvert in /usr/local/lib/python3.10/dist-packages (from jupyter>=0.0->ISLP) (6.5.4)
Requirement already satisfied: ipykernel in /usr/local/lib/python3.10/dist-packages (from jupyter>=0.0->ISLP) (5.5.6)
Requirement already satisfied: ipywidgets in /usr/local/lib/python3.10/dist-packages (from jupyter>=0.0->ISLP) (7.7.1)
Requirement already satisfied: autograd>=1.5 in /usr/local/lib/python3.10/dist-packages (from lifelines>=0.0->ISLP) (1.6.2)
Collecting autograd-gamma>=0.3 (from lifelines>=0.0->ISLP)
Downloading autograd-gamma-0.5.0.tar.gz (4.0 kB)
Preparing metadata (setup.py) ... done
Collecting formulaic>=0.2.2 (from lifelines>=0.0->ISLP)
Downloading formulaic-0.6.4-py3-none-any.whl (88 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 88.9/88.9 kB 7.6 MB/s eta 0:00:00
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=3.3.3->ISLP) (1.1.0)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=3.3.3->ISLP) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=3.3.3->ISLP) (4.42.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=3.3.3->ISLP) (1.4.4)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=3.3.3->ISLP) (23.1)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=3.3.3->ISLP) (9.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=3.3.3->ISLP) (3.1.1)
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=3.3.3->ISLP) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=0.20->ISLP) (2023.3)
Collecting numpy>=1.7.1 (from ISLP)
Downloading numpy-1.25.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.2/18.2 MB 45.1 MB/s eta 0:00:00
Requirement already satisfied: progressbar2<5.0.0,>=4.2.0 in /usr/local/lib/python3.10/dist-packages (from pygam>=0.0->ISLP) (4.2.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=1.2->ISLP) (3.2.0)
Requirement already satisfied: patsy>=0.5.2 in /usr/local/lib/python3.10/dist-packages (from statsmodels>=0.13->ISLP) (0.5.3)
Requirement already satisfied: future>=0.15.2 in /usr/local/lib/python3.10/dist-packages (from autograd>=1.5->lifelines>=0.0->ISLP) (0.18.3)
Collecting astor>=0.8 (from formulaic>=0.2.2->lifelines>=0.0->ISLP)
Downloading astor-0.8.1-py2.py3-none-any.whl (27 kB)
Collecting interface-meta>=1.2.0 (from formulaic>=0.2.2->lifelines>=0.0->ISLP)
Downloading interface_meta-1.3.0-py3-none-any.whl (14 kB)
Requirement already satisfied: typing-extensions>=4.2.0 in /usr/local/lib/python3.10/dist-packages (from formulaic>=0.2.2->lifelines>=0.0->ISLP) (4.7.1)
Requirement already satisfied: wrapt>=1.0 in /usr/local/lib/python3.10/dist-packages (from formulaic>=0.2.2->lifelines>=0.0->ISLP) (1.14.1)
Requirement already satisfied: six in /usr/local/lib/python3.10/dist-packages (from patsy>=0.5.2->statsmodels>=0.13->ISLP) (1.16.0)
Requirement already satisfied: python-utils>=3.0.0 in /usr/local/lib/python3.10/dist-packages (from progressbar2<5.0.0,>=4.2.0->pygam>=0.0->ISLP) (3.7.0)
Requirement already satisfied: ipython-genutils in /usr/local/lib/python3.10/dist-packages (from ipykernel->jupyter>=0.0->ISLP) (0.2.0)
Requirement already satisfied: ipython>=5.0.0 in /usr/local/lib/python3.10/dist-packages (from ipykernel->jupyter>=0.0->ISLP) (7.34.0)
Requirement already satisfied: traitlets>=4.1.0 in /usr/local/lib/python3.10/dist-packages (from ipykernel->jupyter>=0.0->ISLP) (5.7.1)
Requirement already satisfied: jupyter-client in /usr/local/lib/python3.10/dist-packages (from ipykernel->jupyter>=0.0->ISLP) (6.1.12)
Requirement already satisfied: tornado>=4.2 in /usr/local/lib/python3.10/dist-packages (from ipykernel->jupyter>=0.0->ISLP) (6.3.1)
Requirement already satisfied: widgetsnbextension~=3.6.0 in /usr/local/lib/python3.10/dist-packages (from ipywidgets->jupyter>=0.0->ISLP) (3.6.5)
Requirement already satisfied: jupyterlab-widgets>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from ipywidgets->jupyter>=0.0->ISLP) (3.0.8)
Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from jupyter-console->jupyter>=0.0->ISLP) (3.0.39)
Requirement already satisfied: pygments in /usr/local/lib/python3.10/dist-packages (from jupyter-console->jupyter>=0.0->ISLP) (2.16.1)
Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.10/dist-packages (from nbconvert->jupyter>=0.0->ISLP) (4.11.2)
Requirement already satisfied: bleach in /usr/local/lib/python3.10/dist-packages (from nbconvert->jupyter>=0.0->ISLP) (6.0.0)
Requirement already satisfied: defusedxml in /usr/local/lib/python3.10/dist-packages (from nbconvert->jupyter>=0.0->ISLP) (0.7.1)
Requirement already satisfied: entrypoints>=0.2.2 in /usr/local/lib/python3.10/dist-packages (from nbconvert->jupyter>=0.0->ISLP) (0.4)
Requirement already satisfied: jinja2>=3.0 in /usr/local/lib/python3.10/dist-packages (from nbconvert->jupyter>=0.0->ISLP) (3.1.2)
Requirement already satisfied: jupyter-core>=4.7 in /usr/local/lib/python3.10/dist-packages (from nbconvert->jupyter>=0.0->ISLP) (5.3.1)
Requirement already satisfied: jupyterlab-pygments in /usr/local/lib/python3.10/dist-packages (from nbconvert->jupyter>=0.0->ISLP) (0.2.2)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from nbconvert->jupyter>=0.0->ISLP) (2.1.3)
Requirement already satisfied: mistune<2,>=0.8.1 in /usr/local/lib/python3.10/dist-packages (from nbconvert->jupyter>=0.0->ISLP) (0.8.4)
Requirement already satisfied: nbclient>=0.5.0 in /usr/local/lib/python3.10/dist-packages (from nbconvert->jupyter>=0.0->ISLP) (0.8.0)
Requirement already satisfied: nbformat>=5.1 in /usr/local/lib/python3.10/dist-packages (from nbconvert->jupyter>=0.0->ISLP) (5.9.2)
Requirement already satisfied: pandocfilters>=1.4.1 in /usr/local/lib/python3.10/dist-packages (from nbconvert->jupyter>=0.0->ISLP) (1.5.0)
Requirement already satisfied: tinycss2 in /usr/local/lib/python3.10/dist-packages (from nbconvert->jupyter>=0.0->ISLP) (1.2.1)
Requirement already satisfied: pyzmq>=17 in /usr/local/lib/python3.10/dist-packages (from notebook->jupyter>=0.0->ISLP) (23.2.1)
Requirement already satisfied: argon2-cffi in /usr/local/lib/python3.10/dist-packages (from notebook->jupyter>=0.0->ISLP) (21.3.0)
Requirement already satisfied: nest-asyncio>=1.5 in /usr/local/lib/python3.10/dist-packages (from notebook->jupyter>=0.0->ISLP) (1.5.7)
Requirement already satisfied: Send2Trash>=1.8.0 in /usr/local/lib/python3.10/dist-packages (from notebook->jupyter>=0.0->ISLP) (1.8.2)
Requirement already satisfied: terminado>=0.8.3 in /usr/local/lib/python3.10/dist-packages (from notebook->jupyter>=0.0->ISLP) (0.17.1)
Requirement already satisfied: prometheus-client in /usr/local/lib/python3.10/dist-packages (from notebook->jupyter>=0.0->ISLP) (0.17.1)
Collecting qtpy>=2.0.1 (from qtconsole->jupyter>=0.0->ISLP)
Downloading QtPy-2.3.1-py3-none-any.whl (84 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 84.9/84.9 kB 7.7 MB/s eta 0:00:00
Requirement already satisfied: setuptools>=18.5 in /usr/local/lib/python3.10/dist-packages (from ipython>=5.0.0->ipykernel->jupyter>=0.0->ISLP) (67.7.2)
Collecting jedi>=0.16 (from ipython>=5.0.0->ipykernel->jupyter>=0.0->ISLP)
Downloading jedi-0.19.0-py2.py3-none-any.whl (1.6 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 51.6 MB/s eta 0:00:00
Requirement already satisfied: decorator in /usr/local/lib/python3.10/dist-packages (from ipython>=5.0.0->ipykernel->jupyter>=0.0->ISLP) (4.4.2)
Requirement already satisfied: pickleshare in /usr/local/lib/python3.10/dist-packages (from ipython>=5.0.0->ipykernel->jupyter>=0.0->ISLP) (0.7.5)
Requirement already satisfied: backcall in /usr/local/lib/python3.10/dist-packages (from ipython>=5.0.0->ipykernel->jupyter>=0.0->ISLP) (0.2.0)
Requirement already satisfied: matplotlib-inline in /usr/local/lib/python3.10/dist-packages (from ipython>=5.0.0->ipykernel->jupyter>=0.0->ISLP) (0.1.6)
Requirement already satisfied: pexpect>4.3 in /usr/local/lib/python3.10/dist-packages (from ipython>=5.0.0->ipykernel->jupyter>=0.0->ISLP) (4.8.0)
Requirement already satisfied: platformdirs>=2.5 in /usr/local/lib/python3.10/dist-packages (from jupyter-core>=4.7->nbconvert->jupyter>=0.0->ISLP) (3.10.0)
Requirement already satisfied: fastjsonschema in /usr/local/lib/python3.10/dist-packages (from nbformat>=5.1->nbconvert->jupyter>=0.0->ISLP) (2.18.0)
Requirement already satisfied: jsonschema>=2.6 in /usr/local/lib/python3.10/dist-packages (from nbformat>=5.1->nbconvert->jupyter>=0.0->ISLP) (4.19.0)
Requirement already satisfied: wcwidth in /usr/local/lib/python3.10/dist-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->jupyter-console->jupyter>=0.0->ISLP) (0.2.6)
Requirement already satisfied: ptyprocess in /usr/local/lib/python3.10/dist-packages (from terminado>=0.8.3->notebook->jupyter>=0.0->ISLP) (0.7.0)
Requirement already satisfied: argon2-cffi-bindings in /usr/local/lib/python3.10/dist-packages (from argon2-cffi->notebook->jupyter>=0.0->ISLP) (21.2.0)
Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.10/dist-packages (from beautifulsoup4->nbconvert->jupyter>=0.0->ISLP) (2.4.1)
Requirement already satisfied: webencodings in /usr/local/lib/python3.10/dist-packages (from bleach->nbconvert->jupyter>=0.0->ISLP) (0.5.1)
Requirement already satisfied: parso<0.9.0,>=0.8.3 in /usr/local/lib/python3.10/dist-packages (from jedi>=0.16->ipython>=5.0.0->ipykernel->jupyter>=0.0->ISLP) (0.8.3)
Requirement already satisfied: attrs>=22.2.0 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.1->nbconvert->jupyter>=0.0->ISLP) (23.1.0)
Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.1->nbconvert->jupyter>=0.0->ISLP) (2023.7.1)
Requirement already satisfied: referencing>=0.28.4 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.1->nbconvert->jupyter>=0.0->ISLP) (0.30.2)
Requirement already satisfied: rpds-py>=0.7.1 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.1->nbconvert->jupyter>=0.0->ISLP) (0.9.2)
Requirement already satisfied: cffi>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from argon2-cffi-bindings->argon2-cffi->notebook->jupyter>=0.0->ISLP) (1.15.1)
Requirement already satisfied: pycparser in /usr/local/lib/python3.10/dist-packages (from cffi>=1.0.1->argon2-cffi-bindings->argon2-cffi->notebook->jupyter>=0.0->ISLP) (2.21)
Building wheels for collected packages: autograd-gamma
Building wheel for autograd-gamma (setup.py) ... done
Created wheel for autograd-gamma: filename=autograd_gamma-0.5.0-py3-none-any.whl size=4030 sha256=3c10e5cb79026447ed015f925c4d39b389354a82c619e7d829c5145ee653c1a0
Stored in directory: /root/.cache/pip/wheels/25/cc/e0/ef2969164144c899fedb22b338f6703e2b9cf46eeebf254991
Successfully built autograd-gamma
Installing collected packages: qtpy, numpy, jedi, interface-meta, astor, pygam, formulaic, autograd-gamma, qtconsole, lifelines, jupyter, ISLP
Attempting uninstall: numpy
Found existing installation: numpy 1.23.5
Uninstalling numpy-1.23.5:
Successfully uninstalled numpy-1.23.5
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
numba 0.56.4 requires numpy<1.24,>=1.18, but you have numpy 1.25.2 which is incompatible.
tensorflow 2.12.0 requires numpy<1.24,>=1.22, but you have numpy 1.25.2 which is incompatible.
Successfully installed ISLP-0.3.18 astor-0.8.1 autograd-gamma-0.5.0 formulaic-0.6.4 interface-meta-1.3.0 jedi-0.19.0 jupyter-1.0.0 lifelines-0.27.7 numpy-1.25.2 pygam-0.9.0 qtconsole-5.4.3 qtpy-2.3.1
In [ ]:
from ISLP import load_data
from ISLP.models import (ModelSpec as MS, summarize , poly)
Inspecting Objects and Namespaces¶
In [ ]:
dir()
Out[ ]:
['In',
'MS',
'Out',
'VIF',
'_',
'__',
'___',
'__builtin__',
'__builtins__',
'__doc__',
'__loader__',
'__name__',
'__package__',
'__spec__',
'_dh',
'_exit_code',
'_i',
'_i1',
'_i2',
'_i3',
'_i4',
'_i5',
'_i6',
'_ih',
'_ii',
'_iii',
'_oh',
'anova_lm',
'exit',
'get_ipython',
'load_data',
'np',
'pd',
'poly',
'quit',
'sm',
'subplots',
'summarize']
In [ ]:
A = np.array([3,5,11])
dir(A)
Out[ ]:
['T',
'__abs__',
'__add__',
'__and__',
'__array__',
'__array_finalize__',
'__array_function__',
'__array_interface__',
'__array_prepare__',
'__array_priority__',
'__array_struct__',
'__array_ufunc__',
'__array_wrap__',
'__bool__',
'__class__',
'__class_getitem__',
'__complex__',
'__contains__',
'__copy__',
'__deepcopy__',
'__delattr__',
'__delitem__',
'__dir__',
'__divmod__',
'__dlpack__',
'__dlpack_device__',
'__doc__',
'__eq__',
'__float__',
'__floordiv__',
'__format__',
'__ge__',
'__getattribute__',
'__getitem__',
'__gt__',
'__hash__',
'__iadd__',
'__iand__',
'__ifloordiv__',
'__ilshift__',
'__imatmul__',
'__imod__',
'__imul__',
'__index__',
'__init__',
'__init_subclass__',
'__int__',
'__invert__',
'__ior__',
'__ipow__',
'__irshift__',
'__isub__',
'__iter__',
'__itruediv__',
'__ixor__',
'__le__',
'__len__',
'__lshift__',
'__lt__',
'__matmul__',
'__mod__',
'__mul__',
'__ne__',
'__neg__',
'__new__',
'__or__',
'__pos__',
'__pow__',
'__radd__',
'__rand__',
'__rdivmod__',
'__reduce__',
'__reduce_ex__',
'__repr__',
'__rfloordiv__',
'__rlshift__',
'__rmatmul__',
'__rmod__',
'__rmul__',
'__ror__',
'__rpow__',
'__rrshift__',
'__rshift__',
'__rsub__',
'__rtruediv__',
'__rxor__',
'__setattr__',
'__setitem__',
'__setstate__',
'__sizeof__',
'__str__',
'__sub__',
'__subclasshook__',
'__truediv__',
'__xor__',
'all',
'any',
'argmax',
'argmin',
'argpartition',
'argsort',
'astype',
'base',
'byteswap',
'choose',
'clip',
'compress',
'conj',
'conjugate',
'copy',
'ctypes',
'cumprod',
'cumsum',
'data',
'diagonal',
'dot',
'dtype',
'dump',
'dumps',
'fill',
'flags',
'flat',
'flatten',
'getfield',
'imag',
'item',
'itemset',
'itemsize',
'max',
'mean',
'min',
'nbytes',
'ndim',
'newbyteorder',
'nonzero',
'partition',
'prod',
'ptp',
'put',
'ravel',
'real',
'repeat',
'reshape',
'resize',
'round',
'searchsorted',
'setfield',
'setflags',
'shape',
'size',
'sort',
'squeeze',
'std',
'strides',
'sum',
'swapaxes',
'take',
'tobytes',
'tofile',
'tolist',
'tostring',
'trace',
'transpose',
'var',
'view']
In [ ]:
A.sum()
Out[ ]:
19
2.Simple Linear Regression¶
- Boston 교외 506개 지역의 중앙값 주택 가격 기록
- 주택당 평균 방의 개수(rm), 평균 주택 연력(age), 사회 경제적 지위가 낮은 가정의 백분율(latat)와 같은13개의 변수를 사용하여 medv 예측
In [ ]:
Boston = load_data("Boston")
Boston.columns
Out[ ]:
Index(['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax',
'ptratio', 'lstat', 'medv'],
dtype='object')
- X 데이터프레임은 선형 회귀 분석에 사용되는 설명 변수(독립 변수)
- 'intercept': 이 열은 상수항(절편)을 나타내며, 모든 값이 1로 채워져 있습니다. 선형 회귀 모델에서는 상수항을 포함하여 모델을 구축하게 되는데, 이는 회귀선의 y절편을 의미합니다.
- 'lstat': 이 열은 'Boston' 데이터프레임에서 가져온 'lstat' 컬럼의 값을 가지고 있습니다. 'lstat'는 회귀 모델에서 독립 변수로 사용될 피처입니다.
In [ ]:
X = pd.DataFrame({'intercept': np.ones(Boston.shape[0]), 'lstat': Boston['lstat']})
X[:4]
Out[ ]:
| intercept | lstat | |
|---|---|---|
| 0 | 1.0 | 4.98 |
| 1 | 1.0 | 9.14 |
| 2 | 1.0 | 4.03 |
| 3 | 1.0 | 2.94 |
- sm.OLS(y, X) : statsmodels 라이브러리에서 사용되는 메서드로, 선형 회귀 모델을 구축하는 데 사용
- y는 종속 변수(타겟 변수)이고, X는 독립 변수(피처 또는 설명 변수)
- 이 메서드는 최소 제곱법(Ordinary Least Squares, OLS)을 사용하여 주어진 데이터 포인트들에 가장 적합한 선형 회귀 모델을 찾음.
In [ ]:
y = Boston['medv']
model = sm.OLS(y, X)
results = model.fit()
In [ ]:
summarize(results)
Out[ ]:
| coef | std err | t | P>|t| | |
|---|---|---|---|---|
| intercept | 34.5538 | 0.563 | 61.415 | 0.0 |
| lstat | -0.9500 | 0.039 | -24.528 | 0.0 |
Using Transformations: Fit and Transform¶
- 모델을 피팅하기 전에 변수에 대한 변환을 지정
- 변수 사이의 연결, 일부 특정 변수를 일련의 집합으로 확장 -> 다항식
- fit(), transform()
- ModelSpec()을 통해 모델을 지정하고 모델 행렬을 구성
- MS(['lstat']) : 모델 명세를 만들기 위한 클래스로서, 예를 들어 회귀 분석 모델의 특정 변수들을 어떻게 조합하거나 변환할지 지정하는 데 사용
- fit(): 센터링 및 스케일링을 위한 평균 및 표준 편차를 계산
- transform(): 적합 변환을 데이터 배열에 적용
In [ ]:
design = MS(['lstat'])
design = design.fit(Boston)
X = design.transform(Boston)
X[:4]
Out[ ]:
| intercept | lstat | |
|---|---|---|
| 0 | 1.0 | 4.98 |
| 1 | 1.0 | 9.14 |
| 2 | 1.0 | 4.03 |
| 3 | 1.0 | 2.94 |
- fit_transform(): 데이터를 모델에 적합시키고 변환하는 작업을 한 번에 수행하는 메서드
In [ ]:
design = MS(['lstat'])
X = design.fit_transform(Boston)
X[:4]
Out[ ]:
| intercept | lstat | |
|---|---|---|
| 0 | 1.0 | 4.98 |
| 1 | 1.0 | 9.14 |
| 2 | 1.0 | 4.03 |
| 3 | 1.0 | 2.94 |
In [ ]:
results.summary()
Out[ ]:
| Dep. Variable: | medv | R-squared: | 0.544 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.543 |
| Method: | Least Squares | F-statistic: | 601.6 |
| Date: | Sun, 13 Aug 2023 | Prob (F-statistic): | 5.08e-88 |
| Time: | 05:19:24 | Log-Likelihood: | -1641.5 |
| No. Observations: | 506 | AIC: | 3287. |
| Df Residuals: | 504 | BIC: | 3295. |
| Df Model: | 1 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| intercept | 34.5538 | 0.563 | 61.415 | 0.000 | 33.448 | 35.659 |
| lstat | -0.9500 | 0.039 | -24.528 | 0.000 | -1.026 | -0.874 |
| Omnibus: | 137.043 | Durbin-Watson: | 0.892 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 291.373 |
| Skew: | 1.453 | Prob(JB): | 5.36e-64 |
| Kurtosis: | 5.319 | Cond. No. | 29.7 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [ ]:
results.params
Out[ ]:
intercept 34.553841
lstat -0.950049
dtype: float64
In [ ]:
new_df = pd.DataFrame({'lstat':[5, 10, 15]})
newX = design.transform(new_df)
newX
Out[ ]:
| intercept | lstat | |
|---|---|---|
| 0 | 1.0 | 5 |
| 1 | 1.0 | 10 |
| 2 | 1.0 | 15 |
- get_prediction(): 예측을 얻고 주어진 lstat 값에 대한 medv 예측에 대한 신뢰 구간과 예측 구간을 생성
- 객체에서 predicted_mean 속성을 사용하여 예측된 평균값을 얻음. 이 값은 예측된 종속 변수의 평균 예측값
In [ ]:
new_predictions = results.get_prediction(newX);
new_predictions.predicted_mean
Out[ ]:
array([29.80359411, 25.05334734, 20.30310057])
- conf_int(alpha=0.05): 예측값의 신뢰 구간을 계산합니다. alpha는 신뢰 수준을 나타내며, 여기서는 0.05로 설정, 95% 신뢰 구간을 계산
In [ ]:
new_predictions.conf_int(alpha=0.05)
Out[ ]:
array([[29.00741194, 30.59977628],
[24.47413202, 25.63256267],
[19.73158815, 20.87461299]])
In [ ]:
new_predictions.conf_int(obs=True, alpha=0.05)
Out[ ]:
array([[17.56567478, 42.04151344],
[12.82762635, 37.27906833],
[ 8.0777421 , 32.52845905]])
Defining Functions¶
- ax는 기존 플롯의 축 객체이고, b는 절편이고 m은 기울기
In [ ]:
def abline(ax, b, m):
"Add a line with slope m and intercept b to ax"
xlim = ax.get_xlim()
ylim = [m * xlim[0] + b, m * xlim[1] + b]
ax.plot(xlim, ylim)
- *args를 추가하면 명명되지 않은 인수를 얼마든지 허용
In [ ]:
def abline(ax, b, m, *args, **kwargs):
"Add a line with slope m and intercept b to ax"
xlim = ax.get_xlim()
ylim = [m * xlim[0] + b, m * xlim[1] + b]
ax.plot(xlim, ylim, *args, **kwargs)
- lstat와 medv 사이의 관계
- 빨간색 점선을 생성하기 위해 'r--' 사용
In [ ]:
ax = Boston.plot.scatter('lstat', 'medv')
abline(ax, results.params[0],results.params[1], 'r--', linewidth=3)
- 적합치와 잔차 찾기
- results.fittedvalues는 회귀 모델의 적합값(예측값)
- results.resid는 잔차(residuals)
In [ ]:
ax = subplots(figsize=(8,8))[1]
ax.scatter(results.fittedvalues , results.resid)
ax.set_xlabel('Fitted value')
ax.set_ylabel('Residual')
ax.axhline(0, c='k', ls='--')
Out[ ]:
<matplotlib.lines.Line2D at 0x78408d983250>
- get_influence(): 결과 객체의 속성. 다음을 설명하는 다양한 영향 측정회귀 모델
In [ ]:
infl = results.get_influence()
ax = subplots(figsize=(8,8))[1]
ax.scatter(np.arange(X.shape[0]), infl.hat_matrix_diag)
ax.set_xlabel('Index')
ax.set_ylabel('Leverage')
np.argmax(infl.hat_matrix_diag)
Out[ ]:
374
3.Multiple Linear Regression¶
- 최소 제곱을 사용하여 다중 선형 회귀 모델을 맞추기
- Boston 데이터 세트에는 12개의 변수가 포함되어 있으므로 복잡할 것입니다.회귀를 수행하려면 이 모든 것을 입력해야 합니다.
In [ ]:
X = MS(['lstat', 'age']).fit_transform(Boston)
model1 = sm.OLS(y, X)
results1 = model1.fit()
summarize(results1)
Out[ ]:
| coef | std err | t | P>|t| | |
|---|---|---|---|---|
| intercept | 33.2228 | 0.731 | 45.458 | 0.000 |
| lstat | -1.0321 | 0.048 | -21.416 | 0.000 |
| age | 0.0345 | 0.012 | 2.826 | 0.005 |
In [ ]:
terms = Boston.columns.drop('medv')
terms
Out[ ]:
Index(['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax',
'ptratio', 'lstat'],
dtype='object')
In [ ]:
X = MS(terms).fit_transform(Boston)
model = sm.OLS(y, X)
results = model.fit()
summarize(results)
Out[ ]:
| coef | std err | t | P>|t| | |
|---|---|---|---|---|
| intercept | 41.6173 | 4.936 | 8.431 | 0.000 |
| crim | -0.1214 | 0.033 | -3.678 | 0.000 |
| zn | 0.0470 | 0.014 | 3.384 | 0.001 |
| indus | 0.0135 | 0.062 | 0.217 | 0.829 |
| chas | 2.8400 | 0.870 | 3.264 | 0.001 |
| nox | -18.7580 | 3.851 | -4.870 | 0.000 |
| rm | 3.6581 | 0.420 | 8.705 | 0.000 |
| age | 0.0036 | 0.013 | 0.271 | 0.787 |
| dis | -1.4908 | 0.202 | -7.394 | 0.000 |
| rad | 0.2894 | 0.067 | 4.325 | 0.000 |
| tax | -0.0127 | 0.004 | -3.337 | 0.001 |
| ptratio | -0.9375 | 0.132 | -7.091 | 0.000 |
| lstat | -0.5520 | 0.051 | -10.897 | 0.000 |
- 연령의 p-값이 높습니다.따라서 이 예측자를 제외하고 회귀를 실행
In [ ]:
minus_age = Boston.columns.drop(['medv', 'age'])
Xma = MS(minus_age).fit_transform(Boston)
model1 = sm.OLS(y, Xma)
summarize(model1.fit())
Out[ ]:
| coef | std err | t | P>|t| | |
|---|---|---|---|---|
| intercept | 41.5251 | 4.920 | 8.441 | 0.000 |
| crim | -0.1214 | 0.033 | -3.683 | 0.000 |
| zn | 0.0465 | 0.014 | 3.379 | 0.001 |
| indus | 0.0135 | 0.062 | 0.217 | 0.829 |
| chas | 2.8528 | 0.868 | 3.287 | 0.001 |
| nox | -18.4851 | 3.714 | -4.978 | 0.000 |
| rm | 3.6811 | 0.411 | 8.951 | 0.000 |
| dis | -1.5068 | 0.193 | -7.825 | 0.000 |
| rad | 0.2879 | 0.067 | 4.322 | 0.000 |
| tax | -0.0127 | 0.004 | -3.333 | 0.001 |
| ptratio | -0.9346 | 0.132 | -7.099 | 0.000 |
| lstat | -0.5474 | 0.048 | -11.483 | 0.000 |
4.Multivariate Goodness of Fit¶
- results.rsquared는 R2
- np.sqrt(results.scale)는 RSE
List Comprehension¶
- 다중공선성(VIF, Variance Inflation Factor)을 계산하여 데이터프레임으로 출력
- 다중공선성은 선형 회귀 분석에서 독립 변수들 간의 상관관계로 인해 발생할 수 있는 문제를 나타내는 지표
In [ ]:
vals = [VIF(X, i) for i in range(1, X.shape[1])]
vif = pd.DataFrame({'vif':vals}, index=X.columns[1:])
vif
Out[ ]:
| vif | |
|---|---|
| crim | 1.767486 |
| zn | 2.298459 |
| indus | 3.987181 |
| chas | 1.071168 |
| nox | 4.369093 |
| rm | 1.912532 |
| age | 3.088232 |
| dis | 3.954037 |
| rad | 7.445301 |
| tax | 9.002158 |
| ptratio | 1.797060 |
| lstat | 2.870777 |
In [ ]:
vals = []
for i in range(1, X.values.shape[1]):
vals.append(VIF(X.values, i))
5.Interaction Terms¶
- ModelSpec()을 사용하여 선형 모델에 상호 작용 항을 포함시키기
- 튜플을 포함하면 모델 매트릭스 빌더가 다음을 포함하도록 지시
In [ ]:
X = MS(['lstat', 'age',('lstat', 'age')]).fit_transform(Boston)
model2 = sm.OLS(y, X)
summarize(model2.fit())
Out[ ]:
| coef | std err | t | P>|t| | |
|---|---|---|---|---|
| intercept | 36.0885 | 1.470 | 24.553 | 0.000 |
| lstat | -1.3921 | 0.167 | -8.313 | 0.000 |
| age | -0.0007 | 0.020 | -0.036 | 0.971 |
| lstat:age | 0.0042 | 0.002 | 2.244 | 0.025 |
6.Non-linear Transformations of the Predictors¶
- 모델 매트릭스 빌더는 열 이름과 상호 작용
- poly() 함수: 모델 명세를 생성하는 데 사용되며, 다항식 피처와 'age' 피처를 조합한 설계 행렬을 생성하도록 지정
In [ ]:
X = MS([poly('lstat', degree=2), 'age']).fit_transform(Boston)
model3 = sm.OLS(y, X)
results3 = model3.fit()
summarize(results3)
Out[ ]:
| coef | std err | t | P>|t| | |
|---|---|---|---|---|
| intercept | 17.7151 | 0.781 | 22.681 | 0.0 |
| poly(lstat, degree=2)[0] | -179.2279 | 6.733 | -26.620 | 0.0 |
| poly(lstat, degree=2)[1] | 72.9908 | 5.482 | 13.315 | 0.0 |
| age | 0.0703 | 0.011 | 6.471 | 0.0 |
- anova_lm 함수: 두 개의 선형 회귀 모델 간의 분산 분석(ANOVA, Analysis of Variance)을 수행
- anova_lm() 함수는 가설을 수행
- 두 모델을 비교하는 테스트. 귀무 가설
- 더 큰 모델의 항은 필요하지 않으며 대립 가설
- 더 큰 모델이 우월하다는 것입니다. 여기서 F-통계량은 177.28이고 연관된 p-값은 0입니다. 이 경우 F 통계량은 다음의 제곱입니다.
In [ ]:
anova_lm(results1, results3)
Out[ ]:
| df_resid | ssr | df_diff | ss_diff | F | Pr(>F) | |
|---|---|---|---|---|---|---|
| 0 | 503.0 | 19168.128609 | 0.0 | NaN | NaN | NaN |
| 1 | 502.0 | 14165.613251 | 1.0 | 5002.515357 | 177.278785 | 7.468491e-35 |
In [ ]:
ax = subplots(figsize=(8,8))[1]
ax.scatter(results3.fittedvalues , results3.resid)
ax.set_xlabel('Fitted value')
ax.set_ylabel('Residual')
ax.axhline(0, c='k', ls='--')
Out[ ]:
<matplotlib.lines.Line2D at 0x78408da1fc40>
7.Qualitative Predictors¶
- Carseats 데이터에는 ShelveLoc,선반 위치의 품질 지표, 즉 내부 공간, 카시트가 진열되어 있는 매장
- 질적 변수가 주어지면 ShelveLoc과 같은 ModelSpec()은 더미 변수를 자동으로 생성
- 범주형의 원-핫 인코딩
In [ ]:
Carseats = load_data('Carseats')
Carseats.columns
Out[ ]:
Index(['Sales', 'CompPrice', 'Income', 'Advertising', 'Population', 'Price',
'ShelveLoc', 'Age', 'Education', 'Urban', 'US'],
dtype='object')
In [ ]:
allvars = list(Carseats.columns.drop('Sales'))
y = Carseats['Sales']
final = allvars + [('Income', 'Advertising'), ('Price', 'Age')]
X = MS(final).fit_transform(Carseats)
model = sm.OLS(y, X)
summarize(model.fit())
Out[ ]:
| coef | std err | t | P>|t| | |
|---|---|---|---|---|
| intercept | 6.5756 | 1.009 | 6.519 | 0.000 |
| CompPrice | 0.0929 | 0.004 | 22.567 | 0.000 |
| Income | 0.0109 | 0.003 | 4.183 | 0.000 |
| Advertising | 0.0702 | 0.023 | 3.107 | 0.002 |
| Population | 0.0002 | 0.000 | 0.433 | 0.665 |
| Price | -0.1008 | 0.007 | -13.549 | 0.000 |
| ShelveLoc[Good] | 4.8487 | 0.153 | 31.724 | 0.000 |
| ShelveLoc[Medium] | 1.9533 | 0.126 | 15.531 | 0.000 |
| Age | -0.0579 | 0.016 | -3.633 | 0.000 |
| Education | -0.0209 | 0.020 | -1.063 | 0.288 |
| Urban[Yes] | 0.1402 | 0.112 | 1.247 | 0.213 |
| US[Yes] | -0.1576 | 0.149 | -1.058 | 0.291 |
| Income:Advertising | 0.0008 | 0.000 | 2.698 | 0.007 |
| Price:Age | 0.0001 | 0.000 | 0.801 | 0.424 |
- allvars: 상호작용 항목을 추가
- ShelveLoc[Good]: 좋을 경우 1의 값을 가지며, 그렇지 않으면 0의 값
- ShelveLoc[Medium]: 중간일 경우 1의 값을 가지며, 그렇지 않으면 0의 값
- 선반 위치가 나쁠 경우 두 더미 변수 모두 0의 값
<결과>
- (1) ShelveLoc[Good]의 계수가 양수인 것은 좋은 선반 위치가 높은 판매와 관련이 있다.
- (2) ShelveLoc[Medium]은 더 작은 양수 계수를 가지며, 중간 선반 위치가 나쁜 선반 위치보다 높은 판매와 관련이 있지만 좋은 선반 위치보다는 낮은 판매와 관련이 있다.
'수학 및 통계 > ISL with Python' 카테고리의 다른 글
| [ISLP] 3장 Linear Regression - 3.5 선형회귀와 K-최근접 이웃의 비교 (0) | 2023.08.25 |
|---|---|
| [ISLP] 3장 Linear Regression - 3.4 마케팅 플랜 (0) | 2023.08.24 |
| [ISLP] 3장 Linear Regression - 3.3 회귀모델에서 다른 고려할 사항 (0) | 2023.08.17 |
| [ISLP] 3장 Linear Regression - 3.2 다중선형회귀 (1) | 2023.08.16 |
| [ISLP] 3장 Linear Regression- 3.1 단순선형회귀 (0) | 2023.08.13 |