R : 중선형 회귀 분석 (개념 및 예제)

SW/R

R : 중선형 회귀 분석 (개념 및 예제)

얇은생각 2019. 3. 11. 12:30

중선형 회귀

영어로 Multiple linear regression입니다. 독립 변수가 2개 이상일 때 중선형 회귀라 합니다. 따라서 중선형 회귀식의 형태는 선형 회귀식과 유사합니다.

실습 : 연봉 예측 모델

특정 직군의 연봉을 3가지 변수(교육년수, 여성비율, 평판)를 가지고 예측해보겠습니다. 사용하는 데이터는 car 패키지의 Prestige입니다.

 library(car)
# 필요한 패키지를 로딩중입니다: carData
 
 head(Prestige)
#                     education income women prestige census type
# gov.administrators      13.11  12351 11.16     68.8   1113 prof
# general.managers        12.26  25879  4.02     69.1   1130 prof
# accountants             12.77   9271 15.70     63.4   1171 prof
# purchasing.officers     11.42   8865  9.11     56.8   1175 prof
# chemists                14.62   8403 11.68     73.5   2111 prof
# physicists              15.64  11030  5.13     77.6   2113 prof

위 데이터는 직업, 교육년수, 연봉, 여성 비율, 평판 등으로 이루어진 데이터입니다.

 newdata <- Prestige[,c(1:4)]
 
 plot(newdata, pch=16, col="blue",      
      main="Matrix Scatterplot")

위 데이터를 통해 아래와 같이 연봉에 대한 데이터를 산점도로 가시화할 수 있습니다. 위 예제와 아래 그림을 참조하세요.

빅데이터

이제 각 독립 변수를 통해 모델을 완성해 보겠습니다. 아래 결과로 추론된 Estimate가 바로 독립 변수들의 값입니다.

 mod1 <- lm(income ~ education + prestige +              
              women, data=newdata)
 
 summary(mod1)
# Call:
# lm(formula = income ~ education + prestige + women, data = newdata)

# Residuals:
#     Min      1Q  Median      3Q     Max 
# -7715.3  -929.7  -231.2   689.7 14391.8 

# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept) -253.850   1086.157  -0.234    0.816    
# education    177.199    187.632   0.944    0.347    
# prestige     141.435     29.910   4.729 7.58e-06 ***
# women        -50.896      8.556  -5.948 4.19e-08 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

# Residual standard error: 2575 on 98 degrees of freedom
# Multiple R-squared:  0.6432,  Adjusted R-squared:  0.6323 
# F-statistic: 58.89 on 3 and 98 DF,  p-value: < 2.2e-16

그렇다면 식을 세워보겠습니다. 평균교육연수가 9.5년, 여성비율이 20%, 평판도가 80 이라면 예상 평균 연봉은 얼마일까요?

income = -253.850 + 177.199 * education + 141.435 * prestige - 50.896 * women

income = -253.850 + 177.199 * 9.5

+ 141.435 * 80

+ 50.896 * 20

= 13762.26

모델 평가

구한 모델을 위와 같이 평가할 수 있습니다. 각 변수마다 중요성을 알 수 있습니다. 구한 모델의 유효성을 확인할 수 도 있습니다. 모델이 종속 값을 얼마나 표현할 수 있습니다.

변수 선택

독립변수들이 많을 때 그중에서 종속변수를 잘 설명할 수 있는 변수들이 있습니다. 그것들만 모아서 모델을 만들면 좋습니다. 이러한 작업을 자동으로 하는 방법이 있습니다. 다음 예제를 참고하세요.

 library(MASS)
 
 newdata2 <- Prestige[,c(1:5)]
 
 head(newdata2)
#                     education income women prestige census
# gov.administrators      13.11  12351 11.16     68.8   1113
# general.managers        12.26  25879  4.02     69.1   1130
# accountants             12.77   9271 15.70     63.4   1171
# purchasing.officers     11.42   8865  9.11     56.8   1175
# chemists                14.62   8403 11.68     73.5   2111
# physicists              15.64  11030  5.13     77.6   2113
 
 mod2 <- lm(income ~ education + prestige +              
              women + census, data= newdata2)
 
#  step <- stepAIC(mod2, direction="both")
# Start:  AIC=1607.93
# income ~ education + prestige + women + census

#             Df Sum of Sq       RSS    AIC
# - census     1    639658 649654265 1606.0
# - education  1   5558323 654572930 1606.8
# <none>                   649014607 1607.9
# - prestige   1 143207106 792221712 1626.3
# - women      1 212639294 861653901 1634.8

# Step:  AIC=1606.03
# income ~ education + prestige + women

#             Df Sum of Sq       RSS    AIC
# - education  1   5912400 655566665 1605.0
# <none>                   649654265 1606.0
# + census     1    639658 649014607 1607.9
# - prestige   1 148234959 797889223 1625.0
# - women      1 234562232 884216497 1635.5

# Step:  AIC=1604.96
# income ~ prestige + women

#             Df Sum of Sq        RSS    AIC
# <none>                    655566665 1605.0
# + education  1   5912400  649654265 1606.0
# + census     1    993735  654572930 1606.8
# - women      1 234647032  890213697 1634.2
# - prestige   1 811037947 1466604612 1685.1

이제 선택된 변수로 모델을 다시 생성해보겠습니다. 이러한 과정을 통해 유의미한 모델을 만들고 데이터를 처리할 수 있겠죠?

 newdata2 <- Prestige[,c(1:5)]
 
 head(newdata2)
#                     education income women prestige census
# gov.administrators      13.11  12351 11.16     68.8   1113
# general.managers        12.26  25879  4.02     69.1   1130
# accountants             12.77   9271 15.70     63.4   1171
# purchasing.officers     11.42   8865  9.11     56.8   1175
# chemists                14.62   8403 11.68     73.5   2111
# physicists              15.64  11030  5.13     77.6   2113
 
 mod3 <- lm(income ~ prestige + women,            
            data= newdata2)

 summary(mod3)

# Call:
# lm(formula = income ~ prestige + women, data = newdata2)

# Residuals:
#     Min      1Q  Median      3Q     Max 
# -7620.9 -1008.7  -240.4   873.1 14180.0 

# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  431.574    807.630   0.534    0.594    
# prestige     165.875     14.988  11.067  < 2e-16 ***
# women        -48.385      8.128  -5.953 4.02e-08 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

# Residual standard error: 2573 on 99 degrees of freedom
# Multiple R-squared:   0.64,   Adjusted R-squared:  0.6327 
# F-statistic: 87.98 on 2 and 99 DF,  p-value: < 2.2e-16

저작자표시

'SW > R' 카테고리의 다른 글

R : 군집화-Clustering, 분류-Classification (개념 및 예제) (0)	2019.03.13
R : 로지스틱 회귀 ( 개념 및 예제 ) (0)	2019.03.12
R : 데이터 마이닝과 단순 선형 회귀 (개념 및 예제) (0)	2019.03.10
R : ggmap-데이터 크기를 지도에 표현 (개념 및 예제) (0)	2019.03.09
R : ggmap-마커, 텍스트 출력 (개념 및 예제) (0)	2019.03.08

현재글R : 중선형 회귀 분석 (개념 및 예제)

쵸코쿠키의 연습장