#title Poisson regression
[[TableOfContents]]

방법만 보라.

--응용데이터분석, 허명회
{{{
diabetes <- read.csv("c:\\data\\diabetes.csv", header=T)
str(diabetes)
diabetes.glm <- glm(deaths ~ age + gender, offset=l_popn, family="poisson", data=diabetes)
summary(diabetes.glm)
anova(diabetes.glm)
}}}

==== 가정 ====
 * 결과 발생이 매우 적은 경우(대략 5% 미만) - 부분 유료화 게임의 구매자수 정도?
 * 도수(카운터 데이터)가 종속변수인 경우
 * poisson 가정(음수는 없고, 오른쪽으로 긴 꼬리를 가진 분포, 각 event는 독립, 시간이 지나도 확률은 일정한 것 등등)

==== 어떤 경우에 써먹나? ====
 * event의 발생 횟수나 확률
 * 도수의 모형화 -> log(u) = ax + by + c
 * 비율의 모형화 -> log(u/N) = ax + by + c

==== 예제 ====
[http://freesearch.pe.kr/archives/2555 여기]의 데이터를 이용해 보자.
{{{
tmp <- textConnection( 
  "교육수준  흡연실태  사원수
대졸	과흡연	51
고졸	과흡연	22
중졸	과흡연	43
대졸	흡연	92
고졸	흡연	21
중졸	흡연	28
대졸	비흡연	68
고졸	비흡연	9
중졸	비흡연	22") 
x <- read.table(tmp, header=TRUE)
close.connection(tmp)
head(x)
}}}

카이제곱 분석
{{{
t <- xtabs(사원수~교육수준+흡연실태, data=x)
summary(t)
}}}

카이제곱 분석 결과
{{{

> summary(t)
Call: xtabs(formula = 사원수 ~ 교육수준 + 흡연실태, data = x)
Number of cases in table: 356 
Number of factors: 2 
Test for independence of all factors:
	Chisq = 18.51, df = 4, p-value = 0.0009808
> 
}}}
결과를 보면 교육수준은 흡연에 영향을 끼치는 것을 볼 수 있다. 이것을 possion regression 해보자.

{{{
fit <- glm(사원수~교육수준+흡연실태, data=x, family=poisson)
summary(fit)
}}}

{{{
> summary(fit)

Call:
glm(formula = 사원수 ~ 교육수준 + 흡연실태, family = poisson, 
    data = x)

Deviance Residuals: 
       1         2         3         4         5         6  
-2.24478   1.17378   2.16834   0.90724   0.08884  -1.52052  
       7         8         9  
 1.18683  -1.54454  -0.77967  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)      2.8299     0.1582  17.883  < 2e-16 ***
교육수준대졸     1.4006     0.1548   9.047  < 2e-16 ***
교육수준중졸     0.5814     0.1732   3.357 0.000787 ***
흡연실태비흡연  -0.1585     0.1368  -1.158 0.246791    
흡연실태흡연     0.1952     0.1254   1.557 0.119474    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 137.880  on 8  degrees of freedom
Residual deviance:  18.663  on 4  degrees of freedom
AIC: 76.454

Number of Fisher Scoring iterations: 4

>
}}}
뭐.. 결과가 흡족하진 않다. 결과를 써먹어보자. 대졸, 과흡연하는 사람은 몇명 정도일까?

{{{
predict(fit, data.frame(교육수준=c("대졸"), 흡연실태=c("과흡연")), type="response")
#또는 exp(predict(fit, data.frame(교육수준=c("대졸"), 흡연실태=c("과흡연"))))
}}}

{{{
> predict(fit, data.frame(교육수준=c("대졸"), 흡연실태=c("과흡연")), type="response")
       1 
68.75281
}}}
69명 정도다. R말고 다른데서 써먹을라면 다음과 같이 summary() 한 내용을 써먹으면 된다. 

{{{
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)      2.8299     0.1582  17.883  < 2e-16 ***
교육수준대졸     1.4006     0.1548   9.047  < 2e-16 ***
교육수준중졸     0.5814     0.1732   3.357 0.000787 ***
흡연실태비흡연  -0.1585     0.1368  -1.158 0.246791    ---------> 여길 보면 모델이 안좋다는 것을 알 수 있다.
흡연실태흡연     0.1952     0.1254   1.557 0.119474    ---------> 여길 보면 모델이 안좋다는 것을 알 수 있다.
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
}}}
 * (Intercept)      2.8299
 * 교육수준대졸     1.4006
 * 흡연실태과흡연 -> X 없다. 
 * 결과적으로 exp(2.8299 + 1.4006) 하면 68.7516 이렇게 나온다. 

음...69와 51은 좀 차이가 있다. 예제는 poisson regression에는 적당하지 않아서 그럴테다.