星期日晚看見選舉結果,望見兩大選區(新界東/新界西)各泛民名單的得票如此相近,心想「噢,law of large number…」,然後跟據幾個模糊的概念,草草地寫了那篇
Fooled by randomness。
事實上該文很不準確。先是兆安的幾個留言提醒我對量子力學的理解有錯,然後是朋友S私底下指出這不可能是純綷隨機結果。我自己計了一次,對,如果純粹的隨機投票,五張名單的得票應該是極之接近(七成機會正負一百)。多謝賜正。關於量子力學,我的理解只是大學通識程度,不多提了。數學還有點點能力,下面是我的算法。對數學沒有興趣的,可跳至尾二段。
Given n balls and k urns. All balls are uniformly distributed among all k urns. We calculate the probability of the first urn getting not more than t balls.
The random variable is the number of balls in the first urn, it can be approximated by a Poisson Distribution.
where is the expected rate.
We can get the probability of the first urn getting not more than t balls by summing the probability of all events, since they are mutually excluded.
Begin lazy, we ask matlab for results with the following command,
poisscdf(t, u)
Suppose we have 204030 balls and 5 urns, the expected number of each urn is 40806. The probability of the first urn getting less than 99% of the expected number (i.e. 40398) is
On the other hand we can get the probability of getting more than 1% of the expected number (i.e. 41214) similarly,
In fact a more precise approximation scheme can be achieved by using Chernoff Bound, however it depends on a moment generating function that i am not familiar with. This article gives a loose approximation of how unlikely an urn to get 1% fewer(more) balls than the expected value.
Thank you S for the useful discussion in Probability.
把問題簡化,只考慮任何一張名單。上面的結果得出,假設有204030張選票和五張名單,若所以選民皆是隨機投票的話,平均每張名單應得40806張選票。然而,一張名單超過/低過平均數的百份之一的機會為百份之五;超過/低過平均數的百份之二更不足百份之零點零零零一。然而沒有一張名單的實際得票是平均正負百份之二內,意即不能以純粹隨機過程去解悉投票行為。
縱然該文錯漏百出,我仍然堅持該文的基本概念,每個人的喜惡不一,無可能準確配票,反而,隨機(於兩大選區)能容易地得出較平均的分配。
2008/09/11 於 02:54
From you equation
1 – P(X <= 41214) = 0.9783
it implies that
P(X <= 41214) = 0.0217
which is equal to P(X <= 40398) from your another result.
I think something seems to be wrong because logically you’d expect that P(X <= 41214) to be larger (actually much larger) than P(X 41214) is around 0.0217?
2008/09/11 於 02:56
For some reason my previous comment was truncated. This was what I wanted to say:
From you equation
1 – P(X <= 41214) = 0.9783
it implies that
P(X <= 41214) = 0.0217
which is equal to P(X <= 40398) from your another result.
I think something seems to be wrong because logically you’d expect that P(X <= 41214) to be larger (actually much larger) than P(X 41214) is around 0.0217?
2008/09/11 於 02:58
that was a typo. thx!
P(X <= 41214) = 0.9873.
should be 1-P(X <= 41214) = 1-0.9873 = 0.0217
2008/09/11 於 03:06
(I posted this for Alex (from http://www.alexweblog.com/), there are some strange errors that prevent his comments to display correctly)
In the last sentence,
*** P(X 41214) ***
is actually
*** P(X < 41214) ***
2008/09/11 於 03:27
it seems you don’t really need a Poisson approximation here
basically X is a binomial random variable with parameters n and p=1/k
then for n and t large, you can just use central limit theorem and approximate with normal distribution
(Poisson approximation holds for n large)
2008/09/13 於 00:56
堆數好似亂晒坑…
Pr( X <= 41214) 應該係等於 Pr( X=0 ) + Pr( X= 1) + … +P(X= 41213) ,所以0.0217粒數好似錯左.
正如cowmoo講,應該用Lindonberg-Levy ClT approximate 最方便:
mean = 204030 *.2 = 40806
variance = 204030 *.2 *.8 = 32644.8
z = (40398-40806) / sqrt(32644.8) = -2.25815 ,
P( Z N(0,1)
所以一張名單lor 到少過99% of mean 既機會大約係 0.012
P( -2.25815 < Z < 2.25815) 就係0.976064。
切勿忘記,我地假設投票人係亂投的,呢個係一個好大的assumption. 根據投票結果,我地應該能夠否定是假設。
2008/09/13 於 00:59
sorry.. truncated words
**P( z < Z = -2.25815) = 0.012 , where Z converges to standard normal. **
2008/09/13 於 01:59
Cow, 😛
The fact is I have forgotten CLT can be applied.
I got 0.0217 probably becoz poisson approx is not as accurate as clt.
>切勿忘記,我地假設投票人係亂投的,呢個係一個好大的assumption. 根據投票結果,我地應該能夠否定是假設。
this is what i am trying to say in this post.
2008/09/13 於 04:29
As a note of clarification, in fact, the Poisson approximation of binomial distribution cannot be applied in this case. The reason is the following:
Note the the moment generating fuction of Binomial distribution B(n, p) is (1-p +pe^t)^n = (1+p(e^t-1))^n
where as the MGF of poisson P(入) is exp(入(exp(t) -1)).
If we make two assumptions here — 1) 入 = np = constant mean, 2) n tends to infinity
then the MGF of B(n,p) becomes (1+ 入(e^t -1) /n ) ^n
which tends to exp(入(exp(t) -1)) , then Poisson MGF, as n tends to infinity.
Here, in this case, n = 204030 is sufficiently large, but np is increasing with n, as p is fixed at 0.2. The first assumption is not satisfied, and thus we cannot use the Poisson approximation in this case.
That can perhaps explain to a great extent the discrepancy of the results. 😛
2008/09/15 於 03:25
😛
Thank you so much for your detailed explanation!
I should pay more attention in math class!
🙂