Generate binomial sample with (pretty) exact probability

When I simulate normal data in R, I make sure that the sample have the exact mean and sd of the sampling distribution: x = scale(rnorm(n))*sd + mean . I want to do the same for binomial data, making the sample express the near-exact probability that they were generated from. Of course it can't be exact when the probability is continuous and the sample is discrete but something that gets pretty close would be nice x = rbinom(n, 18, 0.5) can potentially give samples where an MLE estimate would indicate a probability of p=0.2 or p=0.8 which is pretty far from p=0.5. Purpose: I'm building a Bayesian model where I infer a binomial rate from a small sample. To test that the model works, I'd like to simulate well-specified data, in order to diagnose whether a strange inferential result is due to chance in the simulation or in the model.

147k 89 89 gold badges 404 404 silver badges 712 712 bronze badges asked Mar 24, 2014 at 15:08 Jonas Lindeløv Jonas Lindeløv 2,478 2 2 gold badges 25 25 silver badges 31 31 bronze badges

$\begingroup$ Hmm. That's kinda a strange request but .. Binomial gives integers so with fixed n, you can't always find a set of integers that will exactly hit a mean, say Pi. Now, if that mean is from another binomial than you can obviously but perhaps not in a different way. My guess is that you may not have to do this. What's the purpose? $\endgroup$

Commented Mar 24, 2014 at 15:22

$\begingroup$ This question still seems under-specified. Why, for instance, would it not suffice to generate c(rep(1, round(n*p)), rep(0, n-round(n*p))) ? That would seem to satisfy all the requirements you have set out. $\endgroup$

Commented Mar 24, 2014 at 15:53

$\begingroup$ "When I simulate normal data in R, I make sure that the sample have the exact mean and sd of the sampling distribution" --- Except that by doing so, you guarantee it's not actually normally distributed. Is that actually what you want here? And if you do want to absolutely guarantee it's not actually a random sample from a binomial, why bring up 'binomial' at all? $\endgroup$

Commented Mar 24, 2014 at 22:40

$\begingroup$ (ctd) Though I guess the easy way to do that would be to multiply the pmf by $n$ and round, and then add or remove a few values that leave exactly $n$, by adding to values that are most under-represented or subtracting from ones that are over, in such a way as to get the desired moments close to right. $\endgroup$

Commented Mar 24, 2014 at 23:25

$\begingroup$ Jonas, the deterministic sample I specified is as close to Binomial as you possibly can get for any given $n$, so it is not evident what you would mean by "not the best approximation." (In fact, my example is a very simple illustration of the general procedure proposed by @Glen_b.) It may be worth bluntly stating that what you are asking for looks like a poor way of carrying out your intended investigation. Instead, generate random samples according to simpler procedures, without any constraints, and analyze the effects of sampling variability on your simulation. $\endgroup$

Commented May 26, 2014 at 15:00

2 Answers 2

$\begingroup$

I'm not sure whether this is actually advisable, but it should be straightforward to generate. What you are asking for, essentially, is an underdispersed binomial distribution. You can get this by sampling (with replacement, if you want more than 1 value) from a vector of the integers 0:size , where you specify a set of underdispersed probabilities. You just have to figure out what the probabilities are that you want. First, consider this figure:

enter image description here

From this you can see that the probabilities of each possible value from a binomial distribution can be matched by a normal distribution with mean $n\pi$ and variance $n\pi(1-\pi)$. Thus, you can make an underdispersed version by using appropriately scaled densities from a normal distribution with the same mean but a smaller SD. Imagine that you want the SD to be cut in half, then:

set.seed(1773) hSD.norm = dnorm(0:18, mean=9, sd=sqrt(18*.25)*.5) ud.probs = hSD.norm/sum(hSD.norm) N = 10000 vals = sample(0:18, size=N, replace=TRUE, prob=ud.probs)