I ran across this awesome site, emojitracker, the other day. If you love data like me, you might recognize (or otherwise have guessed) that the distribution of the usage of emoji is probably a pareto or zipf distribution. I downloaded a snapshot of the data and it’s a reasonable match:
These particular non-uniform distributions are common in all kinds of applications. In our work, for example, a zipf distribution maps fairly well to the distribution of activity across countries, across particular advertising campaigns, and even across the behavior of users (that is, there is a set of very active users and a long tail of less active users). This can be helpful in making simulations better mirror real-world behavior; some of our integration tests take advantage of it.
You may not know that Python has some non-uniform random distribution functions built-in: scroll down to the bottom of the random module’s documentation and you’ll see lots of functions that can be useful for simulations:
expovariate, and even
In practice, though, you may find
paretovariate difficult to use, because it can return an arbitrarily large number. For simulations like the ones above—where there are a finite number of, say, emoji to choose from—you really want a finite zipf distribution. Fortunately, it’s pretty easy to implement:
Altering the method to cache the cumulative weights, and to take optional shape parameters (α and s), is left as an exercise to the reader ❤
Also, we’re hiring!