Mirostat: A PerplexityControlled Neural Text Decoding Algorithm
Abstract
Neural text decoding is important for generating highquality texts using language models. To generate highquality text, popular decoding algorithms like top, top (nucleus), and temperaturebased sampling truncate or distort the unreliable low probability tail of the language model. Though these methods generate highquality text after parameter tuning, they are ad hoc. Not much is known about the control they provide over the statistics of the output, which is important since recent reports show text quality is highest for a specific range of likelihoods. Here, first we provide a theoretical analysis of perplexity in top, top, and temperature sampling, finding that crossentropy behaves approximately linearly as a function of in top sampling whereas it is a nonlinear function of in top sampling, under Zipfian statistics. We use this analysis to design a feedbackbased adaptive top text decoding algorithm called mirostat that generates text (of any length) with a predetermined value of perplexity, and thereby highquality text without any tuning. Experiments show that for low values of and in top and top sampling, perplexity drops significantly with generated text length, which is also correlated with excessive repetitions in the text (the boredom trap). On the other hand, for large values of and , we find that perplexity increases with generated text length, which is correlated with incoherence in the text (confusion trap). Mirostat avoids both traps: experiments show that crossentropy has a nearlinear relation with repetition in generated text. This relation is almost independent of the sampling method but slightly dependent on the model used. Hence, for a given language model, control over perplexity also gives control over repetitions.
I Introduction
Generative language models have received recent attention due to their highquality openended text generation ability for tasks such as story writing, making conversations, and question answering [1, 2]. These models are trained in an unsupervised manner over large amounts of humanwritten text data. Generation of texts from these models usually relies on different forms of random sampling methods. Pure sampling from trained models often leads to incoherent and lowquality texts [3]. On the other hand, greedy decoding leads to excessive repetitions resulting in lowquality texts. Hence, choosing the right decoding algorithm is important to generate highquality texts with controlled attributes [4, 5, 6].
In this work, we introduce a neural text decoding algorithm called mirostat^{1}^{1}1The word mirostat is derived from mirum which is Latin for surprise and stat meaning control., which actively controls the generative process to maintain the perplexity of generated text at a certain desirable value. Mirostat uses an adaptive top sampling algorithm to tune the value of which helps maintain the overall perplexity of the text. Trained language models often have an unreliable tail in their probability distribution; hence a number of recently proposed sampling methods focus on suppressing this unreliable tail [3, 9, 10]. Top sampling [3, 9] is where the next word is sampled from the top most probable choices. Top sampling [10], also known as nucleus sampling, is where the next word is chosen from the next top probable choices, where is the smallest integer such that their cumulative probability mass is at least . While top sampling involves choosing from a fixed number of most probable choices, top sampling involves a dynamic number of most probable words based on a predefined value of and shows better performance on a number of different statistical and human evaluated tests. For small values of and , these sampling methods tend to show repetitions in the generated texts, thereby yielding poor quality. This can be handled by penalizing repetitions and using appropriate temperature values [11] or adding diversity to the generated text [5, 12]. On the other hand, large values of and can lead to incoherent texts similar to pure sampling. Although choosing appropriate values of or in these two sampling methods can help us avoid problems such as repetition and incoherence, this involves ad hoc tuning of parameters. We also observe that these methods do not give good control over the statistics of the generated text. Even for a fixed value of or , the generated text can have varying statistical properties. Intriguingly, we show that small value of a certain perplexity statistic of generated texts, which we call surprise (see Def. 1), is closely linked to repetitions and large values of surprise are linked to incoherence. Consider the following example that illustrates that for a fixed value of , repetition positively correlates with small surprise.^{†}^{†}This work was funded in part by the IBMIllinois Center for Cognitive Computing Systems Research (C3SR), a research collaboration as part of the IBM AI Horizons Network and the National Science Foundation Grant CCF1717530.
Example 1.
We generate four samples of texts with 200 tokens (which are words or subwords that are part of the vocabulary of the model) using GPT2 model with 117M parameters [2]. For the first two samples we use top sampling with , for the third sample we use top sampling with , and for the fourth sample we use our proposed mirostat sampling algorithm. The fifth example is humangenerated, taken from the same corpus as the context. The context used is as follows.
Context:“For two months early in 1943, Shannon came into contact with the leading British mathematician Alan Turing. Turing had been posted to Washington to share with the U.S. Navy’s cryptanalytic service the methods used by the British Government Code and Cypher School at Bletchley Park to break the ciphers used by the Kriegsmarine Uboats in the north Atlantic Ocean. He was also interested in the encipherment of speech and to this end spent time at Bell Labs."

Observed average surprise value = 1.471 p = 0.4
Generated text: “Turing’s work on the cryptography of the Kriegsmarine was based on the cipher used by the Kriegsmarine’s famous E2. The Kriegsmarine code was based on the Cryptographic Algorithm (CAT) of the Royal Navy. It was a new method of solving the problem of encryption, which was a mathematical problem which was the basis of the cryptanalysis of the Japanese naval cipher.
Turing’s work on the cipher was based on the algorithm used by the Kriegsmarine’s famous E2. The cipher was based on the algorithm used by the Kriegsmarine’s famous E2.
Turing’s work on the cipher was based on the algorithm used by the Kriegsmarine’s famous E2.
Turing’s work on the cipher was based on the algorithm used by the Kriegsmarine’s famous E2.
Turing’s work on the cipher was based on"

Observed average surprise value = 1.887, p = 0.4
Generated text: “Turing was a great mathematician, and he was not a professional cryptographer. But he was a very talented cryptographer. He was also a highly intelligent man. He was a great man who was very wellversed in the law of probability. He was also very wellversed in the mathematical methods of the United States Army. He was also very wellversed in the mathematics of the Russian Academy of Sciences. He was also very wellversed in the mathematical methods of the German Academy of Sciences.
In 1943, Shannon traveled to Moscow to meet with the Director of the NSA. He was told that the FBI had obtained information from the NSA about the German secret police. He was told that the FBI had obtained information from the NSA about the German secret police. He was told that the FBI had obtained information about the German secret police.
The NSA, on the other hand, had been unable to get information from the NSA about the German secrete"

Observed average surprise value = 6.342 p = 1.0
Generated text: “In January of 1942, in an open letter delivered to 8 persons and addressed to Sherwood, Shannon took the opportunity to address skeptics about the little outofprint $128,000 unreported race as "cryptanalysis apparatus of extraterrestrial intelligence." His reply was also reprinted in 6 South Carolina publications (Atlanta Press, Fall Publishing, Miami, New House Press, Yorkshire Press, Sydney Press, Weatherby Press, Winter Press, and Thames & Hudson, to name a few), and eventually accepted as the final subscribed FHAT article. (Summary, were whom)
Birmingham City Police concludes that [Turing’s][new replace Robert] Symington is not guilty of breach of OKF law. Further, his suggestions for preventing his sentencing Should he reside anywhere near the city he defies extradition to Bavaria closer to dawn minimal? Lylimud Simpson (Ladies & Gentlemen, Lawrence Livermore University Press, Philadelphia): Jim Gilmore and its wife, Eva Civilsky,"

Observed average surprise value = 1.828 mirostat target surprise = 1.8
Generated text: “In April 1943, Shannon was assigned to the U.S. Naval Intelligence Center at Naval Air Station Norfolk, Virginia, where he was assigned to work on a project to develop a new cipher for the U.S. Navy’s cryptanalysis program. The project was to be conducted by the U.S. Navy’s cryptanalysis team.
In early April, the U.S. Navy was informed that Shannon had been assigned to work on an encrypted cipher for the Kriegsmarine’s Uboat. The project was to be conducted by a group of cryptographers from the U.S. Naval Intelligence Center. The project was to be conducted by a group of cryptographers from the U.S. Naval Intelligence Center in a project that could be executed by the cryptanalysts.
In early April, Shannon was ordered by the U.S. Navy to provide the cryptanalysts with information on a cipher for the Uboat’s Uboats that could"

Observed average surprise value = 5.285 humangenerated
Generated text: “Shannon and Turing met at teatime in the cafeteria. Turing showed Shannon his 1936 paper that defined what is now known as the ’Universal Turing machine’. This impressed Shannon, as many of its ideas complemented his own.
In 1945, as the war was coming to an end, the NDRC was issuing a summary of technical reports as a last step prior to its eventual closing down. Inside the volume on fire control, a special essay titled Data Smoothing and Prediction in FireControl Systems, coauthored by Shannon, Ralph Beebe Blackman, and Hendrik Wade Bode, formally treated the problem of smoothing the data in firecontrol by analogy with ’the problem of separating a signal from interfering noise in communications systems’. In other words, it modeled the problem in terms of data and signal processing and thus heralded the coming of the Information Age. Shannon’s work on cryptography was even more closely related to his later publications on communication theory. At the close of the war"
Figure 6 shows plots of surprise values against indices of tokens in each of the samples in Ex. 1. The blue plot corresponds to surprise values of each token, while the red plot corresponds to average surprise values over a window of size 10 at each token index. Note that the surprise values drop drastically in Fig. (a)a as the repetitions increase in Ex. 6.1. Similarly, in Fig. (b)b, we observe a dip in surprise values wherever there is a repetition in Ex. 6.2. Clearly, there is a correlation between small average surprise values and repetitions. Further, in Fig. (a)a note that the generating model seems to get trapped into a small surprise repetition region. We call this region of small surprise as boredom trap. We observe that these models tend to fall into a boredom trap for small values of . Figure (c)c corresponds to Ex. 6.3, where we choose and illustrate that for large values of , the average surprise value of the generated text tends to increase with the number of generated tokens, which leads to incoherence. We call this region of large surprise a confusion trap. Figure (d)d shows surprise values corresponding to Ex. 6.4 which is generated using our proposed sampling algorithm, mirostat. We can observe in Fig. (d)d that mirostat has the ability to increase the surprise value when when falling into a boredom trap and, thereby maintaining the average surprise value. By doing so, it not only helps generate highquality text with predetermined average surprise value, but also helps avoid small surprise repetition regions and large surprise incoherent regions. In Fig. (e)e, we show the surprise values in humangenerated text that followed this context as shown in Ex. 6.5. We observe that humangenerated text has average surprise value that is between values using top sampling for and . More importantly, humangenerated text does not seem to fall into either of the traps described above.
Perplexity is a statistical metric used to evaluate quality of neural text generation. It is closely related to average surprise described in Ex. 6 as will be formally defined in Sec. IIIC. A largescale human experiment in [5] showed that humanbased quality evaluation is closely related to the likelihood of the generated text for fixed number of tokens. In particular, it was found that reducing perplexity increases quality till some point before the quality of generated text starts dropping with decrease in perplexity. This implies that good control over perplexity of the generated text would give us direct control over the quality of generated text (as evaluated by humans). Hence, generating texts with an appropriately chosen target perplexity value may maximize the quality of generated text. Unfortunately, existing decoding methods such as top or top sampling do not provide good control over the average surprise in generated text, as was observed in Ex. 6. On the other hand, our proposed mirostat algorithm can generate highquality text by controlling the perplexity.
Now we summarize the key contributions of this work.

In Sec. IV, we show theoretically how crossentropy and hence perplexity grows in top and top sampling as a function of and respectively, which was previously unknown.

In Sec. VB, we introduce mirostat sampling, which outputs texts with predetermined target perplexity value. Although perplexity may not be the best metric to measure the quality of generated text[13], much literature discusses its correlation to quality[5]. Hence, our algorithm that controls perplexity would help generate highquality text.

In Sec. VIA, we show experimentally that there is much fluctuation in crossentropy rates in top and top sampling as a function of their input parameters; hence they cannot provide good control over the perplexity of the output text.

In Sec. VIB, we observe that repetition is closely related to perplexity of the generated texts and mostly independent of the sampling method but slightly dependent on the model used.

In Sec. VIC, we experimentally show that mirostat sampling is able to avoid both boredom and confusion traps for a wide range of target perplexity values.
Ii Related Work

Sampling from a distorted probability distribution: Pure sampling from language models often leads to incoherent text and greedy decoding leads to repetitions. It was found that distorted probability distributions obtained using top, top, or temperature sampling help improve the quality of generated text [3, 9, 10]. Proper tuning of parameters in these methods generate highquality text, but these methods are ad hoc and do not provide good control over the statistics of the output. Our method uses the statistics of the previously generated tokens as an input to generate the next token by distorting the probability distribution in a way that helps control the overall statistics of the generated text. The advantage our method provides is control over the perplexity of the output which is not guaranteed in previously proposed decoding methods. This, when used with the relation between perplexity and humanbased quality evaluation observed in [5], can provide us text that has better controlled quality.

Controllable text generation: Controllable text generation has commonly focused on controlling semantics of the output text while our approach is purely statistical. We try to solve the statistical problems associated with pure sampling or greedy decoding by guiding the decoder along a desired statistical path. A new model with 1.63 billion parameters, CTRL, was trained to generate text based on a control word [11]. On the other hand, sampling algorithms like Plug and Play Language Model (PPLM) and Constrained sentence Generation by MetropolisHastings (CGMH) work at the inference stage on top of a pretrained language model to control certain attributes of the generated text. PPLM shows that using attribute classifiers on top of pretrained language models helps control text generation [14]. CGMH uses MetropolisHastings sampling to generate text with certain constraints like appearance of multiple keywords [15]

Qualitydiversity tradeoff: Distorting probability distributions for decoding using top, top, or temperature sampling with low temperatures improves the quality of the text, however, it also reduces the diversity in text. Certain applications like questionanswering demand highquality generation whereas openended tasks such as story generation demand diversity in addition to quality. However, there is a tradeoff between quality and diversity as observed in [5]. Further, it was observed that diversity is closely related to entropy while quality is found to be maximized in a certain range of observed likelihood values for fixed length sentences. Our algorithm provides a very good control over observed crossentropy, which is equal to observed likelihood per token of the generated text. Hence, by maintaining the observed crossentropy of the text in a certain range, we can ensure highquality text generation.

Repetitions: Greedy decoding from language models often lead to texts with excessive repetitions both at token level and at sentence level. Several techniques have been proposed in the literature to solve this problem. Token loss dynamic reweighting (TLDR) hypothesizes that there are some tokens which are more difficult to learn than others [16]. Hence, they propose reweighting the tokens while learning to balance out this effect and thus reduce repetitions. CTRL uses a repetition penalty in the decoding process to reduce the occurrence of repetitive tokens [11]. It was suggested in [17] that the cause for repetitions is in fact due to a flaw in the training objective itself and used a new objective that gives less probability to unlikely sequence including texts with high repetitions. Variants of beam search have been proposed in [18, 12, 19] to induce diversity in the generated text, which in turn alleviates the problem of repetition. Variants of top sampling and repetition penalty used in [11] were also used in [20] for reducing repetitions.
Iii Surprise, Entropy, and Perplexity
In this section we first briefly discuss language modeling. Later, we formally define the notions of surprise, entropy, and perplexity.
Language modeling is an unsupervised task of learning the probability distribution from a set of examples of the form where each and is a finite vocabulary. Since is a sequence, it can be written as a product of conditional probabilities using the chain rule as follows [21]:
(1) 
Current stateoftheart methods [2, 22] train a model with parameter minimizing the loss function
(2) 
over dataset .
A trained model can be used for generating the th word from the previous words by sampling from the distribution .
Iiia Surprise
For a random variable distributed as , the information content or surprisal associated with an instance of is defined as [23]. Hence, less probable instances are more surprising than more probable instances. Extending this definition to conditional random variables, we next define the surprise value associated with tokens and sentences with respect to generated text for a fixed model distribution .
Definition 1.
We define the surprise value of a token with respect to generated text and model distribution for some fixed model as
(3) 
As we will soon see, this quantity is directly related to perplexity. Now we define the average surprise for a sentence with tokens.
Definition 2.
For a sentence with tokens, we define surprise rate with respect to a probability distribution for some model as
(4) 
IiiB Entropy
The entropy of a discrete random variable with probability distribution is given by
(5) 
i.e., is the expected value of with respect to . The entropy rate of a stochastic process is defined as
(6) 
when the limit exists. Now, if the data is being sampled from and if is a stationary ergodic source, then by the ShannonMcMillanBreiman theorem [24, Thm. 16.8.1], we have
(7) 
when the limit exists. For experiments, we will approximate by for a sentence of length of generated by model .
IiiC Perplexity
A popular metric used in natural language processing to measure the quality of a generated text is perplexity [25, 26]. Perplexity is closely related to crossentropy and hence we define the notion of crossentropy first.
The crossentropy of a discrete random variable distributed as with respect to a discrete random variable distributed as such that is given by
(8) 
The crossentropy rate of a stochastic process distributed as with respect to a stochastic process distributed as and is defined as
(9) 
when the limit exists. Further, if is sampled from and if is a stationary ergodic source, then by the ShannonMcMillanBreiman theorem [24, Thm. 16.8.1], we have
(10) 
when the limit exists. Now, the perplexity corresponding to is simply defined as
(11) 
For experiments, when the text is generated using , we will approximate by for a sentence of length of . Perplexity denotes how close is to . The lower the perplexity, the closer the distributions and .
Iv Theoretical Analysis of Sampling Methods
In this section, we will analyze top and top sampling methods theoretically. We will note that the log of perplexity or crossentropy of generated text increases in a nonlinear manner with increase in in top sampling. In particular, perplexity seems to grow faster with increase in for small values of compared to large values of . On the other hand, in top sampling, we note that log of perplexity increases essentially linearly with increase in . Hence, it is easier to control perplexity using top sampling because of this linear growth in crossentropy.
Iva Zipf’s law
Zipf’s law states that the frequency of occurrence of any word in the vocabulary is inversely proportional to its rank in the frequency table [27, 28]. More precisely, for a vocabulary of size the frequency of the th most probable word is given by
(12) 
where is an exponent characterizing the distribution and is the th generalized harmonic number.
Further, it is known that for human languages, words have a very heavytailed probability distributions; hence the exponent is very close to . Hence, when required, we write , where is a small positive constant. For all the theoretical analysis in this work, we will assume that the sampled words follow Zipf’s law. Proofs to Prop. 1 and all the theorems in this section are provided in the appendix.
IvB Surprise and crossentropy in top sampling
Now we will analyze top sampling. Thm. 1 gives the expression for and its growth with increase in .
Theorem 1.
If words are sampled from the Zipf’s distribution given by (12), then the surprise value of a word with rank and its rate of increase are given by
(13)  
(14) 
respectively, where is a continuous function with the same expression as with a continuous domain.
From Fig. 9, we note that is highly sensitive to change in for small values of and its sensitivity to decreases drastically with increase in . Now, we analyze how crossentropy varies with . Let be the model distribution. Top sampling works by truncating the tail of the distribution and samples from the most probable tokens. Let the truncated distribution be denoted by . In Prop. 1, we provide an expression for .
Proposition 1.
Let be the model distribution satisfying (12) with vocabulary of size . and let be the model distribution obtained using top sampling. Then is given by
(15) 
It is difficult to get an intuition about the behavior of directly from (15). Thus, in Thm. 2 we obtain an approximation to that shows is essentially of the form for , where are some constants. Hence we observe that grows fast with small values of and slows down for large values of .
Theorem 2.
Let be the model distribution satisfying (12) with vocabulary of size . and let be the model distribution obtained using top sampling. Then, for , can be approximated as
(16) 
where
In Fig. 10 we plot the approximation obtained for in Thm. 2 and observe that the obtained approximation is very close to the actual value, hence provides a good estimate of . Note that the value of does not grow much beyond and hence it makes sense to tune between 1 to 2000 to get a desired crossentropy.
IvC Surprise and crossentropy in top sampling
Now we provide a theoretical analysis for top sampling. In Thm. 3, we provide approximate expressions for and that shows that grows essentially linearly with .
Theorem 3.
If words are sampled from the Zipf’s distribution given by (12). If is a small constant, then and the rate of change of with respect to is given by
(17)  
(18) 
where .
In Fig. (a)a, we plot the approximate expression for obtained in Thm. 3 which is a linear function in and has a slope approximately for and . In Fig. (b)b, we plot the approximate expression for from Thm. 3 which is also a linear function of . This tells us that even though can be approximated as essentially a linear function of , it has a slightly increasing slope. Further, unlike the plot of in Fig. (b)b, which is decreasing with , in Fig. (b)b has a positive slope.
As pointed out earlier, gives the surprise values as a function of for a fixed vocabulary size, while the observations made in Fig. (b)b for top sampling had a varying vocabulary size. To estimate the change in average surprise with change in the value of in top sampling, Thm. 4 provides approximate expressions for which shows that grows approximately linearly with . This is in contrast to top sampling where the growth of average surprise was highly nonlinear.
Theorem 4.
Let be the model distribution satisfying (12) with vocabulary of size . and let be the model distribution obtained using top sampling where is the minimum value of satisfying , can be approximated as . Then, for
(19) 
Figure 14 shows that the growth of with is approximately linear.
IvD Effect of temperature on top and top sampling
Temperature is used to distort the original distribution in a suitable way so as to generate samples that avoid the problems associated with pure sampling. In particular, lowering the temperature makes the sampling more greedy. For a given temperature , the frequency of the th most probable word in (12) is given by
(20) 
Hence the effect of temperature in our analysis can be captured simply by modifying to .
V PerplexityControlled Text Generation
In this section we propose an algorithm that provides control over the crossentropy rate of the generated text. The algorithm works in two stages for generating each word. First it estimates the value of assuming that the words follow Zipf’s law. Then, it uses top sampling where is a function of estimated and the target surprise value of the output text.
Va Estimating for Zipf’s distribution
We assume that the words follow Zipf’s distribution (12), i.e.
Further, we observe the probabilities produced by our language model as , where is the probability associated with th most probable word for . We use minimum mean squared error estimation to find the value of . However, shows up in both as an exponent of and in which makes the computation difficult. Hence we estimate by minimizing the mean squared error between logarithm of ratios of subsequent probabilities which eliminates , i.e. we estimate as
(21) 
where and . When is large, we estimate using the most probable tokens for around 100 to improve time complexity, which also gives practically good estimation.
VB Algorithm
Here we provide mirostat^{2}^{2}2Code for mirostat sampling is available at https://github.com/basusourya/mirostat which can generate texts with predetermined average surprise value as given as Alg.1. This algorithm takes as input a target surprise value and initializes a variable . Each word is sampled by first estimating from (21) as , then using top sampling by approximating as a function of and by approximating 13) to get and using (
(22) 
where . We initialize corresponding to surprise value and not since we are sampling from top and not computing the surprise value at itself. Here, seems to give a good initialization practically, and the rest is taken care of by the feedback which updates the value of . An error term is computed as the difference between the observed surprise of the sampled word and the is then updated using this error.
Note that we can use an alternate algorithm to tune in Alg. 1 by iterating through the most probable tokens to set corresponding to a token that has a suitable amount of surprise. This alternate algorithm would have varying computational complexity depending on the target surprise value and can be slow for large surprise values. Hence, in Alg. 1, we first estimate using a fixed number of tokens and then directly use top sampling, which is computationally efficient for all target surprise values and provides good control over perplexity.
Vi Experimental Analysis
Here we provide experiments for the performance of top, top, and mirostat sampling. We use the GPT2 language model with 117M parameters for all our experiments[2] unless mentioned otherwise. In Sec. VIA we show how observed crossentropy varies with change in parameters for all the sampling methods. This illustrates that mirostat provides direct control over the observed crossentropy of the output text. In Sec. VID we demonstrate output texts for different target crossentropy and their observed crossentropy.
Via Crossentropy rate for different sampling methods
In Fig. 18, we plot observed crossentropy in generated texts versus several input parameters for different sampling methods. For each of the plots, we generate four output texts of 200 tokens corresponding to each value of input parameter in each sampling method with same context in each case.
In Fig. (a)a, we plot the observed surprise values in generated texts versus in top sampling. Note that crossentropy has a steep increase for small values of and relatively slow increase in for high values of . Thus, for small values of , crossentropy is very sensitive to change in the , but, for large values of , crossentropy hardly changes. Hence, even though we can clearly see the increase in crossentropy with increase in top, it is difficult to control crossentropy using top sampling.
In Fig. (b)b, we plot the observed surprise values in generated text versus in top sampling. We observe that crossentropy grow essentially linearly with increase in unlike top sampling.
In Fig. (c)c, we plot the observed crossentropy in generated texts versus target crossentropy in mirostat sampling, Alg. 1. We observe that mirostat sampling gives very good control over observed surprise value with low variance for surprise values less than five. For higher target surprise, Alg. 1 does not provide good control on observed surprise value since the algorithm truncates the low probability words for controlling surprise value and without any truncation the observed surprise is around five. Thus, to get better control over observed surprise values, we must truncate some more probable words as well, which would reduce the quality of the generated text, hence not considered in this work.
The observation on different growth rate of surprise values in top and top sampling in Fig. 18 is not very intuitive on its own. Our theoretical analysis in Sec. IV helps explain nonlinear growth in crossentropy rate in top sampling and essentially linear growth in crossentropy rate in top sampling. Note that our theoretical analysis in Sec. IV deals with crossentropy while our experiments deal with crossentropy rate. However, for practical purposes crossentropy helps us give an intuition about crossentropy rate in different sampling methods. Further, the fluctuation in crossentropy rate in Fig. (a)a and Fig. (b)b is likely because 200 tokens is not sufficient for the convergence in (9). There is not much fluctuation in crossentropy rate in Fig. (c)c because we use feedback to control the crossentropy rate more accurately, which gives accurate results even for a small number of tokens.
ViB Perplexity and repetitions
Here, we present some experimental observations for percentage of repeated tokens across different sampling methods and language models. In Fig. 23, we generate texts with 200 tokens using different sampling methods and models with varying relevant input parameters such as , , or target surprise values, . We also consider the percentage of gram repetitions for different values of for a fixed sampling method. We define percentage gram repetition as
(23) 
where an gram token simply means concatenation of contiguous tokens. Hence, for , gram repetitions capture wordlevel repetitions, whereas larger values of capture sentencelevel repetitions. For , we refer to percentage gram repetition simply as percentage repetition.
In Fig. (a)a, we fix the language model to GPT2 with 117M parameters and observe that percentage repetition decreases with increase in crossentropy and more importantly, for a fixed model, this relation is independent of the sampling method.
In Fig. (b)b, we fix the language model as GPT2 with 117M parameters and use top sampling with varying temperature values. We observe that repetitions for different temperature values and follow the same curve as in Fig. (a)a. This implies that crossentropy controls the percentage repetitions in generated texts. Moreover, it implies that once the model and crossentropy are fixed, percentage repetition is not affected by the considered sampling methods.
In Fig. (c)c, we fix the language model as GPT2 with 117M parameters and capture gram repetitions for varying crossentropy rate and different values of . We note from Fig. (c)c that for small values of , which captures wordlevel repetitions, the percentage gram repetitions drop almost linearly with increase in crossentropy; whereas for larger values of , which captures sentencelevel repetitions, the percentage gram repetitions is very close to zero for crossentropy greater than . This indicates that sentencelevel repetitions disappear after a threshold of crossentropy whereas wordlevel repetitions continue to appear for larger values of crossentropy. Also, note that in humangenerated text data, there are often common pronouns and conjunctions that are essential and are often repeated, hence we do not expect a good sampling algorithm to have absolutely zero 1gram repetitions. But, we do expect a good sampling algorithm to have minimum sentencelevel repetitions, which all the sampling seems to show beyond a threshold of crossentropy, which seems to be around 2.5 for GPT2 language model with 117M parameters.
In Fig. (d)d, we plot percentage repetition versus crossentropy for different language models using top sampling for varying values of . We observe that larger language models such as GPT2XL with 1558M parameters^{3}^{3}3For more information on various GPT models, refer to https://huggingface.co/transformers/pretrained_models.html. have slightly less repetitions for a fixed value of crossentropy compared to smaller models such as GPT2 with 117M parameters.
From these observations, we conclude that in order to control percentage repetition in generation we need to control the crossentropy of the output text. Hence, mirostat can help provide a good control over repetitions in generated texts.
ViC Boredom and confusion traps
Here we show that due to lack of control over statistics in top and top sampling, these methods get trapped into generating lowquality texts when the number of tokens generated are high for a wide range of and . We generated 10 samples of texts on the same context that are 900 tokens long and averaged their observed crossentropy at various points of generation for each of the plots in Fig. 28, except for the humangenerated text for which only one sample was used (the tokens following the context in the corpus).
In Fig. (a)a we illustrate the boredom trap. We observe that for small values of and , both top and top sampling methods fall into these low crossentropy regions—boredom traps—which results in increase in repetitions as the length of the generated text increases, as illustrated in Sec. VIB. Hence, lack of control over output statistics in these methods leads to degradation of quality in generated texts for longer texts.
In Fig. (b)b we illustrate the confusion trap. Here we observe that for high values of and in top and top sampling methods respectively, the observed crossentropy of the generated texts increases with the length of generated texts. This leads to increase in incoherence in the text as the token index increases.
In Fig. (c)c we choose certain values of and in an ad hoc manner and generate texts using top and top sampling methods respectively to observe that for these values of and , the generated texts tend to have crossentropy that seems to be converging to some limiting value with increase in text length and not fall into either boredom or confusion trap. We also illustrate how the observed crossentropy varies with increase in text length in the humangenerated text corresponding to the tokens following the context used for these experiments. We find that the humangenerated text converges to some limiting value of crossentropy when the generated text is long enough and does not fall into either boredom or confusion.
Finally, in Fig. (d)d we show the crossentropy for the generated texts using Alg. 1 converge to their target crossentropy value within a few tokens and continue to maintain the desired value of crossentropy for long generated texts.
ViD Text generation and compression
Here we will look at texts generated for various target surprise values using mirostat sampling with GPT2 with 117M. We also observe the wellknown relation between crossentropy and data compression [24, Ch. 5],[29]. In particular, it is known that when the actual probability distribution of the generated text is not known, then the minimum lossless compression rate achievable is equal to the crossentropy of the assumed distribution, which is the language model here, with respect to the actual unknown distribution, which is obtained from adaptive top sampling here.
We generate 200 tokens for different values of target surprise values using the GPT2 model with 117M parameters to show the quality of the text generated using Alg. 1 for different target surprise values. We also observe the compression rates obtained using arithmetic coding [30, 31] with the language model as the probability distribution. So, in a way, mirostat can generate text that has a predetermined minimum lossless compression rate for a given model.
Example 2.
Context:“For two months early in 1943, Shannon came into contact with the leading British mathematician Alan Turing. Turing had been posted to Washington to share with the U.S. Navy’s cryptanalytic service the methods used by the British Government Code and Cypher School at Bletchley Park to break the ciphers used by the Kriegsmarine Uboats in the north Atlantic Ocean. He was also interested in the encipherment of speech and to this end spent time at Bell Labs."
Target average surprise value = 1.0 Observed crossentropy rate = 0.759 Average compressed size (per token) = 0.863 Percentage compression = 97.66%
Generated text: “In April 1943, Shannon was assigned to the U.S. Navy’s cryptanalytic service, the Naval Intelligence Service. He was assigned to the cryptanalytic service’s cryptanalysis unit, the Naval Intelligence Laboratory. He was also assigned to the cryptanalytic unit’s cryptanalysis unit, the Naval Intelligence Laboratory.
In May 1943, Shannon was assigned to the cryptanalytic unit’s cryptanalysis unit, the Naval Intelligence Laboratory. He was also assigned to the cryptanalytic unit’s cryptanalysis unit, the Naval Intelligence Laboratory.
In June 1943, Shannon was assigned to the cryptanalytic unit’s cryptanalysis unit. He was assigned to the cryptanalytic unit’s cryptanalysis unit, the Naval Intelligence Laboratory.
In July 1943, Shannon was assigned to the cryptanalytic unit’s cryptanalytic unit. He was assigned to the cryptanalytic unit’s cryptanalytic unit, the Naval Intelligence Laboratory.
In August 1943, Shannon was assigned"
Target average surprise value = 2.0 Observed crossentropy rate = 1.987 Average compressed size (per token) = 2.07 Percentage compression = 95.13%
Generated text: “He was a member of the British Intelligence Service, and was also a member of the British Intelligence Service’s Intelligence Advisory Board.
In early 1944, Shannon was assigned to the British Intelligence Service’s Intelligence Advisory Board. He was also assigned the task of deciphering the ciphertext of the British Government Code. He was also assigned to the British Intelligence Service’s Cryptanalytic Service. He was assigned to the British Intelligence Service’s Intelligence Advisory Board at the end of 1944. He was assigned to the British Intelligence Service’s Secret Intelligence Service at the end of 1944.
He was confirmed to work as the Director of the British Intelligence Service’s Information Service at the end of 1944. He was the first British employee to be appointed as an Intelligence Advisory Officer at the end of the war. He was also confirmed as an Intelligence Advisory Officer to the United Kingdom’s Intelligence Advisory Committee at the end of the war. (The United Kingdom is a member of the European Union.)
He was also appointed"
Target average surprise value = 3.0 Observed crossentropy rate = 3.002 Average compressed size (per token) = 3.141 Percentage compression = 91.84%
Generated text: “After a short stay at Bell Labs, Shannon was commissioned to report back to the Naval Intelligence Services (NISC) to learn more about his research. He made his way over to the Royal Naval Institute. He had already been working with a number of other cryptanalytic researchers, including the mathematician Alan Turing, for over eight years at Bell Labs, but this research was not yet done. He spent the next two months studying the cipher schemes of the various British cryptanalytic systems. The first step was to find out the actual cipher scheme for the cipher used by the HMS Queen Elizabeth on a submarine.
By the time Shannon arrived at the Royal Naval Institute he had been working for Patagonia Scientific and Technology, the British cryptanalytic company, for eight years. Patagonia received a call at 9 A.D. on the evening of September 4, 1943, asking to check the cipher system for a cipher used by the HMS Queen Elizabeth, which is written as:"
Target average surprise value = 4.0 Observed crossentropy rate = 3.943 Average compressed size (per token) = 4.06 Percentage compression = 89.98%
Generated text: “He also began writing code to help with deciphering encrypted communications. During 1943, Shannon kept a diary to help decipher his instructions to the cryptanalytic service. Hearing of the story of Alan Turing sent Shannon excitedly into a panic of his own. He that he had only discovered Turing had made it clear to him that the great deception he had witnessed was due to the German government’s efforts to conceal it. I believe he intended to reveal it.
In the early days, the NSA was working with the National Security Agency (NSA) at different times against the Soviet Union in the early 90s and as recently as three years ago against the USSR in the early 1990s. It is believed that the four Russian spies were at the CIA while also an intermediary in the process of doing a disinformation campaign to manipulate the American public to support Bush Sr. by calling for American military intervention in Afghanistan against the Bush, Bush, and Cheney (and the Bush family) and by trying to persuade the public to"
Target average surprise value = 5.0 Observed crossentropy rate = 4.929 Average compressed size (per token) = 5.116 Percentage compression = 85.41%
Generated text: “Naval Contractors hired Shannon as a shortage specialist at the Navy. So she worked alone with Turing to decipher the British cipher for the northern German submarines. Shannon undertook the work initially on the S.G.F. dual cipher. Shannon spent the whole working days at the Bell Labs lab at Bletchley Park.
After weeks of interrogations Shannon was able to break the great ElFran jujito with a German accent. Shannon then calculated that this hydrocephalic consciousness alone would allow her to think the same words but without the huge amount of writing required to produce such a thing, and with millions of hours on board she was able to write some of the same phrases over a twentyeight hours workweek.
Output: Individualist Hypothesis.
Miranda (alias Charlotte Reagmire, aka "The Lady in the Lake") (19452049) (wife of B.G. Lloyd) (Philadelphia, PA)"
Target average surprise value = 6.0 Observed crossentropy rate = 5.735 Average compressed size (per token) = 5.834 Percentage compression = 85.55%
Generated text: “The CIA trained a small band of cryptanalysts to do the maths again, this time using a UK number generator. A few days after the wars Belgium introduced Bermuda rugby as the appropriate sport for the National Guard. Notwithstanding there being no convention around using English for Rugby at the time, there would be no possible instance in the history of fencing in Europe. Flags for the Hurricanes had evolved recently using a Dutch Italian design called the Crazy Flag. These flag designs come largely of British origin and the date published of its introduction by the Royal Armouries of Lameucers is from 1638. The camouflage was recently added to the new Irish power orange flag. The design is based on the weapons bao mouèret Standard and has two coloured pouches connected to the rifle barrel by two checks along the top of the barrel with protection straps around the barrel to protect the cutouts. NATO hired a team of physicists to do the reconstruction. Readers who want to know more about this new"
In Ex. 2 we can see that low value of surprise value results in repetitions and high value of surprise value results in incoherent generated texts. Moderate surprise values result in good quality, coherent text with no repetition. Also, note that the control does not work well when the target surprise value is greater then 5. This is because without any truncation, the average surprise of pure sampled text comes out to be around 5.4. Thus, in order to attain higher values of average surprise, we need to truncate from both sides of the distribution.
Vii Conclusion
We provide a theoretical understanding of how perplexity varies as a function of input parameters in popular top and top neural text decoding algorithms. In particular, we show that log of perplexity varies nearly linearly as a function of in top sampling whereas it is highly nonlinear as a function of in top sampling. Building on this analysis, we present mirostat, a neural text decoding algorithm that provides good control over the perplexity of the generated text. Mirostat provides several advantages over other sampling algorithms. While top and top do not provide a good control over the statistics of the output, mirostat can maintain the perplexity of generated text over a wide range of text length. For longer texts and certain range of input parameters, top and top sampling seem to fall into boredom and confusion traps which lead to lowquality texts. Mirostat can avoid both these traps. Further, recent largescale humanbased evaluation of neural generated text suggests that human judgement of text quality is maximized for a certain range of perplexity of the output. Since mirostat provides direct control over the perplexity of the output, it helps generate highquality text. As a takeaway, we find that our proposed sampling algorithm, mirostat, with a target surprise value around produces varying lengths of highquality texts with minimal repetitions.
Additionally, we also analyze the relation between perplexity and repetitions in text. We find that for fixed model, repetitions vary linearly with perplexity and are independent of the sampling method used. We also find that larger models have less repetitions in them for any fixed amount of perplexity. Future work would include theoretically analyzing repetitions, boredom and confusion traps, and convergence properties of mirostat.
Proof to Prop. 1.
Proof to Thm. 2.
From Prop. 1 we have
First note that the function is a decreasing function of for . Thus, for , we have the following inequalities
(24) 
Solving the above integration for we get
(25) 
where
Now, we bound as follows. Note that is a decreasing function in for and , hence, we have
(26)  
(27) 
We empirically observed that can be approximated well as
(28) 
which lies between the bounds found in (27). Moreover, we approximate using the upper bound obtained in (24) to get
(29)  
(30)  
(31) 
where (31) follows by writing 30), then simplifying the expression and writing the infinite series back as a fraction. ∎ as an infinite series in (
Proof to Thm. 3.
The cumulative probability for Zipf’s distribution is given by . Using the approximation to in (28), we have
(32) 
where .
Now, writing as a function of , we get
(33) 
Using (33) in the equation from Thm. 1, we get
(34) 
Further, taking small enough, we can approximate
(35) 
Now, can be directly computed from (34) as
(36) 
For small enough, we can use the approximation
(37) 
∎
References
 [1] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. HerbertVoss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language models are fewshot learners,” arXiv:2005.14165v3 [cs.CL]., Jun. 2020.
 [2] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” Unpublished manuscript, Feb. 2019. [Online]. Available: https://d4mucfpksywv.cloudfront.net/betterlanguagemodels/language_models_are_unsupervised_multitask_learners.pdf
 [3] A. Holtzman, J. Buys, M. Forbes, A. Bosselut, D. Golub, and Y. Choi, “Learning to write with cooperative discriminators,” in Proc. Assoc. Comput. Linguist. Annu. Meet. (ACL 2018), Jul. 2018, pp. 1638–1649. [Online]. Available: https://www.aclweb.org/anthology/P181152
 [4] D. Ippolito, D. Duckworth, C. CallisonBurch, and D. Eck, “Automatic detection of generated text is easiest when humans are fooled,” in Proc. Assoc. Comput. Linguist. Annu. Meet. (ACL 2020), Jul. 2020, pp. 1808–1822.
 [5] H. Zhang, D. Duckworth, D. Ippolito, and A. Neelakantan, “Trading off diversity and quality in natural language generation,” arXiv:2004.10450v1 [cs.CL]., Apr. 2020.
 [6] D. Ippolito, R. Kriz, M. Kustikova, J. Sedoc, and C. CallisonBurch, “Comparison of diverse decoding methods from conditional language models,” in Proc. Assoc. Comput. Linguist. Annu. Meet. (ACL 2019), Jul. 2019, pp. 3752–3762.
 [7] L. Itti and P. F. Baldi, “Bayesian surprise attracts human attention,” in Proc. 20th Annu. Conf. Neural Inf. Process. Syst. (NIPS), May 2006, pp. 547–554.
 [8] L. R. Varshney, “Mathematical limit theorems for computational creativity,” IBM J. Res. Develop., vol. 63, no. 1, pp. 2:1–2:12, Jan. 2019.
 [9] A. Fan, M. Lewis, and Y. Dauphin, “Hierarchical neural story generation,” in Proc. Assoc. Comput. Linguist. Annu. Meet. (ACL 2018), Jul. 2018, pp. 889–898. [Online]. Available: https://www.aclweb.org/anthology/P181082
 [10] A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi, “The curious case of neural text degeneration,” in Proc. 9th Int. Conf. Learn. Represent. (ICLR), May 2020.
 [11] N. S. Keskar, B. McCann, L. R. Varshney, C. Xiong, and R. Socher, “CTRL: A conditional transformer language model for controllable generation,” arXiv:1909.05858v2 [cs.CL]., Sep. 2019.
 [12] A. K. Vijayakumar, M. Cogswell, R. R. Selvaraju, Q. Sun, S. Lee, D. Crandall, and D. Batra, “Diverse beam search for improved description of complex scenes,” in Proc. 32nd AAAI Conf. Artif. Intell., Apr. 2018.
 [13] T. Hashimoto, H. Zhang, and P. Liang, “Unifying human and statistical evaluation for natural language generation,” in Proc. NAACLHLT 2019, Jun. 2019, pp. 1689–1701.
 [14] S. Dathathri, A. Madotto, J. Lan, J. Hung, E. Frank, P. Molino, J. Yosinski, and R. Liu, “Plug and play language models: a simple approach to controlled text generation,” in Proc. 9th Int. Conf. Learn. Represent. (ICLR), Apr. 2020.
 [15] N. Miao, H. Zhou, L. Mou, R. Yan, and L. Li, “CGMH: Constrained sentence generation by metropolishastings sampling,” in Proc. 33rd AAAI Conf. Artif. Intell., Jul. 2019.
 [16] S. Jiang, T. Wolf, C. Monz, and M. de Rijke, “TLDR: Token loss dynamic reweighting for reducing repetitive utterance generation,” arXiv:2003.11963 [cs.CL]., Apr. 2020.
 [17] S. Welleck, I. Kulikov, S. Roller, E. Dinan, K. Cho, and J. Weston, “Neural text generation with unlikelihood training,” in Proc. 9th Int. Conf. Learn. Represent. (ICLR), May 2020.
 [18] J. Li, W. Monroe, and D. Jurafsky, “A simple, fast diverse decoding algorithm for neural generation,” arXiv:1611.08562 [cs.CL]., Nov. 2016.
 [19] I. Kulikov, A. Miller, K. Cho, and J. Weston, “Importance of search and evaluation strategies in neural dialogue modeling,” in Proc. 12th Int. Conf. Natural Language Generation (ICNLG 2019), Oct. 2019, pp. 76–87.
 [20] M. E. Foster and M. White, “Avoiding repetition in generated text,” in Proc. Conf. Eleventh European Workshop Natural Language Generation. (ENLG’07), Jun. 2007, pp. 33–40. [Online]. Available: https://www.aclweb.org/anthology/W072305
 [21] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, “A neural probabilistic language model,” J. Mach. Learn. Res., no. 3, pp. 1137–1155, Feb. 2003.
 [22] Z. Dai, Z. Yang, Y. Yang, J. G. Carbonell, Q. Le, and R. Salakhutdinov, “TransformerXL: Attentive language models beyond a fixedlength context,” in Proc. Assoc. Comput. Linguist. Annu. Meet. (ACL 2019), Jul. 2019, pp. 2978–2988. [Online]. Available: https://www.aclweb.org/anthology/P191285
 [23] T. S. Han and K. Kobayashi, “Mathematics of information and coding,” Ann. Math. Stat., vol. 203, Mar. 2007.
 [24] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed. New York: John Wiley & Sons, 2006.
 [25] P. F. Brown, S. A. D. Pietra, V. J. D. Pietra, J. C. Lai, and R. L. Mercer, “An estimate of an upper bound for the entropy of English,” Comput. Linguist., vol. 18, no. 1, pp. 31–40, 1992.
 [26] L. R. Varshney, N. S. Keskar, and R. Socher, “Limits of detecting text generated by largescale language models,” in Proc. 2020 Inf. Theory Appl. Workshop, Feb. 2020.
 [27] G. K. Zipf, Human Behavior and the Principle of Least Effort. Cambridge, MA: AddisonWesley, 1949.
 [28] D. M. W. Powers, “Applications and explanations of Zipf’s law,” in New Meth. Language Process. and Comp. Natural Language Learning, 1998, pp. 151–160.
 [29] E. N. Gilbert, “Codes based on inaccurate source probabilities,” IEEE Trans. Inf. Theory, vol. IT17, no. 3, pp. 304–314, May 1971.
 [30] I. H. Witten, R. M. Neal, and J. G. Cleary, “Arithmetic coding for data compression,” Commun. ACM, vol. 30, no. 6, pp. 520–540, Jun. 1987.
 [31] J. J. Rissanen and G. G. Langdon, “Arithmetic coding,” IBM J. Res. Develop., vol. 23, no. 2, pp. 149–162, Mar. 1979.