A POPULATION DURING AN INFECTIOUS DISEASE OUTBREAK ∗

Tajima’s D measures the difference between two estimates of genetic diversity in a given set of nucleic acid sequences. Here we show how Tajima’s D can be calculated/estimated by developing an inductive algorithm for calculating the site-specific nucleotide frequencies from a standard multistrain susceptible-infective-removed (SIR) model (both deterministic and stochastic). We show that these frequencies are fully determined by the mutation rate and the initial condition of the frequencies. We prove that the sign of Tajima’s D is independent of the disease population dynamics and that the negative sign does not imply an expansion of the infected population in the deterministic model. Using individual-based simulations, we also show that dependence of Tajima’s D on the disease transmission and evolution dynamics is a result of the stochasticity of those dynamics. The same is true for the dependence related to genetic diversity of a pathogen.

1. Introduction.The rapid development of nucleic acid sequencing methodologies and technologies has contributed to the accumulation of enormous amounts of sequence data.This large quantity of data makes it easier to predict the impact of each mutation on the fitness of a pathogen via estimation of the selection pressure on the mutation [14].For infectious disease epidemiology, the estimation of a mutation contributing to the fitness of a pathogen can be used to provide insight into mutations, e.g., discovering mutations determining pathogenicity, virulence, and host specificity.Rapid sequencing technologies can also capture detailed information about the time evolution of pathogen sequences.Such time series data are useful not only for the detection of pathogen evolution but also for capturing the pathogen population dynamics.Coalescent theory in neutral evolution, i.e., no natural selection and constant population size, can be used to estimate population size from the genetic diversity of sampled sequences [13].The extension of basic coalescent theory to discrete changes in population over time allows us to estimate the time series of such changes [16,17], which can then be applied to estimate the reproduction number of a pathogen [19].
Several methods have been proposed to measure the natural selection of a mutation at a specific site.Tajima's test is one of the most popular statistical tests of evolution neutrality at the sequence level.Tajima's D measures the difference between two estimates of genetic diversity [21].These estimates are equal when evolution is neutral (i.e., there is no natural selection and a constant population).When this happens, Tajima's D = 0. Therefore, a nonzero Tajima's D implies that at least one condition for neutral evolution is violated.It is known that when Tajima's D is nonzero, the sign of Tajima's D gives us the interpretation of natural selection: balancing selection can result in positive Tajima's D, and positive selection can result in negative Tajima's D [1].Not only natural selection but also population dynamics determines the sign of Tajima's D, and it is believed that negative Tajima's D results from an increase in population size and positive Tajima's D results from a decrease in population size [8,18,20,11].When it comes to the nonendemic situation of infectious disease transmission, if we assume that the number of pathogens is proportional to the number of infected hosts, i.e., the population dynamics of the pathogen is proportional to the disease dynamics, the population of the pathogen is always changing during an outbreak.Therefore, Tajima's D can change over time.
To detect the natural selection of an infectious disease in a nonendemic situation, we need to understand the impact of disease dynamics on Tajima's D. This is the main objective of our study.
To focus on the effect of population dynamics alone, we assume that no mutation contributes to a change in the pathogen phenotypes.In section 2.1, we will introduce the concept of Tajima's D and discuss key components (strain-specific nucleotide frequencies) of Tajima's D. Then in section 2.2, we link these to the population infection dynamics of a pathogen through a simple multistrain SIR (susceptible-infectedremoved) model to describe the population dynamics of the pathogen.We show in section 3.1 how Tajima's D can be calculated by developing an inductive algorithm for calculating the site-specific nucleotide frequencies from the multistrain SIR model.Then in section 3.2, we formulate the stochastic analogue of the multistrain SIR model and perform Monte Carlo simulations to compare the effects of disease dynamics described by both deterministic and stochastic SIR models on Tajima's D values.We observed the dependence of Tajima's D on the stochasticity of transmission dynamics during an outbreak.In the final section, we discuss the implication and the application of our results in evolutionary and epidemiological analyses.
2. Site-specific frequency of nucleotide and models.

Tajima's D.
A nucleic acid sequence is a sequence of nucleotides consisting of four values.The nucleotides are A, T, G, or C for DNA sequences, and A, U, G, or C for RNA sequences.For the sake of simplicity, we refer to A, T (U), G, and C as 1, 2, 3, and 4. Here, we consider a set of sampled sequences with length L. The possible number of sequences is 4 L .We denote by x a sequence of nucleotides, with x l representing the nucleotide at the lth site, that is, Figure 1 gives the sequence data of 3 sequences of length 6.To define Tajima's D, we define π i,j in a pair of sequences (ith sequence and jth sequence) as the number of unmatched sites.We call a site a segregating site when there are at least two different nucleotides among the sampled sequences.In Figure 1(a), we see π 1,2 = 2 and π 2,3 = 0, whereas in Figure 1(b), there are two segregating sites.For a given n sampled sequences, Tajima's D describes the variation in sequences among these samples.This is a function of the mean pairwise distance π and the number of In [21], the denominator of Tajima's D is given by the function of σ, where To calculate Tajima's D, we require only (i) the site-specific frequency of each nucleotide among sampled sequences f , and (ii) the number of sampled sequences n.In the next subsection, we start with the calculation of Tajima's D for a specific site and then expand the calculation to include multiple sites.
2.1.1.Tajima's D for a specific site.In this subsection, we focus on a specific site, the lth site, in the sequences to calculate Tajima's D. Let f k l be the relative frequency of nucleotide k in the lth site among sampled sequences, where The numerator in the right-hand side of (2) is the sum of pairwise distances with respect to the lth site among all pairs of existing sequences and can be written as f k l .The sum of pairwise distances is the product of (i) the number of pairs classified by the nucleotide, e.g., 1 and 1, 1 and 2,. .., 4, and 4, and (ii) the pairwise distance for the pair classified by the nucleotide.The number of pairs classified by the nucleotide except self-pairing is given by The pairwise distance for the pair classified by the nucleotide is given by Therefore, the total pairwise distance per each pair classified by the nucleotide is written as Note that this matrix is a symmetry matrix, and the sum of all elements of this matrix counts each nucleotide pair twice.Then (3) With respect to the number of segregating sites for the l 1 th site, σ l , we have from its definition the following: Assuming that the sequence sampling process is random and the sampling process of the nucleotide in the lth site follows a multinomial process, the maximum likelihood estimate of the frequency of nucleotide k in the lth site among the population is equal to f k l .Therefore, the sampling probability of σ l among n sampled sequences is given by The expected value of σ l is Meanwhile, Tajima's D at the lth site D l is given by ( 5) .
Assuming σ l = E(σ l ), from (3), (4), and (5) we notice that D l is determined only by l , and n. 2.1.2.Tajima's D for L sites.Assuming no linkage between sites, the mean pairwise distance among L sites of n sampled sequences is simply given by the sum of π l , ( 6) σ is given by If the site-specific frequency of nucleotides is assumed to be independent, i.e., the sampling process of site-specific nucleotides follows a multinomial process (E(f k l ) = f k l ), then the expected number of segregating sites σ among n sampled sequences is Substituting ( 6) and ( 7) or ( 8) into (1) yields immediately that Tajima's D is determined only by f k l and n. 2.2.Disease dynamics model.To explore the relationship between Tajima's D and disease dynamics, we construct a multistrain SIR model [3,7,15].Here, the classification of strain is such that two sequences are considered the same strain if and only if they have identical sequences.We also assume that the number of sequences in the host population is proportional to the number of infected individuals corresponding to the strain.To focus only on the impact of disease dynamics, we assume that the evolution of the pathogen is neutral and has no effect on the phenotype.Then the transmission rate and recovery rate are identical among all sequences, and the established immunity against a strain carrying one sequence protects hosts against infections, with all strains carrying any other sequences.The population dynamics of the hosts infected with the strain carrying sequences x, I x , can be described by ( 9) where β denotes the transmission rate, γ denotes the recovery rate, and µ x→y denotes the mutation rate from sequence x to sequence y.Here we assume that every mutation in a given host replaces the dominant genotype of the pathogen.Suppose the epidemic duration is short compared to the life span of the host; in this case the host population size is constant: 2.2.1.1-site model.We start from the simplest 1-site model.We assume that only one site (the lth site) in the genetic sequences can mutate and that the nucleotides in other sites are the same among all sequences.Therefore, only four sequences can exist.Hereafter we refer to them as x1, x2, x3, and x4.The nucleotides in the lth site of sequences x1, x2, x3, and x4 are 1, 2, 3, and 4. Mutation can occur (I x can be I y by mutation, where I x denotes the number of infected individuals with strain x).In population genetics, many models for mutation have been proposed.For example, Jukes and Cantor assumed that all site-specific mutation rates between nucleotides are the same [10], and Kimura assumed two mutation rates: a constant rate for the mutation between A and G and between C and T (U), and a different rate for all other mutations [12].If we follow the simplest mutation model and use the Jukes-Cantor assumption, (9) can be written as ( 10) where µ denotes the mutation rate among sequences x1, x2, x3, and x4 (i.e., a constant mutation rate among sequences x1, x2, x3, and x4 by the Jukes-Cantor assumption).Suppose that only one sequence, x1, exists at the beginning of an outbreak, the number of infected individuals is small, and all other individuals are susceptible; then we have (for sure 1) ( 11) 2.2.2.L-site model.We now expand the 1-site model into an L-site model, assuming that the number of mutable sites is L. In this model, 4 L sequences can exist.We denote the infected hosts and recovered hosts with the strain carrying sequence x = {x 1 , x 2 , . . ., x L } by I x1,x2,...,x L and R x1,x2,...,x L .The maximum number of mutations for each time step is assumed to be one for each sequence.Therefore, the number of unmatched sites between parent sequence and child sequence is always one.For the sake of simplicity, we use the Jukes-Cantor assumption for mutation, which is a constant rate for the mutations in a site; however, the mutation rates among different sites can vary, so µ l denotes the mutation rate among nucleotides at the lth site.Following these assumptions, (9) can be written as ( 12) Suppose that only one sequence, {1, 1, . . ., 1}, exists at the beginning of an outbreak, the number of infected individuals is small, and all other individuals are initially susceptible.Then ( 13) 3. Main results.
3.1.Deterministic model.We start the analysis of the genetic variation of pathogens from the above deterministic SIR model.For a given site l 1 ∈ {1, . . ., L} and a given nucleotide k 1 ∈ {1, 2, 3, 4}, we define to be the set of all sequences with the nucleotide k 1 in the l i site.Let I k1 l1 (t) be the number of infectives whose sequences belong to G k1 l1 , and let f k1 l1 (t) = I k1 l1 (t)/I(t) be the frequency of nucleotide k 1 in the l 1 site.Here I(t) denotes the total number of infected individuals among all sequences: We start with the 1-site model as shown in (10) Substituting (11) and ( 15) into ( 14), we have Recalling f k1 l1 (t) = I k1 l1 (t)/I(t), we obtain the site-specific nucleotide frequency, f k1 l1 (t) as follows: (t) Therefore, f k1 l1 (t) is completely determined by the mutation rate µ.For a more general L-site model, from (12) the dynamics of I k1 l1 can be written as ( 17) Here the term ( m 4µ m ) I k1 l1 (t) describes the new strains from sequences ∈ G k1 l1 , including the mutations from G k1 l1 to G k1 l1 .The term ( m =l1 4µ m )I k1 l1 (t) describes the new strain from G k1 l1 to G k1 l1 , which should result from the mutation at sites other than the l 1 site.The term µ l1 I(t) describes the new strains, which become elements of G k1 l1 by the mutation at the l 1 th site, including the mutation from G k1 l1 to G k1 l1 .Therefore, from which we conclude that f k1 l1 (t) is completely determined by the initial condition f k1 l1 (0) and the mutation rate µ l1 .Assuming that the sequence existing at the beginning of an outbreak has nucleotide 1 at the lth site as shown in (13), we obtain .
Recalling that Tajima's D is determined only by f k1 l1 and n, we have the following.Lemma (induction lemma on Tajima's D).Tajima's D is completely determined by the sample size n and site-specific mutation rate µ l1 .
In population genetics, the sign of Tajima's D is believed to be affected by the population dynamics [5].However, we have observed here that from the deterministic disease dynamics, Tajima's D is independent of the disease dynamics (precisely, independent of the parameters characterizing the disease dynamics, i.e., β and γ).This is true even if the assumption of mutation follows Kimura's assumption [12]; see subsection 5.1.Figure 2 illustrates the dependence of the sign of Tajima's D with mutation rate µ and sample size n.Tajima's D tends to be negative at the beginning of an outbreak and becomes positive as time passes, except when n is small.Tajima's D can be positive from the beginning to the end of an outbreak when n is small.The time point at which Tajima's D = 0 becomes earlier with increasing µ.Tajima's D becomes 0 at the latest time point when the sample size n is intermediate.
The frequencies f k1,...,k L l1,...,l L describe the genetic diversity of the pathogen.Thus, we observe from the deterministic model that the genetic diversity of the pathogen under neutral evolution (where mutation has no effect on the pathogens' phenotypes, i.e., β and γ) is independent of the disease dynamics.

Stochastic model. What happens if we expand the deterministic 1-site model into a stochastic 1-site model following a continuous time Markov process?
From (21), the time differentiation of the expected number of infected hosts with the strain carrying the sequence whose nucleotide k 1 at the specific site l 1 , E(I k1 l1 ), and the expected number of the total number of infected hosts, E (I), can be written as where Cov (x, y) denotes covariance of x and y and dτ .
Unlike the deterministic model, the expected value of the site-specific nucleotide frequency in the stochastic model can be affected by disease dynamics.This is also true when we expand the stochastic 1-site model into a stochastic L-site model with varying mutation rates among sites as follows: (24) where 3.2.1.Monte Carlo simulations.We now show that the impact of disease dynamics on Tajima's D is hidden in the stochasticity of the disease dynamics.The term in Tajima's D that includes the effect of stochasticity of disease dynamics, as shown in (24), cannot be expressed in a closed form [9]. To explore the behavior of Tajima's D in a stochastic model, we employ an individual-based Monte Carlo (IBM) simulation.The infection state of each host is recorded as 0, 1, or 2 in the IBM simulation.The infection states 0, 1, and 2 indicate that the host is susceptible (S), currently infected (I), or recovered (R), respectively, assuming each infected  individual has only one nucleic acid sequence of the pathogen, and each infected individual and nucleic acid sequence is recorded.Secondary cases have the same sequence as the primary case when transmission occurs.During infection the mutation replaces the dominant sequence of the pathogen by a constant rate µ.
We set the parameters to their baseline values as follows: N = 200,000, L = 500, γ = 0.3, ∆t = 1, n = 200.We assume initially that one host is infected with a strain carrying a sequence whose nucleotides at all sites are 1, and all other hosts are susceptible for all possible strains.The IBM simulations run until there are no infected individuals.Figure 3 shows the impact of disease dynamics on the temporal evolution of Tajima's D with varied basic reproduction number R 0 = β/γ.The difference between the deterministic model and the stochastic model increases as R 0 decreases.The duration of epidemics in the IBM simulation also increases when R 0 decreases.
As discussed in the previous sections, Tajima's D is a function of the site-specific nucleotide frequency f k1 l1 .The frequency of nucleotide 1, which is identical in all sequences at the beginning of an outbreak, f 11 l1 , is calculated.As we assume the site-specific mutation rate is identical among different sites, the expected value of f k1 l1 is also identical among different nucleotides and different sites.We can then calculate the average of f 11 l1 among all sites, f 1 .Figure 4 shows the impact of disease dynamics on the time evolution of f 1 with varied basic reproduction number R 0 .The deterministic model predicts f 1 in the corresponding stochastic model, and the impact The dotted lines show the theoretical π, which is given by (6) using deterministic f 1 , f 2 , f 3 , and f 4 given by (19).The dashed lines show the theoretical is given by (6) using the 1000 iteration average of the time series of f 1 , f 2 , f 3 , and f 4 with IBM.At each time step of the IBM simulation, σ/a was calculated when I(t) ≥ 200.The dotted lines show the theoretical σ, which is given by (8) using the deterministic f 1 , f 2 , f 3 , and f 4 given by (19).The dashed lines show the theoretical σ, which is given by (8) using the 1000 iteration average of the time series of f 1 , f 2 , f 3 , and f 4 with IBM.
on R 0 is small for f 1 .
As shown in (1), Tajima's D can be decomposed into two functions of f 1 , the mean pairwise distance between sampled sequences π and the number of segregating sites among sampled sequences σ.Specifically, the relationship between π and σ/a determines the sign of Tajima's D. Therefore, we explore the difference between π and σ/a for the deterministic model and the stochastic model.Figure 5 shows the difference in the time evolution of π with varied basic reproduction number R 0 .π is the function of f 1 , f 2 , f 3 , and f 4 , and there is a gap between theoretical and simulated π even if f 1 , f 2 , f 3 , and f 4 are all fixed.This gap also depends on R 0 : the smaller R 0 , the larger the gap.
Figure 6 shows the difference in the time evolution of σ with the varied basic reproduction number R 0 .This gap also depends on R 0 : again, the smaller R 0 , the larger the gap.The gap between theoretical and simulated Tajima's D is large at the early stage of an epidemic.Comparing π and σ/a, σ/a shows a large gap between the deterministic and stochastic models.

Discussion.
In this paper we investigated the time evolution of Tajima's D in a nonendemic disease outbreak situation, and we conclude that (i) Tajima's D in a deterministic SIR model is completely determined by the mutation rate and sample size, and (ii) the time evolution of genetic diversity of an infectious disease pathogen in a deterministic SIR model is completely determined by the mutation rate.We observed that Tajima's D is independent of disease dynamics, and we found that the sign interpretation of Tajima's D (i.e., negative (positive) Tajima's D shows population expansion (contraction)) is not always true for disease transmission dynamics.Employing a stochastic model, we demonstrated that the dependence of Tajima's D on the disease transmission and evolution dynamics is a result of the stochasticity of those dynamics.The same is true for the dependence related to genetic diversity.Interestingly, the period when Tajima's D remains negative is longer than the period when the infected population increases, as shown in Figure 3.
Our observation that Tajima's D is dependent on stochasticity in disease transmission during an outbreak can be useful for the estimation of insightful epidemiological and evolutional parameters.If the sequence data reflect stochastic transmission and evolution dynamics, e.g., sequences of pathogens sampled from a small outbreak in a limited host population, then Tajima's D depends on both the mutation rate and R 0 .Therefore, a joint estimation of the mutation rate and R 0 from Tajima's D is possible.If the disease dynamics can be approximated by a deterministic SIR model, then Tajima's D is not biased by disease dynamics and can be determined by the mutation rate alone.
Tajima's D depends largely on the sample size, as shown in Figure 2. In particular, with a small sample size, small changes in the sample size causes significant bias in Tajima's D. At a practical level, the sampling probability during an outbreak changes over time due to reporting bias, so careful study If we can assume that sampling probability, p, is constant over time, and that the sample size is proportional to the number of infected individuals, n = pI(t), then Tajima's D depends on the disease dynamics even in the deterministic model.Even in such a case, the sign of Tajima's D is not directly related to the sign of the time-derivative of the pathogen population I (t), as seen in the study with monotonic population growth [18,20].Tajima's D is determined not only by R 0 , but also by the mutation rate and sampling probability.In this case we can estimate R 0 from deterministic Tajima's D. However, this estimate of R 0 is equivalent to the estimate of R 0 fitting an SIR model, with the time series of the sample size as the time series of the number of infective individuals.
Previous studies show that a sudden change in the sign of Tajima's D implies the replacement of the dominant strain under strong natural selection [6].If R 0 and the mutation rate are given, the confidence interval of the site-specific Tajima's D, D l , can be obtained from our stochastic model.This confidence interval gives the confidence interval for neutral mutation.Using this confidence interval, we can discover the significant mutations in terms of their impact on phenotypes.The estimation of R 0 using epidemiological data, e.g., the time series of the number of infected hosts [2] and the estimation of the mutation rate using sequence data [4] have already been developed, and the estimation of significant mutations is possible if both epidemiological and sequence data are available.
Sampling time series of sequences of pathogens during an outbreak has become a common practice for infectious disease surveillance.Methodologies analyzing evolution and disease dynamics from such sequence data are in great demand.Many sequence datasets of infectious disease are sampled from nonendemic situations, as described by the SIR model.In the SIR model, the fitness of a pathogen and effective reproduction number are changing over time, even if there is no mutation.In such nonequilibrium population dynamics, the interpretation of the existing methods for evolutionary analysis may be different from that in simple population dynamics, e.g., neutral evolution or exponential growth.Theoretical analysis of the impact of population dynamics on common evolutionary analysis can help interpret results of such analysis.

Appendix.
5.1.The disease dynamics with Kimura's assumption.If we follow the mutation model of Kimura's assumption [12], (10)  The initial condition of the system shown in (26) is the same as that of (11)  Therefore, we obtain the site-specific nucleotide frequency, f k1 l1 (t), as follows; f 11 l1 (t) = where µ l1,1 (µ l1,2 ) denotes the mutation rate µ 1 (µ 2 ) in the 1-site model at the l 1 th site.

Site 1 Fig. 1 .
Fig. 1.A given set of three sampled sequences.(a) Pairwise distance.The pairwise distance is defined as the number of unmatched sites.(b) Segregating sites.A segregating site is a site where there is at least two different nucleotides among the sequence dataset.

Fig. 2 .
Fig. 2. The dependence of the sign of Tajima's D with mutation rate µ (a) and sample size n (b) when the number of mutable sites L = 500 and the mutation rate is constant among sites.The black area denotes negative Tajima's D and the gray area denotes positive Tajima's D.
c 2017 SIAM.Published by SIAM under the terms of the Creative Commons 4.0 license Downloaded 06/20/18 to 133.50.96.34.Redistribution subject to CCBY license With manipulations similar to those for the deterministic 1-site model, we obtain (23)

Fig. 3 .
Fig. 3.The impact of disease dynamics on the temporal evolution of Tajima's D with varied basic reproduction number R 0 = β/γ.The solid lines show the 1000-iteration average of the time evolution of Tajima's D when R 0 = 2.0 (a), 5.0 (b), and 10.0 (c), respectively.At each time step of the IBM simulation, Tajima's D was calculated when I(t) ≥ 200.The dashed lines show Tajima's D in the deterministic model.

1 Fig. 4 .
Fig.4.The impact of disease dynamics on the time evolution of f 1 with varied basic reproduction number R 0 .The solid lines show the 1000-iteration average of the time evolution of f 1 when R 0 = 2.0 (a), 5.0 (b), and 10.0 (c), respectively.At each time step of the IBM simulation, f 1 was calculated when I(t) ≥ 200.The dashed lines show f 1 in the deterministic model, which is given by(19).

10 Fig. 5 .
Fig.5.The time evolution of π with varied basic reproduction number R 0 .The solid lines show the 1000-iteration average of the time evolution of π.At each time step of the IBM simulation, π was calculated when I(t) ≥ 200.The dotted lines show the theoretical π, which is given by (6) using deterministic f 1 , f 2 , f 3 , and f 4 given by(19).The dashed lines show the theoretical is given by (6) using the 1000 iteration average of the time series of f 1 , f 2 , f 3 , and f 4 with IBM.

Fig. 6 .
Fig.6.The time evolution of σ/a with varied basic reproduction number R 0 .σ denotes the number of segregating site, and a denotes n−1 i=1 1/i.The solid lines show the 1000-iteration average of the time evolution of σ when R 0 = 2.0 (a), 5.0 (b), and 10.0 (c).At each time step of the IBM simulation, σ/a was calculated when I(t) ≥ 200.The dotted lines show the theoretical σ, which is given by (8) using the deterministic f 1 , f 2 , f 3 , and f 4 given by(19).The dashed lines show the theoretical σ, which is given by (8) using the 1000 iteration average of the time series of f 1 , f 2 , f 3 , and f 4 with IBM.