How many infected? – Valleriani's Num63r5

Are the curfew measures working? How can we test and see it from the available data?

Many of those who look at numbers are interested in providing help in understanding what is behind the spread of the coronavirus. There is plenty of webpages dedicated to provide collected numbers, and many pages dedicated to a more or less deep analysis of them.

Here we seek for facts that indicate how things are going. The facts and the analysis reported below neither complement nor substitute the analysis done by others.

The issue with the number of infected

The total number of infected, the cumulative plots we are used to look at, are both impressive and so misleading at the same time. They impress us because they show the unavoidable spread of the virus in all countries, They are misleading because the number s shown there depend on a number of factors, such as the number of tests: the more tests a country performs, the more infected people are found.

The number of infected day by day in several countries shows patterns of similarities. The most meaningful way to look at these numbers would be if the number of tests was the same, in proportion to the population of the country. Without information on the number of tests, the number of infected may mean nothing. The first point in each curve is on the day at which the 100th infected was found. Data taken from JHU database. Data elaboration is original.

The comparison country by country of the cumulative number of infected leads to misleading conclusions. Every country tests at different rates and the number of tests is not constant over time. There is no unified database where we can check how each country tests. Even the criteria to determine if a patient is healthy are different in the different countries. Therefore, day-to-day variations in the numbers have little meaning whereas a trend in the time series of one country is useful if we assume that each country tests at the same rate over time.

Positive vs total number of tests

For Italy there is a number of pages that summarize the number of tests and the number of positive tests day-by-day. In Italy it is even possible to get these numbers for every region and town. Here I have analyzed only the national data.

In order to know what is the number of infectious people circulating outside their homes one would need a systematic random testing procedure. This procedure does not exist but we can use the available data to look for trends in the proportion of positive tests.

The available data is the number of positives $x_t$ on day $t$ and the number of tests $z_t$ on that same day $t$ . Naively, the proportion of positives is given by $x_t/z_t$ . However, this calculation does not take into account that the result of some tests performed at day $t$ will be delivered later. I assume here that the results of the tests are either delivered on the same day or are delivered on the next day.

Number of performed tests and number of positive tests in Italy in the last 10 days. The ratio between these numbers could deliver an estimate of the proportion of infected people if the tests were performed randomly. Based on this data, 19% of the tests is positive. Data source: IlSole24Ore.

In the model calculation, we then create the variable

$w_t\, =\, qz_{t-1}+(1-q)z_t\, ,$

which gives the adjusted number of tests on day $t$ to be compared with the number of positives $x_t$ on that day of report. The parameter $q$ has to be inferred from the data, with a method explained below.

Once $q$ is known, the proportion of positives on day $t$ is given by

$\alpha_t\, =\, x_t/w_t\, .$

The parameter $\alpha_t$ will also change from one day to the next in a random fashion because of fluctuations in the number of tests, because of larger or small amounts of delayed test reports etc. It is therefore better to choose for one value of $\alpha$ that best explains the last 10 days of diffusion. This value of $\alpha$ is obviously not independent of the value of $q$ introduced above. But we can fix both of them by minimizing the quantity

$S\, =\, \sum_t \left(\alpha w_t \, -\, x_t\right)^2$

for a combination of $\alpha$ and $q$ . The minimization can be performed analytically (details available on request).

After the regression to minimize $S$ , we obtain $\alpha=0.19$ and $q=0.29$ , meaning that the percent of infected is estimated to be 19% of the population. The percent of positive is the same as when we assume no delay in the delivery of the results.

At present, the percent of the positives in the population in Italy seems to be around 20%. This is very likely a rough overestimate because the testing is not random but rather performed on people who have or had symptoms. In some regions, one every ten tested people is positive. Maybe a realistic estimation of the number of infected in Italy is roughly one tenth of the population, namely about 6 million people.

In the plot above, the read line indicates the present status of the proportion of infected in Italy. A decrease of this line towards smaller values is a sign of global relaxation for the country. In the plot we see that the proportions on the left (past) are larger than those on the right. Potentially encouraging. The fluctuations below the read line are more difficult to interpret. They may mean that testing is focused on people who are still in the hospitals.

A percentage calculation that is not useful

Recently, some analysts in Italy have concentrated on the proportion $F_t$ of new infections $x_t$ , defined before, compared to the total $X_{t-1} = \sum_i^{t-1}x_i$ :

$F_t = \frac{x_t}{X_{t-1}}\, .$

A decrease of this number over time has been taken as a proxy of a flattening of the epidemic curve. However, this variable and the conclusions taken from it are misleading.

In fact, it is elementary to prove that if we approximate $x_t$ and $X_t$ as continuous non negative deterministic processes, then $F_t$ stays constant only if $x_t$ is exponential. In all other cases the quantity $F_t$ has to decrease with time. Its decrease therefore says nothing about the capacity of the taken measures to contain the spread of the virus.

In fact, the virus spreads exponentially only at the beginning the infection, when social hopping is still the dominating part of the spreading. Afterwards, at least at the time in which social hopping is stopped, the spread cannot be exponential but it can still be too rapid for a stabilization of the national health system.

Useful information

Useful for any analysis is to know the number of tests and especially useful is the number of random tests and their results. Obviously, the analysis would be more precise if we did not have to estimate the delay $q$ . This would be possible if the number of positives and the number of tests where organized by day of report only. In the calculation shown here for Italy, nevertheless, the calculation with and without the factor $q$ delivers the same 19% of positive tests.

The cumulative curve $X_t$ is useful in order to look at the trend but it might be useless to compare countries because of the different policy taken in each country concerning the rate of testing.

A clear sign of relief is when $X_t$ grows linearly with time. At present this is the trend in only few countries. Why this is a good sign and were this sign is clearly evident from data will be the topic of the next posts.

Mortality rates

It has been made clear by many epidemiologists that the death rate is not an easy quantity to compute. The main reason is that we don’t know how many people are infected.

Most statistics shows the distribution of the age of the death by the COVID-19 disease. Formally, this statistics currently available amounts to compute the following conditional probability

$\Pr\{ \text{person's age is in a certain range} \mid \text{person is dead}\}\, ,$

that can be translated in words as follows: given that a person died, what is the probability that this person’s age was in a certain range (e.g. between 80 and 90 years old).

While the above statistics can make it to a newspaper article, there are other two quantities that may be more interesting for policy makers and for the analysis of the effect of the sickness.

The first quantity is

$\Pr\{\text{person is sick} \mid \text{person's age is in a certain range}\}\, .$

In words, this quantity tells us the probability that a person of a certain age is infected. In fact, this quantity tells us if a person of a certain age is more or less likely to catch the virus. If there are differences in different areas, it may mean that certain policy measures to avoid the spreading may be more or less effective.

The second quantity is

$\Pr\{\text{person died} \mid \text{person's age is in a certain range \& person was sick}\}\, .$

In words, this quantity tells us the mortality rate of people in a certain age after having become infected. Big differences in this mortality rate across different areas may indicate some form of chronic problem either in the national or local health system or in the environmental conditions in the areas with high mortality. Small (non significant) difference in this probability instead would indicate that mortality in that range of age is the same.

One can easily recognize that the latter conditional probability has some similarity with the conditional probability at the beginning of this section but it is not quite the same. In fact, these two probabilities are related but are very different. One day, the epidemiologists will compute the second quantity from data and will tell us if the high death toll in Italy is due to the large number of infected or to some chronic problem.

Thank you for reading and commenting.

Stay healthy.