Statistics is a discipline that has as its goal the quantitative and qualitative study of a particular collective phenomenon under conditions of uncertainty or non-determinism, that is, not complete knowledge of it or of a part of it.
Based on the collection of a large number of data relating to the phenomena under consideration, and starting from hypotheses more or less directly suggested by experience or analogies with other phenomena already known, through the application of mathematical methods based on the calculation of probability, it comes to the formulation of laws of averages that govern such phenomena, called statistical laws; often the collection of data is limited to a smaller sample, appropriately predetermined in order to represent faithfully the general characteristics. Initially conceived as a descriptive activity of certain social facts and in particular as an administrative activity of the State, statistics has gradually expanded its boundaries, until it became a real ‘science of the collective’, a discipline with purposes not only descriptive of social and natural phenomena, but also oriented to research purposes in various scientific fields.
Statistics is somehow related to the theory of probability, as they are both part of the theory of random phenomena. While the theory of probability is concerned with providing probabilistic theoretical models, i.e. probability distributions adaptable to the various real random phenomena by defining the parameters of the random variable in question, statistics starts from a random sample to describe its statistical properties or to infer the underlying probabilistic model and the estimation of its parameters (mean, variance, standard deviation, fashion, median).
Statistical science is commonly divided into two main branches:
- descriptive statistics: is the branch of statistics that studies the criteria for the collection, classification, synthesis and representation of data learned from the study of a population or part of it (called sample). The results obtained in the field of descriptive statistics can be defined as certain, unless measurement errors due to chance, which are on average zero. From this point of view it differs from inferential statistics, with which errors of evaluation are also associated.
- inferential statistics: is the process by which characteristics of a population are induced from the observation of a portion of it (called a “sample”), usually selected by a random (haphazard) experiment. From a philosophical point of view, these are mathematical techniques for quantifying the process of learning by experience. In other words, inferential statistics aims to establish characteristics of the data and behaviors of the measured measures (statistical variables) with a predetermined possibility of error. Inferences may relate to the theoretical nature (the probabilistic law) of the phenomenon being observed. We will mainly consider simple random samples of size \(n > 1\), which can be interpreted as \(n\) independent realizations of a basic experiment, under the same conditions. Since a random experiment is considered, the calculus of probability is involved. In statistical inference there is, in a sense, a reversal of point of view with respect to the calculus of probabilities. In the latter, known the process of generation of experimental data (probabilistic model) we are able to evaluate the probability of different possible results of an experiment. In statistics, the process of generating experimental data is not fully known (the process in question is ultimately the object of investigation), and statistical techniques aim to induce the characteristics of this process on the basis of observation of the experimental data generated by it. The knowledge of this nature will then allow a prediction to be made (consider, for example, that when it is said that “inflation next year will be of a certain magnitude” it derives from the fact that there is a model of the inflation trend derived from inferential techniques). Inferential statistics is strongly related to probability theory. From this point of view, to describe in probabilistic or statistical terms a phenomenon which is random in time, and therefore characterizable by a random variable, means to describe it in terms of density of probability distribution and its parameters of mean and variance. Inferential statistics is then subdivided into other chapters, the most important of which are estimation theory (point estimate and interval estimate) and hypothesis testing.
Applications of statistics
Statistics is useful wherever one of the following conditions is required:
- to proceed to an orderly collection, to an understandable drafting and to a processing of the most varied data;
- to discover possible laws that regulate the data, often only apparently disordered, and to compare them;
- define a reference variable that assumes different values definable in a certain range of variation.
The method and statistical techniques, typically theoretical, assume fundamental importance in many other fields of study such as physics (statistical physics) when, for obvious complexity of analysis, we must give up to have deterministic information on complex physical systems or many degrees of freedom, accepting instead a statistical description.
Among these disciplines there are also economics, which relies heavily on statistics (economic statistics, business statistics and econometrics, as well as game and decision theory) in the qualitative (time series) and quantitative (statistical models) description of socio-economic phenomena occurring within the economic system; and psychology, which relies on statistics in the research of the characteristics and attitudes of individuals and their differences (psychometrics).
Statistics is an essential tool in medical research. Biostatistics provides the tools to translate clinical and laboratory experience into quantitative expressions aimed at identifying whether, and to what extent, a treatment or procedure has had an effect on a group of patients. Another extremely common application in society is that of opinion polls, market analysis and in general any analysis of sample data.
Over the years a large number of definitions of statistics have been given (as early as 1934 W. F. Willcox listed 115), each of which highlights a particular methodological or interpretative function and reflects the evolution of its content throughout history. Among the following definitions, it is worth mentioning some by the scholars who have contributed most to the development of statistics. For G. Yule it is a method for the exposition and interpretation of quantitative data influenced by a multiplicity of causes; for E. S. Fisher it is essentially a branch of applied mathematics and can be considered as mathematics applied to the study of observational phenomena; for M. Boldrini it is the empirical history of natural science; for M. G. Kendall it is a branch of scientific method concerning the treatment of data obtained by enumerating and measuring the properties of natural populations; for A. Wald it is the general theory of decisions.
Modern statistics can be fully defined as “a synthetic expression including, in its widest meaning, theory, which is an organic set of principles and logical procedures defining statistical models and theoretical schemes; methodology, which is a set of criteria and methods deriving from theory and constituting an ineliminable phase of empirical research and of scientific and operative research; and applied statistics, which offers criteria for the use of statistical methodology in different fields of observation” (S. Vianelli). On the other hand, the term statistics is usually used to designate only statistical methodology and, in everyday language, to indicate a set of quantitative data.
Applied statistics takes on its own names according to the different fields of application (demography, psychometrics, biometrics, economic and econometric anthropometrics, business statistics, etc.). The term statistics seems to have been used for the first time in 1589 by the Italian Gerolamo Ghilini (or Ghislini) to indicate the science of describing the qualities that characterize and the elements that make up a state.
The origin of statistics is however older than its name, since traces of statistical surveys, i.e. enumerations and measurements (such as censuses) can be found in the most remote social organizations, where they had the task of satisfying administrative and fiscal needs. It is traditionally believed that statistics as an autonomous science originated in the 17th century from that particular discipline known as university statistics or Staatenkunde and developed from the fusion of this discipline with political arithmetic and the calculus of probabilities.
The founder of university statistics was the German jurist and physician H. Conring. Conring was the first to teach a course on the systematic description of facts affecting the life of the state, which began in 1660 and was called notitia rerum publicarum. The diffusion of the new discipline was the work of G. Achenwall who, in addition to using the precise term Statistik, took care to limit its task to the description of the most remarkable things of the state. University statistics, however, was limited to a mainly qualitative description (Achenwall and the other exponents of the Staatenkunde supported, among other things, a heated polemic against the defenders of the use of statistical tables, i.e. the political arithmeticians) so that its contribution to modern statistics can basically be reduced to the name.
In reality, methodological and investigative statistics was born with political arithmetic, of which the Englishmen J. Graunt and W. Petty were the greatest exponents. Petty. Political arithmeticians did not limit themselves to describing and measuring single phenomena (the phenomena they analyzed were demographic ones), but also tried to identify their characteristics and regularities as well as possible interrelations. The so-called encyclopaedic-mathematical approach and, within it, the calculus of probabilities allowed statistics to evolve further, providing it with both more refined instruments of investigation and rational bases for the application of such instruments.
From the work of a. Quételet to the XX century
From this last direction came the work of A. Quételet, considered by many the true founder of statistics. As a historian of statistics, A. Dell’Oro, points out, before Quételet statistics had been used in separate fields (particularly in demography) and the interest in the phenomenon studied had always prevailed over that in the means by which it was studied. The only notable exception to this tradition is to be found in the research on the theory of errors (started by A. De Moivre already in the first decades of the XVIIIth century and then developed by P.-S. de Laplace and especially by K. F. Gauss a century later), research that also led to the formulation of the principle of least squares (the merit was of A.-M. Legendre and H. Ellis). Quételet not only collected and elaborated masses of data relating to different phenomena, but above all, as noted by the already mentioned Dell’Oro, “he was able to distinguish the formal model from the phenomenon, convinced that the same model could also be used with other types of phenomena, he personally found, by statistical analysis, new formal models.
Among these models we must remember for its importance the binomial law of human characters (in a homogeneous population the curve of values of a character common to its individuals has the ordinates proportional to the successive terms of development of the binomial of I. Newton). The demonstration of this law “was on the one hand a continuation of the use of the curve of errors of Gauss in anthropometry, on the other hand the best proof of the principle of least squares and at the same time the confirmation that for the interpretations on human characters we could use the calculus of probabilities”. The ideal continuers of Quételet’s work were the English F. Galton and K. Pearson, the fathers of regression and correlation analysis.
To Pearson is also due the analysis of the characteristics of a distribution curve and the theory (then developed in particular by R. A. Fisher) aimed at verifying in probabilistic terms the adherence between the actual frequencies, related to a group of independent observations, and the theoretical frequencies determined on the basis of a given hypothesis (this adherence is measurable with the χ² test). From the beginning of the twentieth century date the systematic studies on the sample theory of which Fisher was the excellent interpreter. Fisher himself provided statistical methodology with two other fundamental tools: the analysis of variance and the programming of experiments.
The most recent evolution of theoretical and methodological statistics, favoured by the development of mathematics and of the calculus of probability (disciplines which statistics makes use of, but in which it does not exhaust itself, expressing itself with its own theoretical concepts and logical schemes), is mainly linked to the existence of two complex orders of problems: the one concerning the variability of phenomena and the one deriving from the fact of operating on sample observations. For the solution of the first order of problems, the theory of stochastic processes has been progressively developed, for the solution of the second order of problems, the theory of statistical inference has been developed.
The first task of modern statistics, whose historical process of formation has been briefly analyzed, is the collection of data provided by observation on a set of individual cases, in order to highlight the characteristics of a collective phenomenon (structural characteristics or relationships of dependence or interdependence). The observation data, collected through a complete survey or a sample survey, are then summarized in statistical data, either by enumeration and summation of the quantitative modalities related to the observed data, or by classification of the observed data according to the qualitative or quantitative modalities of the characteristics that differentiate them.
The statistical data are then processed in a more or less refined form depending on the purpose to be achieved. If one wishes to know the structure of the collective fact one can, for example, order the data into frequency distributions or describe them synthetically by means of mean values, indices of variability and concentration. In order to identify and measure the possible relationship of dependence or interdependence between the various ways in which the different characters appear in the observed units or, more generally, to identify the factors that influence a certain observed phenomenon, statistics offers various methods among which are regression and correlation analysis, variance analysis, factorial analysis and contingency analysis.
There are also statistical methods aimed at eliminating irregularities of various kinds that perturb the observed data (e.g. equalization) and methods that allow to calculate intermediate values to those observed (e.g. interpolation). Finally, the statistical methodology allows both to evaluate a sample estimate by defining the risk of referring this estimate to the population from which the sample was drawn, and to decide, again in probabilistic terms, whether a given hypothesis should be accepted or rejected, i.e. to decide whether the difference between two values (both sample values or a theoretical expected value and a sample value) is due to chance or is significant. The set of techniques for solving this group of problems is called statistical inference.
Statistical hypothesis testing
Hypothesis testing is a branch of statistics that tries to evaluate the reliability of a hypothesis through the results of an experiment or a series of observations. The hypotheses considered can be the most varied and in the most disparate fields; that is, one can try to ascertain by experiment the validity of a natural law (in physics, in biology, etc.); or to check, through observations on an economic, social phenomenon, etc., if the trend of the phenomenon itself follows certain rules.