Descriptive Statistics
We will use descriptive statistics to describe the behavior of a characteristic, from the mass of data provided by its observation in the population, we will carry out a series of operations such as:
- The reduction of the mass of data, by means of the construction of frequency tables and the realization of some graphs
- In the case of quantitative variables, we can also take some measures that allow us to characterize the behavior of the variable. To do this we need to calculate some statistics such as position, dispersion and form
With all this, we can perfectly describe the behavior of our variable
Description and organization of data
When using computer programs it is common to name the variables so that there are no mistakes regarding the content of them, but we should not forget that the normal thing in statistics, especially when giving general results is to name the statistical variables using capital letters, preferably the last ones of the alphabet: X, Y, Z, \cdots, and the different values taken by that variable are named with the same letter but in lowercase: x_1, x_2, x_3, \cdots
We will use this notation, to give the following definitions:
Absolute frequency of a certain value, x_i, of the variable (and we will represent it by n_i): is the number of times that certain value is presented x_i
Relative frequency of a certain value, x_i, of the variable (and we will represent it by f_i): is the proportion of times that value appears in the observation set and is calculated as the ratio of its absolute frequency (n_i) and the total number of data (N)
I mean: \frac{n_i}{N}
Absolute frequency accumulated of a certain value, x_i, of the variable (and we will represent it by N_i): is the sum of the absolute frequencies of all variable values less than or equal to that value x_i
I mean: N_i=n_1+\cdots+n_i=\sum\limits_{j=1}^{i} n_j; N_k=N
Relative frequency cumulative of a certain value, x_i, of the variable (and we will represent it by F_i): is the sum of the relative frequencies of all variable values less than or equal to that value, x_i
I mean: F_i=f_1+\cdots+f_i=\sum\limits_{j=1}^{i} f_j=\frac{N_i}{N}; F_k=1
Cumulative frequencies only make sense if the scale is ordinal or quantitative. When a set of observed values in a variable performs sorting and grouping operations on repeating values (determining the frequency of each value), a statistical frequency distribution table is obtained
To said set of operations is called a tab
When a variable has many different values, sometimes (although it is not usually recommended), before its analysis it proceeds to group the observed values into intervals
In these cases, what you do is define the intervals (which can be constant amplitude or not) and then calculate the frequency for the values of the variable that are in each of the intervals. That is, frequencies do not represent the times or proportion of times a value appears, but how many times (or how many times) variable values have been obtained in each interval
Each interval is perfectly delimited by its limits, as well as for the i-th interval: l_{i-1} it would be the lower limit and l_i it would be the upper limit
The amplitude of the interval a_i is the distance between the two limits: a_i = l_i - l_{i-1}
To facilitate mathematical management of intervals it is necessary to consider a specific value of the variable as a representative of each interval, which is called class brand, and is denoted by x_i. Usually taken as a class mark, the midpoint of the interval, although care must be taken as it is not always the best representative of the same
In the event that the intervals have different amplitude, a value to consider is the density frequency, which is the number of observations of the variable per unit
length
I mean: h_i = \frac{n_i}{a_i}
By affinity with the density function (which will be discussed later), in some cases the density of relative frequencies, which is nothing more than the proportion of observations per unit length
I mean: h'_i = \frac{f_i}{a_i}
Statistical measures
Statistical measures with numerical values tell us the most important traits of frequency distributions and are classified into the following groups based on what they try to measure:
\text{Measurements}\left\{\begin{matrix}\text{of position}& \left\{\begin{matrix}\text{central}& \\\text{not central}\end{matrix}\right.& \\ \text{of dispersion}& \left\{\begin{matrix}\text{absolute}& \\\text{relative}\end{matrix}\right.& \\\text{of shape}& \left\{\begin{matrix}\text{of asymmetry}& \\\text{of kurtosis}\end{matrix}\right.& \\\text{of concentration}\end{matrix}\right.Graphics
To summarize the information it is also very common to use charts. Let's look at some of the simplest:
- Bar diagram: Used in ungrouped variables in intervals. On a system of coordinate axes are placed, on the abscrising axis the values of the variable and on the axis of ordering the absolute frequencies, then, on each value of the variable rises a bar whose height is equal to its absolute frequency
If instead of the absolute frequencies we use relative frequencies, the resulting graph is analogous, but N times less
It is also typically used to display the observed values of a variable
- Sector diagram: It is generally used for variables not grouped into intervals and consists of dividing the area of a circle into sectors proportional to frequencies (absolute or relative)). The grades covered by each sector are obtained by a simple rule of three, taking into account that the total data (N) corresponds to 360^o
- Frequency Histogram: Used for variables grouped into intervals. It is constructed by lifting over each interval, represented on the abscceous axis, a rectangle whose area is proportional to the frequency (absolute or relative) in that range. In general, the height of the rectangle of the i-th range is proportional to the frequency density. In particular, if all intervals have the same amplitude we can take, as the height of the rectangles, the frequencies