Category Archives: Descriptive

Descriptive statistics is the mathematical technique that obtains, organizes, presents and describes a set of data with the purpose of facilitating its use, usually with the support of tables, numerical measures or graphs. In addition, it calculates statistical parameters such as measures of centralization and dispersion that describe the set studied

Descriptive Statistics

Descriptive Statistics

We will use descriptive statistics to describe the behavior of a characteristic, from the mass of data provided by its observation in the population, we will carry out a series of operations such as:

  • The reduction of the mass of data, by means of the construction of frequency tables and the realization of some graphs
  • In the case of quantitative variables, we can also take some measures that allow us to characterize the behavior of the variable. To do this we need to calculate some statistics such as position, dispersion and form

With all this, we can perfectly describe the behavior of our variable

Description and organization of data

When using computer programs it is common to name the variables so that there are no mistakes regarding the content of them, but we should not forget that the normal thing in statistics, especially when giving general results is to name the statistical variables using capital letters, preferably the last ones of the alphabet: X, Y, Z, \cdots, and the different values taken by that variable are named with the same letter but in lowercase: x_1, x_2, x_3, \cdots

We will use this notation, to give the following definitions:

Absolute frequency of a certain value, x_i, of the variable (and we will represent it by n_i): is the number of times that certain value is presented x_i

Relative frequency of a certain value, x_i, of the variable (and we will represent it by f_i): is the proportion of times that value appears in the observation set and is calculated as the ratio of its absolute frequency (n_i) and the total number of data (N)

I mean: \frac{n_i}{N}

Absolute frequency accumulated of a certain value, x_i, of the variable (and we will represent it by N_i): is the sum of the absolute frequencies of all variable values less than or equal to that value x_i

I mean: N_i=n_1+\cdots+n_i=\sum\limits_{j=1}^{i} n_j; N_k=N

Relative frequency cumulative of a certain value, x_i, of the variable (and we will represent it by F_i): is the sum of the relative frequencies of all variable values less than or equal to that value, x_i

I mean: F_i=f_1+\cdots+f_i=\sum\limits_{j=1}^{i} f_j=\frac{N_i}{N}; F_k=1

Cumulative frequencies only make sense if the scale is ordinal or quantitative. When a set of observed values in a variable performs sorting and grouping operations on repeating values (determining the frequency of each value), a statistical frequency distribution table is obtained

To said set of operations is called a tab

When a variable has many different values, sometimes (although it is not usually recommended), before its analysis it proceeds to group the observed values into intervals

In these cases, what you do is define the intervals (which can be constant amplitude or not) and then calculate the frequency for the values of the variable that are in each of the intervals. That is, frequencies do not represent the times or proportion of times a value appears, but how many times (or how many times) variable values have been obtained in each interval

Each interval is perfectly delimited by its limits, as well as for the i-th interval: l_{i-1} it would be the lower limit and l_i it would be the upper limit

The amplitude of the interval a_i is the distance between the two limits: a_i = l_i - l_{i-1}

To facilitate mathematical management of intervals it is necessary to consider a specific value of the variable as a representative of each interval, which is called class brand, and is denoted by x_i. Usually taken as a class mark, the midpoint of the interval, although care must be taken as it is not always the best representative of the same

In the event that the intervals have different amplitude, a value to consider is the density frequency, which is the number of observations of the variable per unit
length

I mean: h_i = \frac{n_i}{a_i}

By affinity with the density function (which will be discussed later), in some cases the density of relative frequencies, which is nothing more than the proportion of observations per unit length

I mean: h'_i = \frac{f_i}{a_i}

Statistical measures

Statistical measures with numerical values tell us the most important traits of frequency distributions and are classified into the following groups based on what they try to measure:

\text{Measurements}\left\{\begin{matrix}\text{of position}& \left\{\begin{matrix}\text{central}& \\\text{not central}\end{matrix}\right.& \\ \text{of dispersion}& \left\{\begin{matrix}\text{absolute}& \\\text{relative}\end{matrix}\right.& \\\text{of shape}& \left\{\begin{matrix}\text{of asymmetry}& \\\text{of kurtosis}\end{matrix}\right.& \\\text{of concentration}\end{matrix}\right.

Graphics

To summarize the information it is also very common to use charts. Let's look at some of the simplest:

  • Bar diagram: Used in ungrouped variables in intervals. On a system of coordinate axes are placed, on the abscrising axis the values of the variable and on the axis of ordering the absolute frequencies, then, on each value of the variable rises a bar whose height is equal to its absolute frequency

    If instead of the absolute frequencies we use relative frequencies, the resulting graph is analogous, but N times less

    It is also typically used to display the observed values of a variable

  • Sector diagram: It is generally used for variables not grouped into intervals and consists of dividing the area of a circle into sectors proportional to frequencies (absolute or relative)). The grades covered by each sector are obtained by a simple rule of three, taking into account that the total data (N) corresponds to 360^o
  • Frequency Histogram: Used for variables grouped into intervals. It is constructed by lifting over each interval, represented on the abscceous axis, a rectangle whose area is proportional to the frequency (absolute or relative) in that range. In general, the height of the rectangle of the i-th range is proportional to the frequency density. In particular, if all intervals have the same amplitude we can take, as the height of the rectangles, the frequencies

Measures of position

Measures of position

Measures of position we provide information about the data series that we are analyzing

The description of a dataset includes as an important element the location of data set within a possible value context

Once the basics have been defined in the study of a frequency distribution of a variable, we will study the different ways of summarizing these distributions using position (or centralization) measures, bearing in mind the error made in the summary through the corresponding dispersion measures

It's about finding measures that syntetice frequency distributions. Instead of handling all the data about variables, a task that can be heavy, we can characterize their frequency distribution by some numerical values, choosing as a summary of the data a value around which the values of the variable are distributed

Measures of central position

The central position or average position measurements are values around which the values of the variable are grouped and that summarize the position of the distribution on the horizontal axis. They can also help us synthesize the information provided by the values of the variable

Of the central position measurements, the most commonly used are arithmetic mean, median and fashion. In some specific cases, the harmonic mean or geometric mean is used

Arithmetic mean

The arithmetic mean, \overline{x}, is defined as the sum of all observed values divided by the total number of observations:

I mean: \overline{x}=\frac{x_1\cdot n_1+\cdots+x_k\cdot n_k}{N}=\frac{\sum\limits_{i=1}^{k} (x_i\cdot n_i)}{N}

This is the most commonly used average in practice, for the following advantages:

  • Takes into account all the observed values
  • It is easy to calculate and has a clear statistical significance
  • It is unique

However, it has the disadvantage of the influence exerted by the extreme values of the distribution on it

The medium cropped is obtained by calculating the mean of the observed values a
a certain percentage of the extreme values (the same percentage on both sides) have been removed

It is often used to calculate the mean of a variable in which we know, or suspect, that there are extreme values, as these can "deflect" the mean

Properties of the arithmetic mean

  1. The sum of the deviations (differences with the corresponding sign) of the variable values, relative to their arithmetic mean, is equal to zero

    \sum\limits_{i=1}^{k} (x_i-\overline{x})\cdot n_i=\sum\limits_{i=1}^{k} (x_i\cdot n_i)-\overline{x}\cdot \sum\limits_{i=1}^{k} n_i=N\cdot\overline{x}-N\cdot\overline{x}=0

  2. The mean is affected by the source and scale changes. If we have to u_i=a+b\cdot x_i, being any a and b values, with b nonzero (which is equivalent to making a change of origin and scale), the arithmetic mean can be expressed as follows: \overline{u}=a+b\cdot\overline{x}

    And to prove it is very simple:

    \overline{u}=\frac{\sum\limits_{i=1}^{k} (u_i\cdot n_i)}{N}=\frac{\sum\limits_{i=1}^{k} (a+b\cdot x_i)\cdot n_i}{N}=\frac{a}{N}\cdot \sum\limits_{i=1}^{k} n_i+\frac{b}{N}\cdot \sum\limits_{i=1}^{k} (x_i\cdot n_i)=\frac{a\cdot N}{N}+\frac{b}{N}\cdot \sum\limits_{i=1}^{k} (x_i\cdot n_i)=a+b\cdot\overline{x}

    This property, conveniently choosing the values a and b, is very useful in many cases, to simplify the calculation of the arithmetic mean

Example of arithmetic mean

In a vaccination campaign, the number of people vaccinated per hour over the course of 50 hours has been:

0, 3, 2, 2, 1, 4, 5, 2, 3, 2, 1, 0, 4, 3, 5, 3, 1, 4, 6, 1, 2, 3, 0, 4, 4, 5, 3, 1, 4, 2, 3, 1, 0, 6, 3, 2, 5, 3, 2, 3, 6, 2, 2, 5, 7, 4, 2, 7, 4, 2

We want to calculate the average number of people vaccinated in those 50 hours

Before we start calculating the mean, we group the results into a frequency table:

x_i n_i f_i N_i F_i
0 4 0.08 4 0.08
1 6 0.12 10 0.2
2 12 0.24 22 0.44
3 10 0.2 32 0.64
4 8 0.16 40 0.8
5 5 0.1 45 0.9
6 3 0.06 48 0.96
7 2 0.04 50 1

We calculate the arithmetic mean:

\overline{x}=\frac{\sum\limits_{i=1}^{k} (x_i\cdot n_i)}{N}=\frac{0 \cdot 4 + 1 \cdot 6 + 2 \cdot 12 + 3 \cdot 10 + 4 \cdot 8 + 5 \cdot 5 + 6 \cdot 3 + 7 \cdot 2}{50}=\frac{149}{50}=2.98\simeq 3

Therefore, the average number of people vaccinated per hour in that 50-hour interval has been 3, because it has been rounded up

Median

The median is defined as that value of the variable that divides the distribution into two parts with the same number of observations, when they are sorted from lowest to highest

This measure has the advantage, over the mean, that it is less sensitive to extreme values

Example of median

Following the example of the vaccination campaign, we now want to calculate its median

We check the previous frequency table and see that we have 50 data, to find the central value we divide it by 2 and as it is even we will add 1 to the result. If it had been odd it would not be necessary to add that unity to it, because it would already be divided into two parts with the same number of observations

\frac{50+1}{2}=25.5

When we exit a value close to 26 we will take 2 central positions: 25 and 26

We look in the column of absolute frequencies accumulated in values 25 and 26, whose values are both 3

Now we calculate the median value: Me=\frac{3+3}{2}=3

Therefore, half of those vaccinated per hour in that 50-hour interval have been 3 or less and the other half 3 or more

Fashion

The fashion is defined as that value of the variable whose frequency is not surpassed by that of no other value

It may be the case that the maximum frequency corresponds to 2 or more values of the variable, in that case, the distributions are said to be bimodal or multimodal

Example of fashion

Following the example of the vaccination campaign, we now want to calculate its fashion

We look in the column of absolute frequencies and see that the largest is 12, which corresponds to the value 2

Therefore, the highest number of people vaccinated per hour in that 50 hour interval has been 2

Harmonic mean

The harmonic mean is defined as: Ma(X)=\frac{N}{\frac{x_1}{n_1}+\cdots+\frac{x_k}{n_k}}=\frac{N}{\sum\limits_{i=1}^{k} \frac{x_i}{n_i}}

The advantages of this average are:

  • It is unique
  • Uses all the observed values of the variable

It has the disadvantage that it is strongly influenced by the values of the variable close to zero

This average is used in variables that measure speeds, yields, and, in general, for variables that are the ratio of two magnitudes

Example of harmonic Mean

A cyclist performs a training consisting of 12 series of 1 km, each at constant speed. The data collected from your training are collected in the following table:

Series Speed (km/h)
1 54
2 47
3 46
4 50
5 52
6 47
7 51
8 52
9 49
10 51
11 47
12 50

We want to calculate the average speed of the runner during his training

The arithmetic mean cannot be applied because the variable is the ratio of two magnitudes (V=\frac{e}{t}), in this case the harmonic mean must be applied

Ma(X)==\frac{N}{\sum\limits_{i=1}^{k} \frac{x_i}{n_i}}=\frac{12}{\frac{1}{54}+\frac{2}{47}+\frac{3}{46}+\frac{4}{50}+\frac{5}{52}+\frac{6}{47}+\frac{7}{51}+\frac{8}{52}+\frac{9}{49}+\frac{10}{51}+\frac{11}{47}+\frac{12}{50}}=49.55139

Therefore, the average rider's speed has been 49,55139 km/h in the 12 series

Geometric mean

The geometric mean is defined as: Mg(X)=\sqrt[N]{x_1^{n_1}+\cdots+x_k^{n_k}}=\sqrt[N]{\prod\limits_{i=1}^{k} x_i^{n_i}}

It has the advantage, that in its calculation all observed values of the variable are used

It has the disadvantage of the influence exerted by values close to zero and negative values if N is even

This average is used in variables that measure percentages, rates, or index numbers

In any set of observations, if they can be calculated, it is always true that: Ma(X)< Mg(X)<\overline{X}

Example of media geomética

We have the price of a certain product and we know that in the last 3 years its price has risen by 10%, 20% and 30%

We want to know how much has been the rise of media

That is, we want to know what percentage you would have had to have raised each year (the same annual percentage) to get the same price after three years

Since percentages are being calculated we cannot use the arithmetic mean, we must use the geometric mean

Mg(X)=\sqrt[N]{\prod\limits_{i=1}^{k} x_i^{n_i}}=\sqrt[3]{(1+\frac{10}{100})\cdot(1+\frac{20}{100})\cdot(1+\frac{30}{100})}=\sqrt[3]{1.1\cdot 1.20\cdot 1.3}=1.19721577

Now, the result, we pass it to percentage: 1.19721577\cdot 100 =11.9721577\%

Thus, the average annual increase over the past 3 years has been 11.9721577%