Introduction¶

In this notebook we will;

Download the Olympics Dataset, which we will utilize to provide practical code examples.
Define structured and unstructured data.
Delve into the fundamental classifications of structured data, namely numerical and categorical types.
Look at the preferred methods for visualizing structured data.

In [15]:

Copied!

#Import Libraries

import pandas as pd
import numpy as np
#Import Libraries

import pandas as pd
import numpy as np

Download Olympics Dataset¶

Image 1: Source; Kaggle

Install the opendatasets library

Use the opendatasets.download helper function. Get Kaggle Credentials.

Download Kaggle.json file.
Enter your user name and Kaggle API or store the Kaggle.json file in the same directory with the Jupyter notebook.

Query the directory where the dataset has been downloaded to using the OS Module.

In [16]:

Copied!

# Use opendatasets library to download olympics data from kaggle

import opendatasets as od
download_url = 'https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results'
od.download(download_url)
# Use opendatasets library to download olympics data from kaggle

import opendatasets as od
download_url = 'https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results'
od.download(download_url)

Skipping, found downloaded files in ".\120-years-of-olympic-history-athletes-and-results" (use force=True to force download)

In [17]:

Copied!

#Use OS to access the data dir

import os
data_dir = "./120-years-of-olympic-history-athletes-and-results"
os.listdir(data_dir)
#Use OS to access the data dir

import os
data_dir = "./120-years-of-olympic-history-athletes-and-results"
os.listdir(data_dir)

Out[17]:

['athlete_events.csv', 'noc_regions.csv']

In [18]:

Copied!

data_filename = data_dir + "/athlete_events.csv"
data_filename = data_dir + "/athlete_events.csv"

Image 2: Reading and Writing Tabular Data (Source;Pandas Documentation)

In [19]:

Copied!

# Read the dataset

olympics_df = pd.read_csv(data_filename)
# Read the dataset

olympics_df = pd.read_csv(data_filename)

There are various types of variables, each capturing different types of data. Importantly, the specific type of data influences the knowledge that can be derived from it, as well as what cannot be acquired. Therefore, it is essential to comprehend the different data types.

Image 3 : Source;Big Data Framework

The differentiation between quantitative and qualitative data represents the fundamental basis for categorizing data types.

Structured data refers to data that possesses a well-defined internal framework. Such data is meticulously organized and can be easily understood and interpreted. It is commonly stored in table formats, comprising rows and columns, with the data in each column sharing a consistent semantic significance.

Unstructured data, refers to data that has no predetermined internal structure, thus existing independently from one another. Unstructured data encompasses various forms that cannot be stored within a structured database format. Although unstructured data may possess inherent structures, these structures are not explicitly predefined. Examples of unstructured data include text files such as PDF and DOC formats, as well as media files encompassing audio, video, and images.

Image 5 : Representation of data in tabular form with rows and columns
Image Source:Embeddings for Tabular Data: A Survey

In [20]:

Copied!

olympics_df.head(10)
olympics_df.head(10)

Out[20]:

	ID	Name	Sex	Age	Height	Weight	Team	NOC	Games	Year	Season	City	Sport	Event	Medal
0	1	A Dijiang	M	24.0	180.0	80.0	China	CHN	1992 Summer	1992	Summer	Barcelona	Basketball	Basketball Men's Basketball	NaN
1	2	A Lamusi	M	23.0	170.0	60.0	China	CHN	2012 Summer	2012	Summer	London	Judo	Judo Men's Extra-Lightweight	NaN
2	3	Gunnar Nielsen Aaby	M	24.0	NaN	NaN	Denmark	DEN	1920 Summer	1920	Summer	Antwerpen	Football	Football Men's Football	NaN
3	4	Edgar Lindenau Aabye	M	34.0	NaN	NaN	Denmark/Sweden	DEN	1900 Summer	1900	Summer	Paris	Tug-Of-War	Tug-Of-War Men's Tug-Of-War	Gold
4	5	Christine Jacoba Aaftink	F	21.0	185.0	82.0	Netherlands	NED	1988 Winter	1988	Winter	Calgary	Speed Skating	Speed Skating Women's 500 metres	NaN
5	5	Christine Jacoba Aaftink	F	21.0	185.0	82.0	Netherlands	NED	1988 Winter	1988	Winter	Calgary	Speed Skating	Speed Skating Women's 1,000 metres	NaN
6	5	Christine Jacoba Aaftink	F	25.0	185.0	82.0	Netherlands	NED	1992 Winter	1992	Winter	Albertville	Speed Skating	Speed Skating Women's 500 metres	NaN
7	5	Christine Jacoba Aaftink	F	25.0	185.0	82.0	Netherlands	NED	1992 Winter	1992	Winter	Albertville	Speed Skating	Speed Skating Women's 1,000 metres	NaN
8	5	Christine Jacoba Aaftink	F	27.0	185.0	82.0	Netherlands	NED	1994 Winter	1994	Winter	Lillehammer	Speed Skating	Speed Skating Women's 500 metres	NaN
9	5	Christine Jacoba Aaftink	F	27.0	185.0	82.0	Netherlands	NED	1994 Winter	1994	Winter	Lillehammer	Speed Skating	Speed Skating Women's 1,000 metres	NaN

Structured data can be categorized into two basic types: numerical and categorical.

In [21]:

Copied!

#All columns in the dataframe

olympics_df.columns
#All columns in the dataframe

olympics_df.columns

Out[21]:

Index(['ID', 'Name', 'Sex', 'Age', 'Height', 'Weight', 'Team', 'NOC', 'Games',
       'Year', 'Season', 'City', 'Sport', 'Event', 'Medal'],
      dtype='object')

Image 6 : Pandas Data Table (Source; Pandas Documentation)

In [22]:

Copied!

# Concise summary of a DataFrame to explore the basic data types in a pandas data frame
olympics_df.info()
# Concise summary of a DataFrame to explore the basic data types in a pandas data frame
olympics_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271116 entries, 0 to 271115
Data columns (total 15 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   ID      271116 non-null  int64  
 1   Name    271116 non-null  object 
 2   Sex     271116 non-null  object 
 3   Age     261642 non-null  float64
 4   Height  210945 non-null  float64
 5   Weight  208241 non-null  float64
 6   Team    271116 non-null  object 
 7   NOC     271116 non-null  object 
 8   Games   271116 non-null  object 
 9   Year    271116 non-null  int64  
 10  Season  271116 non-null  object 
 11  City    271116 non-null  object 
 12  Sport   271116 non-null  object 
 13  Event   271116 non-null  object 
 14  Medal   39783 non-null   object 
dtypes: float64(3), int64(2), object(10)
memory usage: 31.0+ MB

In [23]:

Copied!

# Prints a summary of columns count and its dtypes but not per column information:

olympics_df.info(verbose=False)
# Prints a summary of columns count and its dtypes but not per column information:

olympics_df.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271116 entries, 0 to 271115
Columns: 15 entries, ID to Medal
dtypes: float64(3), int64(2), object(10)
memory usage: 31.0+ MB

Numerical Data¶

Quantitative data, also known as numerical data, is recorded as numbers and serves as an objective measurement or count. It encompasses various factors such as temperature, weight, and the number of transactions.

In [24]:

Copied!

# Get numeric columns only

olympics_df_numerics =  olympics_df.select_dtypes(include=np.number)
# Get numeric columns only

olympics_df_numerics =  olympics_df.select_dtypes(include=np.number)

In [25]:

Copied!

olympics_df_numerics.head(10)
olympics_df_numerics.head(10)

Out[25]:

	ID	Age	Height	Weight	Year
0	1	24.0	180.0	80.0	1992
1	2	23.0	170.0	60.0	2012
2	3	24.0	NaN	NaN	1920
3	4	34.0	NaN	NaN	1900
4	5	21.0	185.0	82.0	1988
5	5	21.0	185.0	82.0	1988
6	5	25.0	185.0	82.0	1992
7	5	25.0	185.0	82.0	1992
8	5	27.0	185.0	82.0	1994
9	5	27.0	185.0	82.0	1994

Numeric data can be classified into two types:

Continuous¶

Continuous data is data that can take on any value within a given range, such as wind speed or time duration.

Continuous data can be divided into two types based on the scales of measurement used for their measurement.

Interval scales:¶

These scales lack a zero measurement, as is the case in the Celsius scale, which does not indicate the absence of temperature despite having a zero measure. This absence of a true zero point can be significant when interpreting statistical data.

Ratio scales:¶

These scales possess a measurement of zero which signifies the absence of a particular property. To illustrate, zero kilograms indicates the absence of weight. As a result, measurement ratios hold significance for these scales. For instance, 30 kg is three times the weight of 10 kg. Furthermore, you can perform addition, subtraction, multiplication, and division operations with values on an interval scale.

Continuous variables allow for the assessment of various properties, including but not limited to the mean, median, distribution, range, and standard deviation. These measures provide valuable insights into the characteristics of the data.

Discrete¶

Discrete data is data that can only take on certain values, such as the count of the occurrence of an event. An example is the number of Children.

These counts are non-negative whole numbers that cannot be further divided into smaller increments.

When dealing with discrete variables, it is possible to compute and evaluate the frequency or a summary of the count, such as the mean, sum, and standard deviation. As an illustration, in the year 2014, the average number of vehicles in U.S. households was 2.

Nevertheless, certain numerical variables, like area codes, are not classified as quantitative variables since they lack variability in quantity. To illustrate, a bank may be concerned with determining the average loan size provided to its customers, but calculating an "average" area code would be nonsensical.

When examining quantitative variables, it is essential to focus on two significant aspects: the central tendency and the variability (often referred to as "spread") of the data. For example, one might inquire about the average annual precipitation and the extent of variation observed from year to year.

Categorical Data¶

A variable is referred to as categorical when each observation is assigned to one of a predefined set of categories.

Gender is divided into two categories: male and female. Religious affiliation encompasses various categories including Catholic,Jewish,Muslim, Protestant, Other, and None.

When dealing with categorical variables, it is crucial to analyze the number of observations across different categories. One essential aspect to consider is the proportion or percentage of individuals falling into each category. For instance, one might be interested in determining the percentage of students who identify as Democrats within a particular college.

In [26]:

Copied!

# Method 1 : Get categorical columns only

olympics_df_categorical =  olympics_df.select_dtypes(include=np.object_)
# Method 1 : Get categorical columns only

olympics_df_categorical =  olympics_df.select_dtypes(include=np.object_)

In [27]:

Copied!

olympics_df_categorical.head(10)
olympics_df_categorical.head(10)

Out[27]:

	Name	Sex	Team	NOC	Games	Season	City	Sport	Event	Medal
0	A Dijiang	M	China	CHN	1992 Summer	Summer	Barcelona	Basketball	Basketball Men's Basketball	NaN
1	A Lamusi	M	China	CHN	2012 Summer	Summer	London	Judo	Judo Men's Extra-Lightweight	NaN
2	Gunnar Nielsen Aaby	M	Denmark	DEN	1920 Summer	Summer	Antwerpen	Football	Football Men's Football	NaN
3	Edgar Lindenau Aabye	M	Denmark/Sweden	DEN	1900 Summer	Summer	Paris	Tug-Of-War	Tug-Of-War Men's Tug-Of-War	Gold
4	Christine Jacoba Aaftink	F	Netherlands	NED	1988 Winter	Winter	Calgary	Speed Skating	Speed Skating Women's 500 metres	NaN
5	Christine Jacoba Aaftink	F	Netherlands	NED	1988 Winter	Winter	Calgary	Speed Skating	Speed Skating Women's 1,000 metres	NaN
6	Christine Jacoba Aaftink	F	Netherlands	NED	1992 Winter	Winter	Albertville	Speed Skating	Speed Skating Women's 500 metres	NaN
7	Christine Jacoba Aaftink	F	Netherlands	NED	1992 Winter	Winter	Albertville	Speed Skating	Speed Skating Women's 1,000 metres	NaN
8	Christine Jacoba Aaftink	F	Netherlands	NED	1994 Winter	Winter	Lillehammer	Speed Skating	Speed Skating Women's 500 metres	NaN
9	Christine Jacoba Aaftink	F	Netherlands	NED	1994 Winter	Winter	Lillehammer	Speed Skating	Speed Skating Women's 1,000 metres	NaN

Visualization of structured data¶

Histograms are a highly effective method for visually representing continuous variables, as they vividly illustrate the distribution of values.
Scatterplots serve as excellent tools for visually representing the connection between two continuous variables.
Bar charts are a commonly used method for visually representing discrete variables.

Reference and Further Reading¶

Embeddings for Tabular Data: A Survey by Rajat Singh and Srikanta Bedathur
Practical Statistics for Data Scientists by Peter Bruce, Andrew Bruce, and Peter Gedeck
Introduction to Statistics AN INTUITIVE GUIDE by Jim Frost
Pandas Documentation