Introduction¶
In this notebook we will;
- Download the Olympics Dataset, which we will utilize to provide practical code examples.
- Define structured and unstructured data.
- Delve into the fundamental classifications of structured data, namely numerical and categorical types.
- Look at the preferred methods for visualizing structured data.
#Import Libraries
import pandas as pd
import numpy as np
Download Olympics Dataset¶
Install the opendatasets library
Use the opendatasets.download helper function.
Get Kaggle Credentials.
- Download
Kaggle.jsonfile. - Enter your user name and Kaggle API or store the
Kaggle.jsonfile in the same directory with the Jupyter notebook.
Query the directory where the dataset has been downloaded to using the OS Module.
# Use opendatasets library to download olympics data from kaggle
import opendatasets as od
download_url = 'https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results'
od.download(download_url)
Skipping, found downloaded files in ".\120-years-of-olympic-history-athletes-and-results" (use force=True to force download)
#Use OS to access the data dir
import os
data_dir = "./120-years-of-olympic-history-athletes-and-results"
os.listdir(data_dir)
['athlete_events.csv', 'noc_regions.csv']
data_filename = data_dir + "/athlete_events.csv"
Image 2: Reading and Writing Tabular Data (Source;Pandas Documentation)
# Read the dataset
olympics_df = pd.read_csv(data_filename)
There are various types of variables, each capturing different types of data. Importantly, the specific type of data influences the knowledge that can be derived from it, as well as what cannot be acquired. Therefore, it is essential to comprehend the different data types.
Image 3 : Source;Big Data Framework
The differentiation between quantitative and qualitative data represents the fundamental basis for categorizing data types.
Structured data refers to data that possesses a well-defined internal framework. Such data is meticulously organized and can be easily understood and interpreted. It is commonly stored in table formats, comprising rows and columns, with the data in each column sharing a consistent semantic significance.
Unstructured data, refers to data that has no predetermined internal structure, thus existing independently from one another. Unstructured data encompasses various forms that cannot be stored within a structured database format. Although unstructured data may possess inherent structures, these structures are not explicitly predefined. Examples of unstructured data include text files such as PDF and DOC formats, as well as media files encompassing audio, video, and images.
Image 5 : Representation of data in tabular form with rows and columns
Image Source:Embeddings for Tabular Data: A Survey
olympics_df.head(10)
| ID | Name | Sex | Age | Height | Weight | Team | NOC | Games | Year | Season | City | Sport | Event | Medal | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | A Dijiang | M | 24.0 | 180.0 | 80.0 | China | CHN | 1992 Summer | 1992 | Summer | Barcelona | Basketball | Basketball Men's Basketball | NaN |
| 1 | 2 | A Lamusi | M | 23.0 | 170.0 | 60.0 | China | CHN | 2012 Summer | 2012 | Summer | London | Judo | Judo Men's Extra-Lightweight | NaN |
| 2 | 3 | Gunnar Nielsen Aaby | M | 24.0 | NaN | NaN | Denmark | DEN | 1920 Summer | 1920 | Summer | Antwerpen | Football | Football Men's Football | NaN |
| 3 | 4 | Edgar Lindenau Aabye | M | 34.0 | NaN | NaN | Denmark/Sweden | DEN | 1900 Summer | 1900 | Summer | Paris | Tug-Of-War | Tug-Of-War Men's Tug-Of-War | Gold |
| 4 | 5 | Christine Jacoba Aaftink | F | 21.0 | 185.0 | 82.0 | Netherlands | NED | 1988 Winter | 1988 | Winter | Calgary | Speed Skating | Speed Skating Women's 500 metres | NaN |
| 5 | 5 | Christine Jacoba Aaftink | F | 21.0 | 185.0 | 82.0 | Netherlands | NED | 1988 Winter | 1988 | Winter | Calgary | Speed Skating | Speed Skating Women's 1,000 metres | NaN |
| 6 | 5 | Christine Jacoba Aaftink | F | 25.0 | 185.0 | 82.0 | Netherlands | NED | 1992 Winter | 1992 | Winter | Albertville | Speed Skating | Speed Skating Women's 500 metres | NaN |
| 7 | 5 | Christine Jacoba Aaftink | F | 25.0 | 185.0 | 82.0 | Netherlands | NED | 1992 Winter | 1992 | Winter | Albertville | Speed Skating | Speed Skating Women's 1,000 metres | NaN |
| 8 | 5 | Christine Jacoba Aaftink | F | 27.0 | 185.0 | 82.0 | Netherlands | NED | 1994 Winter | 1994 | Winter | Lillehammer | Speed Skating | Speed Skating Women's 500 metres | NaN |
| 9 | 5 | Christine Jacoba Aaftink | F | 27.0 | 185.0 | 82.0 | Netherlands | NED | 1994 Winter | 1994 | Winter | Lillehammer | Speed Skating | Speed Skating Women's 1,000 metres | NaN |
Structured data can be categorized into two basic types: numerical and categorical.
#All columns in the dataframe
olympics_df.columns
Index(['ID', 'Name', 'Sex', 'Age', 'Height', 'Weight', 'Team', 'NOC', 'Games',
'Year', 'Season', 'City', 'Sport', 'Event', 'Medal'],
dtype='object')
Image 6 : Pandas Data Table (Source; Pandas Documentation)
# Concise summary of a DataFrame to explore the basic data types in a pandas data frame
olympics_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 271116 entries, 0 to 271115 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 271116 non-null int64 1 Name 271116 non-null object 2 Sex 271116 non-null object 3 Age 261642 non-null float64 4 Height 210945 non-null float64 5 Weight 208241 non-null float64 6 Team 271116 non-null object 7 NOC 271116 non-null object 8 Games 271116 non-null object 9 Year 271116 non-null int64 10 Season 271116 non-null object 11 City 271116 non-null object 12 Sport 271116 non-null object 13 Event 271116 non-null object 14 Medal 39783 non-null object dtypes: float64(3), int64(2), object(10) memory usage: 31.0+ MB
# Prints a summary of columns count and its dtypes but not per column information:
olympics_df.info(verbose=False)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 271116 entries, 0 to 271115 Columns: 15 entries, ID to Medal dtypes: float64(3), int64(2), object(10) memory usage: 31.0+ MB
Numerical Data¶
Quantitative data, also known as numerical data, is recorded as numbers and serves as an objective measurement or count. It encompasses various factors such as temperature, weight, and the number of transactions.
# Get numeric columns only
olympics_df_numerics = olympics_df.select_dtypes(include=np.number)
olympics_df_numerics.head(10)
| ID | Age | Height | Weight | Year | |
|---|---|---|---|---|---|
| 0 | 1 | 24.0 | 180.0 | 80.0 | 1992 |
| 1 | 2 | 23.0 | 170.0 | 60.0 | 2012 |
| 2 | 3 | 24.0 | NaN | NaN | 1920 |
| 3 | 4 | 34.0 | NaN | NaN | 1900 |
| 4 | 5 | 21.0 | 185.0 | 82.0 | 1988 |
| 5 | 5 | 21.0 | 185.0 | 82.0 | 1988 |
| 6 | 5 | 25.0 | 185.0 | 82.0 | 1992 |
| 7 | 5 | 25.0 | 185.0 | 82.0 | 1992 |
| 8 | 5 | 27.0 | 185.0 | 82.0 | 1994 |
| 9 | 5 | 27.0 | 185.0 | 82.0 | 1994 |
Numeric data can be classified into two types:
Continuous¶
- Continuous data is data that can take on any value within a given range, such as wind speed or time duration.
- Continuous data can be divided into two types based on the scales of measurement used for their measurement.
Interval scales:¶
- These scales lack a zero measurement, as is the case in the Celsius scale, which does not indicate the absence of temperature despite having a zero measure. This absence of a true zero point can be significant when interpreting statistical data.
Ratio scales:¶
- These scales possess a measurement of zero which signifies the absence of a particular property. To illustrate, zero kilograms indicates the absence of weight. As a result, measurement ratios hold significance for these scales. For instance, 30 kg is three times the weight of 10 kg. Furthermore, you can perform addition, subtraction, multiplication, and division operations with values on an interval scale.
Continuous variables allow for the assessment of various properties, including but not limited to the mean, median, distribution, range, and standard deviation. These measures provide valuable insights into the characteristics of the data.
Discrete¶
- Discrete data is data that can only take on certain values, such as the count of the occurrence of an event. An example is the number of Children.
- These counts are non-negative whole numbers that cannot be further divided into smaller increments.
- When dealing with discrete variables, it is possible to compute and evaluate the frequency or a summary of the count, such as the mean, sum, and standard deviation. As an illustration, in the year 2014, the average number of vehicles in U.S. households was 2.
Nevertheless, certain numerical variables, like area codes, are not classified as quantitative variables since they lack variability in quantity. To illustrate, a bank may be concerned with determining the average loan size provided to its customers, but calculating an "average" area code would be nonsensical.
When examining quantitative variables, it is essential to focus on two significant aspects: the central tendency and the variability (often referred to as "spread") of the data. For example, one might inquire about the average annual precipitation and the extent of variation observed from year to year.
Categorical Data¶
A variable is referred to as categorical when each observation is assigned to one of a predefined set of categories.
- Gender is divided into two categories: male and female. Religious affiliation encompasses various categories including Catholic,Jewish,Muslim, Protestant, Other, and None.
- When dealing with categorical variables, it is crucial to analyze the number of observations across different categories. One essential aspect to consider is the proportion or percentage of individuals falling into each category. For instance, one might be interested in determining the percentage of students who identify as Democrats within a particular college.
# Method 1 : Get categorical columns only
olympics_df_categorical = olympics_df.select_dtypes(include=np.object_)
olympics_df_categorical.head(10)
| Name | Sex | Team | NOC | Games | Season | City | Sport | Event | Medal | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | A Dijiang | M | China | CHN | 1992 Summer | Summer | Barcelona | Basketball | Basketball Men's Basketball | NaN |
| 1 | A Lamusi | M | China | CHN | 2012 Summer | Summer | London | Judo | Judo Men's Extra-Lightweight | NaN |
| 2 | Gunnar Nielsen Aaby | M | Denmark | DEN | 1920 Summer | Summer | Antwerpen | Football | Football Men's Football | NaN |
| 3 | Edgar Lindenau Aabye | M | Denmark/Sweden | DEN | 1900 Summer | Summer | Paris | Tug-Of-War | Tug-Of-War Men's Tug-Of-War | Gold |
| 4 | Christine Jacoba Aaftink | F | Netherlands | NED | 1988 Winter | Winter | Calgary | Speed Skating | Speed Skating Women's 500 metres | NaN |
| 5 | Christine Jacoba Aaftink | F | Netherlands | NED | 1988 Winter | Winter | Calgary | Speed Skating | Speed Skating Women's 1,000 metres | NaN |
| 6 | Christine Jacoba Aaftink | F | Netherlands | NED | 1992 Winter | Winter | Albertville | Speed Skating | Speed Skating Women's 500 metres | NaN |
| 7 | Christine Jacoba Aaftink | F | Netherlands | NED | 1992 Winter | Winter | Albertville | Speed Skating | Speed Skating Women's 1,000 metres | NaN |
| 8 | Christine Jacoba Aaftink | F | Netherlands | NED | 1994 Winter | Winter | Lillehammer | Speed Skating | Speed Skating Women's 500 metres | NaN |
| 9 | Christine Jacoba Aaftink | F | Netherlands | NED | 1994 Winter | Winter | Lillehammer | Speed Skating | Speed Skating Women's 1,000 metres | NaN |
Visualization of structured data¶
Histograms are a highly effective method for visually representing continuous variables, as they vividly illustrate the distribution of values.
Scatterplots serve as excellent tools for visually representing the connection between two continuous variables.
Bar charts are a commonly used method for visually representing discrete variables.
Reference and Further Reading¶
- Embeddings for Tabular Data: A Survey by Rajat Singh and Srikanta Bedathur
- Practical Statistics for Data Scientists by Peter Bruce, Andrew Bruce, and Peter Gedeck
- Introduction to Statistics AN INTUITIVE GUIDE by Jim Frost
- Pandas Documentation