Data Frames¶
A DataFrame is a structured representation of tabular data, taking the form of a rectangular table. It comprises an organized assortment of columns, each capable of holding various value types such as numeric, string, boolean, and others.It resembles the layout of a spreadsheet(see image 1 below) or a database table.
Importance of DataFrames¶
Comprehending DataFrames or rectangular data is of utmost importance for data analysts, as it lays the foundation for handling structured data in diverse analytical tasks.
DataFrames constitute a fundamental data structure prevalent in numerous programming languages and libraries, with Python's Pandas and R being notable examples.
These DataFrames are commonly used to store and manipulate data in a tabular format, akin to a spreadsheet or a database table.
- A table is a grid of rows and columns that store data. Each row holds a collection of columns, and each column contains data of a specified type
- Understanding tables is fundamental to understanding the data in your database
Image 1: Representation of data in tabular form with rows and columns.
Source:Embeddings for Tabular Data: A Survey
The DataFrame (see image below) comprises one row for each record , and one column for each variable.
Image 2: A DataFrame Source: Pandas Documentation
A good example of where these names have been used is in the about section of the Spotify Song Attributes Dataset.
The processing and manipulation of unstructured data are essential to transform it into a structured format, represented as a set of features within rectangular data. For instance, a common method used for this purpose is webscraping.
Download Spotify Song Attributes Dataset.¶
The dataset will be instrumental in presenting a pandas DataFrame effectively.
#import pandas
import pandas as pd
import os
# Use opendatasets library to download o data from kaggle
import opendatasets as od
download_url = 'https://www.kaggle.com/datasets/byomokeshsenapati/spotify-song-attributes'
od.download(download_url)
Skipping, found downloaded files in ".\spotify-song-attributes" (use force=True to force download)
#Use OS to access the data dir
import os
data_dir = "./spotify-song-attributes"
os.listdir(data_dir)
['Spotify_Song_Attributes.csv']
#Use OS to access the data dir
import os
data_dir = "./spotify-song-attributes"
os.listdir(data_dir)
['Spotify_Song_Attributes.csv']
data_filename = data_dir + '/Spotify_Song_Attributes.csv'
#Read the dataset CSV file
spotify_df = olympics_df = pd.read_csv(data_filename)
#Print out the dataframe
spotify_df
| trackName | artistName | msPlayed | genre | danceability | energy | key | loudness | mode | speechiness | ... | liveness | valence | tempo | type | id | uri | track_href | analysis_url | duration_ms | time_signature | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | "Honest" | Nico Collins | 191772 | NaN | 0.476 | 0.799 | 4.0 | -4.939 | 0.0 | 0.2120 | ... | 0.2570 | 0.577 | 162.139 | audio_features | 7dTxqsaFGHOXwtzHINjfHv | spotify:track:7dTxqsaFGHOXwtzHINjfHv | https://api.spotify.com/v1/tracks/7dTxqsaFGHOX... | https://api.spotify.com/v1/audio-analysis/7dTx... | 191948.0 | 4.0 |
| 1 | "In The Hall Of The Mountain King" from Peer G... | London Symphony Orchestra | 1806234 | british orchestra | 0.475 | 0.130 | 7.0 | -17.719 | 1.0 | 0.0510 | ... | 0.1010 | 0.122 | 112.241 | audio_features | 14Qcrx6Dfjvcj0H8oV8oUW | spotify:track:14Qcrx6Dfjvcj0H8oV8oUW | https://api.spotify.com/v1/tracks/14Qcrx6Dfjvc... | https://api.spotify.com/v1/audio-analysis/14Qc... | 150827.0 | 4.0 |
| 2 | #BrooklynBloodPop! | SyKo | 145610 | glitchcore | 0.691 | 0.814 | 1.0 | -3.788 | 0.0 | 0.1170 | ... | 0.3660 | 0.509 | 132.012 | audio_features | 7K9Z3yFNNLv5kwTjQYGjnu | spotify:track:7K9Z3yFNNLv5kwTjQYGjnu | https://api.spotify.com/v1/tracks/7K9Z3yFNNLv5... | https://api.spotify.com/v1/audio-analysis/7K9Z... | 145611.0 | 4.0 |
| 3 | $10 | Good Morning | 25058 | experimental pop | 0.624 | 0.596 | 4.0 | -9.804 | 1.0 | 0.0314 | ... | 0.1190 | 0.896 | 120.969 | audio_features | 3koAwrM1RO0TGMeQJ3qt9J | spotify:track:3koAwrM1RO0TGMeQJ3qt9J | https://api.spotify.com/v1/tracks/3koAwrM1RO0T... | https://api.spotify.com/v1/audio-analysis/3koA... | 89509.0 | 4.0 |
| 4 | (I Just) Died In Your Arms | Cutting Crew | 5504949 | album rock | 0.625 | 0.726 | 11.0 | -11.402 | 0.0 | 0.0444 | ... | 0.0625 | 0.507 | 124.945 | audio_features | 4ByEFOBuLXpCqvO1kw8Wdm | spotify:track:4ByEFOBuLXpCqvO1kw8Wdm | https://api.spotify.com/v1/tracks/4ByEFOBuLXpC... | https://api.spotify.com/v1/audio-analysis/4ByE... | 280400.0 | 4.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10075 | Younger with Time. | Ben Zaidi | 668478 | folk-pop | 0.537 | 0.143 | 2.0 | -16.992 | 1.0 | 0.0331 | ... | 0.1100 | 0.245 | 131.118 | audio_features | 6o8pM5reLgjd5i8gDY3Irt | spotify:track:6o8pM5reLgjd5i8gDY3Irt | https://api.spotify.com/v1/tracks/6o8pM5reLgjd... | https://api.spotify.com/v1/audio-analysis/6o8p... | 222827.0 | 3.0 |
| 10076 | Your Latest Trick - Remastered 1996 | Dire Straits | 304382 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 10077 | Your Love Is My Drug (8 Bit Slowed) | just valery | 97600 | sad lo-fi | 0.282 | 0.158 | 6.0 | -7.783 | 1.0 | 0.0311 | ... | 0.4740 | 0.248 | 65.152 | audio_features | 1EoThnDm6kQfB2idIfR30n | spotify:track:1EoThnDm6kQfB2idIfR30n | https://api.spotify.com/v1/tracks/1EoThnDm6kQf... | https://api.spotify.com/v1/audio-analysis/1EoT... | 112582.0 | 4.0 |
| 10078 | Your Power | Billie Eilish | 988224 | art pop | 0.632 | 0.284 | 9.0 | -14.025 | 0.0 | 0.0801 | ... | 0.2330 | 0.208 | 129.642 | audio_features | 042Sl6Mn83JHyLEqdK7uI0 | spotify:track:042Sl6Mn83JHyLEqdK7uI0 | https://api.spotify.com/v1/tracks/042Sl6Mn83JH... | https://api.spotify.com/v1/audio-analysis/042S... | 245897.0 | 4.0 |
| 10079 | Your Voice / Bethel, NY | Jaden | 213626 | pop rap | 0.560 | 0.344 | 3.0 | -12.283 | 1.0 | 0.0306 | ... | 0.1110 | 0.428 | 115.393 | audio_features | 3BcN2Pcy0kTG1zm8Tz9MsB | spotify:track:3BcN2Pcy0kTG1zm8Tz9MsB | https://api.spotify.com/v1/tracks/3BcN2Pcy0kTG... | https://api.spotify.com/v1/audio-analysis/3BcN... | 213627.0 | 3.0 |
10080 rows × 22 columns
# Concise summary of a DataFrame
spotify_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10080 entries, 0 to 10079 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 trackName 10080 non-null object 1 artistName 10080 non-null object 2 msPlayed 10080 non-null int64 3 genre 8580 non-null object 4 danceability 9530 non-null float64 5 energy 9530 non-null float64 6 key 9530 non-null float64 7 loudness 9530 non-null float64 8 mode 9530 non-null float64 9 speechiness 9530 non-null float64 10 acousticness 9530 non-null float64 11 instrumentalness 9530 non-null float64 12 liveness 9530 non-null float64 13 valence 9530 non-null float64 14 tempo 9530 non-null float64 15 type 9530 non-null object 16 id 9530 non-null object 17 uri 9530 non-null object 18 track_href 9530 non-null object 19 analysis_url 9530 non-null object 20 duration_ms 9530 non-null float64 21 time_signature 9530 non-null float64 dtypes: float64(13), int64(1), object(8) memory usage: 1.7+ MB
Reference and Further Reading¶
- Embeddings for Tabular Data: A Survey by Rajat Singh and Srikanta Bedathur
- Practical Statistics for Data Scientists by Peter Bruce, Andrew Bruce, and Peter Gedeck
- Pandas Documentation
- Python for Data Analysis Data Wrangling with Pandas, NumPy, and IPython by Wes McKinney.
- Think Stats Exploratory Data Analysis in Python by Allen B. Downey and Green Tea Press.