Data Wrangling with Python + Pandas
1. Introduction
What’s up, everyone! We’re going to go over how to use pandas
, a super nifty package for data wrangling in Python. And, no, unfortunately the package name has nothing to do with the adorable herbivorous bear: it’s a portmanteau of “panel” + “data”—the econometric equivalent of “longitudinal data” in psychological research. Eh.
To be clear, this is not an exhaustive guide on the package. For that, you would need to reference the official documentation itself which, thankfully, is actually incredibly well organized and user-friendly.
The purpose of this tutorial, rather, is to capture the most important functionalities of the package to get you on your feet and running: to provide an intermediate level of proficiency that will likely suffice as a useful guide for most tasks you will encounter.
1.1 What is (are?) pandas?
pandas
(always all lowercase) is an industry standard for basically everything you could possibly want to do in data science. It’s open-source, well-maintained through non-profit sponsorship, and one of the most widely used packages—all factors which ensure it will continue to be around for a long time!
According to their official website:
- “
pandas
aims to be the fundamental high-level building block for doing practical, real world data analysis in Python.” - “Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language.”
Indeed, it lives up to this reputation as a very useful tool that will save you lots of time once you work past the initial learning curve. Since the way pandas
works is intentionally quite analogous to the functionality provided by R, having some R chops beforehand will certainly help (although this is not necessary).
1.2 Objectives
Before we get started, I’d like you to consider the matter of why one would choose to use pandas
. Why not just use a GUI like Microsoft Excel to open a tabular datafile, run the required transformations, and then close it all without writing and compiling any scripts?
At least anecdotally, it can certainly be tempting to use them for data preprocessing instead of learning the more complex syntax. While I don’t mean to discourage your measured use of them—they can be powerful and efficient in their own right–one ought not become reliant on them in unwarranted circumstances.
My point, then, is to prove to you here that pandas
is just as effective as industry standard GUIs (and R!) in getting the same types of tasks done.
Pros | Cons | |
---|---|---|
Microsoft Excel |
|
|
pandas |
|
|
2. How to Be a Power User
It’s kind of a tongue-in-cheek humorous term, but “power user” is basically what you’d call someone who prioritizes trying to optimize the (not-that-useful) functionalities of whatever software or code tickles their fancy.
I’d like to present several useful tips and tricks you can implement in vanilla (i.e., regular old) Python, pandas
, and other packages!
2.1 Anaconda Distribution
If you’re a research scientist working in Python, I strongly advise that you install the Anaconda Distribution before you do anything else. It’s an exceptional tool that provides bundles together package management and virtual environment support while also pre-installing many useful data science packages for you.
2.2 IDE vs. Text Editor
When programming in Python, you can choose between two umbrella categories of tools, integrated development environments (IDEs) and text editors. These interfaces almost always represent standalone software and need to be installed on your system separately from Python itself. Python comes bundled with the code editor IDLE, but this GUI is, in my view, woefully inadequate compared to the suite of freely available alternatives.
There’s not really a right answer here, but it’s important to understand the differences between these tools. An IDE is always going to be a beefed up version of a text editor, giving you high-powered tools that allow for debugging, compiling, building and deploying, and more. A text editor is, at its core, simply a space for scripting code that you’ll run sometime somewhere. There’s a spectrum between these, and the advantage of greater functionality generally comes at the cost of greater resource allocation (CPU and RAM). More complex by design, IDEs can also sometimes spontaneously crash.
Personally, I use the following tools for web development and data science:
2.3 Choosing a Typeface
A good typeface goes a very long way for readability and reduced eyestrain when programming projects drag on. By a longshot, my personal favorite monospace font has got to be JetBrains Mono. Not only is it open-source, but it also supports 100+ ligatures:
2.4 Keyboard Shortcuts
Keyboard shortcuts/hotkeys are probably a divisive topic. While there are commonly used preset key mappings, you know the saying:
“Different (key)strokes for different folks.” Also, it’s just burdensome to go through a long, comprehensive list to identify the ones you’ll actually end up using. Still, for the majority of users, I believe there are a couple of need-to-know practices you may have overlooked. In the table below, I’ve included my picks of the bunch. If you do end up making it a habit of using these shortcuts, I can guarantee you they will do wonders for your productivity.
Windows | Mac | |
---|---|---|
Switch between open windows |
Alt + Tab |
⌘ + Tab |
Switch between open windows (in reverse) |
Alt + Shift + Tab |
⌘ + Shift + Tab |
Switch between open tabs |
Ctrl + Tab |
|
Switch between open tabs (reverse) |
Ctrl + Shift + Tab |
|
Comment out highlighted code block or inline code at cursor position |
Ctrl + / |
⌘ + / |
Move cursor left/right by word | Ctrl + ← Ctrl + → |
⌥ + ← ⌥ + → |
Select by moving cursor left/right by character |
Shift + ← Shift + → |
|
Select by moving cursor left/right by word |
Ctrl + Shift + ← Ctrl + Shift + → |
⌥ + Shift + ← ⌥ + Shift + → |
2.5 Vectorization
We always want to optimize our code for memory efficiency and speed to yield a quick compile. Because dataframe objects are already quite expensive in storage, a reliable rule of thumb is to always attempt to vectorize any operations one performs on them.
Vectorization is a superior alternative to writing explicit for
and while
loops because it accelerates compile time astronomically and is almost always more concise. We’ll go into this in more detail later on, but I’d like to introduce the idea before we move on of applying an operation to an entire array at one time instead of the constituent components piecemeal.
On a mechanistic level, vectorization transfers the burden of looping to the precompiled C code on which Python is built.
2.6 PEP8
The last thing I must mention is PEP8: this is to Python what APA7 is to psychology researchers. It’s a style guide that exhaustively covers all the formatting rules and best practices for programming in the language.
PEP stands for Python Enhancement Proposals, and the style guide is only one of many hundreds of meticulously curated documents addressing every last detail of how to best write and distribute code. I would only recommend that you reference it on an as-needed basis because it’s basically an unlimited well of information. Still, it’s a powerful resource that every Python programmer should be know.
3. Getting Started
Let’s get started! As you’ll begin to see if you haven’t already, pandas
and numpy
(Numerical Python) go together like bread and butter. Where you see one, you’ll surely see the other. We will always be importing both as “aliases” or abbreviations (like many, but not all, other packages!) because:
“All things being equal, the simplest [readable] solution tends to be the best one [holding memory efficiency and compile time relatively equal].”
—Occam's CS Razor [citation needed]
1
2
3
4
5
6
7
from google.colab import drive # let's initialize colab
import pandas as pd # this is standard import aliasing, always use it
import numpy as np # this is standard import aliasing, always use it
drive.mount('/content/drive', force_remount=True)
data_dir = "drive/MyDrive/2020-10-21-dragon-data.csv"
1
Mounted at /content/drive
3.1 Reading in Data
Alright, so to begin, we’re going to start with an example. Looking at dragons again! Yay! 🐉
Let’s say we have a .csv
file—you know, that handy file type that almost always ends up being what holds all of your good good experiment data. That’s not to say, however, that you can’t open up tabular datafiles of other types. Other common ones are .tsv
(tab-separated) and .xlsx
(Microsoft Excel), but the list goes on:
- SPSS
- SQL
- Pickle (yes, this is a real type of data storage—and an efficient one as well! 🥒)
- …and many, many more!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
dragons_df = pd.read_csv(data_dir) # read
print(dragons_df) # let's see what it looks like!
# Let's see how Python classifies this new object and its columns
print('Object type: ' + str(type(dragons_df)) + '\n')
print('Object column type(s): \n' + str(dragons_df.dtypes))
# ----------------------------------------------------- #
# If data_dir were a .tsv file, you'd have two options: #
# ----------------------------------------------------- #
# dragons_df = pd.read_table(data_dir)
# dragons_df = pd.read_csv(data_dir, sep='\t')
# ------------------------------- #
# If data_dir were an .xlsx file: #
# ------------------------------- #
# dragons_df = pd.read_excel(data_dir)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
testScore bodyLength mountainRange color diet breathesFire
0 0.000000 175.512214 Bavarian Blue Carnivore 1
1 0.742914 190.640959 Bavarian Blue Carnivore 1
2 2.501825 169.708778 Bavarian Blue Carnivore 1
3 3.380430 188.847204 Bavarian Blue Carnivore 1
4 4.582095 174.221700 Bavarian Blue Carnivore 0
.. ... ... ... ... ... ...
475 41.806995 186.132578 Southern Yellow Vegetarian 0
476 42.650015 192.272043 Southern Yellow Vegetarian 0
477 44.582490 184.038015 Southern Yellow Vegetarian 1
478 47.795914 189.814408 Southern Yellow Vegetarian 0
479 55.486722 189.661989 Southern Yellow Carnivore 0
[480 rows x 6 columns]
Object type: <class 'pandas.core.frame.DataFrame'>
Object column type(s):
testScore float64
bodyLength float64
mountainRange object
color object
diet object
breathesFire int64
dtype: object
Woohoo! Our data has been successfully loaded into Python as a DataFrame object. Plus, we confirmed this by calling the base type()
function and then identified the type of data in each column by accessing this stored information via .dtypes
.
As you can see, Python does not print out the entire dataframe by default. That being said, it still gives you a decent sense of your data with dimensions, header names, and the first/last couple rows. If we had many more columns in this dataset, the ...
would also appear in a pseudo-column to represent the columns not displayed between the first/last couple columns.
If, however, you would like to inspect a greater proportion of your data visually within Python, you can call the .option_context()
command and customize the number of rows to be displayed:
1
2
3
# feel free to play around with the numbers in the below statement!
with pd.option_context('display.max_rows', 8, 'display.max_columns', 8):
print(dragons_df)
1
2
3
4
5
6
7
8
9
10
11
12
testScore bodyLength mountainRange color diet breathesFire
0 0.000000 175.512214 Bavarian Blue Carnivore 1
1 0.742914 190.640959 Bavarian Blue Carnivore 1
2 2.501825 169.708778 Bavarian Blue Carnivore 1
3 3.380430 188.847204 Bavarian Blue Carnivore 1
.. ... ... ... ... ... ...
476 42.650015 192.272043 Southern Yellow Vegetarian 0
477 44.582490 184.038015 Southern Yellow Vegetarian 1
478 47.795914 189.814408 Southern Yellow Vegetarian 0
479 55.486722 189.661989 Southern Yellow Carnivore 0
[480 rows x 6 columns]
3.1.1 Heads, Tails, and Duplicates
A nifty option you have is to ask pandas
to only display the top or bottom few rows by calling: df.head()
or df.tail()
Another particularly useful method will remove duplicate rows (according to specified column values):
df.drop_duplicates()
1
2
3
4
5
print(dragons_df.head(14)) # prints the first (leading) n rows
print(dragons_df.tail(2)) # prints the last (trailing) n rows
print(dragons_df.drop_duplicates('testScore'))
# drops all rows with the same test score
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
testScore bodyLength mountainRange color diet breathesFire
0 0.000000 175.512214 Bavarian Blue Carnivore 1
1 0.742914 190.640959 Bavarian Blue Carnivore 1
2 2.501825 169.708778 Bavarian Blue Carnivore 1
3 3.380430 188.847204 Bavarian Blue Carnivore 1
4 4.582095 174.221700 Bavarian Blue Carnivore 0
5 12.453635 183.081931 Bavarian Blue Carnivore 1
6 16.176758 174.839935 Bavarian Blue Carnivore 1
7 16.249164 182.876708 Bavarian Blue Carnivore 1
8 16.852420 177.820550 Bavarian Blue Omnivore 1
9 18.452767 176.740267 Bavarian Blue Carnivore 1
10 20.484229 182.921515 Bavarian Blue Carnivore 1
11 20.489804 170.686850 Bavarian Blue Carnivore 1
12 22.563014 179.574357 Bavarian Blue Carnivore 1
13 24.148017 180.063446 Bavarian Blue Carnivore 1
testScore bodyLength mountainRange color diet breathesFire
478 47.795914 189.814408 Southern Yellow Vegetarian 0
479 55.486722 189.661989 Southern Yellow Carnivore 0
testScore bodyLength mountainRange color diet breathesFire
0 0.000000 175.512214 Bavarian Blue Carnivore 1
1 0.742914 190.640959 Bavarian Blue Carnivore 1
2 2.501825 169.708778 Bavarian Blue Carnivore 1
3 3.380430 188.847204 Bavarian Blue Carnivore 1
4 4.582095 174.221700 Bavarian Blue Carnivore 0
.. ... ... ... ... ... ...
475 41.806995 186.132578 Southern Yellow Vegetarian 0
476 42.650015 192.272043 Southern Yellow Vegetarian 0
477 44.582490 184.038015 Southern Yellow Vegetarian 1
478 47.795914 189.814408 Southern Yellow Vegetarian 0
479 55.486722 189.661989 Southern Yellow Carnivore 0
[470 rows x 6 columns]
Now, that’s all fine and dandy. We did what we set out to do, and, for all intents and purposes, that’s basically all we’re going to have to write out to get the data sitting in a tabular datafile into Python.
But, it’s important to note that you have a much greater amount of flexibility in deciding how you want to read in your data. In fact, the options are honestly overwhelming. Just check out the list of parameters you could technically tweak to your heart’s desire for one function call: everything here represents default parameters, which means they’re like nonsignificant zeroes: you only have to explicitly state these if you’re deviating from their preset values.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
pandas.read_csv(filepath_or_buffer,
sep=<object object>,
delimiter=None,
header='infer',
names=None,
index_col=None,
usecols=None,
squeeze=False,
prefix=None,
mangle_dupe_cols=True,
dtype=None,
engine=None,
converters=None,
true_values=None,
false_values=None,
skipinitialspace=False,
skiprows=None,
skipfooter=0,
nrows=None,
na_values=None,
keep_default_na=True,
na_filter=True,
verbose=False,
skip_blank_lines=True,
parse_dates=False,
infer_datetime_format=False,
keep_date_col=False,
date_parser=None,
dayfirst=False,
cache_dates=True,
iterator=False,
chunksize=None,
compression='infer',
thousands=None,
decimal='.',
lineterminator=None,
quotechar='"',
quoting=0,
doublequote=True,
escapechar=None,
comment=None,
encoding=None,
dialect=None,
error_bad_lines=True,
warn_bad_lines=True,
delim_whitespace=False,
low_memory=True,
memory_map=False,
float_precision=None,
storage_options=None)
Visually a little much, eh? This is why being told to “just go read the documentation” can often be so discouraging—if you’ve just stepped in the shallow end, how do you even know where to start?
3.2 Selecting Data
3.2.1 Slicing and Dicing
We select (or subset) our data of interest from the master dataframe either by name (label) or by index (position). Because a pandas
dataframe is a 2D object, we have to specify two arguments: one for the rows and the other for the columns we want. This is just like R syntax!
- To select by label:
df.loc[row_names, col_names]
- In this case, both start and stop rows/columns are included in the slice!
- To select by position:
df.iloc[row_indices, col_indices]
- In this case, just like in vanilla Python, the end stop is not included in the slice.
1
2
3
4
5
6
7
8
# Selecting by label:
print(dragons_df.loc[:,'bodyLength':'breathesFire']) # all rows, some columns
# Selecting by position:
print(dragons_df.iloc[:, 0]) # all rows, first column
print(dragons_df.iloc[:, 0:2]) # all rows, first two columns
print(dragons_df.iloc[0:2, :]) # first two rows, all columns
print(dragons_df.iloc[0:2, 0:2]) # first two rows, first and second columns
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
bodyLength mountainRange color diet breathesFire
0 175.512214 Bavarian Blue Carnivore 1
1 190.640959 Bavarian Blue Carnivore 1
2 169.708778 Bavarian Blue Carnivore 1
3 188.847204 Bavarian Blue Carnivore 1
4 174.221700 Bavarian Blue Carnivore 0
.. ... ... ... ... ...
475 186.132578 Southern Yellow Vegetarian 0
476 192.272043 Southern Yellow Vegetarian 0
477 184.038015 Southern Yellow Vegetarian 1
478 189.814408 Southern Yellow Vegetarian 0
479 189.661989 Southern Yellow Carnivore 0
[480 rows x 5 columns]
0 0.000000
1 0.742914
2 2.501825
3 3.380430
4 4.582095
...
475 41.806995
476 42.650015
477 44.582490
478 47.795914
479 55.486722
Name: testScore, Length: 480, dtype: float64
testScore bodyLength
0 0.000000 175.512214
1 0.742914 190.640959
2 2.501825 169.708778
3 3.380430 188.847204
4 4.582095 174.221700
.. ... ...
475 41.806995 186.132578
476 42.650015 192.272043
477 44.582490 184.038015
478 47.795914 189.814408
479 55.486722 189.661989
[480 rows x 2 columns]
testScore bodyLength mountainRange color diet breathesFire
0 0.000000 175.512214 Bavarian Blue Carnivore 1
1 0.742914 190.640959 Bavarian Blue Carnivore 1
testScore bodyLength
0 0.000000 175.512214
1 0.742914 190.640959
3.2.2 Single Column Extraction
If you’re just looking to access one column, you have two options:
- Index operator,
df[col_name]
- Attribute operator,
df.colname
- Notice, however, that these two only work for labels, not for indices.
- You can think of these as offering the same functionality as
df$colname
in R.
1
2
3
4
5
6
7
# Selecting a single column by label:
print(dragons_df['bodyLength']) # index operator []
print(dragons_df.bodyLength) # attribute operator . and does not include " "
# Attempting unsuccesfully to select a single column by index
# print(dragons_df.0)
# print(dragons_df[0])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
0 175.512214
1 190.640959
2 169.708778
3 188.847204
4 174.221700
...
475 186.132578
476 192.272043
477 184.038015
478 189.814408
479 189.661989
Name: bodyLength, Length: 480, dtype: float64
0 175.512214
1 190.640959
2 169.708778
3 188.847204
4 174.221700
...
475 186.132578
476 192.272043
477 184.038015
478 189.814408
479 189.661989
Name: bodyLength, Length: 480, dtype: float64
There is one caveat, however, to implementing an index operator with indices: if you use a slice, you’ll end up getting the corresponding rows. This happens because pandas
is only recognizing the slice in one direction—by rows. If you’re looking to slice columns, you’ll need to use the .iloc
format.
1
print(dragons_df[0:1]) # this does work, but only for sliced rows
1
2
testScore bodyLength mountainRange color diet breathesFire
0 0.0 175.512214 Bavarian Blue Carnivore 1
3.2.3 Boolean Indexing
We can subset using conditional truth statements using boolean operators:
- Greater than:
>
- Less than:
<
- Greater than or equal to:
>=
- Less than or equal to:
<=
- Equal to:
==
- Unequal to:
!=
To filter by row values, we use the .isin()
method that accepts a list-like object.
1
2
3
4
5
# Subset only the rows for the dragons that received a passing score
print(dragons_df[dragons_df['testScore'] > 65])
# Filter out and keep only the rows for the dragons who cannot breathe fire
print(dragons_df[dragons_df['breathesFire'].isin([0])]) # must enclose in list
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
testScore bodyLength mountainRange color diet breathesFire
33 67.382720 212.229895 Central Blue Omnivore 1
34 71.080284 214.752198 Central Blue Omnivore 1
35 73.119070 213.582062 Central Blue Omnivore 1
36 75.757683 204.952743 Central Blue Vegetarian 1
37 76.930157 215.294360 Central Blue Omnivore 1
.. ... ... ... ... ... ...
455 76.515129 212.282806 Sarntal Yellow Omnivore 0
456 77.152326 215.961638 Sarntal Yellow Omnivore 0
457 81.883610 211.153312 Sarntal Yellow Omnivore 0
458 87.413974 205.694126 Sarntal Yellow Omnivore 0
459 94.106096 216.727091 Sarntal Yellow Omnivore 0
[136 rows x 6 columns]
testScore bodyLength mountainRange color diet breathesFire
4 4.582095 174.221700 Bavarian Blue Carnivore 0
14 25.201276 182.353952 Bavarian Blue Carnivore 0
25 51.768306 210.940255 Central Blue Vegetarian 0
32 59.540570 215.965209 Central Blue Vegetarian 0
42 24.651235 206.677342 Emmental Blue Carnivore 0
.. ... ... ... ... ... ...
474 41.045542 190.352825 Southern Yellow Vegetarian 0
475 41.806995 186.132578 Southern Yellow Vegetarian 0
476 42.650015 192.272043 Southern Yellow Vegetarian 0
478 47.795914 189.814408 Southern Yellow Vegetarian 0
479 55.486722 189.661989 Southern Yellow Carnivore 0
[229 rows x 6 columns]
3.2.4 Scalar Value Selection
There are actually quicker versions of the aforementioned .loc()
and .iloc()
methods that you can use when you’re only interested in retrieving a scalar value, and they work the same way:
- Scalar access by label:
.at()
- Scalar access by position:
.iat()
1
2
3
4
5
# Setting by label:
print(dragons_df.at[0,'bodyLength']) # first row, first column
# Setting by position:
print(dragons_df.iat[0, 0]) # first row, first column
1
2
175.5122137
0.0
3.2.5 Ordered and Random Selection
Whether for ease of data review or because you’re setting up a subsequent calculation, you can also order a subset of rows in either ascending or descending order:
- Select and sort greatest to least n entries:
df.nlargest(n, 'colnames')
- Select and sort least to greatest n entries:
df.nsmallest(n, 'colnames')
If, instead of an ordered sample, you’d like to obtain a random sample of n rows, you could use the following: df.sample(n)
1
2
3
4
print(dragons_df.nlargest(50, 'testScore'))
print(dragons_df.nsmallest(50, 'testScore'))
print(dragons_df.sample(100))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
testScore bodyLength mountainRange color diet breathesFire
79 100.000000 209.490674 Julian Blue Omnivore 1
119 100.000000 198.874616 Maritime Blue Omnivore 1
299 100.000000 193.423645 Sarntal Red Omnivore 1
359 100.000000 200.182169 Central Yellow Omnivore 0
436 100.000000 191.248929 Maritime Yellow Omnivore 0
437 100.000000 192.917439 Maritime Yellow Omnivore 0
438 100.000000 196.546624 Maritime Yellow Omnivore 0
439 100.000000 200.272381 Maritime Yellow Omnivore 0
435 99.605642 199.100041 Maritime Yellow Omnivore 0
78 98.499816 215.733178 Julian Blue Omnivore 1
239 97.647935 216.177762 Julian Red Omnivore 1
279 95.106090 204.671957 Maritime Red Vegetarian 0
298 94.545010 195.590678 Sarntal Red Omnivore 1
77 94.116335 230.650417 Julian Blue Omnivore 1
459 94.106096 216.727091 Sarntal Yellow Omnivore 0
118 92.760852 202.011059 Maritime Blue Omnivore 1
238 92.632400 214.918855 Julian Red Omnivore 1
434 91.094122 195.293005 Maritime Yellow Omnivore 0
199 90.031156 209.131404 Central Red Carnivore 1
433 89.961323 208.371043 Maritime Yellow Omnivore 1
358 89.915472 214.406738 Central Yellow Omnivore 0
117 89.815497 202.899173 Maritime Blue Omnivore 1
399 89.042531 232.513680 Julian Yellow Omnivore 0
116 88.550200 192.770727 Maritime Blue Omnivore 1
458 87.413974 205.694126 Sarntal Yellow Omnivore 0
432 87.371212 204.092492 Maritime Yellow Carnivore 0
278 87.201463 209.072897 Maritime Red Omnivore 1
297 86.984696 193.414687 Sarntal Red Omnivore 1
76 85.499126 220.173495 Julian Blue Omnivore 1
398 85.072554 227.324594 Julian Yellow Omnivore 0
431 84.767627 193.269179 Maritime Yellow Vegetarian 0
59 84.272785 216.582753 Emmental Blue Omnivore 1
296 84.269416 187.589529 Sarntal Red Omnivore 1
237 84.192600 210.784914 Julian Red Omnivore 1
277 83.969242 194.153313 Maritime Red Omnivore 1
115 83.941677 200.733784 Maritime Blue Omnivore 1
236 83.497061 202.689625 Julian Red Omnivore 0
235 83.168146 209.188844 Julian Red Omnivore 1
295 82.853902 195.131278 Sarntal Red Omnivore 0
276 82.251778 199.007122 Maritime Red Omnivore 0
259 82.127343 204.082620 Ligurian Red Omnivore 0
294 82.012468 187.591883 Sarntal Red Omnivore 1
457 81.883610 211.153312 Sarntal Yellow Omnivore 0
39 81.471137 220.156269 Central Blue Omnivore 1
114 81.470769 202.537720 Maritime Blue Omnivore 1
38 81.138248 201.321573 Central Blue Omnivore 1
113 80.892279 196.908408 Maritime Blue Omnivore 1
75 80.448654 219.178612 Julian Blue Omnivore 0
74 80.412169 217.214241 Julian Blue Vegetarian 1
139 80.292789 201.469309 Sarntal Blue Omnivore 1
testScore bodyLength mountainRange color diet breathesFire
0 0.000000 175.512214 Bavarian Blue Carnivore 1
140 0.000000 173.532588 Southern Blue Carnivore 1
200 0.000000 189.605685 Emmental Red Carnivore 1
300 0.000000 171.050730 Southern Red Carnivore 1
1 0.742914 190.640959 Bavarian Blue Carnivore 1
2 2.501825 169.708778 Bavarian Blue Carnivore 1
160 2.557890 169.619382 Bavarian Red Carnivore 0
460 2.599976 184.135925 Southern Yellow Carnivore 0
161 3.282949 179.000774 Bavarian Red Carnivore 1
3 3.380430 188.847204 Bavarian Blue Carnivore 1
162 3.597530 167.510430 Bavarian Red Carnivore 1
163 3.875730 164.416273 Bavarian Red Carnivore 0
4 4.582095 174.221700 Bavarian Blue Carnivore 0
164 6.038333 165.882952 Bavarian Red Carnivore 1
400 6.411029 224.721753 Ligurian Yellow Carnivore 1
165 7.358565 179.606068 Bavarian Red Carnivore 1
320 10.825361 191.932215 Bavarian Yellow Carnivore 0
5 12.453635 183.081931 Bavarian Blue Carnivore 1
120 13.080916 199.999325 Sarntal Blue Carnivore 1
321 14.013776 193.244146 Bavarian Yellow Carnivore 0
141 14.381638 175.438152 Southern Blue Carnivore 1
461 14.488204 196.060398 Southern Yellow Vegetarian 0
80 14.565634 206.029133 Ligurian Blue Carnivore 1
360 14.916917 220.749686 Emmental Yellow Carnivore 0
462 14.999489 193.616367 Southern Yellow Carnivore 0
81 15.042811 214.375613 Ligurian Blue Carnivore 1
40 15.399484 211.185196 Emmental Blue Carnivore 1
401 15.606387 225.451337 Ligurian Yellow Carnivore 0
463 15.923599 188.993337 Southern Yellow Carnivore 0
166 16.147309 165.548520 Bavarian Red Carnivore 1
6 16.176758 174.839935 Bavarian Blue Carnivore 1
7 16.249164 182.876708 Bavarian Blue Carnivore 1
167 16.285131 170.819319 Bavarian Red Carnivore 1
361 16.829491 222.505211 Emmental Yellow Carnivore 0
8 16.852420 177.820550 Bavarian Blue Omnivore 1
142 17.633001 176.474214 Southern Blue Carnivore 1
168 18.442801 175.970409 Bavarian Red Carnivore 1
9 18.452767 176.740267 Bavarian Blue Carnivore 1
169 18.467754 169.770832 Bavarian Red Carnivore 1
301 18.578797 169.845711 Southern Red Carnivore 1
170 18.838821 167.685525 Bavarian Red Carnivore 1
302 18.961916 168.670439 Southern Red Carnivore 1
322 19.095881 192.895791 Bavarian Yellow Omnivore 0
464 20.010852 187.867885 Southern Yellow Carnivore 0
201 20.017659 196.343091 Emmental Red Vegetarian 0
465 20.030181 191.637876 Southern Yellow Carnivore 0
171 20.108537 170.053440 Bavarian Red Carnivore 1
362 20.122821 217.988338 Emmental Yellow Carnivore 0
172 20.242590 171.093272 Bavarian Red Carnivore 1
323 20.462621 185.663441 Bavarian Yellow Carnivore 0
testScore bodyLength mountainRange color diet breathesFire
189 42.823690 213.113375 Central Red Vegetarian 1
61 53.393965 222.517254 Julian Blue Vegetarian 1
186 40.080020 216.481476 Central Red Vegetarian 1
134 71.095966 200.127441 Sarntal Blue Omnivore 1
11 20.489804 170.686850 Bavarian Blue Carnivore 1
.. ... ... ... ... ... ...
258 58.694412 205.190403 Ligurian Red Vegetarian 0
197 67.121844 201.348759 Central Red Omnivore 0
479 55.486722 189.661989 Southern Yellow Carnivore 0
93 47.656285 210.898075 Ligurian Blue Vegetarian 1
292 71.861195 187.583061 Sarntal Red Omnivore 1
[100 rows x 6 columns]
3.2.6 Compound Selection
We can also subset data that fits multiple conditions:
- Conjunctive (“and”) logic:
&
- Disjunctive (“or”) logic:
|
- Exclusive or (“xor”) logic:
^
-
If this is confusing, you can decompose it into the following:
a ^ b = (a | b) & (~a | ~b)
a ^ b = (a & ~b) | (~a & b)
- Not:
~
- Make sure to wrap individual conditions in
()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Subset rows for which both conditions are true
print(dragons_df[(dragons_df['testScore'] > 65) &
(dragons_df['breathesFire'].isin([0]))])
# Subset rows for which at least one condition is true
print(dragons_df[(dragons_df['testScore'] > 65) |
(dragons_df['breathesFire'].isin([0]))])
# Subset rows for which the first condition is true and the second is false
print(dragons_df[(dragons_df['testScore'] > 65) &
~(dragons_df['breathesFire'].isin([0]))])
# Subset rows for which there is a mismatch in truth values of conditions
print(dragons_df[(dragons_df['testScore'] > 65) ^
(dragons_df['breathesFire'].isin([0]))])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
testScore bodyLength mountainRange color diet breathesFire
75 80.448654 219.178612 Julian Blue Omnivore 0
110 70.153055 204.432935 Maritime Blue Omnivore 0
196 66.766388 201.759702 Central Red Omnivore 0
197 67.121844 201.348759 Central Red Omnivore 0
227 66.689827 207.957300 Julian Red Omnivore 0
.. ... ... ... ... ... ...
455 76.515129 212.282806 Sarntal Yellow Omnivore 0
456 77.152326 215.961638 Sarntal Yellow Omnivore 0
457 81.883610 211.153312 Sarntal Yellow Omnivore 0
458 87.413974 205.694126 Sarntal Yellow Omnivore 0
459 94.106096 216.727091 Sarntal Yellow Omnivore 0
[71 rows x 6 columns]
testScore bodyLength mountainRange color diet breathesFire
4 4.582095 174.221700 Bavarian Blue Carnivore 0
14 25.201276 182.353952 Bavarian Blue Carnivore 0
25 51.768306 210.940255 Central Blue Vegetarian 0
32 59.540570 215.965209 Central Blue Vegetarian 0
33 67.382720 212.229895 Central Blue Omnivore 1
.. ... ... ... ... ... ...
474 41.045542 190.352825 Southern Yellow Vegetarian 0
475 41.806995 186.132578 Southern Yellow Vegetarian 0
476 42.650015 192.272043 Southern Yellow Vegetarian 0
478 47.795914 189.814408 Southern Yellow Vegetarian 0
479 55.486722 189.661989 Southern Yellow Carnivore 0
[294 rows x 6 columns]
testScore bodyLength mountainRange color diet breathesFire
33 67.382720 212.229895 Central Blue Omnivore 1
34 71.080284 214.752198 Central Blue Omnivore 1
35 73.119070 213.582062 Central Blue Omnivore 1
36 75.757683 204.952743 Central Blue Vegetarian 1
37 76.930157 215.294360 Central Blue Omnivore 1
.. ... ... ... ... ... ...
297 86.984696 193.414687 Sarntal Red Omnivore 1
298 94.545010 195.590678 Sarntal Red Omnivore 1
299 100.000000 193.423645 Sarntal Red Omnivore 1
433 89.961323 208.371043 Maritime Yellow Omnivore 1
452 73.617034 209.306230 Sarntal Yellow Omnivore 1
[65 rows x 6 columns]
testScore bodyLength mountainRange color diet breathesFire
4 4.582095 174.221700 Bavarian Blue Carnivore 0
14 25.201276 182.353952 Bavarian Blue Carnivore 0
25 51.768306 210.940255 Central Blue Vegetarian 0
32 59.540570 215.965209 Central Blue Vegetarian 0
33 67.382720 212.229895 Central Blue Omnivore 1
.. ... ... ... ... ... ...
474 41.045542 190.352825 Southern Yellow Vegetarian 0
475 41.806995 186.132578 Southern Yellow Vegetarian 0
476 42.650015 192.272043 Southern Yellow Vegetarian 0
478 47.795914 189.814408 Southern Yellow Vegetarian 0
479 55.486722 189.661989 Southern Yellow Carnivore 0
[223 rows x 6 columns]
3.2.7 Complex Boolean Logic
For more advanced logical operations designed specifically for dataframes, you also have the following two complementary options:
- Check if at least one value in each dataframe column is
True
: df.any()
- Check if all the values in each dataframe column are
True
: df.all()
Any given pandas
element will be True
by default. The only time they will be identified as False
is when the element is equal to zero or missing.
1
2
print(dragons_df.any())
print(dragons_df.all())
1
2
3
4
5
6
7
8
9
10
11
12
13
14
testScore True
bodyLength True
mountainRange True
color True
diet True
breathesFire True
dtype: bool
testScore False
bodyLength True
mountainRange True
color True
diet True
breathesFire False
dtype: bool
3.3 Data Types
There are two object types that pandas
works with: DataFrame
and Series
.
Now, I don’t know about you, but the un-R-esque capital letters annoy me. Unfortunately, that’s just what we have to work with. So how are they different? A Series
is 1D while the DataFrame
is 2D and built out of a collection of Series
. However, a Series
is more specifically a labeled 1D array, which means you can think of it as a single column with corresponding row labels preserved.
If we check the type of a single extracted column, we see that it is a Series
. However, if we take a bunch of columns together, they’re still grouped together as one DataFrame
object. However, it’s also totally possible to have a single-column DataFrame
object.
1
2
3
print(type(dragons_df)) # type of the entire dataframe
print(type(dragons_df['bodyLength'])) # type of a single column
print(type(dragons_df.iloc[:, 0:5])) # type of a slice of several columns
1
2
3
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
3.3.1 Type Conversion
Just like with vanilla Python type conversion functions (list()
, str()
, int()
, etc.), we can directly convert any existing array-like object to a Series
or a DataFrame
by using corresponding constructors: pd.Series()
and pd.DataFrame()
.
1
2
3
4
5
6
7
8
9
10
11
lst = [1, 2, 3, 'hello world']
tup = (1, 2, 3, 'hello world')
ser = pd.Series(lst) # convert list to a pandas Series object
print(type(ser))
print(ser)
df = pd.DataFrame(ser) # convert Series to a pandas DataFrame object
print(type(df))
print(df)
1
2
3
4
5
6
7
8
9
10
11
12
<class 'pandas.core.series.Series'>
0 1
1 2
2 3
3 hello world
dtype: object
<class 'pandas.core.frame.DataFrame'>
0
0 1
1 2
2 3
3 hello world
As you can see in the above compile, Python automatically assigns indices (starting from 0) if you do not provide them. This happens for the 1D Series
as well as the 2D DataFrame
. However, you can also manually them, which we’ll see in the next section.
3.4 Making pandas Objects from Scratch
We now know how to view and subset preformatted tabular data that we loaded! But what about when we need to create a Series
or a DataFrame
from scratch in Python? We have options, and I’ll explain what I think are the most useful ones below.
In some cases, the input format will by nature already furnish column header labels. However, in other cases, you will have to manually rename the column header labels from the indices they default to, and the same is true for rows.
- To set column names:
1
pd.DataFrame(data, columns = ['colname_1', 'colname_2', ..., 'colname_n'])
- To set row (index) names:
1
pd.DataFrame(data, index = ['colname_1', 'colname_2', ..., 'colname_n'])
3.4.2 Dictionary → DataFrame
1
2
3
4
5
6
7
my_exotic_fruits = {
'fruit': ['cassabanana', 'rambutan', 'mangosteen', 'kiwano', 'atemoya'],
'outside_color': ['burgundy', 'red', 'purple', 'yellow', 'green'],
'inside_color': ['yellow', 'white', 'white', 'green', 'white']}
print(pd.DataFrame(my_exotic_fruits,
index = ['fruit0', 'fruit1', 'fruit2', 'fruit3', 'fruit4']))
1
2
3
4
5
6
fruit outside_color inside_color
fruit0 cassabanana burgundy yellow
fruit1 rambutan red white
fruit2 mangosteen purple white
fruit3 kiwano yellow green
fruit4 atemoya green white
3.4.1 Lists of Lists → DataFrame
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
my_exotic_fruits = [['cassabanana', 'burgundy', 'yellow'],
['rambutan', 'red', 'white'],
['mangosteen', 'purple', 'white'],
['kiwano', 'yellow', 'green'],
['atemoya', 'green', 'white']] # each nested list is a row
no_labels_df = pd.DataFrame(my_exotic_fruits)
all_labels_df = pd.DataFrame(my_exotic_fruits,
columns = ['fruit', 'outside_color', 'inside_color'],
index = ['fruit0',
'fruit1',
'fruit2',
'fruit3',
'fruit4'])
print(no_labels_df) # defaults to indices (0, 1, 2, ..., n)
print(all_labels_df) # aligns with supplied column and row labels
1
2
3
4
5
6
7
8
9
10
11
12
0 1 2
0 cassabanana burgundy yellow
1 rambutan red white
2 mangosteen purple white
3 kiwano yellow green
4 atemoya green white
fruit outside_color inside_color
fruit0 cassabanana burgundy yellow
fruit1 rambutan red white
fruit2 mangosteen purple white
fruit3 kiwano yellow green
fruit4 atemoya green white
3.4.3 List of Dictionaries → DataFrame
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
my_exotic_fruits = [{'fruit': 'cassabanana',
'outside_color': 'burgundy',
'inside_color': 'yellow'},
{'fruit': 'rambutan',
'outside_color': 'red',
'inside_color': 'white'},
{'fruit': 'mangosteen',
'outside_color': 'purple',
'inside_color': 'white'},
{'fruit': 'kiwano',
'outside_color': 'yellow',
'inside_color': 'green'},
{'fruit': 'atemoya',
'outside_color': 'green',
'inside_color': 'white'}]
# each nested dictionary is a row (with column headers)
print(pd.DataFrame(my_exotic_fruits,
index = ['fruit0', 'fruit1', 'fruit2', 'fruit3', 'fruit4']))
1
2
3
4
5
6
fruit outside_color inside_color
fruit0 cassabanana burgundy yellow
fruit1 rambutan red white
fruit2 mangosteen purple white
fruit3 kiwano yellow green
fruit4 atemoya green white
4. Data Cleaning
4.1 Setting Data
We’ve talked about pulling out columns and rows, but what about adding them in or replacing existing values? In both cases, you just (re)assign a value to that table cell or group of cells.
4.1.1 Adding Data
Let’s say that there’s a Goldilocks effect where dragons with a body length closer to 190 feet are nicer than those which are either too long or too short. If we assign this formula to a new column label, voilà, the new column is added for us!
1
2
3
4
print(dragons_df)
dragons_df['friendliness'] = (190 - abs(dragons_df['bodyLength'] - 190))/190
print(dragons_df)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
testScore bodyLength mountainRange color diet breathesFire
0 0.000000 175.512214 Bavarian Blue Carnivore 1
1 0.742914 190.640959 Bavarian Blue Carnivore 1
2 2.501825 169.708778 Bavarian Blue Carnivore 1
3 3.380430 188.847204 Bavarian Blue Carnivore 1
4 4.582095 174.221700 Bavarian Blue Carnivore 0
.. ... ... ... ... ... ...
475 41.806995 186.132578 Southern Yellow Vegetarian 0
476 42.650015 192.272043 Southern Yellow Vegetarian 0
477 44.582490 184.038015 Southern Yellow Vegetarian 1
478 47.795914 189.814408 Southern Yellow Vegetarian 0
479 55.486722 189.661989 Southern Yellow Carnivore 0
[480 rows x 6 columns]
testScore bodyLength ... breathesFire friendliness
0 0.000000 175.512214 ... 1 0.923748
1 0.742914 190.640959 ... 1 0.996627
2 2.501825 169.708778 ... 1 0.893204
3 3.380430 188.847204 ... 1 0.993933
4 4.582095 174.221700 ... 0 0.916956
.. ... ... ... ... ...
475 41.806995 186.132578 ... 0 0.979645
476 42.650015 192.272043 ... 0 0.988042
477 44.582490 184.038015 ... 1 0.968621
478 47.795914 189.814408 ... 0 0.999023
479 55.486722 189.661989 ... 0 0.998221
[480 rows x 7 columns]
4.1.2 Replacing Data
Uh oh…more research has shown that dragons with a body length of 180 feet are actually the most friendly—looks like we messed up! If we change the formula for the Goldilocks number but still assign it to the same column label, we get our correct values.
1
2
3
print(dragons_df)
dragons_df['friendliness'] = (180 - abs(dragons_df['bodyLength'] - 180))/180
print(dragons_df)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
testScore bodyLength ... breathesFire friendliness
0 0.000000 175.512214 ... 1 0.923748
1 0.742914 190.640959 ... 1 0.996627
2 2.501825 169.708778 ... 1 0.893204
3 3.380430 188.847204 ... 1 0.993933
4 4.582095 174.221700 ... 0 0.916956
.. ... ... ... ... ...
475 41.806995 186.132578 ... 0 0.979645
476 42.650015 192.272043 ... 0 0.988042
477 44.582490 184.038015 ... 1 0.968621
478 47.795914 189.814408 ... 0 0.999023
479 55.486722 189.661989 ... 0 0.998221
[480 rows x 7 columns]
testScore bodyLength ... breathesFire friendliness
0 0.000000 175.512214 ... 1 0.975068
1 0.742914 190.640959 ... 1 0.940884
2 2.501825 169.708778 ... 1 0.942827
3 3.380430 188.847204 ... 1 0.950849
4 4.582095 174.221700 ... 0 0.967898
.. ... ... ... ... ...
475 41.806995 186.132578 ... 0 0.965930
476 42.650015 192.272043 ... 0 0.931822
477 44.582490 184.038015 ... 1 0.977567
478 47.795914 189.814408 ... 0 0.945476
479 55.486722 189.661989 ... 0 0.946322
[480 rows x 7 columns]
4.1.3 Renaming Columns
We will very often need to rename columns to enhance readability. To do this, we use the following method:
df.rename(columns={'old_name': 'new_name'})
1
2
dragons_df = dragons_df.rename(columns={'friendliness': 'passiveness'})
print(dragons_df)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
testScore bodyLength mountainRange ... diet breathesFire passiveness
0 0.000000 175.512214 Bavarian ... Carnivore 1 0.975068
1 0.742914 190.640959 Bavarian ... Carnivore 1 0.940884
2 2.501825 169.708778 Bavarian ... Carnivore 1 0.942827
3 3.380430 188.847204 Bavarian ... Carnivore 1 0.950849
4 4.582095 174.221700 Bavarian ... Carnivore 0 0.967898
.. ... ... ... ... ... ... ...
475 41.806995 186.132578 Southern ... Vegetarian 0 0.965930
476 42.650015 192.272043 Southern ... Vegetarian 0 0.931822
477 44.582490 184.038015 Southern ... Vegetarian 1 0.977567
478 47.795914 189.814408 Southern ... Vegetarian 0 0.945476
479 55.486722 189.661989 Southern ... Carnivore 0 0.946322
[480 rows x 7 columns]
4.2 Missing Data
Missing data is automatically filled in as np.nan
: numpy “Not a Number” (NaN
) values. While R identifies two distinct values for missing and empty data (the difference is quite nuanced), NA and NULL, respectively, Python collapses these two into one.
Let’s pull up that table of exotic fruits from a while ago and plug in a missing value to play around.
1
2
3
4
my_exotic_fruits_df = pd.DataFrame(my_exotic_fruits)
my_exotic_fruits_df.loc[0, 'fruit'] = np.nan
print(my_exotic_fruits_df)
1
2
3
4
5
6
fruit outside_color inside_color
0 NaN burgundy yellow
1 rambutan red white
2 mangosteen purple white
3 kiwano yellow green
4 atemoya green white
4.2.1 Dropping Missing Data
A common problem in research is getting rid of bad or missing data. To quickly clear your dataset of such occurrences, we use the following command: df.dropna(how="any")
1
print(my_exotic_fruits_df.dropna(how='any'))
1
2
3
4
5
fruit outside_color inside_color
1 rambutan red white
2 mangosteen purple white
3 kiwano yellow green
4 atemoya green white
4.2.2 Renaming Missing Data
On the contrary, what if we wanted to keep our missing data but assign a different placeholder name? We can do this by using the following command: df.fillna(value='foo')
(by the way, ‘foo’ is a common placeholder you’ll see)
1
print(my_exotic_fruits_df.fillna(value='unknown_fruit_data'))
1
2
3
4
5
6
fruit outside_color inside_color
0 unknown_fruit_data burgundy yellow
1 rambutan red white
2 mangosteen purple white
3 kiwano yellow green
4 atemoya green white
4.2.3 Locating Missing Data
Another crucial function when dealing with missing data is finding where the holes are actually located. We have three options:
pd.isna(df)
pd.isnull(obj)
pd.notna(obj)
pd.notnull(obj)
- Note that these are two functionally identical pairs with the vestigial syntactical difference carrying over from R
For each of these functions, Python will return what’s called a “boolean mask,” a version of the dataframe or object we inputted that has all its values masked to reflect whether the value is missing or not.
1
2
3
4
5
print(pd.isna(my_exotic_fruits_df))
print(pd.isnull(my_exotic_fruits_df))
print(pd.notna(my_exotic_fruits_df))
print(pd.notnull(my_exotic_fruits_df))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
fruit outside_color inside_color
0 True False False
1 False False False
2 False False False
3 False False False
4 False False False
fruit outside_color inside_color
0 True False False
1 False False False
2 False False False
3 False False False
4 False False False
fruit outside_color inside_color
0 False True True
1 True True True
2 True True True
3 True True True
4 True True True
fruit outside_color inside_color
0 False True True
1 True True True
2 True True True
3 True True True
4 True True True
5. Statistical Calculations
5.1 Descriptives
pandas
offers a relatively comprehensive suite of functions for pulling together descriptive statistics:
.sum()
.count()
.max()
.min()
.median()
.quantile([0.25, 0.75)
.mean()
.var()
.std()
.value_counts()
.nunique()
Each of the functions/methods above returns a single value. Alternatively, you could run df.describe
to provide a full output in Series
format.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
bodl = dragons_df['bodyLength']
print(bodl.sum())
print(bodl.count())
print(bodl.max())
print(bodl.min())
print(bodl.median())
print(bodl.quantile([0.25, 0.75]))
print(bodl.mean())
print(bodl.var())
print(bodl.std())
print('\n')
print(bodl.value_counts()) # frequency table
print('\n')
print(bodl.nunique()) # count of total unique values
print('\n')
print(bodl.describe()) # full(er) descriptives table
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
96631.9015289
480
236.36245290000002
162.3266403
202.91298255
0.25 191.762141
0.75 213.136002
Name: bodyLength, dtype: float64
201.3164615185418
262.83612105632886
16.212221348610093
195.131278 1
187.226044 1
198.559975 1
204.092492 1
170.686850 1
..
208.055977 1
234.915029 1
172.626085 1
200.156521 1
209.373511 1
Name: bodyLength, Length: 480, dtype: int64
480
count 480.000000
mean 201.316462
std 16.212221
min 162.326640
25% 191.762141
50% 202.912983
75% 213.136002
max 236.362453
Name: bodyLength, dtype: float64
5.1 Custom Functions
Depending on the complexity of your data manipulations, you may very well like to apply far more intricate functions onto your dataframe columns. For this, we have the .apply()
method.
While this method might initially appear to be a panacea, it should never be your first choice because it is fundamentally non-vectorized. What this means is that whatever function you apply to your dataframe column will be run on each row iteratively: in other words, highly inefficiently.
1
2
3
4
dragons_df['diet'] = dragons_df['diet'].apply(lambda diet : 'Pesce-Pollotarian'
if diet == 'Omnivore' else 'Vegan'
)
print(dragons_df['diet'].value_counts())
1
2
3
Vegan 313
Pesce-Pollotarian 167
Name: diet, dtype: int64
6. Data Transformations
We’re now going to talk about really awesome ways you can quickly combine and pivot your data!
6.1 Combining Data
When you’re looking to add data to an existing dataframe, you can either to choose to append more rows or more columns. The method is the same for both transformations, and all you need to do is specify the axis of interest:
- Add more rows:
pd.concat([df1,df2])
- Add more columns:
pd.concat([df1,df2], axis=1)
The axis parameter always defaults to axis=0
, so it is unnecessary to declare it unless you’re working on the separate axis (sideways). Note that the concatenation function requires the input dataframes to be enclosed in an array-like object.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
my_new_exotic_fruits_df = pd.DataFrame(
{'fruit':
['ice cream bean', 'caviar lime', 'granadilla', 'tamarillo', 'sapodilla'],
'outside_color':
['green', 'green', 'yellow', 'red', 'tan'],
'inside_color':
['white', 'clear', 'grey', 'orange', 'tan']
}
)
print(pd.concat([my_exotic_fruits_df, my_new_exotic_fruits_df],
ignore_index=True))
# if we set ignore_index to True, the index incrementer does not get all wonky
print(pd.concat([my_exotic_fruits_df, my_new_exotic_fruits_df], axis=1,
ignore_index=True))
# if we set ignore_index to True, the index incrementer does not get all wonky
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
fruit outside_color inside_color
0 NaN burgundy yellow
1 rambutan red white
2 mangosteen purple white
3 kiwano yellow green
4 atemoya green white
5 ice cream bean green white
6 caviar lime green clear
7 granadilla yellow grey
8 tamarillo red orange
9 sapodilla tan tan
0 1 2 3 4 5
0 NaN burgundy yellow ice cream bean green white
1 rambutan red white caviar lime green clear
2 mangosteen purple white granadilla yellow grey
3 kiwano yellow green tamarillo red orange
4 atemoya green white sapodilla tan tan
6.2 Pivoting Data
To pivot data long to wide (increase columns, decrease rows):
pd.pivot(index='idx', columns='var', values='val')
To pivot data wide to long (increase rows, decrease columns):
-
pd.melt(df)
-
Note that these are inverse operations, but they will not necessarily cancel each other out
1
2
3
4
5
6
7
8
9
10
11
my_exotic_fruits_df['rarity'] = \
['rare', 'common', 'uncommon', 'uncommon', 'rare']
print(my_exotic_fruits_df)
my_exotic_fruits_pvt_df = my_exotic_fruits_df.pivot(
index='rarity', columns='fruit', values='inside_color')
print(my_exotic_fruits_pvt_df)
my_exotic_fruits_pvt_df = pd.melt(
my_exotic_fruits_pvt_df)
print(my_exotic_fruits_pvt_df)
6.3 Stacking Data
We can also aggregate data into groups by using df.stack()
. This is reversible by df.unstack()
.
1
2
print(my_exotic_fruits_df.stack())
print(my_exotic_fruits_df.stack().unstack())
6.4 Indices
Okay, I’ve admittedly beaten around the bush for this one: what even are the row labels, the so-called “indices?”
They are an immutable object that stores axis data.
6.3.1 Re(setting) Indices
1
2
3
4
5
6
print(my_exotic_fruits_df)
my_exotic_fruits_df = my_exotic_fruits_df.set_index('fruit')
print(my_exotic_fruits_df) # new set of indices along which to reference
my_exotic_fruits_df = my_exotic_fruits_df.reset_index()
print(my_exotic_fruits_df) # indices returned to default
6.3.2 Multi-Indexing
It’s actually also possible to have a multi-index since you can set an index to be pairs of data rather than single values.
1
2
3
4
5
6
index = pd.MultiIndex.from_tuples([('exotic', 'tropical'),
('exotic', 'desert'),
('non-exotic', 'deciduous')],
names=['n','v'])
print(index)
1
2
3
4
MultiIndex([( 'exotic', 'tropical'),
( 'exotic', 'desert'),
('non-exotic', 'deciduous')],
names=['n', 'v'])
6.4 Merging Data
While we’ve already covered methods for combining data, we haven’t yet addressed a more nuanced alternative to forcing data on top of or next to each other.
This is where the adaptable pd.merge()
method comes into the picture:
pd.merge(df1, df2, how='direction', on='colname')
1
2
3
4
5
6
7
8
9
10
11
print(pd.merge(my_exotic_fruits_df, my_new_exotic_fruits_df,
how='left', on='fruit')) # Join corresponding rows by first df
print(pd.merge(my_exotic_fruits_df, my_new_exotic_fruits_df,
how='right', on='fruit')) # Join corresponding rows by second df
print(pd.merge(my_exotic_fruits_df, my_new_exotic_fruits_df,
how='inner', on='fruit')) # Keep rows shared across both sets
print(pd.merge(my_exotic_fruits_df, my_new_exotic_fruits_df,
how='outer', on='fruit')) # Keep all rows and values
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
fruit outside_color_x inside_color_x outside_color_y inside_color_y
0 NaN burgundy yellow NaN NaN
1 rambutan red white NaN NaN
2 mangosteen purple white NaN NaN
3 kiwano yellow green NaN NaN
4 atemoya green white NaN NaN
fruit outside_color_x inside_color_x outside_color_y inside_color_y
0 ice cream bean NaN NaN green white
1 caviar lime NaN NaN green clear
2 granadilla NaN NaN yellow grey
3 tamarillo NaN NaN red orange
4 sapodilla NaN NaN tan tan
Empty DataFrame
Columns: [fruit, outside_color_x, inside_color_x, outside_color_y, inside_color_y]
Index: []
fruit outside_color_x inside_color_x outside_color_y inside_color_y
0 NaN burgundy yellow NaN NaN
1 rambutan red white NaN NaN
2 mangosteen purple white NaN NaN
3 kiwano yellow green NaN NaN
4 atemoya green white NaN NaN
5 ice cream bean NaN NaN green white
6 caviar lime NaN NaN green clear
7 granadilla NaN NaN yellow grey
8 tamarillo NaN NaN red orange
9 sapodilla NaN NaN tan tan
6.5 Pipelines
A common occurrence in tidyverse-powered R scripts is the useful maggritr
pipeline symbol, %>%
. Thankfully, this functionality can be recreated in Python with ease by simply chaining together statements with a newline break after each successive function:
1
2
3
4
5
df = pd.melt(df)
.rename(columns={'variable' : 'var',
'value' : 'val'})
.rename(columns={'var1' : 'var2',
'value2' : 'va2'})
7. Conclusion and Additional Resources
Phew—that was a lot! As I mentioned previously, this tutorial only goes so far in covering the full spectrum of functionality afforded by pandas
. Once you’re familiar with the above techniques, you can certainly dive deeper into the documentation at your leisure.
Below, I’ve included links to several resources that I’ve found helpful and referenced for this walkthrough:
- Official Website
- 10 Minutes to pandas
- pandas Cookbook
- pandas Cheatsheet
- Learndatasci.com
- pandas: A complete introduction