Nathan Liang
Nathan Liang Lab manager at Princeton University working with Diana Tamir. Previously, a research assistant studying moral psychology with Felipe De Brigard, Walter Sinnott-Armstrong, and Paul Seli at Duke University.

Data Wrangling with Python + Pandas

Data Wrangling with Python + Pandas

1. Introduction

What’s up, everyone! We’re going to go over how to use pandas, a super nifty package for data wrangling in Python. And, no, unfortunately the package name has nothing to do with the adorable herbivorous bear: it’s a portmanteau of “panel” + “data”—the econometric equivalent of “longitudinal data” in psychological research. Eh.

To be clear, this is not an exhaustive guide on the package. For that, you would need to reference the official documentation itself which, thankfully, is actually incredibly well organized and user-friendly.

The purpose of this tutorial, rather, is to capture the most important functionalities of the package to get you on your feet and running: to provide an intermediate level of proficiency that will likely suffice as a useful guide for most tasks you will encounter.

1.1 What is (are?) pandas?

pandas (always all lowercase) is an industry standard for basically everything you could possibly want to do in data science. It’s open-source, well-maintained through non-profit sponsorship, and one of the most widely used packages—all factors which ensure it will continue to be around for a long time!

According to their official website:

  • pandas aims to be the fundamental high-level building block for doing practical, real world data analysis in Python.”
  • “Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language.”

Indeed, it lives up to this reputation as a very useful tool that will save you lots of time once you work past the initial learning curve. Since the way pandas works is intentionally quite analogous to the functionality provided by R, having some R chops beforehand will certainly help (although this is not necessary).

1.2 Objectives

Before we get started, I’d like you to consider the matter of why one would choose to use pandas. Why not just use a GUI like Microsoft Excel to open a tabular datafile, run the required transformations, and then close it all without writing and compiling any scripts?

At least anecdotally, it can certainly be tempting to use them for data preprocessing instead of learning the more complex syntax. While I don’t mean to discourage your measured use of them—they can be powerful and efficient in their own right–one ought not become reliant on them in unwarranted circumstances.

My point, then, is to prove to you here that pandas is just as effective as industry standard GUIs (and R!) in getting the same types of tasks done.

  Pros Cons
Microsoft Excel
  • GUI: easy to click and drag selections
  • Customizable full visual display of dataset
  • Increased chance for human error
  • Cannot handle large datasets
  • Slower performance
pandas
  • Can assemble a reproducible pipeline for data analysis
  • Can automate vectorized operations on a massive scale
  • Reduced chance for human error
  • Takes time to learn and apply full functionality

2. How to Be a Power User

It’s kind of a tongue-in-cheek humorous term, but “power user” is basically what you’d call someone who prioritizes trying to optimize the (not-that-useful) functionalities of whatever software or code tickles their fancy.

I’d like to present several useful tips and tricks you can implement in vanilla (i.e., regular old) Python, pandas, and other packages!

2.1 Anaconda Distribution

If you’re a research scientist working in Python, I strongly advise that you install the Anaconda Distribution before you do anything else. It’s an exceptional tool that provides bundles together package management and virtual environment support while also pre-installing many useful data science packages for you.

2.2 IDE vs. Text Editor

When programming in Python, you can choose between two umbrella categories of tools, integrated development environments (IDEs) and text editors. These interfaces almost always represent standalone software and need to be installed on your system separately from Python itself. Python comes bundled with the code editor IDLE, but this GUI is, in my view, woefully inadequate compared to the suite of freely available alternatives.

There’s not really a right answer here, but it’s important to understand the differences between these tools. An IDE is always going to be a beefed up version of a text editor, giving you high-powered tools that allow for debugging, compiling, building and deploying, and more. A text editor is, at its core, simply a space for scripting code that you’ll run sometime somewhere. There’s a spectrum between these, and the advantage of greater functionality generally comes at the cost of greater resource allocation (CPU and RAM). More complex by design, IDEs can also sometimes spontaneously crash.

Personally, I use the following tools for web development and data science:

2.3 Choosing a Typeface

A good typeface goes a very long way for readability and reduced eyestrain when programming projects drag on. By a longshot, my personal favorite monospace font has got to be JetBrains Mono. Not only is it open-source, but it also supports 100+ ligatures:


image.png

2.4 Keyboard Shortcuts

Keyboard shortcuts/hotkeys are probably a divisive topic. While there are commonly used preset key mappings, you know the saying:
“Different (key)strokes for different folks.” Also, it’s just burdensome to go through a long, comprehensive list to identify the ones you’ll actually end up using. Still, for the majority of users, I believe there are a couple of need-to-know practices you may have overlooked. In the table below, I’ve included my picks of the bunch. If you do end up making it a habit of using these shortcuts, I can guarantee you they will do wonders for your productivity.

  Windows Mac
Switch between
open windows
Alt + Tab ⌘ + Tab
Switch between
open windows (in reverse)
Alt + Shift + Tab ⌘ + Shift + Tab
Switch between
open tabs
Ctrl + Tab  
Switch between
open tabs (reverse)
Ctrl + Shift + Tab  
Comment out highlighted
code block or inline
code at cursor position
Ctrl + / ⌘ + /
Move cursor left/right by word Ctrl + ←
Ctrl + →
⌥ + ←
⌥ + →
Select by moving cursor
left/right by character
Shift + ←
Shift + →
 
Select by moving cursor
left/right by word
Ctrl + Shift + ←
Ctrl + Shift + →
⌥ + Shift + ←
⌥ + Shift + →

2.5 Vectorization

We always want to optimize our code for memory efficiency and speed to yield a quick compile. Because dataframe objects are already quite expensive in storage, a reliable rule of thumb is to always attempt to vectorize any operations one performs on them.

Vectorization is a superior alternative to writing explicit for and while loops because it accelerates compile time astronomically and is almost always more concise. We’ll go into this in more detail later on, but I’d like to introduce the idea before we move on of applying an operation to an entire array at one time instead of the constituent components piecemeal.

On a mechanistic level, vectorization transfers the burden of looping to the precompiled C code on which Python is built.

2.6 PEP8

The last thing I must mention is PEP8: this is to Python what APA7 is to psychology researchers. It’s a style guide that exhaustively covers all the formatting rules and best practices for programming in the language.

PEP stands for Python Enhancement Proposals, and the style guide is only one of many hundreds of meticulously curated documents addressing every last detail of how to best write and distribute code. I would only recommend that you reference it on an as-needed basis because it’s basically an unlimited well of information. Still, it’s a powerful resource that every Python programmer should be know.

3. Getting Started

Let’s get started! As you’ll begin to see if you haven’t already, pandas and numpy (Numerical Python) go together like bread and butter. Where you see one, you’ll surely see the other. We will always be importing both as “aliases” or abbreviations (like many, but not all, other packages!) because:

“All things being equal, the simplest [readable] solution tends to be the best one [holding memory efficiency and compile time relatively equal].”

—Occam's CS Razor [citation needed]

1
2
3
4
5
6
7
from google.colab import drive  # let's initialize colab
import pandas as pd  # this is standard import aliasing, always use it
import numpy as np  # this is standard import aliasing, always use it

drive.mount('/content/drive', force_remount=True)

data_dir = "drive/MyDrive/2020-10-21-dragon-data.csv"
1
Mounted at /content/drive

3.1 Reading in Data

Alright, so to begin, we’re going to start with an example. Looking at dragons again! Yay! 🐉

Let’s say we have a .csv file—you know, that handy file type that almost always ends up being what holds all of your good good experiment data. That’s not to say, however, that you can’t open up tabular datafiles of other types. Other common ones are .tsv (tab-separated) and .xlsx (Microsoft Excel), but the list goes on:

  • SPSS
  • SQL
  • Pickle (yes, this is a real type of data storage—and an efficient one as well! 🥒)
  • …and many, many more!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
dragons_df = pd.read_csv(data_dir)  # read
print(dragons_df)  # let's see what it looks like!

# Let's see how Python classifies this new object and its columns
print('Object type: ' + str(type(dragons_df)) + '\n')
print('Object column type(s): \n' + str(dragons_df.dtypes))

# ----------------------------------------------------- #
# If data_dir were a .tsv file, you'd have two options: #
# ----------------------------------------------------- #
# dragons_df = pd.read_table(data_dir) 
# dragons_df = pd.read_csv(data_dir, sep='\t')

# ------------------------------- #
# If data_dir were an .xlsx file: #
# ------------------------------- #
# dragons_df = pd.read_excel(data_dir)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
     testScore  bodyLength mountainRange   color        diet  breathesFire
0     0.000000  175.512214      Bavarian    Blue   Carnivore             1
1     0.742914  190.640959      Bavarian    Blue   Carnivore             1
2     2.501825  169.708778      Bavarian    Blue   Carnivore             1
3     3.380430  188.847204      Bavarian    Blue   Carnivore             1
4     4.582095  174.221700      Bavarian    Blue   Carnivore             0
..         ...         ...           ...     ...         ...           ...
475  41.806995  186.132578      Southern  Yellow  Vegetarian             0
476  42.650015  192.272043      Southern  Yellow  Vegetarian             0
477  44.582490  184.038015      Southern  Yellow  Vegetarian             1
478  47.795914  189.814408      Southern  Yellow  Vegetarian             0
479  55.486722  189.661989      Southern  Yellow   Carnivore             0

[480 rows x 6 columns]
Object type: <class 'pandas.core.frame.DataFrame'>

Object column type(s): 
testScore        float64
bodyLength       float64
mountainRange     object
color             object
diet              object
breathesFire       int64
dtype: object

Woohoo! Our data has been successfully loaded into Python as a DataFrame object. Plus, we confirmed this by calling the base type() function and then identified the type of data in each column by accessing this stored information via .dtypes.

As you can see, Python does not print out the entire dataframe by default. That being said, it still gives you a decent sense of your data with dimensions, header names, and the first/last couple rows. If we had many more columns in this dataset, the ... would also appear in a pseudo-column to represent the columns not displayed between the first/last couple columns.

If, however, you would like to inspect a greater proportion of your data visually within Python, you can call the .option_context() command and customize the number of rows to be displayed:

1
2
3
# feel free to play around with the numbers in the below statement!
with pd.option_context('display.max_rows', 8, 'display.max_columns', 8):
  print(dragons_df)
1
2
3
4
5
6
7
8
9
10
11
12
     testScore  bodyLength mountainRange   color        diet  breathesFire
0     0.000000  175.512214      Bavarian    Blue   Carnivore             1
1     0.742914  190.640959      Bavarian    Blue   Carnivore             1
2     2.501825  169.708778      Bavarian    Blue   Carnivore             1
3     3.380430  188.847204      Bavarian    Blue   Carnivore             1
..         ...         ...           ...     ...         ...           ...
476  42.650015  192.272043      Southern  Yellow  Vegetarian             0
477  44.582490  184.038015      Southern  Yellow  Vegetarian             1
478  47.795914  189.814408      Southern  Yellow  Vegetarian             0
479  55.486722  189.661989      Southern  Yellow   Carnivore             0

[480 rows x 6 columns]

3.1.1 Heads, Tails, and Duplicates

A nifty option you have is to ask pandas to only display the top or bottom few rows by calling: df.head() or df.tail()

Another particularly useful method will remove duplicate rows (according to specified column values): df.drop_duplicates()

1
2
3
4
5
print(dragons_df.head(14))  # prints the first (leading) n rows
print(dragons_df.tail(2))  # prints the last (trailing) n rows

print(dragons_df.drop_duplicates('testScore'))  
# drops all rows with the same test score
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
    testScore  bodyLength mountainRange color       diet  breathesFire
0    0.000000  175.512214      Bavarian  Blue  Carnivore             1
1    0.742914  190.640959      Bavarian  Blue  Carnivore             1
2    2.501825  169.708778      Bavarian  Blue  Carnivore             1
3    3.380430  188.847204      Bavarian  Blue  Carnivore             1
4    4.582095  174.221700      Bavarian  Blue  Carnivore             0
5   12.453635  183.081931      Bavarian  Blue  Carnivore             1
6   16.176758  174.839935      Bavarian  Blue  Carnivore             1
7   16.249164  182.876708      Bavarian  Blue  Carnivore             1
8   16.852420  177.820550      Bavarian  Blue   Omnivore             1
9   18.452767  176.740267      Bavarian  Blue  Carnivore             1
10  20.484229  182.921515      Bavarian  Blue  Carnivore             1
11  20.489804  170.686850      Bavarian  Blue  Carnivore             1
12  22.563014  179.574357      Bavarian  Blue  Carnivore             1
13  24.148017  180.063446      Bavarian  Blue  Carnivore             1
     testScore  bodyLength mountainRange   color        diet  breathesFire
478  47.795914  189.814408      Southern  Yellow  Vegetarian             0
479  55.486722  189.661989      Southern  Yellow   Carnivore             0
     testScore  bodyLength mountainRange   color        diet  breathesFire
0     0.000000  175.512214      Bavarian    Blue   Carnivore             1
1     0.742914  190.640959      Bavarian    Blue   Carnivore             1
2     2.501825  169.708778      Bavarian    Blue   Carnivore             1
3     3.380430  188.847204      Bavarian    Blue   Carnivore             1
4     4.582095  174.221700      Bavarian    Blue   Carnivore             0
..         ...         ...           ...     ...         ...           ...
475  41.806995  186.132578      Southern  Yellow  Vegetarian             0
476  42.650015  192.272043      Southern  Yellow  Vegetarian             0
477  44.582490  184.038015      Southern  Yellow  Vegetarian             1
478  47.795914  189.814408      Southern  Yellow  Vegetarian             0
479  55.486722  189.661989      Southern  Yellow   Carnivore             0

[470 rows x 6 columns]

Now, that’s all fine and dandy. We did what we set out to do, and, for all intents and purposes, that’s basically all we’re going to have to write out to get the data sitting in a tabular datafile into Python.

But, it’s important to note that you have a much greater amount of flexibility in deciding how you want to read in your data. In fact, the options are honestly overwhelming. Just check out the list of parameters you could technically tweak to your heart’s desire for one function call: everything here represents default parameters, which means they’re like nonsignificant zeroes: you only have to explicitly state these if you’re deviating from their preset values.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
pandas.read_csv(filepath_or_buffer, 
                sep=<object object>, 
                delimiter=None, 
                header='infer', 
                names=None, 
                index_col=None, 
                usecols=None, 
                squeeze=False, 
                prefix=None, 
                mangle_dupe_cols=True, 
                dtype=None, 
                engine=None,
                converters=None, 
                true_values=None, 
                false_values=None, 
                skipinitialspace=False, 
                skiprows=None, 
                skipfooter=0, 
                nrows=None, 
                na_values=None, 
                keep_default_na=True, 
                na_filter=True, 
                verbose=False, 
                skip_blank_lines=True, 
                parse_dates=False, 
                infer_datetime_format=False, 
                keep_date_col=False, 
                date_parser=None, 
                dayfirst=False, 
                cache_dates=True,
                iterator=False, 
                chunksize=None, 
                compression='infer', 
                thousands=None, 
                decimal='.', 
                lineterminator=None, 
                quotechar='"', 
                quoting=0, 
                doublequote=True, 
                escapechar=None, 
                comment=None, 
                encoding=None, 
                dialect=None, 
                error_bad_lines=True, 
                warn_bad_lines=True, 
                delim_whitespace=False, 
                low_memory=True, 
                memory_map=False, 
                float_precision=None, 
                storage_options=None)

Visually a little much, eh? This is why being told to “just go read the documentation” can often be so discouraging—if you’ve just stepped in the shallow end, how do you even know where to start?

3.2 Selecting Data

3.2.1 Slicing and Dicing

We select (or subset) our data of interest from the master dataframe either by name (label) or by index (position). Because a pandas dataframe is a 2D object, we have to specify two arguments: one for the rows and the other for the columns we want. This is just like R syntax!

  • To select by label: df.loc[row_names, col_names]
  • In this case, both start and stop rows/columns are included in the slice!
  • To select by position: df.iloc[row_indices, col_indices]
  • In this case, just like in vanilla Python, the end stop is not included in the slice.
1
2
3
4
5
6
7
8
# Selecting by label:
print(dragons_df.loc[:,'bodyLength':'breathesFire'])  # all rows, some columns

# Selecting by position:
print(dragons_df.iloc[:, 0])  # all rows, first column
print(dragons_df.iloc[:, 0:2])  # all rows, first two columns
print(dragons_df.iloc[0:2, :])  # first two rows, all columns
print(dragons_df.iloc[0:2, 0:2])  # first two rows, first and second columns
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
     bodyLength mountainRange   color        diet  breathesFire
0    175.512214      Bavarian    Blue   Carnivore             1
1    190.640959      Bavarian    Blue   Carnivore             1
2    169.708778      Bavarian    Blue   Carnivore             1
3    188.847204      Bavarian    Blue   Carnivore             1
4    174.221700      Bavarian    Blue   Carnivore             0
..          ...           ...     ...         ...           ...
475  186.132578      Southern  Yellow  Vegetarian             0
476  192.272043      Southern  Yellow  Vegetarian             0
477  184.038015      Southern  Yellow  Vegetarian             1
478  189.814408      Southern  Yellow  Vegetarian             0
479  189.661989      Southern  Yellow   Carnivore             0

[480 rows x 5 columns]
0       0.000000
1       0.742914
2       2.501825
3       3.380430
4       4.582095
         ...    
475    41.806995
476    42.650015
477    44.582490
478    47.795914
479    55.486722
Name: testScore, Length: 480, dtype: float64
     testScore  bodyLength
0     0.000000  175.512214
1     0.742914  190.640959
2     2.501825  169.708778
3     3.380430  188.847204
4     4.582095  174.221700
..         ...         ...
475  41.806995  186.132578
476  42.650015  192.272043
477  44.582490  184.038015
478  47.795914  189.814408
479  55.486722  189.661989

[480 rows x 2 columns]
   testScore  bodyLength mountainRange color       diet  breathesFire
0   0.000000  175.512214      Bavarian  Blue  Carnivore             1
1   0.742914  190.640959      Bavarian  Blue  Carnivore             1
   testScore  bodyLength
0   0.000000  175.512214
1   0.742914  190.640959

3.2.2 Single Column Extraction

If you’re just looking to access one column, you have two options:

  • Index operator, df[col_name]
  • Attribute operator, df.colname
  • Notice, however, that these two only work for labels, not for indices.
  • You can think of these as offering the same functionality as df$colname in R.
1
2
3
4
5
6
7
# Selecting a single column by label:
print(dragons_df['bodyLength'])  # index operator []
print(dragons_df.bodyLength)  # attribute operator . and does not include " "

# Attempting unsuccesfully to select a single column by index
# print(dragons_df.0)
# print(dragons_df[0])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
0      175.512214
1      190.640959
2      169.708778
3      188.847204
4      174.221700
          ...    
475    186.132578
476    192.272043
477    184.038015
478    189.814408
479    189.661989
Name: bodyLength, Length: 480, dtype: float64
0      175.512214
1      190.640959
2      169.708778
3      188.847204
4      174.221700
          ...    
475    186.132578
476    192.272043
477    184.038015
478    189.814408
479    189.661989
Name: bodyLength, Length: 480, dtype: float64

There is one caveat, however, to implementing an index operator with indices: if you use a slice, you’ll end up getting the corresponding rows. This happens because pandas is only recognizing the slice in one direction—by rows. If you’re looking to slice columns, you’ll need to use the .iloc format.

1
print(dragons_df[0:1])  # this does work, but only for sliced rows
1
2
   testScore  bodyLength mountainRange color       diet  breathesFire
0        0.0  175.512214      Bavarian  Blue  Carnivore             1

3.2.3 Boolean Indexing

We can subset using conditional truth statements using boolean operators:

  • Greater than: >
  • Less than: <
  • Greater than or equal to: >=
  • Less than or equal to: <=
  • Equal to: ==
  • Unequal to: !=

To filter by row values, we use the .isin() method that accepts a list-like object.

1
2
3
4
5
# Subset only the rows for the dragons that received a passing score
print(dragons_df[dragons_df['testScore'] > 65])

# Filter out and keep only the rows for the dragons who cannot breathe fire
print(dragons_df[dragons_df['breathesFire'].isin([0])])  # must enclose in list
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
     testScore  bodyLength mountainRange   color        diet  breathesFire
33   67.382720  212.229895       Central    Blue    Omnivore             1
34   71.080284  214.752198       Central    Blue    Omnivore             1
35   73.119070  213.582062       Central    Blue    Omnivore             1
36   75.757683  204.952743       Central    Blue  Vegetarian             1
37   76.930157  215.294360       Central    Blue    Omnivore             1
..         ...         ...           ...     ...         ...           ...
455  76.515129  212.282806       Sarntal  Yellow    Omnivore             0
456  77.152326  215.961638       Sarntal  Yellow    Omnivore             0
457  81.883610  211.153312       Sarntal  Yellow    Omnivore             0
458  87.413974  205.694126       Sarntal  Yellow    Omnivore             0
459  94.106096  216.727091       Sarntal  Yellow    Omnivore             0

[136 rows x 6 columns]
     testScore  bodyLength mountainRange   color        diet  breathesFire
4     4.582095  174.221700      Bavarian    Blue   Carnivore             0
14   25.201276  182.353952      Bavarian    Blue   Carnivore             0
25   51.768306  210.940255       Central    Blue  Vegetarian             0
32   59.540570  215.965209       Central    Blue  Vegetarian             0
42   24.651235  206.677342      Emmental    Blue   Carnivore             0
..         ...         ...           ...     ...         ...           ...
474  41.045542  190.352825      Southern  Yellow  Vegetarian             0
475  41.806995  186.132578      Southern  Yellow  Vegetarian             0
476  42.650015  192.272043      Southern  Yellow  Vegetarian             0
478  47.795914  189.814408      Southern  Yellow  Vegetarian             0
479  55.486722  189.661989      Southern  Yellow   Carnivore             0

[229 rows x 6 columns]

3.2.4 Scalar Value Selection

There are actually quicker versions of the aforementioned .loc() and .iloc() methods that you can use when you’re only interested in retrieving a scalar value, and they work the same way:

  • Scalar access by label: .at()
  • Scalar access by position: .iat()
1
2
3
4
5
# Setting by label:
print(dragons_df.at[0,'bodyLength'])  # first row, first column

# Setting by position:
print(dragons_df.iat[0, 0])  # first row, first column
1
2
175.5122137
0.0

3.2.5 Ordered and Random Selection

Whether for ease of data review or because you’re setting up a subsequent calculation, you can also order a subset of rows in either ascending or descending order:

  • Select and sort greatest to least n entries: df.nlargest(n, 'colnames')
  • Select and sort least to greatest n entries: df.nsmallest(n, 'colnames')

If, instead of an ordered sample, you’d like to obtain a random sample of n rows, you could use the following: df.sample(n)

1
2
3
4
print(dragons_df.nlargest(50, 'testScore'))
print(dragons_df.nsmallest(50, 'testScore'))

print(dragons_df.sample(100))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
      testScore  bodyLength mountainRange   color        diet  breathesFire
79   100.000000  209.490674        Julian    Blue    Omnivore             1
119  100.000000  198.874616      Maritime    Blue    Omnivore             1
299  100.000000  193.423645       Sarntal     Red    Omnivore             1
359  100.000000  200.182169       Central  Yellow    Omnivore             0
436  100.000000  191.248929      Maritime  Yellow    Omnivore             0
437  100.000000  192.917439      Maritime  Yellow    Omnivore             0
438  100.000000  196.546624      Maritime  Yellow    Omnivore             0
439  100.000000  200.272381      Maritime  Yellow    Omnivore             0
435   99.605642  199.100041      Maritime  Yellow    Omnivore             0
78    98.499816  215.733178        Julian    Blue    Omnivore             1
239   97.647935  216.177762        Julian     Red    Omnivore             1
279   95.106090  204.671957      Maritime     Red  Vegetarian             0
298   94.545010  195.590678       Sarntal     Red    Omnivore             1
77    94.116335  230.650417        Julian    Blue    Omnivore             1
459   94.106096  216.727091       Sarntal  Yellow    Omnivore             0
118   92.760852  202.011059      Maritime    Blue    Omnivore             1
238   92.632400  214.918855        Julian     Red    Omnivore             1
434   91.094122  195.293005      Maritime  Yellow    Omnivore             0
199   90.031156  209.131404       Central     Red   Carnivore             1
433   89.961323  208.371043      Maritime  Yellow    Omnivore             1
358   89.915472  214.406738       Central  Yellow    Omnivore             0
117   89.815497  202.899173      Maritime    Blue    Omnivore             1
399   89.042531  232.513680        Julian  Yellow    Omnivore             0
116   88.550200  192.770727      Maritime    Blue    Omnivore             1
458   87.413974  205.694126       Sarntal  Yellow    Omnivore             0
432   87.371212  204.092492      Maritime  Yellow   Carnivore             0
278   87.201463  209.072897      Maritime     Red    Omnivore             1
297   86.984696  193.414687       Sarntal     Red    Omnivore             1
76    85.499126  220.173495        Julian    Blue    Omnivore             1
398   85.072554  227.324594        Julian  Yellow    Omnivore             0
431   84.767627  193.269179      Maritime  Yellow  Vegetarian             0
59    84.272785  216.582753      Emmental    Blue    Omnivore             1
296   84.269416  187.589529       Sarntal     Red    Omnivore             1
237   84.192600  210.784914        Julian     Red    Omnivore             1
277   83.969242  194.153313      Maritime     Red    Omnivore             1
115   83.941677  200.733784      Maritime    Blue    Omnivore             1
236   83.497061  202.689625        Julian     Red    Omnivore             0
235   83.168146  209.188844        Julian     Red    Omnivore             1
295   82.853902  195.131278       Sarntal     Red    Omnivore             0
276   82.251778  199.007122      Maritime     Red    Omnivore             0
259   82.127343  204.082620      Ligurian     Red    Omnivore             0
294   82.012468  187.591883       Sarntal     Red    Omnivore             1
457   81.883610  211.153312       Sarntal  Yellow    Omnivore             0
39    81.471137  220.156269       Central    Blue    Omnivore             1
114   81.470769  202.537720      Maritime    Blue    Omnivore             1
38    81.138248  201.321573       Central    Blue    Omnivore             1
113   80.892279  196.908408      Maritime    Blue    Omnivore             1
75    80.448654  219.178612        Julian    Blue    Omnivore             0
74    80.412169  217.214241        Julian    Blue  Vegetarian             1
139   80.292789  201.469309       Sarntal    Blue    Omnivore             1
     testScore  bodyLength mountainRange   color        diet  breathesFire
0     0.000000  175.512214      Bavarian    Blue   Carnivore             1
140   0.000000  173.532588      Southern    Blue   Carnivore             1
200   0.000000  189.605685      Emmental     Red   Carnivore             1
300   0.000000  171.050730      Southern     Red   Carnivore             1
1     0.742914  190.640959      Bavarian    Blue   Carnivore             1
2     2.501825  169.708778      Bavarian    Blue   Carnivore             1
160   2.557890  169.619382      Bavarian     Red   Carnivore             0
460   2.599976  184.135925      Southern  Yellow   Carnivore             0
161   3.282949  179.000774      Bavarian     Red   Carnivore             1
3     3.380430  188.847204      Bavarian    Blue   Carnivore             1
162   3.597530  167.510430      Bavarian     Red   Carnivore             1
163   3.875730  164.416273      Bavarian     Red   Carnivore             0
4     4.582095  174.221700      Bavarian    Blue   Carnivore             0
164   6.038333  165.882952      Bavarian     Red   Carnivore             1
400   6.411029  224.721753      Ligurian  Yellow   Carnivore             1
165   7.358565  179.606068      Bavarian     Red   Carnivore             1
320  10.825361  191.932215      Bavarian  Yellow   Carnivore             0
5    12.453635  183.081931      Bavarian    Blue   Carnivore             1
120  13.080916  199.999325       Sarntal    Blue   Carnivore             1
321  14.013776  193.244146      Bavarian  Yellow   Carnivore             0
141  14.381638  175.438152      Southern    Blue   Carnivore             1
461  14.488204  196.060398      Southern  Yellow  Vegetarian             0
80   14.565634  206.029133      Ligurian    Blue   Carnivore             1
360  14.916917  220.749686      Emmental  Yellow   Carnivore             0
462  14.999489  193.616367      Southern  Yellow   Carnivore             0
81   15.042811  214.375613      Ligurian    Blue   Carnivore             1
40   15.399484  211.185196      Emmental    Blue   Carnivore             1
401  15.606387  225.451337      Ligurian  Yellow   Carnivore             0
463  15.923599  188.993337      Southern  Yellow   Carnivore             0
166  16.147309  165.548520      Bavarian     Red   Carnivore             1
6    16.176758  174.839935      Bavarian    Blue   Carnivore             1
7    16.249164  182.876708      Bavarian    Blue   Carnivore             1
167  16.285131  170.819319      Bavarian     Red   Carnivore             1
361  16.829491  222.505211      Emmental  Yellow   Carnivore             0
8    16.852420  177.820550      Bavarian    Blue    Omnivore             1
142  17.633001  176.474214      Southern    Blue   Carnivore             1
168  18.442801  175.970409      Bavarian     Red   Carnivore             1
9    18.452767  176.740267      Bavarian    Blue   Carnivore             1
169  18.467754  169.770832      Bavarian     Red   Carnivore             1
301  18.578797  169.845711      Southern     Red   Carnivore             1
170  18.838821  167.685525      Bavarian     Red   Carnivore             1
302  18.961916  168.670439      Southern     Red   Carnivore             1
322  19.095881  192.895791      Bavarian  Yellow    Omnivore             0
464  20.010852  187.867885      Southern  Yellow   Carnivore             0
201  20.017659  196.343091      Emmental     Red  Vegetarian             0
465  20.030181  191.637876      Southern  Yellow   Carnivore             0
171  20.108537  170.053440      Bavarian     Red   Carnivore             1
362  20.122821  217.988338      Emmental  Yellow   Carnivore             0
172  20.242590  171.093272      Bavarian     Red   Carnivore             1
323  20.462621  185.663441      Bavarian  Yellow   Carnivore             0
     testScore  bodyLength mountainRange   color        diet  breathesFire
189  42.823690  213.113375       Central     Red  Vegetarian             1
61   53.393965  222.517254        Julian    Blue  Vegetarian             1
186  40.080020  216.481476       Central     Red  Vegetarian             1
134  71.095966  200.127441       Sarntal    Blue    Omnivore             1
11   20.489804  170.686850      Bavarian    Blue   Carnivore             1
..         ...         ...           ...     ...         ...           ...
258  58.694412  205.190403      Ligurian     Red  Vegetarian             0
197  67.121844  201.348759       Central     Red    Omnivore             0
479  55.486722  189.661989      Southern  Yellow   Carnivore             0
93   47.656285  210.898075      Ligurian    Blue  Vegetarian             1
292  71.861195  187.583061       Sarntal     Red    Omnivore             1

[100 rows x 6 columns]

3.2.6 Compound Selection

We can also subset data that fits multiple conditions:

  • Conjunctive (“and”) logic: &
  • Disjunctive (“or”) logic: |
  • Exclusive or (“xor”) logic: ^
  • If this is confusing, you can decompose it into the following:

    a ^ b = (a | b) & (~a | ~b)

    a ^ b = (a & ~b) | (~a & b)

  • Not: ~
  • Make sure to wrap individual conditions in ()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Subset rows for which both conditions are true
print(dragons_df[(dragons_df['testScore'] > 65) & 
                 (dragons_df['breathesFire'].isin([0]))])

# Subset rows for which at least one condition is true
print(dragons_df[(dragons_df['testScore'] > 65) | 
                 (dragons_df['breathesFire'].isin([0]))])

# Subset rows for which the first condition is true and the second is false
print(dragons_df[(dragons_df['testScore'] > 65) & 
                 ~(dragons_df['breathesFire'].isin([0]))])

# Subset rows for which there is a mismatch in truth values of conditions
print(dragons_df[(dragons_df['testScore'] > 65) ^ 
                 (dragons_df['breathesFire'].isin([0]))])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
     testScore  bodyLength mountainRange   color      diet  breathesFire
75   80.448654  219.178612        Julian    Blue  Omnivore             0
110  70.153055  204.432935      Maritime    Blue  Omnivore             0
196  66.766388  201.759702       Central     Red  Omnivore             0
197  67.121844  201.348759       Central     Red  Omnivore             0
227  66.689827  207.957300        Julian     Red  Omnivore             0
..         ...         ...           ...     ...       ...           ...
455  76.515129  212.282806       Sarntal  Yellow  Omnivore             0
456  77.152326  215.961638       Sarntal  Yellow  Omnivore             0
457  81.883610  211.153312       Sarntal  Yellow  Omnivore             0
458  87.413974  205.694126       Sarntal  Yellow  Omnivore             0
459  94.106096  216.727091       Sarntal  Yellow  Omnivore             0

[71 rows x 6 columns]
     testScore  bodyLength mountainRange   color        diet  breathesFire
4     4.582095  174.221700      Bavarian    Blue   Carnivore             0
14   25.201276  182.353952      Bavarian    Blue   Carnivore             0
25   51.768306  210.940255       Central    Blue  Vegetarian             0
32   59.540570  215.965209       Central    Blue  Vegetarian             0
33   67.382720  212.229895       Central    Blue    Omnivore             1
..         ...         ...           ...     ...         ...           ...
474  41.045542  190.352825      Southern  Yellow  Vegetarian             0
475  41.806995  186.132578      Southern  Yellow  Vegetarian             0
476  42.650015  192.272043      Southern  Yellow  Vegetarian             0
478  47.795914  189.814408      Southern  Yellow  Vegetarian             0
479  55.486722  189.661989      Southern  Yellow   Carnivore             0

[294 rows x 6 columns]
      testScore  bodyLength mountainRange   color        diet  breathesFire
33    67.382720  212.229895       Central    Blue    Omnivore             1
34    71.080284  214.752198       Central    Blue    Omnivore             1
35    73.119070  213.582062       Central    Blue    Omnivore             1
36    75.757683  204.952743       Central    Blue  Vegetarian             1
37    76.930157  215.294360       Central    Blue    Omnivore             1
..          ...         ...           ...     ...         ...           ...
297   86.984696  193.414687       Sarntal     Red    Omnivore             1
298   94.545010  195.590678       Sarntal     Red    Omnivore             1
299  100.000000  193.423645       Sarntal     Red    Omnivore             1
433   89.961323  208.371043      Maritime  Yellow    Omnivore             1
452   73.617034  209.306230       Sarntal  Yellow    Omnivore             1

[65 rows x 6 columns]
     testScore  bodyLength mountainRange   color        diet  breathesFire
4     4.582095  174.221700      Bavarian    Blue   Carnivore             0
14   25.201276  182.353952      Bavarian    Blue   Carnivore             0
25   51.768306  210.940255       Central    Blue  Vegetarian             0
32   59.540570  215.965209       Central    Blue  Vegetarian             0
33   67.382720  212.229895       Central    Blue    Omnivore             1
..         ...         ...           ...     ...         ...           ...
474  41.045542  190.352825      Southern  Yellow  Vegetarian             0
475  41.806995  186.132578      Southern  Yellow  Vegetarian             0
476  42.650015  192.272043      Southern  Yellow  Vegetarian             0
478  47.795914  189.814408      Southern  Yellow  Vegetarian             0
479  55.486722  189.661989      Southern  Yellow   Carnivore             0

[223 rows x 6 columns]

3.2.7 Complex Boolean Logic

For more advanced logical operations designed specifically for dataframes, you also have the following two complementary options:

  • Check if at least one value in each dataframe column is True:
  • df.any()
  • Check if all the values in each dataframe column are True:
  • df.all()

Any given pandas element will be True by default. The only time they will be identified as False is when the element is equal to zero or missing.

1
2
print(dragons_df.any())
print(dragons_df.all())
1
2
3
4
5
6
7
8
9
10
11
12
13
14
testScore        True
bodyLength       True
mountainRange    True
color            True
diet             True
breathesFire     True
dtype: bool
testScore        False
bodyLength        True
mountainRange     True
color             True
diet              True
breathesFire     False
dtype: bool

3.3 Data Types

There are two object types that pandas works with: DataFrame and Series.

Now, I don’t know about you, but the un-R-esque capital letters annoy me. Unfortunately, that’s just what we have to work with. So how are they different? A Series is 1D while the DataFrame is 2D and built out of a collection of Series. However, a Series is more specifically a labeled 1D array, which means you can think of it as a single column with corresponding row labels preserved.

If we check the type of a single extracted column, we see that it is a Series. However, if we take a bunch of columns together, they’re still grouped together as one DataFrame object. However, it’s also totally possible to have a single-column DataFrame object.

1
2
3
print(type(dragons_df))  # type of the entire dataframe
print(type(dragons_df['bodyLength']))  # type of a single column
print(type(dragons_df.iloc[:, 0:5]))  # type of a slice of several columns
1
2
3
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>

3.3.1 Type Conversion

Just like with vanilla Python type conversion functions (list(), str(), int(), etc.), we can directly convert any existing array-like object to a Series or a DataFrame by using corresponding constructors: pd.Series() and pd.DataFrame().

1
2
3
4
5
6
7
8
9
10
11
lst = [1, 2, 3, 'hello world']
tup = (1, 2, 3, 'hello world')

ser = pd.Series(lst)  # convert list to a pandas Series object
print(type(ser))
print(ser)

df = pd.DataFrame(ser)  # convert Series to a pandas DataFrame object
print(type(df))
print(df)

1
2
3
4
5
6
7
8
9
10
11
12
<class 'pandas.core.series.Series'>
0              1
1              2
2              3
3    hello world
dtype: object
<class 'pandas.core.frame.DataFrame'>
             0
0            1
1            2
2            3
3  hello world

As you can see in the above compile, Python automatically assigns indices (starting from 0) if you do not provide them. This happens for the 1D Series as well as the 2D DataFrame. However, you can also manually them, which we’ll see in the next section.

3.4 Making pandas Objects from Scratch

We now know how to view and subset preformatted tabular data that we loaded! But what about when we need to create a Series or a DataFrame from scratch in Python? We have options, and I’ll explain what I think are the most useful ones below.

In some cases, the input format will by nature already furnish column header labels. However, in other cases, you will have to manually rename the column header labels from the indices they default to, and the same is true for rows.

  • To set column names:
    1
    
    pd.DataFrame(data, columns = ['colname_1', 'colname_2', ..., 'colname_n'])
    
  • To set row (index) names:
    1
    
    pd.DataFrame(data, index = ['colname_1', 'colname_2', ..., 'colname_n'])
    

3.4.2 Dictionary → DataFrame

1
2
3
4
5
6
7
my_exotic_fruits = {
    'fruit': ['cassabanana', 'rambutan', 'mangosteen', 'kiwano', 'atemoya'], 
    'outside_color': ['burgundy', 'red', 'purple', 'yellow', 'green'],
    'inside_color': ['yellow', 'white', 'white', 'green', 'white']}
             
print(pd.DataFrame(my_exotic_fruits,
                   index = ['fruit0', 'fruit1', 'fruit2', 'fruit3', 'fruit4']))
1
2
3
4
5
6
              fruit outside_color inside_color
fruit0  cassabanana      burgundy       yellow
fruit1     rambutan           red        white
fruit2   mangosteen        purple        white
fruit3       kiwano        yellow        green
fruit4      atemoya         green        white

3.4.1 Lists of Lists → DataFrame

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
my_exotic_fruits = [['cassabanana', 'burgundy', 'yellow'], 
             ['rambutan', 'red', 'white'], 
             ['mangosteen', 'purple', 'white'], 
             ['kiwano', 'yellow', 'green'],
             ['atemoya', 'green', 'white']]  # each nested list is a row

no_labels_df = pd.DataFrame(my_exotic_fruits)
all_labels_df = pd.DataFrame(my_exotic_fruits, 
                       columns = ['fruit', 'outside_color', 'inside_color'],
                       index = ['fruit0', 
                                'fruit1',
                                'fruit2', 
                                'fruit3', 
                                'fruit4'])

print(no_labels_df)  # defaults to indices (0, 1, 2, ..., n)
print(all_labels_df)  # aligns with supplied column and row labels
1
2
3
4
5
6
7
8
9
10
11
12
             0         1       2
0  cassabanana  burgundy  yellow
1     rambutan       red   white
2   mangosteen    purple   white
3       kiwano    yellow   green
4      atemoya     green   white
              fruit outside_color inside_color
fruit0  cassabanana      burgundy       yellow
fruit1     rambutan           red        white
fruit2   mangosteen        purple        white
fruit3       kiwano        yellow        green
fruit4      atemoya         green        white

3.4.3 List of Dictionaries → DataFrame

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
my_exotic_fruits = [{'fruit': 'cassabanana', 
                     'outside_color': 'burgundy', 
                     'inside_color': 'yellow'},
                    {'fruit': 'rambutan', 
                     'outside_color': 'red', 
                     'inside_color': 'white'},
                    {'fruit': 'mangosteen', 
                     'outside_color': 'purple', 
                     'inside_color': 'white'},
                    {'fruit': 'kiwano', 
                     'outside_color': 'yellow', 
                     'inside_color': 'green'},
                    {'fruit': 'atemoya', 
                     'outside_color': 'green', 
                     'inside_color': 'white'}]  
                     # each nested dictionary is a row (with column headers)

print(pd.DataFrame(my_exotic_fruits,
                   index = ['fruit0', 'fruit1', 'fruit2', 'fruit3', 'fruit4']))
1
2
3
4
5
6
              fruit outside_color inside_color
fruit0  cassabanana      burgundy       yellow
fruit1     rambutan           red        white
fruit2   mangosteen        purple        white
fruit3       kiwano        yellow        green
fruit4      atemoya         green        white

4. Data Cleaning

4.1 Setting Data

We’ve talked about pulling out columns and rows, but what about adding them in or replacing existing values? In both cases, you just (re)assign a value to that table cell or group of cells.

4.1.1 Adding Data

Let’s say that there’s a Goldilocks effect where dragons with a body length closer to 190 feet are nicer than those which are either too long or too short. If we assign this formula to a new column label, voilà, the new column is added for us!

1
2
3
4
print(dragons_df)
dragons_df['friendliness'] = (190 - abs(dragons_df['bodyLength'] - 190))/190
print(dragons_df)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
     testScore  bodyLength mountainRange   color        diet  breathesFire
0     0.000000  175.512214      Bavarian    Blue   Carnivore             1
1     0.742914  190.640959      Bavarian    Blue   Carnivore             1
2     2.501825  169.708778      Bavarian    Blue   Carnivore             1
3     3.380430  188.847204      Bavarian    Blue   Carnivore             1
4     4.582095  174.221700      Bavarian    Blue   Carnivore             0
..         ...         ...           ...     ...         ...           ...
475  41.806995  186.132578      Southern  Yellow  Vegetarian             0
476  42.650015  192.272043      Southern  Yellow  Vegetarian             0
477  44.582490  184.038015      Southern  Yellow  Vegetarian             1
478  47.795914  189.814408      Southern  Yellow  Vegetarian             0
479  55.486722  189.661989      Southern  Yellow   Carnivore             0

[480 rows x 6 columns]
     testScore  bodyLength  ... breathesFire friendliness
0     0.000000  175.512214  ...            1     0.923748
1     0.742914  190.640959  ...            1     0.996627
2     2.501825  169.708778  ...            1     0.893204
3     3.380430  188.847204  ...            1     0.993933
4     4.582095  174.221700  ...            0     0.916956
..         ...         ...  ...          ...          ...
475  41.806995  186.132578  ...            0     0.979645
476  42.650015  192.272043  ...            0     0.988042
477  44.582490  184.038015  ...            1     0.968621
478  47.795914  189.814408  ...            0     0.999023
479  55.486722  189.661989  ...            0     0.998221

[480 rows x 7 columns]

4.1.2 Replacing Data

Uh oh…more research has shown that dragons with a body length of 180 feet are actually the most friendly—looks like we messed up! If we change the formula for the Goldilocks number but still assign it to the same column label, we get our correct values.

1
2
3
print(dragons_df)
dragons_df['friendliness'] = (180 - abs(dragons_df['bodyLength'] - 180))/180
print(dragons_df)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
     testScore  bodyLength  ... breathesFire friendliness
0     0.000000  175.512214  ...            1     0.923748
1     0.742914  190.640959  ...            1     0.996627
2     2.501825  169.708778  ...            1     0.893204
3     3.380430  188.847204  ...            1     0.993933
4     4.582095  174.221700  ...            0     0.916956
..         ...         ...  ...          ...          ...
475  41.806995  186.132578  ...            0     0.979645
476  42.650015  192.272043  ...            0     0.988042
477  44.582490  184.038015  ...            1     0.968621
478  47.795914  189.814408  ...            0     0.999023
479  55.486722  189.661989  ...            0     0.998221

[480 rows x 7 columns]
     testScore  bodyLength  ... breathesFire friendliness
0     0.000000  175.512214  ...            1     0.975068
1     0.742914  190.640959  ...            1     0.940884
2     2.501825  169.708778  ...            1     0.942827
3     3.380430  188.847204  ...            1     0.950849
4     4.582095  174.221700  ...            0     0.967898
..         ...         ...  ...          ...          ...
475  41.806995  186.132578  ...            0     0.965930
476  42.650015  192.272043  ...            0     0.931822
477  44.582490  184.038015  ...            1     0.977567
478  47.795914  189.814408  ...            0     0.945476
479  55.486722  189.661989  ...            0     0.946322

[480 rows x 7 columns]

4.1.3 Renaming Columns

We will very often need to rename columns to enhance readability. To do this, we use the following method:

  • df.rename(columns={'old_name': 'new_name'})
1
2
dragons_df = dragons_df.rename(columns={'friendliness': 'passiveness'})
print(dragons_df)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
     testScore  bodyLength mountainRange  ...        diet breathesFire  passiveness
0     0.000000  175.512214      Bavarian  ...   Carnivore            1     0.975068
1     0.742914  190.640959      Bavarian  ...   Carnivore            1     0.940884
2     2.501825  169.708778      Bavarian  ...   Carnivore            1     0.942827
3     3.380430  188.847204      Bavarian  ...   Carnivore            1     0.950849
4     4.582095  174.221700      Bavarian  ...   Carnivore            0     0.967898
..         ...         ...           ...  ...         ...          ...          ...
475  41.806995  186.132578      Southern  ...  Vegetarian            0     0.965930
476  42.650015  192.272043      Southern  ...  Vegetarian            0     0.931822
477  44.582490  184.038015      Southern  ...  Vegetarian            1     0.977567
478  47.795914  189.814408      Southern  ...  Vegetarian            0     0.945476
479  55.486722  189.661989      Southern  ...   Carnivore            0     0.946322

[480 rows x 7 columns]

4.2 Missing Data

Missing data is automatically filled in as np.nan: numpy “Not a Number” (NaN) values. While R identifies two distinct values for missing and empty data (the difference is quite nuanced), NA and NULL, respectively, Python collapses these two into one.

Let’s pull up that table of exotic fruits from a while ago and plug in a missing value to play around.

1
2
3
4
my_exotic_fruits_df = pd.DataFrame(my_exotic_fruits)
my_exotic_fruits_df.loc[0, 'fruit'] = np.nan

print(my_exotic_fruits_df)
1
2
3
4
5
6
        fruit outside_color inside_color
0         NaN      burgundy       yellow
1    rambutan           red        white
2  mangosteen        purple        white
3      kiwano        yellow        green
4     atemoya         green        white

4.2.1 Dropping Missing Data

A common problem in research is getting rid of bad or missing data. To quickly clear your dataset of such occurrences, we use the following command: df.dropna(how="any")

1
print(my_exotic_fruits_df.dropna(how='any'))
1
2
3
4
5
        fruit outside_color inside_color
1    rambutan           red        white
2  mangosteen        purple        white
3      kiwano        yellow        green
4     atemoya         green        white

4.2.2 Renaming Missing Data

On the contrary, what if we wanted to keep our missing data but assign a different placeholder name? We can do this by using the following command: df.fillna(value='foo') (by the way, ‘foo’ is a common placeholder you’ll see)

1
print(my_exotic_fruits_df.fillna(value='unknown_fruit_data'))
1
2
3
4
5
6
                fruit outside_color inside_color
0  unknown_fruit_data      burgundy       yellow
1            rambutan           red        white
2          mangosteen        purple        white
3              kiwano        yellow        green
4             atemoya         green        white

4.2.3 Locating Missing Data

Another crucial function when dealing with missing data is finding where the holes are actually located. We have three options:

  • pd.isna(df)
  • pd.isnull(obj)
  • pd.notna(obj)
  • pd.notnull(obj)
  • Note that these are two functionally identical pairs with the vestigial syntactical difference carrying over from R

For each of these functions, Python will return what’s called a “boolean mask,” a version of the dataframe or object we inputted that has all its values masked to reflect whether the value is missing or not.

1
2
3
4
5
print(pd.isna(my_exotic_fruits_df))
print(pd.isnull(my_exotic_fruits_df))

print(pd.notna(my_exotic_fruits_df))
print(pd.notnull(my_exotic_fruits_df))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
   fruit  outside_color  inside_color
0   True          False         False
1  False          False         False
2  False          False         False
3  False          False         False
4  False          False         False
   fruit  outside_color  inside_color
0   True          False         False
1  False          False         False
2  False          False         False
3  False          False         False
4  False          False         False
   fruit  outside_color  inside_color
0  False           True          True
1   True           True          True
2   True           True          True
3   True           True          True
4   True           True          True
   fruit  outside_color  inside_color
0  False           True          True
1   True           True          True
2   True           True          True
3   True           True          True
4   True           True          True

5. Statistical Calculations

5.1 Descriptives

pandas offers a relatively comprehensive suite of functions for pulling together descriptive statistics:

  • .sum()
  • .count()
  • .max()
  • .min()
  • .median()
  • .quantile([0.25, 0.75)
  • .mean()
  • .var()
  • .std()
  • .value_counts()
  • .nunique()

Each of the functions/methods above returns a single value. Alternatively, you could run df.describe to provide a full output in Series format.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
bodl = dragons_df['bodyLength']

print(bodl.sum())
print(bodl.count())
print(bodl.max())
print(bodl.min())
print(bodl.median())
print(bodl.quantile([0.25, 0.75]))
print(bodl.mean())
print(bodl.var())
print(bodl.std())
print('\n')

print(bodl.value_counts())  # frequency table
print('\n')

print(bodl.nunique())  # count of total unique values
print('\n')

print(bodl.describe())  # full(er) descriptives table
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
96631.9015289
480
236.36245290000002
162.3266403
202.91298255
0.25    191.762141
0.75    213.136002
Name: bodyLength, dtype: float64
201.3164615185418
262.83612105632886
16.212221348610093


195.131278    1
187.226044    1
198.559975    1
204.092492    1
170.686850    1
             ..
208.055977    1
234.915029    1
172.626085    1
200.156521    1
209.373511    1
Name: bodyLength, Length: 480, dtype: int64


480


count    480.000000
mean     201.316462
std       16.212221
min      162.326640
25%      191.762141
50%      202.912983
75%      213.136002
max      236.362453
Name: bodyLength, dtype: float64

5.1 Custom Functions

Depending on the complexity of your data manipulations, you may very well like to apply far more intricate functions onto your dataframe columns. For this, we have the .apply() method.

While this method might initially appear to be a panacea, it should never be your first choice because it is fundamentally non-vectorized. What this means is that whatever function you apply to your dataframe column will be run on each row iteratively: in other words, highly inefficiently.

1
2
3
4
dragons_df['diet'] = dragons_df['diet'].apply(lambda diet : 'Pesce-Pollotarian' 
                                              if diet == 'Omnivore' else 'Vegan'
                                              )
print(dragons_df['diet'].value_counts()) 
1
2
3
Vegan                313
Pesce-Pollotarian    167
Name: diet, dtype: int64

6. Data Transformations

We’re now going to talk about really awesome ways you can quickly combine and pivot your data!

6.1 Combining Data

When you’re looking to add data to an existing dataframe, you can either to choose to append more rows or more columns. The method is the same for both transformations, and all you need to do is specify the axis of interest:

  • Add more rows:
  • pd.concat([df1,df2])
  • Add more columns:
  • pd.concat([df1,df2], axis=1)

The axis parameter always defaults to axis=0, so it is unnecessary to declare it unless you’re working on the separate axis (sideways). Note that the concatenation function requires the input dataframes to be enclosed in an array-like object.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
my_new_exotic_fruits_df = pd.DataFrame(
    {'fruit': 
     ['ice cream bean', 'caviar lime', 'granadilla', 'tamarillo', 'sapodilla'],
     'outside_color': 
     ['green', 'green', 'yellow', 'red', 'tan'],
      'inside_color': 
     ['white', 'clear', 'grey', 'orange', 'tan']
    }
)

print(pd.concat([my_exotic_fruits_df, my_new_exotic_fruits_df], 
                ignore_index=True))  
# if we set ignore_index to True, the index incrementer does not get all wonky

print(pd.concat([my_exotic_fruits_df, my_new_exotic_fruits_df], axis=1, 
                ignore_index=True))  
# if we set ignore_index to True, the index incrementer does not get all wonky

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
            fruit outside_color inside_color
0             NaN      burgundy       yellow
1        rambutan           red        white
2      mangosteen        purple        white
3          kiwano        yellow        green
4         atemoya         green        white
5  ice cream bean         green        white
6     caviar lime         green        clear
7      granadilla        yellow         grey
8       tamarillo           red       orange
9       sapodilla           tan          tan
            0         1       2               3       4       5
0         NaN  burgundy  yellow  ice cream bean   green   white
1    rambutan       red   white     caviar lime   green   clear
2  mangosteen    purple   white      granadilla  yellow    grey
3      kiwano    yellow   green       tamarillo     red  orange
4     atemoya     green   white       sapodilla     tan     tan

6.2 Pivoting Data

To pivot data long to wide (increase columns, decrease rows):

  • pd.pivot(index='idx', columns='var', values='val')

To pivot data wide to long (increase rows, decrease columns):

  • pd.melt(df)

  • Note that these are inverse operations, but they will not necessarily cancel each other out

1
2
3
4
5
6
7
8
9
10
11
my_exotic_fruits_df['rarity'] = \
  ['rare', 'common', 'uncommon', 'uncommon', 'rare']
print(my_exotic_fruits_df)

my_exotic_fruits_pvt_df = my_exotic_fruits_df.pivot(
    index='rarity', columns='fruit', values='inside_color')
print(my_exotic_fruits_pvt_df)

my_exotic_fruits_pvt_df = pd.melt(
    my_exotic_fruits_pvt_df)
print(my_exotic_fruits_pvt_df)

6.3 Stacking Data

We can also aggregate data into groups by using df.stack(). This is reversible by df.unstack().

1
2
print(my_exotic_fruits_df.stack())
print(my_exotic_fruits_df.stack().unstack())

6.4 Indices

Okay, I’ve admittedly beaten around the bush for this one: what even are the row labels, the so-called “indices?”

They are an immutable object that stores axis data.

6.3.1 Re(setting) Indices

1
2
3
4
5
6
print(my_exotic_fruits_df)
my_exotic_fruits_df = my_exotic_fruits_df.set_index('fruit')
print(my_exotic_fruits_df)  # new set of indices along which to reference

my_exotic_fruits_df = my_exotic_fruits_df.reset_index()
print(my_exotic_fruits_df)  # indices returned to default

6.3.2 Multi-Indexing

It’s actually also possible to have a multi-index since you can set an index to be pairs of data rather than single values.

1
2
3
4
5
6
index = pd.MultiIndex.from_tuples([('exotic', 'tropical'),
                                   ('exotic', 'desert'),
                                   ('non-exotic', 'deciduous')],
                                  names=['n','v'])

print(index)
1
2
3
4
MultiIndex([(    'exotic',  'tropical'),
            (    'exotic',    'desert'),
            ('non-exotic', 'deciduous')],
           names=['n', 'v'])

6.4 Merging Data

While we’ve already covered methods for combining data, we haven’t yet addressed a more nuanced alternative to forcing data on top of or next to each other.

This is where the adaptable pd.merge() method comes into the picture:

  • pd.merge(df1, df2, how='direction', on='colname')
1
2
3
4
5
6
7
8
9
10
11
print(pd.merge(my_exotic_fruits_df, my_new_exotic_fruits_df,
         how='left', on='fruit'))  # Join corresponding rows by first df

print(pd.merge(my_exotic_fruits_df, my_new_exotic_fruits_df,
         how='right', on='fruit'))  # Join corresponding rows by second df

print(pd.merge(my_exotic_fruits_df, my_new_exotic_fruits_df,
         how='inner', on='fruit'))  # Keep rows shared across both sets

print(pd.merge(my_exotic_fruits_df, my_new_exotic_fruits_df,
         how='outer', on='fruit'))  # Keep all rows and values
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
        fruit outside_color_x inside_color_x outside_color_y inside_color_y
0         NaN        burgundy         yellow             NaN            NaN
1    rambutan             red          white             NaN            NaN
2  mangosteen          purple          white             NaN            NaN
3      kiwano          yellow          green             NaN            NaN
4     atemoya           green          white             NaN            NaN
            fruit outside_color_x inside_color_x outside_color_y inside_color_y
0  ice cream bean             NaN            NaN           green          white
1     caviar lime             NaN            NaN           green          clear
2      granadilla             NaN            NaN          yellow           grey
3       tamarillo             NaN            NaN             red         orange
4       sapodilla             NaN            NaN             tan            tan
Empty DataFrame
Columns: [fruit, outside_color_x, inside_color_x, outside_color_y, inside_color_y]
Index: []
            fruit outside_color_x inside_color_x outside_color_y inside_color_y
0             NaN        burgundy         yellow             NaN            NaN
1        rambutan             red          white             NaN            NaN
2      mangosteen          purple          white             NaN            NaN
3          kiwano          yellow          green             NaN            NaN
4         atemoya           green          white             NaN            NaN
5  ice cream bean             NaN            NaN           green          white
6     caviar lime             NaN            NaN           green          clear
7      granadilla             NaN            NaN          yellow           grey
8       tamarillo             NaN            NaN             red         orange
9       sapodilla             NaN            NaN             tan            tan

6.5 Pipelines

A common occurrence in tidyverse-powered R scripts is the useful maggritr pipeline symbol, %>%. Thankfully, this functionality can be recreated in Python with ease by simply chaining together statements with a newline break after each successive function:

1
2
3
4
5
df = pd.melt(df)
       .rename(columns={'variable' : 'var',
                'value' : 'val'})
       .rename(columns={'var1' : 'var2',
                'value2' : 'va2'})

7. Conclusion and Additional Resources

Phew—that was a lot! As I mentioned previously, this tutorial only goes so far in covering the full spectrum of functionality afforded by pandas. Once you’re familiar with the above techniques, you can certainly dive deeper into the documentation at your leisure.

Below, I’ve included links to several resources that I’ve found helpful and referenced for this walkthrough: