Notes on the Book "Effective Pandas"

Book: Effective Pandas

Harrison, M. (2021) Effective Pandas: Patterns for Data Manipulation. Independently published. Pandas is a Python library for data analysis and visualization, …

The book I introduced in the last post has turned out to be so fantastic that I have decided to make notes of it. This is a live document; I started from Chapter 21 and am going on from there. I will return to the earlier chapters if I still have time and power. The author recently appeared in Real Python Podcast: Becoming More Effective at Manipulating Data with Pandas and talked about the book.

Introduction

Installation

Data Structures

Seris Introduction

Series Deep Dive

Operators (& Dunder Methods)

Aggregate Methods

Conversion Methods

Manipulation Methods

Indexing Operations

String Manipulation

Date and Time Manipulation

Plotting with a Series

Dates in the Index

Categorical Manipulation

Dataframes

Similarities with Series and DataFrame

Math Methods in DataFrames

Looping and Aggregation

Columns Types, .assign, and Memory Usage

Creating and Updating Columns

p. 179. I think the 8th row of the code .replace({'Yes' : True, 'No' : False, np.nan : False}) could be made more concise like .eq('Yes').

Dealing with Missing and Duplicated Data

Sorting Columns and Indexes

p. 194. Ths is a code in the book to sort a list of American presidents according to ther last names.

>>> (pres
 .sort_values(
     by='President',
     key=lambda name_ser: name_ser
          .str.split()
          .apply(lambda val: val[-1]))
)

    Seq  President          Party                  ...
2	2	 John Adams         Federalist	           ...
6	6	 John Quincy Adams  Democratic-Republican  ... 

An apply method and a second lambda function are necessary as shown in the above example; you cannot split a series and access the last string by indexing [-1] in a single lambda function.

>>> (pres
 .sort_values(
     by='President',
     key=lambda name_ser: name_ser
          .str.split()[-1])
)

ValueError
    ....

Filtering and Indexing Operations

pp. 199-200. I don't see the point in first setting the index for the 'President' column and then resetting it, instead of directly resetting it.

(pres
 .set_index('President')
 .reset_index()
)

p. 201. Note. Now I know why '&' in df.loc[...] sometimes requires parentheses next to it! The book says that "the & operator has higher precedence than >=" and the likes and that as a result, Pandas interprets the double condition pres.Average_rank < 10 & pres.Party == 'Republican' as pres.Average_rank < (10 & pres.Party) == 'Republican' while you meant (pres.Average_rank < 10) & (pres.Party == 'Republican'), for example. The author recommends that "you should always put parentheses around multiple conditions in index operations." I will keep that in mind.

p. 201. The use of the prefix '@' in the query method was totally new to me. It looks to incorporate variables defined outside into the query string.

>>> Roosevelts = ['Theodore Roosevelt', 'Franklin D. Roosevelt']
>>> (pres
 .query('President.isin(@Roosevelts)')
 [['President','Party']]
)

    President              Party
25	Theodore Roosevelt     Republican
31	Franklin D. Roosevelt  Democratic

Plotting with Dataframes

p. 224

(pres
 .set_index('President')
 .loc[:,'Background':'Average_rank']
 .iloc[:9] # This is synonymous with .head(9)
 .T
)

Reshaping Dataframes with Dummies

pp. 231-2. The row .filter( like = r'jb.role.*t' ) in the codes doesn't seem to make sense. I suppose that it should be .filter( like = 'jb.role'). The r prefix does no harm while it is unnecessary.

I don't believe I have grasped this chapter well.

Reshaping by Pivoting and Grouping

p.238. jb2 defined in p. 181 does not work here; the age columns gives a TypeError seemingly because its dtype is Int64. Switching to a float resolves the issue.

>>> (jb2
 # .astype({'age' : float})  This line is necessary
  .pivot_table(index='country_live',columns='employment_status',values='age',aggfunc='mean')
)

TypeError: Int64
pd.crosstab(
    index=jb2.country_live,
    columns=jb2.employment_status,
    values=jb2
        .age
        .astype(float), # So is this line.
    aggfunc="mean",
)

Interestingly, the groupby method doesn't require a conversion of the age column into float numbers.

p. 248. How useful pivot_table is! The order of min and max is somehow reversed.

>>> (jb2
 .pivot_table(
     index='country_live',
     values=['age', 'company_size'],
     aggfunc=({'age': ('min', 'max'), 'company_size': 'mean'})
    )
)

p. 249. This inconsistency of Pandas is annoying; mean requires quotation marks while min and max do with or without them.

(jb2
 .groupby('country_live')
 .agg(age_min=('age',min),
      age_max=('age','max'),
      team_size_mean=('team_size','mean')
     )
)

pp. 252-253. It's wonderful pivot_table and groupby don't only accept values but functions.

def even_grouper(idx):
    return 'not multiiples of 3' if idx % 3 else 'multiples of 3'

jb2.pivot_table(index=even_grouper,aggfunc='size')

More Aggregations

The .transform method, which is new to me, assigns aggregated information of groups such as size and mean to each member.

#28.4 Excercises

#1
(jb2
 .assign(pvmsum=(jb2
                 .groupby('python3_version_most')
                 .age
                 .transform('sum')
                )
        )
)

#2
(jb2
 .groupby('country_live')
 .filter(lambda g:g.age.size>=3)
)

Cross-tabulation Deep Dive

Melting, Transposing, and Stacking Data

Working with Time Series

Joining Dataframes

Exporting Data

Styling Dataframes

Debugging Pandas

Summary

2 thoughts on “Notes on the Book "Effective Pandas"

Leave a Reply

Your email address will not be published. Required fields are marked *