Notes on the Book "Effective Pandas"

2022-03-07 2023-05-30

Hide

The book I introduced in the last post has turned out to be so fantastic that I have decided to make notes of it. This is a live document; I started from Chapter 21 and am going on from there. I will return to the earlier chapters if I still have time and power. The author recently appeared in Real Python Podcast: Becoming More Effective at Manipulating Data with Pandas and talked about the book.

Table of Contents

1. Introduction
2. Installation
3. Data Structures
4. Seris Introduction
5. Series Deep Dive
6. Operators (& Dunder Methods)
7. Aggregate Methods
8. Conversion Methods
9. Manipulation Methods
10. Indexing Operations
11. String Manipulation
12. Date and Time Manipulation
13. Plotting with a Series
14. Dates in the Index
15. Categorical Manipulation
16. Dataframes
17. Similarities with Series and DataFrame
18. Math Methods in DataFrames
19. Looping and Aggregation
20. Columns Types, .assign, and Memory Usage
21. Creating and Updating Columns
22. Dealing with Missing and Duplicated Data
23. Sorting Columns and Indexes
24. Filtering and Indexing Operations
25. Plotting with Dataframes
26. Reshaping Dataframes with Dummies
27. Reshaping by Pivoting and Grouping
28. More Aggregations
29. Cross-tabulation Deep Dive
30. Melting, Transposing, and Stacking Data
31. Working with Time Series
32. Joining Dataframes
33. Exporting Data
34. Styling Dataframes
35. Debugging Pandas
36. Summary

Introduction

Installation

Data Structures

Seris Introduction

Series Deep Dive

Operators (& Dunder Methods)

Aggregate Methods

Conversion Methods

Manipulation Methods

Indexing Operations

String Manipulation

Date and Time Manipulation

Plotting with a Series

Dates in the Index

Categorical Manipulation

Dataframes

Similarities with `Series` and `DataFrame`

Math Methods in DataFrames

Looping and Aggregation

Columns Types, `.assign`, and Memory Usage

Creating and Updating Columns

p. 179. I think the 8th row of the code .replace({'Yes' : True, 'No' : False, np.nan : False}) could be made more concise like .eq('Yes').

Dealing with Missing and Duplicated Data

Sorting Columns and Indexes

p. 194. Ths is a code in the book to sort a list of American presidents according to ther last names.

>>> (pres
 .sort_values(
     by='President',
     key=lambda name_ser: name_ser
          .str.split()
          .apply(lambda val: val[-1]))
)

    Seq  President          Party                  ...
2	2	 John Adams         Federalist	           ...
6	6	 John Quincy Adams  Democratic-Republican  ...

An apply method and a second lambda function are necessary as shown in the above example; you cannot split a series and access the last string by indexing [-1] in a single lambda function.

>>> (pres
 .sort_values(
     by='President',
     key=lambda name_ser: name_ser
          .str.split()[-1])
)

ValueError
    ....

Filtering and Indexing Operations

pp. 199-200. I don't see the point in first setting the index for the 'President' column and then resetting it, instead of directly resetting it.

(pres
 .set_index('President')
 .reset_index()
)

p. 201. Note. Now I know why '&' in df.loc[...] sometimes requires parentheses next to it! The book says that "the & operator has higher precedence than >=" and the likes and that as a result, Pandas interprets the double condition pres.Average_rank < 10 & pres.Party == 'Republican' as pres.Average_rank < (10 & pres.Party) == 'Republican' while you meant (pres.Average_rank < 10) & (pres.Party == 'Republican'), for example. The author recommends that "you should always put parentheses around multiple conditions in index operations." I will keep that in mind.

p. 201. The use of the prefix '@' in the query method was totally new to me. It looks to incorporate variables defined outside into the query string.

>>> Roosevelts = ['Theodore Roosevelt', 'Franklin D. Roosevelt']
>>> (pres
 .query('President.isin(@Roosevelts)')
 [['President','Party']]
)

    President              Party
25	Theodore Roosevelt     Republican
31	Franklin D. Roosevelt  Democratic

Plotting with Dataframes

p. 224

(pres
 .set_index('President')
 .loc[:,'Background':'Average_rank']
 .iloc[:9] # This is synonymous with .head(9)
 .T
)

Reshaping Dataframes with Dummies

pp. 231-2. The row .filter( like = r'jb.role.*t' ) in the codes doesn't seem to make sense. I suppose that it should be .filter( like = 'jb.role'). The r prefix does no harm while it is unnecessary.

I don't believe I have grasped this chapter well.

Reshaping by Pivoting and Grouping

p.238. jb2 defined in p. 181 does not work here; the age columns gives a TypeError seemingly because its dtype is Int64. Switching to a float resolves the issue.

>>> (jb2
 # .astype({'age' : float})  This line is necessary
  .pivot_table(index='country_live',columns='employment_status',values='age',aggfunc='mean')
)

TypeError: Int64

pd.crosstab(
    index=jb2.country_live,
    columns=jb2.employment_status,
    values=jb2
        .age
        .astype(float), # So is this line.
    aggfunc="mean",
)

Interestingly, the groupby method doesn't require a conversion of the age column into float numbers.

p. 248. How useful pivot_table is! The order of min and max is somehow reversed.

>>> (jb2
 .pivot_table(
     index='country_live',
     values=['age', 'company_size'],
     aggfunc=({'age': ('min', 'max'), 'company_size': 'mean'})
    )
)

p. 249. This inconsistency of Pandas is annoying; mean requires quotation marks while min and max do with or without them.

(jb2
 .groupby('country_live')
 .agg(age_min=('age',min),
      age_max=('age','max'),
      team_size_mean=('team_size','mean')
     )
)

pp. 252-253. It's wonderful pivot_table and groupby don't only accept values but functions.

def even_grouper(idx):
    return 'not multiiples of 3' if idx % 3 else 'multiples of 3'

jb2.pivot_table(index=even_grouper,aggfunc='size')

More Aggregations

The .transform method, which is new to me, assigns aggregated information of groups such as size and mean to each member.

#28.4 Excercises

#1
(jb2
 .assign(pvmsum=(jb2
                 .groupby('python3_version_most')
                 .age
                 .transform('sum')
                )
        )
)

#2
(jb2
 .groupby('country_live')
 .filter(lambda g:g.age.size>=3)
)