Notes on the Book "Effective Pandas"
The book I introduced in the last post has turned out to be so fantastic that I have decided to make notes of it. This is a live document; I started from Chapter 21 and am going on from there. I will return to the earlier chapters if I still have time and power. The author recently appeared in Real Python Podcast: Becoming More Effective at Manipulating Data with Pandas and talked about the book.
- 1. Introduction
- 2. Installation
- 3. Data Structures
- 4. Seris Introduction
- 5. Series Deep Dive
- 6. Operators (& Dunder Methods)
- 7. Aggregate Methods
- 8. Conversion Methods
- 9. Manipulation Methods
- 10. Indexing Operations
- 11. String Manipulation
- 12. Date and Time Manipulation
- 13. Plotting with a Series
- 14. Dates in the Index
- 15. Categorical Manipulation
- 16. Dataframes
- 17. Similarities with Series and DataFrame
- 18. Math Methods in DataFrames
- 19. Looping and Aggregation
- 20. Columns Types, .assign, and Memory Usage
- 21. Creating and Updating Columns
- 22. Dealing with Missing and Duplicated Data
- 23. Sorting Columns and Indexes
- 24. Filtering and Indexing Operations
- 25. Plotting with Dataframes
- 26. Reshaping Dataframes with Dummies
- 27. Reshaping by Pivoting and Grouping
- 28. More Aggregations
- 29. Cross-tabulation Deep Dive
- 30. Melting, Transposing, and Stacking Data
- 31. Working with Time Series
- 32. Joining Dataframes
- 33. Exporting Data
- 34. Styling Dataframes
- 35. Debugging Pandas
- 36. Summary
Introduction
Installation
Data Structures
Seris Introduction
Series Deep Dive
Operators (& Dunder Methods)
Aggregate Methods
Conversion Methods
Manipulation Methods
Indexing Operations
String Manipulation
Date and Time Manipulation
Plotting with a Series
Dates in the Index
Categorical Manipulation
Dataframes
Similarities with Series and DataFrame
Math Methods in DataFrames
Looping and Aggregation
Columns Types, .assign, and Memory Usage
Creating and Updating Columns
p. 179. I think the 8th row of the code .replace({'Yes' : True, 'No' : False, np.nan : False}) could be made more concise like .eq('Yes').
Dealing with Missing and Duplicated Data
Sorting Columns and Indexes
p. 194. Ths is a code in the book to sort a list of American presidents according to ther last names.
>>> (pres
.sort_values(
by='President',
key=lambda name_ser: name_ser
.str.split()
.apply(lambda val: val[-1]))
)
Seq President Party ...
2 2 John Adams Federalist ...
6 6 John Quincy Adams Democratic-Republican ...
An apply method and a second lambda function are necessary as shown in the above example; you cannot split a series and access the last string by indexing [-1] in a single lambda function.
>>> (pres
.sort_values(
by='President',
key=lambda name_ser: name_ser
.str.split()[-1])
)
ValueError
....
Filtering and Indexing Operations
pp. 199-200. I don't see the point in first setting the index for the 'President' column and then resetting it, instead of directly resetting it.
(pres
.set_index('President')
.reset_index()
)
p. 201. Note. Now I know why '&' in df.loc[...] sometimes requires parentheses next to it! The book says that "the & operator has higher precedence than >=" and the likes and that as a result, Pandas interprets the double condition pres.Average_rank < 10 & pres.Party == 'Republican' as pres.Average_rank < (10 & pres.Party) == 'Republican' while you meant (pres.Average_rank < 10) & (pres.Party == 'Republican'), for example. The author recommends that "you should always put parentheses around multiple conditions in index operations." I will keep that in mind.
p. 201. The use of the prefix '@' in the query method was totally new to me. It looks to incorporate variables defined outside into the query string.
>>> Roosevelts = ['Theodore Roosevelt', 'Franklin D. Roosevelt']
>>> (pres
.query('President.isin(@Roosevelts)')
[['President','Party']]
)
President Party
25 Theodore Roosevelt Republican
31 Franklin D. Roosevelt Democratic
Plotting with Dataframes
p. 224
(pres
.set_index('President')
.loc[:,'Background':'Average_rank']
.iloc[:9] # This is synonymous with .head(9)
.T
)
Reshaping Dataframes with Dummies
pp. 231-2. The row .filter( like = r'jb.role.*t' ) in the codes doesn't seem to make sense. I suppose that it should be .filter( like = 'jb.role'). The r prefix does no harm while it is unnecessary.
I don't believe I have grasped this chapter well.
Reshaping by Pivoting and Grouping
p.238. jb2 defined in p. 181 does not work here; the age columns gives a TypeError seemingly because its dtype is Int64. Switching to a float resolves the issue.
>>> (jb2
# .astype({'age' : float}) This line is necessary
.pivot_table(index='country_live',columns='employment_status',values='age',aggfunc='mean')
)
TypeError: Int64
pd.crosstab(
index=jb2.country_live,
columns=jb2.employment_status,
values=jb2
.age
.astype(float), # So is this line.
aggfunc="mean",
)
Interestingly, the groupby method doesn't require a conversion of the age column into float numbers.
p. 248. How useful pivot_table is! The order of min and max is somehow reversed.
>>> (jb2
.pivot_table(
index='country_live',
values=['age', 'company_size'],
aggfunc=({'age': ('min', 'max'), 'company_size': 'mean'})
)
)
p. 249. This inconsistency of Pandas is annoying; mean requires quotation marks while min and max do with or without them.
(jb2
.groupby('country_live')
.agg(age_min=('age',min),
age_max=('age','max'),
team_size_mean=('team_size','mean')
)
)
pp. 252-253. It's wonderful pivot_table and groupby don't only accept values but functions.
def even_grouper(idx):
return 'not multiiples of 3' if idx % 3 else 'multiples of 3'
jb2.pivot_table(index=even_grouper,aggfunc='size')
More Aggregations
The .transform method, which is new to me, assigns aggregated information of groups such as size and mean to each member.
#28.4 Excercises
#1
(jb2
.assign(pvmsum=(jb2
.groupby('python3_version_most')
.age
.transform('sum')
)
)
)
#2
(jb2
.groupby('country_live')
.filter(lambda g:g.age.size>=3)
)
2 thoughts on “Notes on the Book "Effective Pandas"”
Comments are closed.