表記に揺らぎがあるリストの項目をPythonで名寄せして対応づける
2台のパソコンの重複するファイルを整理していると、同じファイルなのに名前が微妙に違って困ることがある。Levenshtein距離を用いて名寄せするPythonスクリプトを書いた。使用にあたり、先ずターミナルでpip install python-Levenshtein
または pip install levenshtei
nと入力して Levenshtein距離のライブラリをインストールする必要がある。まだならPandasも。
from collections.abc import Sequence
import numpy as np
import pandas as pd
import Levenshtein
def get_ratios(ser1: Sequence, ser2: Sequence):
ratios = []
for t1 in ser1:
row = []
for t2 in ser2:
row.append(Levenshtein.ratio(t1, t2))
ratios.append(row)
return pd.DataFrame(ratios, index=ser1, columns=ser2)
def make_series(arr, name=None):
if type(arr) == pd.core.series.Series:
return arr
else:
return pd.Series(arr, name=name)
def sort_key(arr):
lst = list(pd.concat([ser1, ser2]))
return [lst.index(t) if t in lst else None for t in arr]
def align_lists(ser1, ser2, threshold=0):
ser1 = make_series(ser1, name="series1")
ser2 = make_series(ser2, name="series2")
if len(ser1) < len(ser2):
ser_s, ser_l = ser1, ser2
else:
ser_s, ser_l = ser2, ser1
ratios = get_ratios(ser_l, ser_s)
common = pd.DataFrame(
np.vstack([ser_l.iloc[np.argmax(ratios, axis=0)], ser_s]),
index=[ser_l.name, ser_s.name],
).T
ser_s_in_pairs = ratios.max(axis=0)[ratios.max(axis=0) > threshold]
pairs = (
pd.DataFrame(ser_s_in_pairs)
.assign(
**{
ser_l.name: ser_l.iloc[
np.argmax(ratios.loc[:, ser_s_in_pairs.index], axis=0)
].values
}
)
.reset_index()
.rename(columns={0: "score"})
)
table = (
pd.concat(
[
pairs,
pd.DataFrame(ser_l[~ser_l.isin(pairs[ser_l.name])]).assign(score=0),
pd.DataFrame(ser_s[~ser_s.isin(pairs[ser_s.name])]).assign(score=0),
],
ignore_index=True,
)
.sort_values([ser1.name, ser2.name], key=sort_key)
.fillna("")
.reset_index(drop=True)[[ser1.name, ser2.name, "score"]]
)
return table
実施例
以下は最近のヒット映画のリストだが、1と2で内容が若干異なり、表記が異なるものも紛れている。
List 1
- Black Panther: Wakanda Forever
- Barbie
- The Super Mario Brothers Movie
- Spider-Man: Across the Spider-Verse
- Guardians of the Galaxy Vol. 3
- Oppenheimer
- The Little Mermaid
- Ant-Man and the Wasp: Quantumania
- John Wick: Chapter 4
- Sound of Freedom
- Taylor Swift: The Eras Tour
List 2
- The Little Mermaid
- Taylor Swift's Eras Tour
- Ant-Man and the Wasp: Quantumania
- The Super Mario Bros. Movie
- Top Gun: Maverick
- Oppenheimer
- Barbecue
- Guardians of the Galaxy Volume Three
- John Wick 4.0
- sound of freedom
- Avatar: The Way of Water
これを整理してみよう。
Script
list1 = pd.Series(
[
"Black Panther: Wakanda Forever",
"Barbie",
"The Super Mario Brothers Movie",
"Spider-Man: Across the Spider-Verse",
"Guardians of the Galaxy Vol. 3",
"Oppenheimer",
"The Little Mermaid",
"Ant-Man and the Wasp: Quantumania",
"John Wick: Chapter 4",
"Sound of Freedom",
"Taylor Swift: The Eras Tour",
],
name="List 1",
)
list2 = pd.Series(
[
"The Little Mermaid",
"Taylor Swift's Eras Tour",
"Ant-Man and the Wasp: Quantumania",
"The Super Mario Bros. Movie",
"Top Gun: Maverick",
"Oppenheimer",
"Barbecue",
"Guardians of the Galaxy Volume Three",
"John Wick 4.0",
"sound of freedom",
"Avatar: The Way of Water",
],
name="List 2",
)
align_lists(list1,list2,threshold=0.6)
結果
List 1 | List 2 | score | |
---|---|---|---|
0 | The Little Mermaid | The Little Mermaid | 1.0 |
1 | Taylor Swift: The Eras Tour | Taylor Swift's Eras Tour | 0.86 |
2 | Ant-Man and the Wasp: Quantumania | Ant-Man and the Wasp: Quantumania | 1.0 |
3 | The Super Mario Brothers Movie | The Super Mario Bros. Movie | 0.91 |
4 | Oppenheimer | Oppenheimer | 1.0 |
5 | Barbie | Barbecue | 0.71 |
6 | Guardians of the Galaxy Vol. 3 | Guardians of the Galaxy Volume Three | 0.85 |
7 | John Wick: Chapter 4 | John Wick 4.0 | 0.67 |
8 | Sound of Freedom | sound of freedom | 0.88 |
9 | Black Panther: Wakanda Forever | 0.0 | |
10 | Spider-Man: Across the Spider-Verse | 0.0 | |
11 | Top Gun: Maverick | 0.0 | |
12 | Avatar: The Way of Water | 0.0 |