fuzzy matching

Record linking and fuzzy matching are terms used to describe the process of joining two data sets together that do not have a common unique identifier. Examples include trying to join files based on people’s names or merging data that only have organization’s name and address.

everyone and his uncle seems to have a solution. But none of them work. Here is a good article on this subject. It has mentioned fuzzymatcher and recordlinkage modules.



An easier solution is mentioned in the comments of the page..

#!pip install string-grouper

import pandas as pd
from string_grouper import match_strings

df = pd.read_csv(“company_list.csv”, encoding=”ISO-8859-1″, header=None)
df.columns = [“serial”, “company_name”]

ndf = match_strings(df[“company_name”])
ndf[ndf[“left_index”] != ndf[“right_index”]].to_csv(“to_study.csv”)

Powered by WPeMatico