Record linking and fuzzy matching are terms used to describe the process of joining two data sets together that do not have a common unique identifier. Examples include trying to join files based on people’s names or merging data that only have organization’s name and address.
everyone and his uncle seems to have a solution. But none of them work. Here is a good article on this subject. It has mentioned fuzzymatcher and recordlinkage modules.
An easier solution is mentioned in the comments of the page..
#!pip install string-grouper
import pandas as pd
from string_grouper import match_strings
df = pd.read_csv(“company_list.csv”, encoding=”ISO-8859-1″, header=None)
df.columns = [“serial”, “company_name”]
ndf = match_strings(df[“company_name”])
ndf[ndf[“left_index”] != ndf[“right_index”]].to_csv(“to_study.csv”)
Powered by WPeMatico