Wednesday, May 29, 2013

Fuzzy VLOOKUP (string matching) in Excel

Recently, I faced a problem of this measure in Excel:

  • File1 contained a list of names and some data
  • File2 contained a list of names and some data
  • File1 and File2 had many common names but the names didn't match exactly
  • Names in File1 had to be matched with names in File2 and some data retrieved for the name from File2 into File1

The problem was, Excel functions like VLOOKUP match exact strings, the match won't work if strings differ (VLOOKUP has a flag for approximate matching but it's very inadequate and trivial). Microsoft also provides a fuzzy-match plugin for excel but I needed this functionality for my office and office IT guys won't let me install the plugin. You might want to check it out if your IT people aren't that brutal :D

So, I wrote a small macro containing is a fuzzy version of VLOOKUP. It is kind of rough and not very inefficient but does the job. Please go through the comments in the macro.

As an example consider branch_data.xslx as File1 and population.xslx as File2.
branch_data.xlsm is the macro enabled file with the array function FuzzyVLookup applied on columns Population, Matched CIty and Confidence(%) which matches city names from population.xslx . Press Alt+F11 to view the macro.
Keep all files in same folder and the function in branch_data.xslm should work( remember to enable macros when prompted). 

Note: This function returns multiple values and uses array formula. Check the "Method to Return a Variable-Size Result Array" at http://support.microsoft.com/kb/110693 .

The macro uses Levenshtein's Algorithm for fuzzy matching of strings( the algorithm calculates the number of insertions and  deletions in one string to convert it into other string and uses that metrics as a measure of match)

Credits: 
Implementation of Levenshtein in VBA

Here is the code itself( for those who are too lazy to download the branch_data.xlsm file) :
'------------------------------------------------------------------------------------------------------------
Option Explicit

' Levenshtein3 tweaked for UTLIMATE speed and CORRECT results
' Solution based on Longs
' Intermediate arrays holding Asc()make difference
' even Fixed length Arrays have impact on speed (small indeed)
' Levenshtein version 3 will return correct percentage(0...100)

Function FuzzyPercent(ByVal String1 As String, ByVal String2 As String) As Long

Dim I As Long, j As Long, string1_length As Long, string2_length As Long
Dim distance(0 To 60, 0 To 50) As Long, smStr1(1 To 60) As Long, smStr2(1 To 50) As Long
Dim min1 As Long, min2 As Long, min3 As Long, minmin As Long, MaxL As Long

string1_length = Len(String1):  string2_length = Len(String2)

distance(0, 0) = 0
For I = 1 To string1_length:    distance(I, 0) = I: smStr1(I) = Asc(LCase(Mid$(String1, I, 1))): Next
For j = 1 To string2_length:    distance(0, j) = j: smStr2(j) = Asc(LCase(Mid$(String2, j, 1))): Next
For I = 1 To string1_length
    For j = 1 To string2_length
        If smStr1(I) = smStr2(j) Then
            distance(I, j) = distance(I - 1, j - 1)
        Else
            min1 = distance(I - 1, j) + 1
            min2 = distance(I, j - 1) + 1
            min3 = distance(I - 1, j - 1) + 1
            If min2 < min1 Then
                If min2 < min3 Then minmin = min2 Else minmin = min3
            Else
                If min1 < min3 Then minmin = min1 Else minmin = min3
            End If
            distance(I, j) = minmin
        End If
    Next
Next

' Levenshtein3 will properly return a percent match (100%=exact) based on similarities and Lengths etc...
MaxL = string1_length: If string2_length > MaxL Then MaxL = string2_length
FuzzyPercent = 100 - CLng(distance(string1_length, string2_length) * 100 / MaxL)

End Function


Function FuzzyVLookup(ByVal LookupValue As String, _
                      ByVal TableArray As Range, _
                      ByVal ValIndex As Integer, _
                      Optional ValIndex1 As Integer) As Variant
                      
'********************************************************************************
'**This function must be called by selecting three columns then entering the function and pressing ctrl+shift+enter**'
'**this function compares the LookUpValue with the values in 1st column of TableArray range and returns the        **'
'**values from columns valIndex, ValIndex1 from the range and also percentage match                                **'

'**LookupValue: the value for which a match is to be found in a range of values**'
'**TableArray: the range in which the match for LookUpValue is to be found     **'
'**ValIndex: index of a column in TableArray range whose value is to be retrieved on match**'
'**ValIndex1: (optional) additional index of a column in TableArray range  whose value is to be retrieved on match**'


Dim R As Range

Dim strListString As String
Dim strWork As String
Dim I As Integer
Dim lEndRow As Long

Dim Row As Integer
Dim sngCurPercent  As Long
Dim sngMinPercent As Long
Dim arr As Variant
'--------------------------------------------------------------
'--    Validation                                            --
'--------------------------------------------------------------
ReDim arr(1 To 5)
Row = 0
sngMinPercent = 0

lEndRow = TableArray.Rows.Count
If VarType(TableArray.Cells(lEndRow, 1).Value) = vbEmpty Then
    lEndRow = TableArray.Cells(lEndRow, 1).End(xlUp).Row
End If

'---------------
'-- Main loop --
'---------------
For Each R In Range(TableArray.Cells(1, 1), TableArray.Cells(lEndRow, 1))

    strListString = R.Offset(0, 0).Text 'the city name in the range is in column 0
    
    '------------------------------------------------
    '-- Fuzzy match strings & get percentage match --
    '------------------------------------------------
    sngCurPercent = FuzzyPercent(String1:=LookupValue, _
                                 String2:=strListString)
    
    If sngCurPercent >= sngMinPercent Then
        Row = R.Row
        sngMinPercent = sngCurPercent
    End If
    
Next R

'-----------------------------------
'-- Return column entry specified --
'-----------------------------------
arr(1) = TableArray.Cells(Row, ValIndex) 'return the column value for matched row at ValIndex
arr(2) = TableArray.Cells(Row, ValIndex1) 'return the column value for matched row at ValIndex1
arr(3) = sngMinPercent 'return the match % for matched row
FuzzyVLookup = arr


End Function