Exploring Python's Difflib: A Guide to Finding Differences
Written on
# Introduction to Difflib
Lately, I've taken the initiative to delve into the built-in libraries of Python. This endeavor has proven to be quite enjoyable, revealing a plethora of features that offer ready-to-use solutions for various challenges.
One particularly interesting library is Difflib. Since it comes pre-installed with Python 3, there's no need for additional downloads—simply import it with the following command:
import difflib as dl
Now, let’s get started!
# 1. Identifying Changes
Most of you are likely familiar with Git. If so, you might have encountered a conflict in a raw code file that looks something like this:
<<<<<< HEAD:file.txt Hello world =============================== Goodbye >>>>>> 77976da35a11db4580b80ae27e8d65caf5208086:file.txt
In GUI tools for Git like Atlassian Sourcetree, changes are visually represented.
Here, a line prefixed with a minus sign indicates removal in the new version, while a plus sign shows what has been added.
In Python, we can replicate this functionality easily using Difflib and a single line of code. The first method I'll demonstrate is context_diff().
Let’s create two lists with some string elements:
s1 = ['Python', 'Java', 'C++', 'PHP'] s2 = ['Python', 'JavaScript', 'C', 'PHP']
Now, we can generate a comparison report:
dl.context_diff(s1, s2)
The context_diff() function yields a generator that we can loop through to display all differences:
for diff in dl.context_diff(s1, s2):
print(diff)
The output highlights the 2nd and 3rd elements as differing, marked with an exclamation point !, indicating the differences.
Let’s modify the lists to show a more intricate comparison:
s1 = ['Python', 'Java', 'C++', 'PHP'] s2 = ['Python', 'Java', 'PHP', 'Swift']
The result indicates that "C++" has been removed and "Swift" added.
For a representation similar to Sourcetree's output, we can use unified_diff():
dl.unified_diff(s1, s2)
This function unifies both lists, generating a more readable output.
# 2. Detailed Character Comparison
In the previous section, we focused on row-level differences. But what if we want to compare them character by character? We can achieve this with the ndiff() function.
Consider comparing these two lists of words:
['tree', 'house', 'landing'] ['tree', 'horse', 'lending']
The second and third words differ by just a single letter. The output will clearly show these changes:
The function not only displays what has changed (with - and + signs) but also highlights the specific letters that differ using the ^ symbol.
# 3. Finding Close Matches
Have you ever typed "teh" and had it automatically corrected to "the"? This common occurrence can be implemented in your Python applications using the get_close_matches() function.
Suppose we have a list of potential candidates and an input:
dl.get_close_matches('thme', ['them', 'that', 'this'])
This will successfully return "them" as it is the closest match to the typo "thme". However, if there are no close matches, the function returns an empty list.
We can also adjust the similarity threshold using the cutoff parameter, which accepts a float between 0 and 1. A higher number indicates stricter matching criteria.
To limit the number of matches returned, we can use the n parameter to specify the top matches.
# 4. Transforming Strings from A to B
If you're familiar with Information Retrieval, you might recognize that the functions mentioned utilize Levenshtein Distance to assess how to transform one string into another.
This distance is determined by counting the minimum number of substitutions, insertions, and deletions required to change string A into string B. While I won't delve into the algorithm here, you can find more details on the concept on its Wikipedia page.
Using Difflib, we can implement the steps to calculate Levenshtein Distance through the SequenceMatcher class.
For instance, if we have the strings abcde and fabdc, we can determine how to convert the former into the latter:
s1 = 'abcde' s2 = 'fabdc' seq_matcher = dl.SequenceMatcher(None, s1, s2)
Using the get_opcodes() method, we can retrieve a list of tuples indicating the necessary modifications:
for tag, i1, i2, j1, j2 in seq_matcher.get_opcodes():
print(f'{tag:<7} s1[{i1}:{i2}] --> s2[{j1}:{j2}] {s1[i1:i2]!r:>6} --> {s2[j1:j2]!r}')
This provides a clear overview of the modifications needed.
Additionally, we can specify to ignore certain characters during processing by passing a function as the first argument to SequenceMatcher.
seq_matcher = dl.SequenceMatcher(lambda c: c in 'abc', s1, s2)
This means that the characters "a", "b", and "c" will be treated as a whole.
Lastly, here's how we might utilize this function in practice:
# Conclusion
In this article, I introduced the Python built-in library Difflib, which allows you to generate reports that highlight differences between two lists or strings. It also helps find the closest matching strings based on user input. Furthermore, we explored a class within this module to implement more complex functionalities effectively.