Finding Duplicate Files with Python
source link: https://www.pythoncentral.io/finding-duplicate-files-with-python/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Finding Duplicate Files with Python
Sometimes we need to find the duplicate files in our file system, or inside a specific folder. In this tutorial we are going to code a Python script to do this. This script works in Python 3.x.
The program is going to receive a folder or a list of folders to scan, then is going to traverse the directories given and find the duplicated files in the folders.
This program is going to compute a hash for every file, allowing us to find duplicated files even though their names are different. All of the files that we find are going to be stored in a dictionary, with the hash as the key, and the path to the file as the value: { hash: [list of paths] }
.
To start, import
the os, sys
and hashlib
libraries:
Then we need a function to calculate the MD5 hash of a given file. The function receives the path to the file and returns the HEX digest of that file:
Now we need a function to scan a directory for duplicated files:
The findDup
function is using os.walk
to traverse the given directory. If you need a more comprehensive guide about it, take a look at the How to Traverse a Directory Tree in Python article. The os.walk
function only returns the filename, so we use os.path.join
to get the full path to the file. Then we'll get the file's hash and store it into the dups
dictionary.
When findDup
finishes traversing the directory, it returns a dictionary with the duplicated files. If we are going to traverse several directories, we need a method to merge two dictionaries:
joinDicts
takes 2 dictionaries, iterates over the second dictionary and checks if the key exists in the first dictionary, if it does exist, the method appends the values in the second dictionary to the ones in the first dictionary. If the key does not exist, it stores it in the first dictionary. At the end of the method, the first dictionary contains all the information.
To be able to run this script from the command line, we need to receive the folders as parameters, and then call findDup
for every folder:
The os.path.exists
function verifies that the given folder exists in the filesystem. To run this script use python dupFinder.py /folder1 ./folder2
. Finally we need a method to print the results:
Putting everything together:
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK