I'm the author of a spiffy command-line thing called jdupes which will help you find and handle identical duplicate files. If you need more info or any help then feel free to ask. If you have lots of files and understand the ramifications of hard linking, that may be the best first step. Subsequently locating duplicates that are hard linked involves zero file data reading (use -H to enable hard link matching) and you'll save a lot of space.
Also...it runs on Linux, Mac, Windows, and pretty much any POSIX-compliant machine. I've submitted a package for Synology NAS devices which was approved but they haven't yet included it.
I've seen jdupes mentioned here quite a few times and have been a lurker for a while. Let's talk about what stuff you'd like to see in my program or how it could be improved for your data set. Be sure to check issues for duplicate bugs/feature requests.
Just to be clear about one very common request: I'm not planning on adding any data-aware duplicate scanning like image or audio comparisons. I want to keep the scope of the program limited to exact file duplicate matching.
> Do a file hash (like MD5 or SHA) of all your photos. fdupes is your friend (on Linux).
jdupes is a better friend on any OS (inc Windows & Mac), and it's about 7 times faster as well :)
ah gotcha did it right then!
the non-linked files, probably gonna need to peak at trace if not debug logs when they import and see what's up; are they even importing?
jdupes -L -r "/data/tv/" "/data/tv/.torrents/"
will check those two folders recursively and remove/hardlink back any files
I'm the author of jdupes. Windows (since Vista) supports actual hard links on NTFS, and jdupes has supported hard linking on Windows since July 2015. jdupes uses the native CreateHardLink()
call on Windows to hard link just like on Linux/macOS/BSD/POSIX-compliant whatever else. NTFS does have a 1024-link limit, though, which jdupes is designed to work around.
For duplicate files I use,
It will be time consuming to make sure you really get it right, but I highly recommend you check out jdupes and Digikam for getting things together in one place and analyzing your media files for duplicates irrespective of their filenames. Good luck.
If you're running a Debian- or Arch-derived distribution, you'll find my fdupes fork called jdupes to be significantly faster than classic fdupes. The vast majority of command options are identical and if you ever need any help with it, I'm always happy to help. Also, your post inspired a new planned feature in jdupes so thanks for that! It's a great idea.
I'm seeing this rather weird behaviour when using BTRFS dedupe and would like to find out more from other users here: https://github.com/jbruchon/jdupes/issues/66
It appears that after a dedupe pass file sizes get changed and are now rounded up to the nearest block size. Has anyone else encountered this?
jdupes -M -r "/data/tv/" "/data/tv/.torrents/" <= this would check for double files
jdupes -L -r "/data/tv/" "/data/tv/.torrents/" <= this would recreate them
I'm the author of jdupes. This works, but is extremely slow. You are reading every single file to do this when there are many "cheap" heuristics to exclude by first, most notably the file size. If you want to see a simplified shell script version of what jdupes does, look at stupid_dupes.sh in the jdupes source repo ...and yes, it works, albeit slowly. Some other tools like dupd rely on secure hash algorithms and a persistent database of those hashes, though that requires reading all of the files to create the hashes in the first place, a potentially very time-consuming process. Better to exclude faster and sooner than to do all that work.
Hi. I'm the author of jdupes. There is a non-working option called -I/--isolate that tries to prevent what you experienced from happening, but in general, when you run the program, it will do whatever you say to do on ALL duplicate files. You should not have run -dN without running without them first to verify what the results would have been. -dN preserves the first file in each set, deleting all other files that are identical. -O is an ordering option, not an isolating option, which will sort duplicates such that the first matches in earlier-specified folders will always be at the top of each set; it does not prevent inter-folder matching, which is what -I is supposed to do but doesn't do properly (it ends up ignoring legitimate duplicates too, so it's not dangerous, just incomplete).
If the files are on the same logical volume, it might be better to hard link with -L than to delete with -dN. There will still be duplicate files but they will all point to the same single data area on disk and take up no extra space.
If you use Linux/macOS/BSD/MinGW, you can feed the output to a custom shell script that does what you want. The jdupes source code includes some example scripts to help you with this task. Just for you, I just added some scripts called delete_but_exclude
that will let you use regexes to exclude file paths from deletion.
No luck with Digikam. Instead I used jdupes to identify and delete duplicates. It seems to have worked well, in that it did't just wipe my entire collection.
And, with several thousand files and duplicates, if I did loose a non-duplicate, I will have no way of knowing x) So I just closed my eyes and ran it after having run the "indexing". It works by making and comparing file hashes, so I guess it's as accurate as one can get short of manually comparing file by file.
> fdupes is in the repos, is there a specific reason you need this one? I'm not familiar with the project enough to gauge it's usefulness compared to the other.
jdupes, which I wrote. I provide Windows binaries and you can use the recently enhanced -X filters to operate only on some files based on extension, modification time, substring path matching, and so on. If you need any help, please feel free to reach out to me.
Deduplication would have a performance impact on your system as every block that gets written has its hash checked against the master table to determine if its a duplicate.
Have you tried jdupes to find and remove the duplicates? https://github.com/jbruchon/jdupes
It sounds like you are on a Windows machine. I recommend jdupes for this particular task and release 1.13.3 has a Windows executable.
For files that are identical i really like https://github.com/jbruchon/jdupes its like fdupes, just way faster.
For videos, i like https://www.video-comparer.com/ which got really fast in the last 2 updates
> jdupes uses jodyhash for file data hashing. This hash is extremely fast with a low collision rate, but it still encounters collisions as any hash function will ("secure" or otherwise) due to the pigeonhole principle. This is why jdupes performs a full-file verification before declaring a match. It's slower than matching by hash only, but the pigeonhole principle puts all data sets larger than the hash at risk of collision, meaning a false duplicate detection and data loss. The slower completion time is not as important as data integrity. Checking for a match based on hashes alone is irresponsible, and using secure hashes like MD5 or the SHA families is orders of magnitude slower than jodyhash while still suffering from the risk brought about by the pigeonholing. An example of this problem is as follows: if you have 365 days in a year and 366 people, the chance of having at least two birthdays on the same day is guaranteed; likewise, even though SHA512 is a 512-bit (64-byte) wide hash, there are guaranteed to be at least 256 pairs of data streams that causes a collision once any of the data streams being hashed for comparison is 65 bytes (520 bits) or larger.
I've always used fdupes or jdupes for finding duplicate files. Both of them work on the hash of the file, among other things.
However, I think they only run on Linux and are command-line only
https://github.com/jbruchon/jdupes/releases/tag/v1.12
Try out the new -t option which disables the file change check code and see if you still run into the same problems. If you're not on Windows and can't build the code yourself then let me know what platform you're on and I'll see if I can build you a static binary or generic tarball.
we can't have it all, sadly :)
Have you tried file-level dedupe tools like jdupes? I use it on my photo collection, since i'm constantly copying stuff off SD cards and phones, and it's helped me clear out thousands of duplicate files. It can create soft/hard links to files instead of deleting, though. very neat tool.
I dug up a very old benchmark that was done about six weeks after I forked fdupes into my own separate project (it was called "fdupes-jody" back then) and the benchmark showed rdfind was slower at the time. Of course, this was three years ago and both rdfind and jdupes are actively developed, so take it with a grain of salt. Most of my work at that point was plucking the low-hanging optimization fruit.
You'll probably notice that a program in that benchmark called dupd blows every other program away. The trick behind dupd is that it uses a SQLite database to cache file information and then picks duplicates with that database, so it works very differently and without the database previously built it's on par with current jdupes. I had a very friendly "competition" with the dupd author and our test results basically boiled down to they're both fast and optimized for the hardware that we individually test the tools upon.
In short, jdupes is about as fast as it gets in a portable package that doesn't use a database. In the future I'll be adding hash databases but in the present it's optimized to do the fastest one-shot dupe scanning possible on lots of data sitting on rotating hard drives. At various times I've used it on data sets exceeding millions of files and on file sets ranging from a few KB to several GB per file. I also do a fair amount of data recovery work which results in lots of duplicate recovered files that need to be cleaned up; that makes an ideal test scenario for duplicate finding.
I recently added an option that allows tuning the I/O chunk size up to 16777216 (16 MiB) which may help with thrashing during byte-for-byte comparisons. Remember that rmlint is faster because it does the equivalent of "jdupes -Q" by default which skips that full-file comparison and risks data loss, however small that risk may be.
I added a feature request on your behalf. It's an intriguing feature idea. I have to write some of the 2.0 framework before I can implement it but I'm definitely wanting to make it happen.
If you want to speed things up a bit with your interim solution, I'd suggest you get jodyhash and build a list of 4K 64-bit hashes to compare (find -type f | while read X; do echo "$(dd if="$X" bs=4096 count=1 | jodyhash) $X"; done > files.txt), then copy the list to the remote host, do the same thing, merge the lists, cut, sort, uniq -d. That will produce a far shorter list of duplicate candidates based on the first 4K block of each file and you can fully hash only those particular files.
I wrote jdupes which does what you want. It's a command-line tool, but it'll let you scan your library and blow away duplicates in one easy shot. I haven't compiled an OS X binary for the latest version yet but you can use an older version.
jdupes -nrdN -x 1M ~/Music
That'll kill all of the 100% duplicate files in one shot. The "-x 1M" will exclude files under one megabyte in size which is a little safer since you know your music files will certainly be larger than that.
I'm the author of jdupes and if you're OK with the command line...it's exactly what you're looking for. You could even set up a scheduled task or cron job that does the dedupe work automatically.
jdupes if you don't mind using the command line. I'm the author; feel free to ask if you have questions. Since it's a month after your post I assume you've already handled it, but on the off chance you have not...
I'm the author of jdupes. If you're OK with using Command Prompt, it'll do exactly what you want and is probably the fastest safe duplicate finder you'll ever find. I know you've got it taken care of already but keep jdupes in mind for the next go-round! I'm happy to help if help is needed.
I'm the author of jdupes. If you're okay with using the command line to do it, it'll get the job done and it runs on Windows, Linux, and Mac OS. If you use a Debian-derived distribution or Arch Linux it's probably already available. To find and auto-delete files under one or more directories:
jdupes -rdN dir1 [dir2...]
The --help option will give you a good run-down of the options.
I have a feature like that planned for jdupes but in the meantime give this one-liner a whirl:
find -type f | rev | sort | sed 's#([^/])#\1 \1#' | rev | uniq -f 1 -D | sed 's#\ [^ ]$##'
I'm the author of jdupes (an fdupes fork) and I'm intrigued by your particular problem. The architecture of jdupes would be fairly easy to adapt to handle this. Are you looking for something that works over a network link sort of like rsync except for duplicate file scanning?
Not yet. Support for excluding directories from final actions is planned in 2.0. Relevant issue: Exclusion of selected directories from automatic deletion
If you can get all your drives mounted under one directory, you can do it in one line like this:
find /mnt/ -type f -print0 | xargs -0 sha512sum | sort | uniq -D -w 128 | tee results.txt
If you can get all the drives mounted at once, but all the locations are not under a single directory, you can pass multiple directories to the find command given above.
If you cannot mount all your drives at once, you can do it in two lines as /u/technifocal suggests, but the second line can be shortened to:
sort *.log | uniq -D -w 128
Those solutions require reading every file and computing a hash for every file. Programs like jdupes
(or the older, slower fdupes
) are more efficient, and more reliable. I'd use jdupes
if possible.
https://github.com/jbruchon/jdupes
A similar question was asked 6 months ago. You should have been able to find it with a simple search: