Dedup - Identifying and displaying duplicate files

Dedup – Identifying and displaying duplicate files

Work on this project has… kinda stalled while I try to figure a few issues. I’m trying to write them out to get a better idea of what I’m actually trying to do.

Identifying the files is fairly simple – I have the content hashes of the files which I can compare. Problem is, the SQL query to get the list of duplicate files just returns the list of duplicate IDs – a direct result of GROUP BYcontent_hash, so I can’t extract the files unless I do a separate query for each of the duplicate IDs. And there’s not an option to not use GROUP BY, since I’m actually doing SELECT *, COUNT(*) GROUP BY content_hash WHERE COUNT(*) >1, and the query fails without the GROUP BY.

At this point, I’m fairly certain doing a SELECT path WHERE content_hash = xyz is the best bet, though I’m not too certain how that will scale up into the thousands of files. If each query takes 0.01sec, 1000 duplicates means the program will take 10 secs to just get the list of files – hopefully displaying will be pretty fast.

I’ll probably be implementing this next for testing.

But that’s a good lead in to my next point – displaying the duplicate files.

I was thinking a tree would be easiest. Then I could navigate the tree looking for duplicate files. But, wait. What if you want to see the location of the other duplicates? Have a second pane showing another tree with the folder highlighted? My use case for this is simple – I have files consolidated into a few folders, but if I’m removing duplicates, I’d want to remove the duplicates that haven’t been consolidated.

And, for that matter, how should I handle selecting the duplicate files? Manually is straight-forward if I have the tree – go and check each file. But automatically? Have a right-click and select ‘Check all duplicate files in this folder and sub-folders’ button? How do I make sure that whatever I delete, I’ve still got one copy?

And what happens if I’ve got 2 duplicate files in the same folder? How do I specify which file to select for deletion? Preserve the shortest filename? If one ends in a number and the other doesn’t, choose the numberless name in the assumption that the other was a copy+paste that Windows just renamed to file (2).ext? (GNOME I believe just does Copy of file.name, so that’s easy. Actually, I think Windows does the same if you copy & paste in the same directory, so, huh.)

SO MANY DECISIONS. D:

dedup, hard choices

This entry was posted on December 11, 2011, 8:47 pm by Kyle Lexmond and is filed under Programming. You can follow any responses to this entry through RSS 2.0. You can leave a response, or trackback from your own site.

M	T	W	T	F	S	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

nTh among all

Dedup – Identifying and displaying duplicate files

Recent Posts

Blogroll

Friends

Archives