So i might have 10 copies of the same song and they are exactly the same. these are the ones i want to reduce down to 1.
Is the MP3 data the same, ie, byte-for-byte exact?
I had a similar problem in having to match FLAC files that had different metadata but same audio data. The audio data would be at a different offset in each file, sometimes thousands of bytes apart.
The initial solution was to open an audio file, grab a 100 byte block of audio data from the middle, and then search each and every other audio file for that block. This worked 100% but was no speed demon. And in your case, with millions of files, it'd be unviable.
The faster solution was to pull some fingerprints from each audio file, and then put them in a database (ok, text file in my case) and then search for duplicates much like you are doing with the SQL statement and COUNT clause.
The fingerprints were "a run of 9 bytes of ascending value", eg 200,
10, 35, 37, 58, 121, 143, 157, 193, 227, 181. I think 9 bytes gave on average 4 fingerprints per MB. I checked that all of the fingerprints in file A were matched in file B, and vice-versa, but what I found was that one fingerprint match between files was enough identify those that had identical audio content. If the audio content had been altered in the slightest bit, then the MP3 encoded output would alter entirely.