[Solved] Duplicate finder



  • What should be my start point to make program that can find files with similar content, but named different?



  • If you want to find fully similar files then some hash will help you like md5 or sha1. If you want to find files that are different a bit but quiet similar in general then you should use something more complex like Levenshtein distance (if you compare files with some plain text) or some diff analysis (something like if diff is smaller than 10% of smallest size then they are similar enough).



  • Moved to Brainstorm forum, as it has nothing to do with Qt programming as such (initially moved to C++ Gurus, but it's not C++ related either, sorry).



  • Volker, I've thought where it should be moved too, but forgot about brainstorm and left it to someone who will have some ideas. Thanks.



  • Maybe i've misunderstood something, but how can md5 ot sha1 help me with finding dublicates? I thought that its cryptohraphic algorithms for government use :)



  • If two hashes are equal then it is very likely that the two files they are computed of bitwise equal content (although not guaranteed as the set of all possible file contents is indefinite, whereas the number of possible hashes is limited and thus there cannot exist an isomorphic relation between them). You can store the hashes in a database and search for duplicates in that. For "similar" files you'll have to go with Levenshtein or other means, as Denis stated.



  • But how can i store file info in a hash ?



  • You can pass it contents through hash function and you will have it hash at the end.



  • [quote author="alex.dadaev" date="1295897294"]But how can i store file info in a hash ?[/quote]

    You cannot. This "web page":http://lmgtfy.com/?q=hash+function has some explanations for you.



  • Okay :)



  • Is there any way to use QHash methods in QCryptographicHash ?
    I'd like to make a comparing table for files that i hash.


  • Moderators

    QHash is a hash table, a datastructure optimized for random access based on a key value.

    QCryptographicHash is used to calculate cryptographic hash values from input data. They are completely different things:-)

    So, no, you can not use QHash's methods in QCryptographicHash. Just use the result of a QCryptographicHash as a key to a QHash and you should be set. Just make sure to reset the QCryptographic hash whenever you are done with a file, or you will not get the same hash values for the same files (since the second one will still have all the data of the first one "prepended").



  • Yes, that's possible. You can use the following function as a start for your project:

    @
    #define MY_SHA1_BUFFER_SIZE 4096

    QString getSha1HashFromFile( const QString &fn )
    {
    QCryptographicHash ch( QCryptographicHash::Sha1 );
    QFile file( fn );
    if( !file.open( QIODevice::ReadOnly ) )
    return QString();

    char buf[MY_SHA1_BUFFER_SIZE];
    while( !file.atEnd() ) {
        qint64 read = file.read( buf, MY_SHA1_BUFFER_SIZE );
        ch.addData( buf, read );
    }
    file.close();
    return QString( ch.result.toHex() );
    

    }
    @



  • how can i make QString from QFileInfoList ?
    is there any possibilities to do that?



  • QFileInfo::path() ???



  • No.

    QFileInfoList is a typedef for QList<QFileInfo>.

    You know how many properties a QFileInfo object describing a single file has, don't you?

    If you want a single string from these big bunch of information you will have to construct it yourself.



  • Ok, forgpot to add the iteration by for(...)...



  • but how can i construct a path of a single file?



  • Read the docs on "QFileInfo":http://doc.qt.nokia.com/stable/qfileinfo.html - we did it too. Everything you need is documented there. Yes, it takes some 5 minutes to read it all through, but if you're too lazy we can't help you. If you have concrete questions or problems with any of the methods, ask them.



  • If you read the documentation, you would find it ...

    @
    QFileInfoList list;
    for(int i = 0; i < list.size(); ++i)
    {
    QString filePath = list[i].absoluteFilePath();
    {
    @



  • i've made it by myself already :) the reason why i ask so dumb questions is because i'm just starting using Qt and programming itself and i just want not to make stupid mistakes.
    @QString path[list.size()];for (int i = 0; i < list.size(); ++i) {
    QFileInfo fileInfo = list.at(i);
    path[i] = fileInfo.path();}@



  • I would suggest using a QStringList instead of QString path[xx];



  • First: Then show us your code and we comment on it; Don't ask dumb questions that are clearly answered in the very good API docs the Trolls have created for us. It is very likely that you will not get any answer (apart from "RTFM"). We all put some valuable amount of time into DevNet to answer questions - with that silly game you are stealing this time!

    Second: Do not use C-Style arrays in C++ if you are not absolutely forced to. Use the fine "Container Classes":http://doc.qt.nokia.com/stable/containers.html of Qt (or the equivalents of C++ standard library or boost). In your case "QStringList":http://doc.qt.nokia.com/stable/qstringlist.html is what you want.

    Third: C-Style arrays of unknown size at compile time are not supported by all compilers and therefore not portable. I leave you to google or bing to search for the details.



  • Don't you think a hash computation is a little bit overkilling in the file duplicate determination?
    Just read your files' contents into memory blocks and compare them with memcmp. If you may have big files it would be wiser to compare them block-by-block rather then the whole files at once.

    Upd. function name



  • memcpy copies in memory and does not compare them.

    AFAIK, he wanted to search for duplicates, so hashes would be faster. YOu don't want to do a full compare for all files with all files....



  • Gerolf, thank you for the function name correction.
    And yes, you are right about the question. I've misinterpreted OP goal :)



  • How do you think, is it correct?
    @
    while(it.hasNext())
    {
    it.next();
    if(it.peekPrevious().key()==it.peekNext().key())
    std::cout<<it.peekPrevious().value()<<"="
    <<it.peekNext().value()<<std::endl;
    }
    @



  • No it is not correct.

    It compares only adjacent entries in your container.

    If your container is a map (QMap or QHash) and you use insert() to populate it then you will get no duplicates at all, since every key occurs only once, hence the keys at different positions are all distinct.

    You must use insertMulti() and values() to get a list of all entries with the same hash value. Or use QMultiMap/QMultiHash with the before mentioned methods.



  • Yes, you're right, thanks



  • i can't understand why it's not working :(
    @
    while(it.hasNext())
    {
    it.next();
    if(it.key()==it.peekNext().key()) {
    std::cout << "i've got you" << std::endl;
    }
    }
    @

    PS: i've used insertMulti() to add item to hash, as you said



  • maybe like this?
    @int compare_flag;

    while(it.hasNext())
    {
        it.next();
        compare_flag = QString::compare(it.key(),it.peekNext().key(),Qt::CaseSensitive);
                       if(compare_flag==0) {
            std::cout << "i've got you" << std::endl;
        }
    }@


  • The keys in a (hash) map are always distinct. You will never find two identical keys so your comparison will never be true.

    And even if you had identical keys in your container you would only find them if they are adjacent in the list.

    But I'm going to have a kind of déjà-vu...

    To make things clearer for us to understand: You do have a multi hash/multi map. What do you put in there and what do you expect to come out?



  • @QHash<QString,int> FilesHash;@
    QString key is MD5
    int value - just a number of file

    on output i want to see the names of similar files



  • Ok, let's make things clearer step by step. Seems that you should make yourself comfortable with the concepts of a map.

    A map (QHash is one) stores values associated with keys. Every key only exists once in the map - I wrote that several times, let's prove it:

    @
    QHash<QString, int> myHash;
    myHash.insert("abc", 2);
    myHash.insert("def", 3);
    myHash.insert("abc", 5);

    qDebug() << "hash keys:" << myHash.keys();

    QHash<QString, int> myMultiHash;
    myMultiHash.insertMulti("abc", 2);
    myMultiHash.insertMulti("def", 3);
    myMultiHash.insertMulti("abc", 5);

    qDebug() << "multi hash keys:" << myMultiHash.keys();
    @

    What will the output be?

    What will happen if you compare every key with every other?



  • insertMulti allows you to store items with similar keys



  • without overwriting them



  • what do you think about it?
    @
    bool ok;
    QHashIterator<QString,int> it(FilesHash);
    QHashIterator<QString,int> begin(FilesHash);
    QHashIterator<QString,int> end(FilesHash);
    while(it.hasNext()) {
    it.next();
    begin = qLowerBound(FilesHash.begin(), FilesHash.end(), it.key());
    end = qUpperBound(begin, FilesHash.end(), it.key());
    iter = begin;
    while(iter!=end) {
    if(*i=*it) {
    ok = true;
    } else { ok = false; }
    }
    }
    @



  • why i cannot do like this?
    @QHashIterator<QString,int> iter(FilesHash);

    while(it.hasNext()) {
        it.next();
        iter = qBinaryFind(FilesHash.begin(), FilesHash.end(), it.key());
    }@


  • please note that the problem has been solved :)



  • you can do it on your own:
    go to your first post and click edit :-)
    and edit the title.


Log in to reply
 

Looks like your connection to Qt Forum was lost, please wait while we try to reconnect.