add Space character every 3 letters



  • Hi,

    I am making a DNA viewer which display sequence like a hexadecimal viewer.
    So, I would like to add space every 3 letters in my QByteArray.

        QByteArray seq = "ACGTATAGTACGTACG"
        seq = transform(seq,3)
        seq = "ACG TAT AGT ACG TAC"
    

    What the most efficient way to do that ? QString / QByteArray have many methods



  • @dridk2

    QByteArray seq = "ACGTATAGTACGTACG";
    
        int cnt = 0;
    
        for(int i = 3; i < seq.size() - 3; i++){
            if(i % 3 == 0){
                seq.insert(i + cnt++, ' ');
            }
        }
    
        qDebug() << seq; = "ACG TAT AGT ACG TAC G"
    

  • Qt Champions 2016

    I don't know if it's efficient enough, but it's certainly a one-liner:

    QString split = QString("ACGTATAGTACGTACG").replace(QRegularExpression("(.{3})"), "\\1 ");
    


  • @kshegunov
    Your code is small and effective.
    My and your variant is probably the same in time?



  • since you are using QByteArray (i.e. 1 character is 1 byte) you can probably optimise it using std::memcpy on the data() pointer.

    QByteArray tarnsform(const QByteArray& seq, int span){
        if(seq.isEmpty() || span<=0) return QByteArray();
        const int oldArrSize = seq.size();
        QByteArray result(oldArrSize  + (oldArrSize /span) - (oldArrSize %span==0),' ');
        auto sourceIter = seq.cbegin();
        auto destIter = result.data();
        const auto srcEnd=seq.cend();
        for(int dstnc = std::distance(sourceIter,srcEnd);dstnc>0;dstnc-=span){
            std::memcpy(destIter,sourceIter,qMin(dstnc,span));
            destIter+=span+1;
            sourceIter+=span;
        }
        return result;
    }
    

    EDIT:

    The code I had before broke memory if seq.size()%span!=0


  • Qt Champions 2016

    @Taz742 said in add Space character every 3 letters:

    My and your variant is probably the same in time?

    I'd even speculate mine may be faster, even though it uses a regular expression. The problem with your piece of code is that at each insert of a new space you're copying the data after that position - the data has to be shifted, which might be rather heavy. The regular expression code (assuming it can optimize the expression well internally) can do it with a single memory allocation. In fact your code can be modified so it uses one allocation, by just using a resulting byte array and copying the data in chunks of 3 bytes, then setting a space, and then repeating.

    Edit: My view hadn't updated, basically what @VRonin wrote is what I was talking about.


  • Qt Champions 2016

    Hi
    Fast test. Might have logical issues. Just for fun.

    using namespace std::chrono;
    void MainWindow::on_pushButton_clicked() {
      high_resolution_clock::time_point t1 = high_resolution_clock::now();
    
      for (int var = 0; var < 10000; ++var) {
        int cnt = 0;
        QByteArray seq = "ACGTATAGTACGTACG";
        for(int i = 3; i < seq.size() - 3; i++) {
          if(i % 3 == 0) {
            seq.insert(i + cnt++, ' ');
          }
        }
      }
      high_resolution_clock::time_point t2 = high_resolution_clock::now();
    
      auto duration = std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();
    
      qDebug() << "time: " << duration ;
    }
    
    void MainWindow::on_pushButton_2_clicked() {
    
      high_resolution_clock::time_point t1 = high_resolution_clock::now();
    
      for (int var = 0; var < 10000; ++var) {
        QString split = QString("ACGTATAGTACGTACG").replace(QRegularExpression("(.{3})"), "\\1 ");
      }
      high_resolution_clock::time_point t2 = high_resolution_clock::now();
    
      auto duration = std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();
    
      qDebug() << "time QRegularExpression: " << duration ;
    }
    

    Result:
    time: 9001
    time: 8002
    time: 8001
    time: 8004
    time: 8001
    time: 8001
    time: 7995
    time: 8001
    time: 8001
    time: 8001
    time QRegularExpression: 161033
    time QRegularExpression: 162033
    time QRegularExpression: 161032
    time QRegularExpression: 161032
    time QRegularExpression: 162032
    time QRegularExpression: 162032
    time QRegularExpression: 162033



  • @mrjj Was that debug or release mode?


  • Qt Champions 2016

    @VRonin
    both debug. ( but just ran it. not ran as debug)
    You think it affects the result in uneven manner ??
    I till try in release just to be sure.



  • @mrjj I think it's fair to optimize a bit:

    QRegularExpression re{"(.{3})"};
    high_resolution_clock::time_point t1 = high_resolution_clock::now();
    ...
    ...QString("ACGTATAGTACGTACG").replace(re, "\\1 ");
    

  • Qt Champions 2016

    @Eeli-K
    Yes more fair to take out construction of "re"
    I will try that also.


  • Qt Champions 2016

    Designing benchmarking tests isn't exactly trivial, but I'd suggest something too (probably the raw insert will outperform the rx, but still for the sake of argument):

    Don't use the same fixed size input string; use input that ranges from very short to very long. And do the benchmarking in batches e.g. run the same benchmark for at least 30-40 times and record the time for each run, then you'd get data that can be put into a histogram and you can work it statistically.


  • Qt Champions 2016

    @kshegunov
    Yep varying input lengths might alter the result significantly so will try that too.



  • Oh, I was not notify by email of all your answers ! Thanks a lot ! I will try it .
    By the way, you can join the team for this small project !
    https://github.com/labsquare/cuteFasta
    Preview on twitter : https://twitter.com/labsquare/status/884146483406266368


Log in to reply
 

Looks like your connection to Qt Forum was lost, please wait while we try to reconnect.