Normalizing PCM Audio

rtavakko

Hi guys,

This is more of an audio question and not directly related to QT but I'm not sure if I'm making a mistake somewhere with my logic. I posted a similar question on Stack overflow also but I couldn't find the answer Link here.
Basically I'm using a QIODevice and QAudioInput to read and process audio data (I started from a QT example and modified it a little). The issue is I'm trying to normalize the signal coming from the Windows Stereo Mix and Microphone inputs and I'm not sure where the midpoint of the signal should be. With 16-bit WAV files, the signal has the right dynamic range from -32,768 to 32,767 and 0 as the midpoint (when no signal is coming in) but with Stereo Mix or Microphone, the 8-bit signed signal varies from -128 to 127 with the midpoint being -128 which is not consistent. Here is the code to process the audio data:

AudioIODevice::AudioIODevice(QObject *parent, const QAudioFormat &deviceFormat) :
    QIODevice(parent),
    format(deviceFormat)
{
   int sampleSize = format.sampleSize();
   switch (format.sampleType())
   {
    case QAudioFormat::UnSignedInt:
       minValue = 0.0f;
       maxValue = static_cast<float>(std::pow(2,sampleSize) - 1);
       break;
    case QAudioFormat::SignedInt:
       minValue = static_cast<float>((std::pow(2,sampleSize)/2) * (-1));
       maxValue = static_cast<float>((std::pow(2,sampleSize)/2) - 1);
       break;  
   case QAudioFormat::Float:
        break;
    default:
       break;
   }
}

qint64 AudioIODevice::writeData(const char *data, qint64 len)
{
    unsigned int sampleBytes = format.sampleSize() / 8;                 //Number of bytes for each interleaved channel sample
    unsigned int combSampleBytes = format.channelCount() * sampleBytes; //Number of bytes for all channel samples
    unsigned int numSamples = len / combSampleBytes;                    //Total number of samples

    if(format.sampleSize() % 8 != 0 || len % sampleBytes != 0)
        return -1;

    //Prepare our output buffer
    buffer.clear();
    buffer.resize(numSamples,0);

    const unsigned char* uData = reinterpret_cast<const unsigned char*>(data);

    for(unsigned int i = 0; i < numSamples; i++)
    {        
        float monoValue = minValue;
        float value = minValue;

        //Process data for all interleaved samples
        for(unsigned int j = 0; j < format.channelCount(); j++)
        {
            switch (format.sampleType())
            {
                case QAudioFormat::UnSignedInt:
                switch(format.sampleSize())
                {
                    case 8:
                    value = *reinterpret_cast<const quint8*>(uData);
                    break;
                    case 16:
                    value = (format.byteOrder()==QAudioFormat::LittleEndian)?
                                (qFromLittleEndian<quint16>(*reinterpret_cast<const quint16*>(uData))):
                                (qFromBigEndian<quint16>(*reinterpret_cast<const quint16*>(uData)));
                    break;
                    case 32:
                    value = (format.byteOrder()==QAudioFormat::LittleEndian)?
                                (qFromLittleEndian<quint32>(*reinterpret_cast<const quint32*>(uData))):
                                (qFromBigEndian<quint32>(*reinterpret_cast<const quint32*>(uData)));
                    break;
                    default:
                    break;
                }
                break;
                case QAudioFormat::SignedInt:
                switch(format.sampleSize())
                {
                    case 8:
                    value = *reinterpret_cast<const qint8*>(uData);
                    break;
                    case 16:
                    value = (format.byteOrder()==QAudioFormat::LittleEndian)?
                                (qFromLittleEndian<qint16>(*reinterpret_cast<const qint16*>(uData))):
                                (qFromBigEndian<qint16>(*reinterpret_cast<const qint16*>(uData)));
                    break;
                    case 32:
                    value = (format.byteOrder()==QAudioFormat::LittleEndian)?
                                (qFromLittleEndian<qint32>(*reinterpret_cast<const qint32*>(uData))):
                                (qFromBigEndian<qint32>(*reinterpret_cast<const qint32*>(uData)));
                    break;
                    default:
                    break;
                }
                break;
                case QAudioFormat::Float:
                break;
                default:
                break;
            }
            monoValue = std::max(value,monoValue);
            uData += sampleBytes; //Get data for the next sample
        }
        buffer[i] = (monoValue - minValue) / (maxValue - minValue);    //Normalize the value to [0-1]

    }
    emit bufferReady();
    return len;
}

Should I be expecting 0 as the midpoint of a signed 8-bit PCM signal or is there no standard for this and I have to figure out another way?

Cheers!

rtavakko

After a few months of searching for an definitive answer to this topic, I've reached the conclusion that my original assumption would be correct. This page describes how to determine the midpoint of a standard PCM audio signal:

https://gist.github.com/endolith/e8597a58bcd11a6462f33fa8eb75c43d

For example an 8-bit signed PCM signal has these ranges:

Min: -128
Max: 128
Midpoint: 0

As to why the signal I'm getting from my soundcard sits at -128 when there is no sound, I'm going to assume that this is related to a driver problem or could be that this particular piece of hardware does not follow the PCM standard.

Converting to the logarithmic scale in my understanding is not related to this issue because you should be able to normalize the signal in time-domain even though eventually you will most likely need to convert it to the log scale if you are doing anything in the frequency domain (e.g. FFT).

If anyone has any input, please feel free to add it.

Kent-Dorfman

This post is deleted!

Kent-Dorfman

Ok, so looks like you're asking about a complete sound sample and not "on the fly", which is good...because you really cannot normalize sound "on the fly". Keep in mind that sound energy is non-linear. It's propagating in 3 dimensions so the dB scale is logarithmic, and not linear. Mathematical mid point will be based on the PCM format you are using: signed vs unsigned, but as I mentioned 128 as a midpoint of [0..255] signed is not an auditory midpoint. you should probably map your midpoint based on a logarithmic scale in the available range, and be careful about signed conversions. I never use signed data to represent PCM data because the electronics of the sound card are always some positive voltage level.

rtavakko

@Kent-Dorfman Thanks a lot for your response! Yes, the writeData method gives me a buffer of data which has a format already determined by the QAudioFormat of the device supplying it (including unsigned / signed).

I understand the need to convert to a logarithmic scale and do this at a later stage when I take the FFT of the data but I need a reference amplitude for that conversion and I'm not sure what I need to use for that.

One thing I noticed is that you can set the data type (unsigned / signed) of the QAudioFormat but whether or not it actually sets will depend on if that setting is supported by the device. I'll try messing around with that to see if it does something useful.

rtavakko

Any more thoughts on this guys? I'm stuck on this

SGaist

Hi,

You might want to check the DSPfilters. It might offers you what you need.

rtavakko

@SGaist Thanks for that link! The library looks cool, I'm going through the source to see how they did things but I'm trying to build my own little audio engine.
Do you know by any chance if there is a set standard for how an audio signal is supposed to be processed, as in the midpoint, min and max levels?

SGaist

@rtavakko said in Normalizing PCM Audio:

Do you know by any chance if there is a set standard for how an audio signal is supposed to be processed, as in the midpoint, min and max levels?

I am sorry but I am not sure to understand exactly what you are looking for.

rtavakko

@SGaist The issue I'm stuck solving is that I'm not sure where a 'silent' audio signal that is represented as signed int data should be.

For example 16-bit WAV files are in the range -32,768 to 32,767 and 0 is the midpoint but the 8-bit signal that I get from a microphone or other live feeds are in the -128 to 127 range with -128 also being the midpoint (silent signal).

So I'm stuck trying to find a universal way to normalize audio to the 0 - 1 range with 0.5 being the midpoint. Anything I've read so far suggests that 0 should always be the midpoint for signed audio but I don't know for sure at this point

Kent-Dorfman

@rtavakko

For PCM audio a silent signal is represented by a a contiuous stream of values that are the same. It is the changes in amplitude that form the sound waveform. You can have silence at any output amplitude if the sample values don't change. Obviously any changes to those values create a waveform. So you cannot look for silence in the method you are thinking.

If you use 8khz as your carrier and create a u16 stream of shorts such as

16384,0,16384,0,16384,0... then you will get a loud 8khz (harsh) tone.

1000,0,1000,0,1000,0... give the same harsh 8khz tone, but at a greatly diminished volume.

any stream of x,x,x,x,x,x,x... will create silence.

download and play with audacity, and programmatically create audio files to experiment with different effects: sin, square, sawtooth waveforms of different amplitudes.

EDIT - actually I screwed up. If the sample rate is 8khz, then you can only reproduce frequencies up to 4khz, since it's the change that forms the wave, not the data points themselves.

rtavakko

@Kent-Dorfman I understand the concept but I'm still not sure how I would go about normalizing the signal. I'm still processing the signal as it comes in as an instantaneous set of values. Do I need to compare each value in the array to the previous one and set it to the lower limit of a dBFS scale if they are equal?

rtavakko

Still trying to figure this out. Any thoughts on converting to the right scale (log scale seems to be appropriate) would help

rtavakko