Thursday, March 15, 2012

Fast Hashing of Variable-length text strings

The following Hash algorithm based on a research paper written by Peter Pearson (a scientist at Lawrence Livermore National Lab) has an ability to generate unique hash numbers for strings with lengths no longer than 256 bytes. That means, we can have up to 256 buckets in a hash table, but it guarantees the uniqueness (free of hash-collision), thus no need to have linked-list in buckets.

The beauty of this algorithm is that it is designed to work on small microprocessors or microcontrollers, which sometimes don't have luxury to do complex arithmetics in their Instruction set (or, require more instructions).  Instead, the following algorithm uses XOR and table lookup which should be easy to do in small CPUs.  On more modern and high-end CPUs, XOR operation translated directly to mnemonic XOR in machine-code (a single instruction) and done even in almost wire-speed.

One of the use I can think of is for creating MAC table for virtual bridging.  An Ethernet MAC address takes only 6 octets, so this algorithm can be used to lookup 256-bucket MAC table.  The hash-key is MACaddr, while the output is port.

For example:

A MAC table structure might look like below:
+-------------------+---------+------+
|      MACAddr      +  port   |  Age |
+-------------------+---------+------+
| XX:XX:XX:XX:XX:XX |    1    |  6   |
| YY:YY:YY:YY:YY:YY |    2    |  7   |
| ZZ:ZZ:ZZ:ZZ:ZZ:ZZ |    3    |  102 |
| AA:BB:BB:BB:BB:BB |    4    |  33  |
| AA:AA:AA:AA:AA:AA |    5    |  76  |
+-------------------+---------+------+

Learning:
port = PearsonHash(SourceMacAddr);
MACTable[port].addr = SourceMacAddr;
MACTable[port].port = port;

Forwarding:
inspect destination MAC address from incoming ethernet frame

//Flooding
if (DestMacAddr == Broadcast)
      Send to all ports, except the original incoming port, with this frame

// a small filtering is done here to prevent loop
if (DestMacAddr != SourceMacAddr)
{
     port = PearsonHash(DestMacAddr);
     if port is invalid:
          flooding
     else
     DataForward(EthernetFrame, port);
}

Aging:
The following statements can be executed periodically.

for each entry/bucket in the MAC table:
      if MACTable[i].age < 1:
            Flush(MACTable[i])

A challenge might be how to expand this algorithm to handle, say, 64000 buckets/entries?  How to generate the pseudorandom hash table?

#include <stdio.h>
#include <string.h>
#include <stdlib.h>


static unsigned char PseudoRandomHash[256] = {
     1,   87,   49,  12, 176, 178,  102, 166, 121, 193,   6,  84, 249, 230,  44,  163, 
     14, 197,  213, 181, 161,  85,  218,  80,  64, 239,  24, 226, 236, 142,  38,  200, 
    110, 177,  104, 103, 141, 253,  255,  50,  77, 101,  81,  18,  45,  96,  31,  222, 
     25, 107,  190,  70,  86, 237,  240,  34,  72, 242,  20, 214, 244, 227, 149,  235, 
     97, 234,   57,  22,  60, 250,   82,  175, 208,   5, 127, 199, 111,  62, 135,  248, 
    174, 169,  211,  58,  66, 154,  106, 195, 245, 171,  17, 187, 182, 179,   0,  243, 
    132,  56,  148,  75, 128, 133,  158, 100, 130, 126,  91,  13, 153, 246, 216,  219,
    119,  68,  223,  78,  83,  88,  201,  99, 122,  11,  92,  32, 136, 114,  52,   10, 
    138,  30,   48, 183, 156,  35,   61,  26, 143,  74, 251,  94, 129, 162,  63,  152, 
    170,   7,  115, 167, 241, 206,    3, 150,  55,  59, 151, 220,  90,  53,  23,  131, 
    125, 173,   15, 238,  79,  95,   89,  16, 105, 137, 225, 224, 217, 160,  37,  123, 
    118,  73,    2, 157,  46, 116,    9, 145, 134, 228, 207, 212, 202, 215,  69,  229, 
     27, 188,   67, 124, 168, 252,   42,   4,  29, 108,  21, 247,  19, 205,  39,  203, 
    233,  40,  186, 147, 198, 192,  155,  33, 164, 191,  98, 204, 165, 180, 117,   76, 
    140,  36,  210, 172,  41,  54,  159,   8, 185, 232, 113, 196, 231,  47, 146,  120, 
     51,  65,   28, 144, 254, 221,   93, 189, 194, 139, 112,  43,  71, 109, 184,  209}; 
     
#define N   256
     
     
int PearsonHash(const char *str)
{
    int n = strlen(str);
    int i;
    unsigned char *h;
    int result;
    
    if (n > N)
    {
        puts("Pearson Hashing can only be used for string length <256 characters");
        return -1;
    }
    h = (unsigned char *)malloc(N);
    bzero(h, sizeof(char)*N);
    h[0] = 0;
    for(i=1; i<n; i++)
    {
        h[i] = PseudoRandomHash[(h[i-1] ^ str[i])& 0xFF];
        //printf("h[%u]=%0x\n", i, h[i]);
        result = h[i];
    }
    free(h);
    return result;
}

     
int main()
{
    int i;
    int n;
    const char MyStringBase[] = "Ahlan Wa Sahlan Bib, ";
    char *StrTable[N];
    
    for(i=0; i<N; i++)
    {
        printf(" %03u ", PseudoRandomHash[i]);
        if ((i+1)%16 == 0)
            printf("\n");
    }
    for(i=0; i<N; i++)
    {
        StrTable[i] = (char *)malloc(N);
        sprintf(StrTable[i], "%s%u", MyStringBase, i);
    }    
    for(i=0; i<N; i++)
    {
        printf("\n%s: \tHash Index=%u\n", StrTable[i], PearsonHash(StrTable[i]));
        free(StrTable[i]);
    }    
}     

2 comments:

  1. The first character of the string is never checked so "tree" and "free" for example will return the same result.

    ReplyDelete