Обсуждение: daitch_mokotoff module

Поиск
Список
Период
Сортировка

daitch_mokotoff module

От
Dag Lem
Дата:
Hello,

Please find attached a patch for the daitch_mokotoff module.

This implements the Daitch-Mokotoff Soundex System, as described in
https://www.avotaynu.com/soundex.htm

The module is used in production at Finance Norway.

In order to verify correctness, I have compared generated soundex codes
with corresponding results from the implementation by Stephen P. Morse
at https://stevemorse.org/census/soundex.html

Where soundex codes differ, the daitch_mokotoff module has been found
to be correct. The Morse implementation uses a few unofficial rules,
and also has an error in the handling of adjacent identical code
digits. Please see daitch_mokotoff.c for further references and
comments.

For reference, detailed instructions for soundex code comparison are
attached.


Best regards

Dag Lem

diff --git a/contrib/Makefile b/contrib/Makefile
index 87bf87ab90..5e1111a729 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -14,6 +14,7 @@ SUBDIRS = \
         btree_gist    \
         citext        \
         cube        \
+        daitch_mokotoff    \
         dblink        \
         dict_int    \
         dict_xsyn    \
diff --git a/contrib/daitch_mokotoff/Makefile b/contrib/daitch_mokotoff/Makefile
new file mode 100644
index 0000000000..baec5e31d4
--- /dev/null
+++ b/contrib/daitch_mokotoff/Makefile
@@ -0,0 +1,25 @@
+# contrib/daitch_mokotoff/Makefile
+
+MODULE_big = daitch_mokotoff
+OBJS = \
+    $(WIN32RES) \
+    daitch_mokotoff.o
+
+EXTENSION = daitch_mokotoff
+DATA = daitch_mokotoff--1.0.sql
+PGFILEDESC = "daitch_mokotoff - Daitch-Mokotoff Soundex System"
+
+HEADERS = daitch_mokotoff.h
+
+REGRESS = daitch_mokotoff
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/daitch_mokotoff
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/daitch_mokotoff/daitch_mokotoff--1.0.sql b/contrib/daitch_mokotoff/daitch_mokotoff--1.0.sql
new file mode 100644
index 0000000000..0b5a643175
--- /dev/null
+++ b/contrib/daitch_mokotoff/daitch_mokotoff--1.0.sql
@@ -0,0 +1,8 @@
+/* contrib/daitch_mokotoff/daitch_mokotoff--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION daitch_mokotoff" to load this file. \quit
+
+CREATE FUNCTION daitch_mokotoff(text) RETURNS text
+AS 'MODULE_PATHNAME', 'daitch_mokotoff'
+LANGUAGE C IMMUTABLE STRICT PARALLEL SAFE;
diff --git a/contrib/daitch_mokotoff/daitch_mokotoff.c b/contrib/daitch_mokotoff/daitch_mokotoff.c
new file mode 100644
index 0000000000..9e66aee434
--- /dev/null
+++ b/contrib/daitch_mokotoff/daitch_mokotoff.c
@@ -0,0 +1,551 @@
+/*
+ * Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem <dag@nimrod.no>
+ *
+ * This implementation of the Daitch-Mokotoff Soundex System aims at high
+ * performance.
+ *
+ * - The processing of each phoneme is initiated by an O(1) table lookup.
+ * - For phonemes containing more than one character, a coding tree is traversed
+ *   to process the complete phoneme.
+ * - The (alternate) soundex codes are produced digit by digit in-place in
+ *   another tree structure.
+ *
+ *  References:
+ *
+ * https://www.avotaynu.com/soundex.htm
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex
+ * https://stevemorse.org/phoneticinfo.htm (dmlat.php, dmsoundex.php)
+ * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java)
+ * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm)
+ *
+ * A few notes on other implementations:
+ *
+ * - "J" is considered a vowel in dmlat.php
+ * - The official rules for "RS" are commented out in dmlat.php
+ * - Adjacent identical code digits are not collapsed correctly in dmsoundex.php
+ *   when double digit codes are involved. E.g. "BESST" yields 744300 instead of
+ *   743000 as for "BEST".
+ * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java
+ * - Both dmlat.php and dmrules.txt have the same unofficial rules for "UE".
+ * - Coding of MN/NM + M/N differs between dmsoundex.php and DaitchMokotoffSoundex.java
+ * - Neither dmsoundex.php nor DaitchMokotoffSoundex.java yields all valid codes for e.g.
+ *   "CJC" (550000 540000 545000 450000 400000 440000).
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+*/
+
+#include "daitch_mokotoff.h"
+
+#ifndef DM_MAIN
+
+#include "postgres.h"
+#include "utils/builtins.h"
+#include "mb/pg_wchar.h"
+
+#else                            /* DM_MAIN */
+
+#include <stdio.h>
+
+#endif                            /* DM_MAIN */
+
+#include <ctype.h>
+#include <stdlib.h>
+#include <string.h>
+
+
+/* Internal C implementation */
+static char *_daitch_mokotoff(char *word, char *soundex, size_t n);
+
+
+#ifndef DM_MAIN
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(daitch_mokotoff);
+Datum
+daitch_mokotoff(PG_FUNCTION_ARGS)
+{
+    text       *arg = PG_GETARG_TEXT_PP(0);
+    char       *string,
+               *tmp_soundex;
+    text       *soundex;
+
+    /*
+     * The maximum theoretical soundex size is several KB, however in practice
+     * anything but contrived synthetic inputs will yield a soundex size of
+     * less than 100 bytes. We thus allocate and free a temporary work buffer,
+     * and return only the actual soundex result.
+     */
+    string = pg_server_to_any(text_to_cstring(arg), VARSIZE_ANY_EXHDR(arg), PG_UTF8);
+    tmp_soundex = palloc(DM_MAX_SOUNDEX_CHARS);
+
+    if (!(tmp_soundex && _daitch_mokotoff(string, tmp_soundex, DM_MAX_SOUNDEX_CHARS)))
+    {
+        pfree(tmp_soundex);
+        ereport(ERROR,
+                (errcode(ERRCODE_OUT_OF_MEMORY),
+                 errmsg("unable to allocate temporary memory for soundex")));
+    }
+
+    soundex = cstring_to_text(pg_any_to_server(tmp_soundex, strlen(tmp_soundex), PG_UTF8));
+    pfree(tmp_soundex);
+
+    PG_RETURN_TEXT_P(soundex);
+}
+
+#endif                            /* DM_MAIN */
+
+
+typedef dm_node dm_nodes[DM_MAX_NODES];
+typedef dm_node * dm_leaves[DM_MAX_LEAVES];
+
+
+static const dm_node start_node = {
+    .soundex_length = 0,
+    .soundex = "000000 ",        /* Six digits + joining space */
+    .is_leaf = 0,
+    .last_update = 0,
+    .code_digit = '\0',
+    .prev_code_digits = {'\0', '\0'},
+    .next_code_digits = {'\0', '\0'},
+    .next_nodes = {NULL}
+};
+
+
+static void
+initialize_node(dm_node * node, int last_update)
+{
+    if (node->last_update < last_update)
+    {
+        node->prev_code_digits[0] = node->next_code_digits[0];
+        node->prev_code_digits[1] = node->next_code_digits[1];
+        node->next_code_digits[0] = '\0';
+        node->next_code_digits[1] = '\0';
+        node->is_leaf = 0;
+        node->last_update = last_update;
+    }
+}
+
+
+static void
+add_next_code_digit(dm_node * node, char code_digit)
+{
+    if (!node->next_code_digits[0])
+    {
+        node->next_code_digits[0] = code_digit;
+    }
+    else if (node->next_code_digits[0] != code_digit)
+    {
+        node->next_code_digits[1] = code_digit;
+    }
+}
+
+
+static void
+set_leaf(dm_leaves leaves_next, int *num_leaves_next, dm_node * node)
+{
+    if (!node->is_leaf)
+    {
+        node->is_leaf = 1;
+        leaves_next[(*num_leaves_next)++] = node;
+    }
+}
+
+
+static dm_node * find_or_create_node(dm_nodes nodes, int *num_nodes, dm_node * node, char code_digit)
+{
+    dm_node   **next_nodes;
+    dm_node    *next_node;
+
+    for (next_nodes = node->next_nodes; (next_node = *next_nodes); next_nodes++)
+    {
+        if (next_node->code_digit == code_digit)
+        {
+            return next_node;
+        }
+    }
+
+    next_node = &nodes[(*num_nodes)++];
+    *next_nodes = next_node;
+
+    *next_node = start_node;
+    memcpy(next_node->soundex, node->soundex, sizeof(next_node->soundex));
+    next_node->soundex_length = node->soundex_length;
+    next_node->soundex[next_node->soundex_length++] = code_digit;
+    next_node->code_digit = code_digit;
+
+    return next_node;
+}
+
+
+static int
+update_node(dm_nodes nodes, dm_node * node, int *num_nodes, dm_leaves leaves_next, int *num_leaves_next, int
char_number,char next_code_digit, char next_code_digit_2) 
+{
+    int            i;
+    int            num_dirty_nodes = 0;
+    dm_node    *dirty_nodes[2];
+
+    initialize_node(node, char_number);
+
+    if (node->soundex_length == DM_MAX_CODE_DIGITS)
+    {
+        /* Keep completed soundex code. */
+        set_leaf(leaves_next, num_leaves_next, node);
+        return 0;
+    }
+
+    if (next_code_digit == 'X' ||
+        node->prev_code_digits[0] == next_code_digit ||
+        node->prev_code_digits[1] == next_code_digit)
+    {
+        /* The code digit is the same as one of the previous (i.e. not added). */
+        dirty_nodes[num_dirty_nodes++] = node;
+    }
+
+    if (next_code_digit != 'X' &&
+        (node->prev_code_digits[0] != next_code_digit ||
+         node->prev_code_digits[1]))
+    {
+        /* The code digit is different from one of the previous (i.e. added). */
+        node = find_or_create_node(nodes, num_nodes, node, next_code_digit);
+        initialize_node(node, char_number);
+        dirty_nodes[num_dirty_nodes++] = node;
+    }
+
+    for (i = 0; i < num_dirty_nodes; i++)
+    {
+        /* Add code digit leading to the current node. */
+        add_next_code_digit(dirty_nodes[i], next_code_digit);
+
+        if (next_code_digit_2)
+        {
+            update_node(nodes, dirty_nodes[i], num_nodes, leaves_next, num_leaves_next, char_number,
next_code_digit_2,'\0'); 
+        }
+        else
+        {
+            set_leaf(leaves_next, num_leaves_next, dirty_nodes[i]);
+        }
+    }
+
+    return 1;
+}
+
+
+static int
+update_leaves(dm_nodes nodes, int *num_nodes, dm_leaves leaves[2], int *ix_leaves, int *num_leaves, int char_number,
dm_codescodes) 
+{
+    int            i,
+                j;
+    char       *code;
+    int            num_leaves_next = 0;
+    int            ix_leaves_next = (*ix_leaves + 1) % 2;
+    int            finished = 1;
+
+    for (i = 0; i < *num_leaves; i++)
+    {
+        dm_node    *node = leaves[*ix_leaves][i];
+
+        /* One or two alternate code sequences. */
+        for (j = 0; j < 2 && (code = codes[j]) && code[0]; j++)
+        {
+            /* One or two sequential code digits. */
+            if (update_node(nodes, node, num_nodes, leaves[ix_leaves_next], &num_leaves_next, char_number, code[0],
code[1]))
+            {
+                finished = 0;
+            }
+        }
+    }
+
+    *ix_leaves = ix_leaves_next;
+    *num_leaves = num_leaves_next;
+
+    return finished;
+}
+
+
+static const char tr_accents_iso8859_1[] =
+/*
+"ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ"
+*/
+"AAAAAAECEEEEIIIIDNOOOOO*OUUUUYDsaaaaaaeceeeeiiiidnooooo/ouuuuydy";
+
+static char
+unaccent_iso8859_1(unsigned char c)
+{
+    return c >= 192 ? tr_accents_iso8859_1[c - 192] : c;
+}
+
+
+/* Convert an UTF-8 character to ISO-8859-1.
+ * Unconvertable characters are returned as '?'.
+ * NB! Beware of the domain specific conversion of Ą, Ę, and Ţ/Ț.
+ */
+static char
+utf8_to_iso8859_1(char *str, int *len)
+{
+    const char    unknown = '?';
+    unsigned char c;
+    unsigned int code_point;
+
+    *len = 1;
+    c = (unsigned char) str[0];
+    if (c < 0x80)
+    {
+        /* ASCII code point. */
+        return c;
+    }
+    else if (c < 0xE0)
+    {
+        /* Two-byte character. */
+        if (!str[1])
+        {
+            /* The UTF-8 character is cut short (invalid code point). */
+            return unknown;
+        }
+        *len = 2;
+        code_point = ((c & 0x1F) << 6) | (str[1] & 0x3F);
+        if (code_point < 0x100)
+        {
+            /* ISO-8859-1 code point. */
+            return code_point;
+        }
+        else if (code_point == 0x0104 || code_point == 0x0105)
+        {
+            /* Ą/ą */
+            return '[';
+        }
+        else if (code_point == 0x0118 || code_point == 0x0119)
+        {
+            /* Ę/ę */
+            return '\\';
+        }
+        else if (code_point == 0x0162 || code_point == 0x0163 ||
+                 code_point == 0x021A || code_point == 0x021B)
+        {
+            /* Ţ/ţ or Ț/ț */
+            return ']';
+        }
+        else
+        {
+            return unknown;
+        }
+    }
+    else if (c < 0xF0)
+    {
+        /* Three-byte character. */
+        if (!str[2])
+        {
+            /* The UTF-8 character is cut short (invalid code point). */
+            *len = 2;
+        }
+        else
+        {
+            *len = 3;
+        }
+        return unknown;
+    }
+    else
+    {
+        /* Four-byte character. */
+        if (!str[3])
+        {
+            /* The UTF-8 character is cut short (invalid code point). */
+            *len = 3;
+        }
+        else
+        {
+            *len = 4;
+        }
+        return unknown;
+    }
+}
+
+
+static char
+read_char(char *str, int *len)
+{
+    return toupper(unaccent_iso8859_1(utf8_to_iso8859_1(str, len)));
+}
+
+
+static char
+read_valid_char(char *str, int *len)
+{
+    int            c;
+    int            i,
+                ilen;
+
+    for (i = 0, ilen = 0; (c = read_char(&str[i], &ilen)) && (c < 'A' || c > ']'); i += ilen)
+    {
+    }
+
+    *len = i + ilen;
+    return c;
+}
+
+
+static char *
+_daitch_mokotoff(char *word, char *soundex, size_t n)
+{
+    int            c,
+                cmp;
+    int            i,
+                ilen,
+                i_ok,
+                j,
+                jlen,
+                k;
+    int            first_letter = 1;
+    int            ix_leaves = 0;
+    int            num_nodes = 0,
+                num_leaves = 0;
+    dm_letter  *letter,
+               *letters;
+    dm_codes   *codes_ok,
+                codes;
+
+    dm_node    *nodes = malloc(sizeof(dm_nodes));
+    dm_leaves  *leaves = malloc(2 * sizeof(dm_leaves));
+
+    if (!nodes || !leaves)
+    {
+        /* Out of memory - clean up and return. */
+        free(leaves);
+        free(nodes);
+        *soundex = '\0';
+        return NULL;
+    }
+
+    /* Starting point. */
+    nodes[num_nodes++] = start_node;
+    leaves[ix_leaves][num_leaves++] = &nodes[0];
+
+    for (i = 0; (c = read_valid_char(&word[i], &ilen)); i += ilen)
+    {
+        /* First letter in sequence. */
+        letter = &letter_[c - 'A'];
+        codes_ok = letter->codes;
+        i_ok = i;
+
+        /* Subsequent letters. */
+        for (j = i + ilen; (letters = letter->letters) && (c = read_valid_char(&word[j], &jlen)); j += jlen)
+        {
+            for (k = 0; (cmp = letters[k].letter); k++)
+            {
+                if (cmp == c)
+                {
+                    /* Coding for letter found. */
+                    break;
+                }
+            }
+            if (!cmp)
+            {
+                /* The sequence of letters has no coding. */
+                break;
+            }
+
+            letter = &letters[k];
+            if (letter->codes)
+            {
+                codes_ok = letter->codes;
+                i_ok = j;
+                ilen = jlen;
+            }
+        }
+
+        /* Determine which code to use. */
+        if (first_letter)
+        {
+            /* This is the first letter. */
+            j = 0;
+            first_letter = 0;
+        }
+        else if ((c = read_valid_char(&word[i_ok + ilen], &jlen)) && strchr(DM_VOWELS, c))
+        {
+            /* The next letter is a vowel. */
+            j = 1;
+        }
+        else
+        {
+            /* All other cases. */
+            j = 2;
+        }
+        memcpy(codes, codes_ok[j], sizeof(codes));
+
+        /* Update leaves. */
+        if (update_leaves(nodes, &num_nodes, leaves, &ix_leaves, &num_leaves, i, codes))
+        {
+            /* All soundex codes are completed to six digits. */
+            break;
+        }
+
+        /* Prepare for next letter sequence. */
+        i = i_ok;
+    }
+
+    /* Concatenate all generated soundex codes. */
+    for (i = 0, j = 0; i < num_leaves && j + DM_MAX_CODE_DIGITS + 1 <= n; i++, j += DM_MAX_CODE_DIGITS + 1)
+    {
+        memcpy(&soundex[j], leaves[ix_leaves][i]->soundex, DM_MAX_CODE_DIGITS + 1);
+    }
+
+    /* Terminate string. */
+    soundex[j - (j != 0)] = '\0';
+
+    free(leaves);
+    free(nodes);
+
+    return soundex;
+}
+
+
+#ifdef DM_MAIN
+
+/* For testing */
+
+int
+main(int argc, char **argv)
+{
+    char       *soundex;
+
+    if (argc != 2)
+    {
+        fprintf(stderr, "Usage: %s string\n", argv[0]);
+        return -1;
+    }
+
+    soundex = malloc(DM_MAX_SOUNDEX_CHARS);
+
+    if (!_daitch_mokotoff(argv[1], soundex, DM_MAX_SOUNDEX_CHARS))
+    {
+        free(soundex);
+        fprintf(stderr, "Unable to allocate memory for soundex trees\n");
+        return -1;
+    }
+
+    printf("%s\n", soundex);
+    free(soundex);
+
+    return 0;
+}
+
+#endif
diff --git a/contrib/daitch_mokotoff/daitch_mokotoff.control b/contrib/daitch_mokotoff/daitch_mokotoff.control
new file mode 100644
index 0000000000..c5aed8e46e
--- /dev/null
+++ b/contrib/daitch_mokotoff/daitch_mokotoff.control
@@ -0,0 +1,5 @@
+# daitch_mokotoff extension
+comment = 'Daitch-Mokotoff Soundex System'
+default_version = '1.0'
+module_pathname = '$libdir/daitch_mokotoff'
+relocatable = true
diff --git a/contrib/daitch_mokotoff/daitch_mokotoff.h b/contrib/daitch_mokotoff/daitch_mokotoff.h
new file mode 100644
index 0000000000..8fcb98f1cf
--- /dev/null
+++ b/contrib/daitch_mokotoff/daitch_mokotoff.h
@@ -0,0 +1,1110 @@
+/*
+ * Types and lookup tables for Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem <dag@nimrod.no>
+ *
+ * This file is generated by daitch_mokotoff_header.pl
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+ */
+
+#include <stdlib.h>
+
+#define DM_MAX_CODE_DIGITS 6
+#define DM_MAX_ALTERNATE_CODES 5
+#define DM_MAX_NODES 1564
+#define DM_MAX_LEAVES 1250
+#define DM_MAX_SOUNDEX_CHARS (DM_MAX_NODES*(DM_MAX_CODE_DIGITS + 1))
+#define DM_VOWELS "AEIOUY"
+
+typedef char dm_code[2 + 1];    /* One or two sequential code digits + NUL */
+typedef dm_code dm_codes[2];    /* One or two alternate code sequences */
+
+struct dm_letter
+{
+    char        letter;
+    struct dm_letter *letters;
+    dm_codes   *codes;
+};
+
+struct dm_node
+{
+    int            soundex_length;
+    char        soundex[DM_MAX_CODE_DIGITS + 1];
+    int            is_leaf;
+    int            last_update;
+    char        code_digit;
+
+    /*
+     * One or two alternate code digits leading to this node - repeated code
+     * digits and 'X' lead back to the same node.
+     */
+    char        prev_code_digits[2];
+    char        next_code_digits[2];
+    struct dm_node *next_nodes[DM_MAX_ALTERNATE_CODES + 1];
+};
+
+typedef struct dm_letter dm_letter;
+typedef struct dm_node dm_node;
+
+static dm_codes codes_0_1_X[] =
+{
+    {
+        "0"
+    },
+    {
+        "1"
+    },
+    {
+        "X"
+    }
+};
+static dm_codes codes_0_7_X[] =
+{
+    {
+        "0"
+    },
+    {
+        "7"
+    },
+    {
+        "X"
+    }
+};
+static dm_codes codes_0_X_X[] =
+{
+    {
+        "0"
+    },
+    {
+        "X"
+    },
+    {
+        "X"
+    }
+};
+static dm_codes codes_1_1_X[] =
+{
+    {
+        "1"
+    },
+    {
+        "1"
+    },
+    {
+        "X"
+    }
+};
+static dm_codes codes_1_X_X[] =
+{
+    {
+        "1"
+    },
+    {
+        "X"
+    },
+    {
+        "X"
+    }
+};
+static dm_codes codes_1or4_Xor4_Xor4[] =
+{
+    {
+        "1", "4"
+    },
+    {
+        "X", "4"
+    },
+    {
+        "X", "4"
+    }
+};
+static dm_codes codes_2_43_43[] =
+{
+    {
+        "2"
+    },
+    {
+        "43"
+    },
+    {
+        "43"
+    }
+};
+static dm_codes codes_2_4_4[] =
+{
+    {
+        "2"
+    },
+    {
+        "4"
+    },
+    {
+        "4"
+    }
+};
+static dm_codes codes_3_3_3[] =
+{
+    {
+        "3"
+    },
+    {
+        "3"
+    },
+    {
+        "3"
+    }
+};
+static dm_codes codes_3or4_3or4_3or4[] =
+{
+    {
+        "3", "4"
+    },
+    {
+        "3", "4"
+    },
+    {
+        "3", "4"
+    }
+};
+static dm_codes codes_4_4_4[] =
+{
+    {
+        "4"
+    },
+    {
+        "4"
+    },
+    {
+        "4"
+    }
+};
+static dm_codes codes_5_54_54[] =
+{
+    {
+        "5"
+    },
+    {
+        "54"
+    },
+    {
+        "54"
+    }
+};
+static dm_codes codes_5_5_5[] =
+{
+    {
+        "5"
+    },
+    {
+        "5"
+    },
+    {
+        "5"
+    }
+};
+static dm_codes codes_5_5_X[] =
+{
+    {
+        "5"
+    },
+    {
+        "5"
+    },
+    {
+        "X"
+    }
+};
+static dm_codes codes_5or45_5or45_5or45[] =
+{
+    {
+        "5", "45"
+    },
+    {
+        "5", "45"
+    },
+    {
+        "5", "45"
+    }
+};
+static dm_codes codes_5or4_5or4_5or4[] =
+{
+    {
+        "5", "4"
+    },
+    {
+        "5", "4"
+    },
+    {
+        "5", "4"
+    }
+};
+static dm_codes codes_66_66_66[] =
+{
+    {
+        "66"
+    },
+    {
+        "66"
+    },
+    {
+        "66"
+    }
+};
+static dm_codes codes_6_6_6[] =
+{
+    {
+        "6"
+    },
+    {
+        "6"
+    },
+    {
+        "6"
+    }
+};
+static dm_codes codes_7_7_7[] =
+{
+    {
+        "7"
+    },
+    {
+        "7"
+    },
+    {
+        "7"
+    }
+};
+static dm_codes codes_8_8_8[] =
+{
+    {
+        "8"
+    },
+    {
+        "8"
+    },
+    {
+        "8"
+    }
+};
+static dm_codes codes_94or4_94or4_94or4[] =
+{
+    {
+        "94", "4"
+    },
+    {
+        "94", "4"
+    },
+    {
+        "94", "4"
+    }
+};
+static dm_codes codes_9_9_9[] =
+{
+    {
+        "9"
+    },
+    {
+        "9"
+    },
+    {
+        "9"
+    }
+};
+static dm_codes codes_X_X_6orX[] =
+{
+    {
+        "X"
+    },
+    {
+        "X"
+    },
+    {
+        "6", "X"
+    }
+};
+
+static dm_letter letter_A[] =
+{
+    {
+        'I', NULL, codes_0_1_X
+    },
+    {
+        'J', NULL, codes_0_1_X
+    },
+    {
+        'U', NULL, codes_0_7_X
+    },
+    {
+        'Y', NULL, codes_0_1_X
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_CH[] =
+{
+    {
+        'S', NULL, codes_5_54_54
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_CS[] =
+{
+    {
+        'Z', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_CZ[] =
+{
+    {
+        'S', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_C[] =
+{
+    {
+        'H', letter_CH, codes_5or4_5or4_5or4
+    },
+    {
+        'K', NULL, codes_5or45_5or45_5or45
+    },
+    {
+        'S', letter_CS, codes_4_4_4
+    },
+    {
+        'Z', letter_CZ, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_DR[] =
+{
+    {
+        'S', NULL, codes_4_4_4
+    },
+    {
+        'Z', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_DS[] =
+{
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        'Z', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_DZ[] =
+{
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        'S', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_D[] =
+{
+    {
+        'R', letter_DR, NULL
+    },
+    {
+        'S', letter_DS, codes_4_4_4
+    },
+    {
+        'T', NULL, codes_3_3_3
+    },
+    {
+        'Z', letter_DZ, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_E[] =
+{
+    {
+        'I', NULL, codes_0_1_X
+    },
+    {
+        'J', NULL, codes_0_1_X
+    },
+    {
+        'U', NULL, codes_1_1_X
+    },
+    {
+        'Y', NULL, codes_0_1_X
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_F[] =
+{
+    {
+        'B', NULL, codes_7_7_7
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_I[] =
+{
+    {
+        'A', NULL, codes_1_X_X
+    },
+    {
+        'E', NULL, codes_1_X_X
+    },
+    {
+        'O', NULL, codes_1_X_X
+    },
+    {
+        'U', NULL, codes_1_X_X
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_K[] =
+{
+    {
+        'H', NULL, codes_5_5_5
+    },
+    {
+        'S', NULL, codes_5_54_54
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_M[] =
+{
+    {
+        'N', NULL, codes_66_66_66
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_N[] =
+{
+    {
+        'M', NULL, codes_66_66_66
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_O[] =
+{
+    {
+        'I', NULL, codes_0_1_X
+    },
+    {
+        'J', NULL, codes_0_1_X
+    },
+    {
+        'Y', NULL, codes_0_1_X
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_P[] =
+{
+    {
+        'F', NULL, codes_7_7_7
+    },
+    {
+        'H', NULL, codes_7_7_7
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_R[] =
+{
+    {
+        'S', NULL, codes_94or4_94or4_94or4
+    },
+    {
+        'Z', NULL, codes_94or4_94or4_94or4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SCHTC[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SCHTSC[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SCHTS[] =
+{
+    {
+        'C', letter_SCHTSC, NULL
+    },
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SCHT[] =
+{
+    {
+        'C', letter_SCHTC, NULL
+    },
+    {
+        'S', letter_SCHTS, NULL
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SCH[] =
+{
+    {
+        'D', NULL, codes_2_43_43
+    },
+    {
+        'T', letter_SCHT, codes_2_43_43
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SC[] =
+{
+    {
+        'H', letter_SCH, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SHC[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SHTC[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SHTS[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SHT[] =
+{
+    {
+        'C', letter_SHTC, NULL
+    },
+    {
+        'S', letter_SHTS, NULL
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SH[] =
+{
+    {
+        'C', letter_SHC, NULL
+    },
+    {
+        'D', NULL, codes_2_43_43
+    },
+    {
+        'T', letter_SHT, codes_2_43_43
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_STC[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_STR[] =
+{
+    {
+        'S', NULL, codes_2_4_4
+    },
+    {
+        'Z', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_STSC[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_STS[] =
+{
+    {
+        'C', letter_STSC, NULL
+    },
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_ST[] =
+{
+    {
+        'C', letter_STC, NULL
+    },
+    {
+        'R', letter_STR, NULL
+    },
+    {
+        'S', letter_STS, NULL
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SZC[] =
+{
+    {
+        'S', NULL, codes_2_4_4
+    },
+    {
+        'Z', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SZ[] =
+{
+    {
+        'C', letter_SZC, NULL
+    },
+    {
+        'D', NULL, codes_2_43_43
+    },
+    {
+        'T', NULL, codes_2_43_43
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_S[] =
+{
+    {
+        'C', letter_SC, codes_2_4_4
+    },
+    {
+        'D', NULL, codes_2_43_43
+    },
+    {
+        'H', letter_SH, codes_4_4_4
+    },
+    {
+        'T', letter_ST, codes_2_43_43
+    },
+    {
+        'Z', letter_SZ, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TC[] =
+{
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TR[] =
+{
+    {
+        'S', NULL, codes_4_4_4
+    },
+    {
+        'Z', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TSC[] =
+{
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TS[] =
+{
+    {
+        'C', letter_TSC, NULL
+    },
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        'Z', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TTC[] =
+{
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TTSC[] =
+{
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TTS[] =
+{
+    {
+        'C', letter_TTSC, NULL
+    },
+    {
+        'Z', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TT[] =
+{
+    {
+        'C', letter_TTC, NULL
+    },
+    {
+        'S', letter_TTS, codes_4_4_4
+    },
+    {
+        'Z', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TZ[] =
+{
+    {
+        'S', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_T[] =
+{
+    {
+        'C', letter_TC, codes_4_4_4
+    },
+    {
+        'H', NULL, codes_3_3_3
+    },
+    {
+        'R', letter_TR, NULL
+    },
+    {
+        'S', letter_TS, codes_4_4_4
+    },
+    {
+        'T', letter_TT, NULL
+    },
+    {
+        'Z', letter_TZ, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_U[] =
+{
+    {
+        'E', NULL, codes_0_X_X
+    },
+    {
+        'I', NULL, codes_0_1_X
+    },
+    {
+        'J', NULL, codes_0_1_X
+    },
+    {
+        'Y', NULL, codes_0_1_X
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_ZDZ[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_ZD[] =
+{
+    {
+        'Z', letter_ZDZ, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_ZHDZ[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_ZHD[] =
+{
+    {
+        'Z', letter_ZHDZ, NULL
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_ZH[] =
+{
+    {
+        'D', letter_ZHD, codes_2_43_43
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_ZSC[] =
+{
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_ZS[] =
+{
+    {
+        'C', letter_ZSC, NULL
+    },
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_Z[] =
+{
+    {
+        'D', letter_ZD, codes_2_43_43
+    },
+    {
+        'H', letter_ZH, codes_4_4_4
+    },
+    {
+        'S', letter_ZS, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_[] =
+{
+    {
+        'A', letter_A, codes_0_X_X
+    },
+    {
+        'B', NULL, codes_7_7_7
+    },
+    {
+        'C', letter_C, codes_5or4_5or4_5or4
+    },
+    {
+        'D', letter_D, codes_3_3_3
+    },
+    {
+        'E', letter_E, codes_0_X_X
+    },
+    {
+        'F', letter_F, codes_7_7_7
+    },
+    {
+        'G', NULL, codes_5_5_5
+    },
+    {
+        'H', NULL, codes_5_5_X
+    },
+    {
+        'I', letter_I, codes_0_X_X
+    },
+    {
+        'J', NULL, codes_1or4_Xor4_Xor4
+    },
+    {
+        'K', letter_K, codes_5_5_5
+    },
+    {
+        'L', NULL, codes_8_8_8
+    },
+    {
+        'M', letter_M, codes_6_6_6
+    },
+    {
+        'N', letter_N, codes_6_6_6
+    },
+    {
+        'O', letter_O, codes_0_X_X
+    },
+    {
+        'P', letter_P, codes_7_7_7
+    },
+    {
+        'Q', NULL, codes_5_5_5
+    },
+    {
+        'R', letter_R, codes_9_9_9
+    },
+    {
+        'S', letter_S, codes_4_4_4
+    },
+    {
+        'T', letter_T, codes_3_3_3
+    },
+    {
+        'U', letter_U, codes_0_X_X
+    },
+    {
+        'V', NULL, codes_7_7_7
+    },
+    {
+        'W', NULL, codes_7_7_7
+    },
+    {
+        'X', NULL, codes_5_54_54
+    },
+    {
+        'Y', NULL, codes_1_X_X
+    },
+    {
+        'Z', letter_Z, codes_4_4_4
+    },
+    {
+        'a', NULL, codes_X_X_6orX
+    },
+    {
+        'e', NULL, codes_X_X_6orX
+    },
+    {
+        't', NULL, codes_3or4_3or4_3or4
+    },
+    {
+        '\0'
+    }
+};
diff --git a/contrib/daitch_mokotoff/daitch_mokotoff_header.pl b/contrib/daitch_mokotoff/daitch_mokotoff_header.pl
new file mode 100755
index 0000000000..26e67fe5df
--- /dev/null
+++ b/contrib/daitch_mokotoff/daitch_mokotoff_header.pl
@@ -0,0 +1,274 @@
+#!/bin/perl
+#
+# Generation of types and lookup tables for Daitch-Mokotoff soundex.
+#
+# Copyright (c) 2021 Finance Norway
+# Author: Dag Lem <dag@nimrod.no>
+#
+# Permission to use, copy, modify, and distribute this software and its
+# documentation for any purpose, without fee, and without a written agreement
+# is hereby granted, provided that the above copyright notice and this
+# paragraph and the following two paragraphs appear in all copies.
+#
+# IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+# DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+# LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+# DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+# POSSIBILITY OF SUCH DAMAGE.
+#
+# THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+# INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+# AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+# ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+# PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+#
+
+use strict;
+use warnings;
+use utf8;
+use open IO => ':utf8', ':std';
+use Data::Dumper;
+
+# Parse code table and generate tree for letter transitions.
+my %codes;
+my %alternates;
+my $table = [{}, [["","",""]]];
+while (<DATA>) {
+    chomp;
+    my ($letters, $codes) = split(/\s+/);
+    my @codes = map { [ split(/\|/) ] } split(/,/, $codes);
+
+    # Find alternate code transitions for calculation of storage.
+    # The first character can never yield more than two alternate codes, and is not considered here.
+    for my $c (@codes[1..2]) {
+        if (@$c > 1) {
+            for my $i (0..1) {
+                my ($a, $b) = (substr($c->[$i], -1, 1), substr($c->[($i + 1)%2], 0, 1));
+                $alternates{$a}{$b} = 1 if $a ne $b;
+            }
+        }
+    }
+    my $key = "codes_" . join("_", map { join("or", @$_) } @codes);
+    my $val = join(",\n", map { "\t{\n\t\t" . join(", ", map { "\"$_\"" } @$_) . "\n\t}" } @codes);
+    $codes{$key} = $val;
+
+    for my $letter (split(/,/, $letters)) {
+        my $ref = $table->[0];
+        # Link each character to the next in the letter combination.
+        my @c = split(//, $letter);
+        my $last_c = pop(@c);
+        for my $c (@c) {
+            $ref->{$c} //= [ {}, undef ];
+            $ref->{$c}[0] //= {};
+            $ref = $ref->{$c}[0];
+        }
+        # The sound code for the letter combination is stored at the last character.
+        $ref->{$last_c}[1] = $key;
+    }
+}
+close(DATA);
+
+# Find the maximum number of alternate codes in one position.
+my $alt_x = $alternates{"X"};
+while (my ($k, $v) = each %alternates) {
+    if (defined delete $v->{"X"}) {
+        for my $x (keys %$alt_x) {
+            $v->{$x} = 1;
+        }
+    }
+}
+my $max_alt = (reverse sort (2, map { scalar keys %$_ } values %alternates))[0];
+
+# The maximum number of nodes and leaves in the soundex tree.
+# These are safe estimates, but in practice somewhat higher than the actual maximums.
+# Note that the first character can never yield more than two alternate codes,
+# hence the calculations are performed as sums of two subtrees.
+my $digits = 6;
+# Number of nodes (sum of geometric progression).
+my $max_nodes = 2 + 2*(1 - $max_alt**($digits - 1))/(1 - $max_alt);
+# Number of leaves (exponential of base number).
+my $max_leaves = 2*$max_alt**($digits - 2);
+
+print <<EOF;
+/*
+ * Types and lookup tables for Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem <dag\@nimrod.no>
+ *
+ * This file is generated by daitch_mokotoff_header.pl
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+ */
+
+#include <stdlib.h>
+
+#define DM_MAX_CODE_DIGITS $digits
+#define DM_MAX_ALTERNATE_CODES $max_alt
+#define DM_MAX_NODES $max_nodes
+#define DM_MAX_LEAVES $max_leaves
+#define DM_MAX_SOUNDEX_CHARS (DM_MAX_NODES*(DM_MAX_CODE_DIGITS + 1))
+#define DM_VOWELS "AEIOUY"
+
+typedef char dm_code[2 + 1];    /* One or two sequential code digits + NUL */
+typedef dm_code dm_codes[2];    /* One or two alternate code sequences */
+
+struct dm_letter
+{
+    char        letter;
+    struct dm_letter *letters;
+    dm_codes   *codes;
+};
+
+struct dm_node
+{
+    int            soundex_length;
+    char        soundex[DM_MAX_CODE_DIGITS + 1];
+    int            is_leaf;
+    int            last_update;
+    char        code_digit;
+
+    /*
+     * One or two alternate code digits leading to this node - repeated code
+     * digits and 'X' lead back to the same node.
+     */
+    char        prev_code_digits[2];
+    char        next_code_digits[2];
+    struct dm_node *next_nodes[DM_MAX_ALTERNATE_CODES + 1];
+};
+
+typedef struct dm_letter dm_letter;
+typedef struct dm_node dm_node;
+
+EOF
+
+for my $key (sort keys %codes) {
+    print "static dm_codes $key\[\] =\n{\n" . $codes{$key} . "\n};\n";
+}
+
+print "\n";
+
+sub hash2code {
+    my ($ref, $letter) = @_;
+
+    my @letters = ();
+
+    my $h = $ref->[0];
+    for my $key (sort keys %$h) {
+        $ref = $h->{$key};
+        my $children = "NULL";
+        if (defined $ref->[0]) {
+            $children = "letter_$letter$key";
+            hash2code($ref, "$letter$key");
+        }
+        my $codes = $ref->[1] // "NULL";
+        push(@letters, "\t{\n\t\t'$key', $children, $codes\n\t}");
+    }
+
+    print "static dm_letter letter_$letter\[\] =\n{\n";
+    for (@letters) {
+        print "$_,\n";
+    }
+    print "\t{\n\t\t'\\0'\n\t}\n";
+    print "};\n";
+}
+
+hash2code($table, '');
+
+
+# Table adapted from https://www.jewishgen.org/InfoFiles/Soundex.html
+#
+# X = NC (not coded)
+#
+# Note that the following letters are coded with substitute letters
+#
+# Ą => a (use '[' for table lookup)
+# Ę => e (use '\\' for table lookup)
+# Ţ => t (use ']' for table lookup)
+#
+
+__DATA__
+AI,AJ,AY                0,1,X
+AU                        0,7,X
+a                        X,X,6|X
+A                        0,X,X
+B                        7,7,7
+CHS                        5,54,54
+CH                        5|4,5|4,5|4
+CK                        5|45,5|45,5|45
+CZ,CS,CSZ,CZS            4,4,4
+C                        5|4,5|4,5|4
+DRZ,DRS                    4,4,4
+DS,DSH,DSZ                4,4,4
+DZ,DZH,DZS                4,4,4
+D,DT                    3,3,3
+EI,EJ,EY                0,1,X
+EU                        1,1,X
+e                        X,X,6|X
+E                        0,X,X
+FB                        7,7,7
+F                        7,7,7
+G                        5,5,5
+H                        5,5,X
+IA,IE,IO,IU                1,X,X
+I                        0,X,X
+J                        1|4,X|4,X|4
+KS                        5,54,54
+KH                        5,5,5
+K                        5,5,5
+L                        8,8,8
+MN                        66,66,66
+M                        6,6,6
+NM                        66,66,66
+N                        6,6,6
+OI,OJ,OY                0,1,X
+O                        0,X,X
+P,PF,PH                    7,7,7
+Q                        5,5,5
+RZ,RS                    94|4,94|4,94|4
+R                        9,9,9
+SCHTSCH,SCHTSH,SCHTCH    2,4,4
+SCH                        4,4,4
+SHTCH,SHCH,SHTSH        2,4,4
+SHT,SCHT,SCHD            2,43,43
+SH                        4,4,4
+STCH,STSCH,SC            2,4,4
+STRZ,STRS,STSH            2,4,4
+ST                        2,43,43
+SZCZ,SZCS                2,4,4
+SZT,SHD,SZD,SD            2,43,43
+SZ                        4,4,4
+S                        4,4,4
+TCH,TTCH,TTSCH            4,4,4
+TH                        3,3,3
+TRZ,TRS                    4,4,4
+TSCH,TSH                4,4,4
+TS,TTS,TTSZ,TC            4,4,4
+TZ,TTZ,TZS,TSZ            4,4,4
+t                        3|4,3|4,3|4
+T                        3,3,3
+UI,UJ,UY                0,1,X
+U,UE                    0,X,X
+V                        7,7,7
+W                        7,7,7
+X                        5,54,54
+Y                        1,X,X
+ZDZ,ZDZH,ZHDZH            2,4,4
+ZD,ZHD                    2,43,43
+ZH,ZS,ZSCH,ZSH            4,4,4
+Z                        4,4,4
diff --git a/contrib/daitch_mokotoff/expected/daitch_mokotoff.out
b/contrib/daitch_mokotoff/expected/daitch_mokotoff.out
new file mode 100644
index 0000000000..b7db809746
--- /dev/null
+++ b/contrib/daitch_mokotoff/expected/daitch_mokotoff.out
@@ -0,0 +1,451 @@
+CREATE EXTENSION daitch_mokotoff;
+SELECT daitch_mokotoff('GOLDEN');
+ daitch_mokotoff
+-----------------
+ 583600
+(1 row)
+
+SELECT daitch_mokotoff('Alpert');
+ daitch_mokotoff
+-----------------
+ 087930
+(1 row)
+
+SELECT daitch_mokotoff('Breuer');
+ daitch_mokotoff
+-----------------
+ 791900
+(1 row)
+
+SELECT daitch_mokotoff('Freud');
+ daitch_mokotoff
+-----------------
+ 793000
+(1 row)
+
+SELECT daitch_mokotoff('Haber');
+ daitch_mokotoff
+-----------------
+ 579000
+(1 row)
+
+SELECT daitch_mokotoff('Manheim');
+ daitch_mokotoff
+-----------------
+ 665600
+(1 row)
+
+SELECT daitch_mokotoff('Mintz');
+ daitch_mokotoff
+-----------------
+ 664000
+(1 row)
+
+SELECT daitch_mokotoff('Topf');
+ daitch_mokotoff
+-----------------
+ 370000
+(1 row)
+
+SELECT daitch_mokotoff('Kleinman');
+ daitch_mokotoff
+-----------------
+ 586660
+(1 row)
+
+SELECT daitch_mokotoff('Ben Aron');
+ daitch_mokotoff
+-----------------
+ 769600
+(1 row)
+
+SELECT daitch_mokotoff('AUERBACH');
+ daitch_mokotoff
+-----------------
+ 097500 097400
+(1 row)
+
+SELECT daitch_mokotoff('OHRBACH');
+ daitch_mokotoff
+-----------------
+ 097500 097400
+(1 row)
+
+SELECT daitch_mokotoff('LIPSHITZ');
+ daitch_mokotoff
+-----------------
+ 874400
+(1 row)
+
+SELECT daitch_mokotoff('LIPPSZYC');
+ daitch_mokotoff
+-----------------
+ 874500 874400
+(1 row)
+
+SELECT daitch_mokotoff('LEWINSKY');
+ daitch_mokotoff
+-----------------
+ 876450
+(1 row)
+
+SELECT daitch_mokotoff('LEVINSKI');
+ daitch_mokotoff
+-----------------
+ 876450
+(1 row)
+
+SELECT daitch_mokotoff('SZLAMAWICZ');
+ daitch_mokotoff
+-----------------
+ 486740
+(1 row)
+
+SELECT daitch_mokotoff('SHLAMOVITZ');
+ daitch_mokotoff
+-----------------
+ 486740
+(1 row)
+
+SELECT daitch_mokotoff('Ceniow');
+ daitch_mokotoff
+-----------------
+ 567000 467000
+(1 row)
+
+SELECT daitch_mokotoff('Tsenyuv');
+ daitch_mokotoff
+-----------------
+ 467000
+(1 row)
+
+SELECT daitch_mokotoff('Holubica');
+ daitch_mokotoff
+-----------------
+ 587500 587400
+(1 row)
+
+SELECT daitch_mokotoff('Golubitsa');
+ daitch_mokotoff
+-----------------
+ 587400
+(1 row)
+
+SELECT daitch_mokotoff('Przemysl');
+ daitch_mokotoff
+-----------------
+ 794648 746480
+(1 row)
+
+SELECT daitch_mokotoff('Pshemeshil');
+ daitch_mokotoff
+-----------------
+ 746480
+(1 row)
+
+SELECT daitch_mokotoff('Rosochowaciec');
+                     daitch_mokotoff
+---------------------------------------------------------
+ 945755 945754 945745 945744 944755 944754 944745 944744
+(1 row)
+
+SELECT daitch_mokotoff('Rosokhovatsets');
+ daitch_mokotoff
+-----------------
+ 945744
+(1 row)
+
+SELECT daitch_mokotoff('Peters');
+ daitch_mokotoff
+-----------------
+ 739400 734000
+(1 row)
+
+SELECT daitch_mokotoff('Peterson');
+ daitch_mokotoff
+-----------------
+ 739460 734600
+(1 row)
+
+SELECT daitch_mokotoff('Moskowitz');
+ daitch_mokotoff
+-----------------
+ 645740
+(1 row)
+
+SELECT daitch_mokotoff('Moskovitz');
+ daitch_mokotoff
+-----------------
+ 645740
+(1 row)
+
+SELECT daitch_mokotoff('Auerbach');
+ daitch_mokotoff
+-----------------
+ 097500 097400
+(1 row)
+
+SELECT daitch_mokotoff('Uhrbach');
+ daitch_mokotoff
+-----------------
+ 097500 097400
+(1 row)
+
+SELECT daitch_mokotoff('Jackson');
+       daitch_mokotoff
+-----------------------------
+ 154600 145460 454600 445460
+(1 row)
+
+SELECT daitch_mokotoff('Jackson-Jackson');
+                            daitch_mokotoff
+-----------------------------------------------------------------------
+ 154654 154645 154644 145465 145464 454654 454645 454644 445465 445464
+(1 row)
+
+SELECT daitch_mokotoff('Müller');
+ daitch_mokotoff
+-----------------
+ 689000
+(1 row)
+
+SELECT daitch_mokotoff('schmidt');
+ daitch_mokotoff
+-----------------
+ 463000
+(1 row)
+
+SELECT daitch_mokotoff('schneider');
+ daitch_mokotoff
+-----------------
+ 463900
+(1 row)
+
+SELECT daitch_mokotoff('fischer');
+ daitch_mokotoff
+-----------------
+ 749000
+(1 row)
+
+SELECT daitch_mokotoff('weber');
+ daitch_mokotoff
+-----------------
+ 779000
+(1 row)
+
+SELECT daitch_mokotoff('meyer');
+ daitch_mokotoff
+-----------------
+ 619000
+(1 row)
+
+SELECT daitch_mokotoff('wagner');
+ daitch_mokotoff
+-----------------
+ 756900
+(1 row)
+
+SELECT daitch_mokotoff('schulz');
+ daitch_mokotoff
+-----------------
+ 484000
+(1 row)
+
+SELECT daitch_mokotoff('becker');
+ daitch_mokotoff
+-----------------
+ 759000 745900
+(1 row)
+
+SELECT daitch_mokotoff('hoffmann');
+ daitch_mokotoff
+-----------------
+ 576600
+(1 row)
+
+SELECT daitch_mokotoff('schäfer');
+ daitch_mokotoff
+-----------------
+ 479000
+(1 row)
+
+SELECT daitch_mokotoff('Straßburg');
+ daitch_mokotoff
+-----------------
+ 294795
+(1 row)
+
+SELECT daitch_mokotoff('Strasburg');
+ daitch_mokotoff
+-----------------
+ 294795
+(1 row)
+
+SELECT daitch_mokotoff('Éregon');
+ daitch_mokotoff
+-----------------
+ 095600
+(1 row)
+
+SELECT daitch_mokotoff('Eregon');
+ daitch_mokotoff
+-----------------
+ 095600
+(1 row)
+
+SELECT daitch_mokotoff('AKSSOL');
+ daitch_mokotoff
+-----------------
+ 054800
+(1 row)
+
+SELECT daitch_mokotoff('GERSCHFELD');
+       daitch_mokotoff
+-----------------------------
+ 594578 594783 545783 547830
+(1 row)
+
+SELECT daitch_mokotoff('OBrien');
+ daitch_mokotoff
+-----------------
+ 079600
+(1 row)
+
+SELECT daitch_mokotoff('''OBrien');
+ daitch_mokotoff
+-----------------
+ 079600
+(1 row)
+
+SELECT daitch_mokotoff('O''Brien');
+ daitch_mokotoff
+-----------------
+ 079600
+(1 row)
+
+SELECT daitch_mokotoff('OB''rien');
+ daitch_mokotoff
+-----------------
+ 079600
+(1 row)
+
+SELECT daitch_mokotoff('OBr''ien');
+ daitch_mokotoff
+-----------------
+ 079600
+(1 row)
+
+SELECT daitch_mokotoff('OBri''en');
+ daitch_mokotoff
+-----------------
+ 079600
+(1 row)
+
+SELECT daitch_mokotoff('OBrie''n');
+ daitch_mokotoff
+-----------------
+ 079600
+(1 row)
+
+SELECT daitch_mokotoff('OBrien''');
+ daitch_mokotoff
+-----------------
+ 079600
+(1 row)
+
+SELECT daitch_mokotoff('KINGSMITH');
+ daitch_mokotoff
+-----------------
+ 565463
+(1 row)
+
+SELECT daitch_mokotoff('-KINGSMITH');
+ daitch_mokotoff
+-----------------
+ 565463
+(1 row)
+
+SELECT daitch_mokotoff('K-INGSMITH');
+ daitch_mokotoff
+-----------------
+ 565463
+(1 row)
+
+SELECT daitch_mokotoff('KI-NGSMITH');
+ daitch_mokotoff
+-----------------
+ 565463
+(1 row)
+
+SELECT daitch_mokotoff('KIN-GSMITH');
+ daitch_mokotoff
+-----------------
+ 565463
+(1 row)
+
+SELECT daitch_mokotoff('KING-SMITH');
+ daitch_mokotoff
+-----------------
+ 565463
+(1 row)
+
+SELECT daitch_mokotoff('KINGS-MITH');
+ daitch_mokotoff
+-----------------
+ 565463
+(1 row)
+
+SELECT daitch_mokotoff('KINGSM-ITH');
+ daitch_mokotoff
+-----------------
+ 565463
+(1 row)
+
+SELECT daitch_mokotoff('KINGSMI-TH');
+ daitch_mokotoff
+-----------------
+ 565463
+(1 row)
+
+SELECT daitch_mokotoff('KINGSMIT-H');
+ daitch_mokotoff
+-----------------
+ 565463
+(1 row)
+
+SELECT daitch_mokotoff('KINGSMITH-');
+ daitch_mokotoff
+-----------------
+ 565463
+(1 row)
+
+SELECT daitch_mokotoff(E' \t\n\r Washington \t\n\r ');
+ daitch_mokotoff
+-----------------
+ 746536
+(1 row)
+
+SELECT daitch_mokotoff('Washington');
+ daitch_mokotoff
+-----------------
+ 746536
+(1 row)
+
+SELECT daitch_mokotoff('ţamas');
+ daitch_mokotoff
+-----------------
+ 364000 464000
+(1 row)
+
+SELECT daitch_mokotoff('țamas');
+ daitch_mokotoff
+-----------------
+ 364000 464000
+(1 row)
+
+SELECT daitch_mokotoff('CJC');
+              daitch_mokotoff
+-------------------------------------------
+ 550000 540000 545000 450000 400000 440000
+(1 row)
+
diff --git a/contrib/daitch_mokotoff/sql/daitch_mokotoff.sql b/contrib/daitch_mokotoff/sql/daitch_mokotoff.sql
new file mode 100644
index 0000000000..eafc24ee87
--- /dev/null
+++ b/contrib/daitch_mokotoff/sql/daitch_mokotoff.sql
@@ -0,0 +1,121 @@
+CREATE EXTENSION daitch_mokotoff;
+
+
+-- https://www.jewishgen.org/InfoFiles/Soundex.html
+SELECT daitch_mokotoff('GOLDEN');
+SELECT daitch_mokotoff('Alpert');
+SELECT daitch_mokotoff('Breuer');
+SELECT daitch_mokotoff('Freud');
+SELECT daitch_mokotoff('Haber');
+SELECT daitch_mokotoff('Manheim');
+SELECT daitch_mokotoff('Mintz');
+SELECT daitch_mokotoff('Topf');
+SELECT daitch_mokotoff('Kleinman');
+SELECT daitch_mokotoff('Ben Aron');
+
+SELECT daitch_mokotoff('AUERBACH');
+SELECT daitch_mokotoff('OHRBACH');
+SELECT daitch_mokotoff('LIPSHITZ');
+SELECT daitch_mokotoff('LIPPSZYC');
+SELECT daitch_mokotoff('LEWINSKY');
+SELECT daitch_mokotoff('LEVINSKI');
+SELECT daitch_mokotoff('SZLAMAWICZ');
+SELECT daitch_mokotoff('SHLAMOVITZ');
+
+
+-- https://www.avotaynu.com/soundex.htm
+SELECT daitch_mokotoff('Ceniow');
+SELECT daitch_mokotoff('Tsenyuv');
+SELECT daitch_mokotoff('Holubica');
+SELECT daitch_mokotoff('Golubitsa');
+SELECT daitch_mokotoff('Przemysl');
+SELECT daitch_mokotoff('Pshemeshil');
+SELECT daitch_mokotoff('Rosochowaciec');
+SELECT daitch_mokotoff('Rosokhovatsets');
+
+
+-- https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex
+SELECT daitch_mokotoff('Peters');
+SELECT daitch_mokotoff('Peterson');
+SELECT daitch_mokotoff('Moskowitz');
+SELECT daitch_mokotoff('Moskovitz');
+SELECT daitch_mokotoff('Auerbach');
+SELECT daitch_mokotoff('Uhrbach');
+SELECT daitch_mokotoff('Jackson');
+SELECT daitch_mokotoff('Jackson-Jackson');
+
+
+-- Perl Text::Phonetic::DaitchMokotoff 006_daitchmokotoff.t
+-- Tests covered above are omitted.
+SELECT daitch_mokotoff('Müller');
+SELECT daitch_mokotoff('schmidt');
+SELECT daitch_mokotoff('schneider');
+SELECT daitch_mokotoff('fischer');
+SELECT daitch_mokotoff('weber');
+SELECT daitch_mokotoff('meyer');
+SELECT daitch_mokotoff('wagner');
+SELECT daitch_mokotoff('schulz');
+SELECT daitch_mokotoff('becker');
+SELECT daitch_mokotoff('hoffmann');
+SELECT daitch_mokotoff('schäfer');
+
+
+-- Apache Commons DaitchMokotoffSoundexTest.java
+
+-- testAccentedCharacterFolding
+SELECT daitch_mokotoff('Straßburg');
+SELECT daitch_mokotoff('Strasburg');
+
+SELECT daitch_mokotoff('Éregon');
+SELECT daitch_mokotoff('Eregon');
+
+-- testAdjacentCodes
+SELECT daitch_mokotoff('AKSSOL');
+SELECT daitch_mokotoff('GERSCHFELD');
+
+-- testEncodeBasic
+-- Tests covered above are omitted.
+
+-- testEncodeIgnoreApostrophes
+SELECT daitch_mokotoff('OBrien');
+SELECT daitch_mokotoff('''OBrien');
+SELECT daitch_mokotoff('O''Brien');
+SELECT daitch_mokotoff('OB''rien');
+SELECT daitch_mokotoff('OBr''ien');
+SELECT daitch_mokotoff('OBri''en');
+SELECT daitch_mokotoff('OBrie''n');
+SELECT daitch_mokotoff('OBrien''');
+
+-- testEncodeIgnoreHyphens
+SELECT daitch_mokotoff('KINGSMITH');
+SELECT daitch_mokotoff('-KINGSMITH');
+SELECT daitch_mokotoff('K-INGSMITH');
+SELECT daitch_mokotoff('KI-NGSMITH');
+SELECT daitch_mokotoff('KIN-GSMITH');
+SELECT daitch_mokotoff('KING-SMITH');
+SELECT daitch_mokotoff('KINGS-MITH');
+SELECT daitch_mokotoff('KINGSM-ITH');
+SELECT daitch_mokotoff('KINGSMI-TH');
+SELECT daitch_mokotoff('KINGSMIT-H');
+SELECT daitch_mokotoff('KINGSMITH-');
+
+-- testEncodeIgnoreTrimmable
+SELECT daitch_mokotoff(E' \t\n\r Washington \t\n\r ');
+SELECT daitch_mokotoff('Washington');
+
+-- testSoundexBasic
+-- Tests covered above are omitted.
+
+-- testSoundexBasic2
+-- Tests covered above are omitted.
+
+-- testSoundexBasic3
+-- Tests covered above are omitted.
+
+-- testSpecialRomanianCharacters
+SELECT daitch_mokotoff('ţamas'); -- t-cedila
+SELECT daitch_mokotoff('țamas'); -- t-comma
+
+
+-- Contrived case which is not handled correctly by other implementations.
+SELECT daitch_mokotoff('CJC');
diff --git a/doc/src/sgml/contrib.sgml b/doc/src/sgml/contrib.sgml
index d3ca4b6932..0788db060d 100644
--- a/doc/src/sgml/contrib.sgml
+++ b/doc/src/sgml/contrib.sgml
@@ -105,6 +105,7 @@ CREATE EXTENSION <replaceable>module_name</replaceable>;
  &citext;
  &cube;
  &dblink;
+ &daitch-mokotoff;
  &dict-int;
  &dict-xsyn;
  &earthdistance;
diff --git a/doc/src/sgml/daitch-mokotoff.sgml b/doc/src/sgml/daitch-mokotoff.sgml
new file mode 100644
index 0000000000..6dbc817eca
--- /dev/null
+++ b/doc/src/sgml/daitch-mokotoff.sgml
@@ -0,0 +1,92 @@
+<!-- doc/src/sgml/daitch-mokotoff.sgml -->
+
+<sect1 id="daitch-mokotoff" xreflabel="daitch_mokotoff">
+ <title>daitch_mokotoff</title>
+
+ <indexterm zone="daitch-mokotoff">
+  <primary>daitch_mokotoff</primary>
+ </indexterm>
+
+ <para>
+  The <filename>daitch_mokotoff</filename> module provides an implementation
+  of the Daitch-Mokotoff Soundex System.
+ </para>
+
+ <para>
+  Compared to the American Soundex System implemented in
+  the <filename>fuzzystrmatch</filename> module, the major improvements of the
+  Daitch-Mokotoff Soundex System are:
+
+  <itemizedlist spacing="compact" mark="bullet">
+   <listitem>
+    <para>
+     Information is coded to the first six meaningful letters rather than
+     four.
+    </para>
+   </listitem>
+   <listitem>
+    <para>
+     The initial letter is coded rather than kept as is.
+    </para>
+   </listitem>
+   <listitem>
+    <para>
+     Where two consecutive letters have a single sound, they are coded as a
+     single number.
+    </para>
+   </listitem>
+   <listitem>
+    <para>
+     When a letter or combination of letters may have two different sounds,
+     it is double coded under the two different codes.
+    </para>
+   </listitem>
+   <listitem>
+    <para>
+     A letter or combination of letters maps into ten possible codes rather
+     than seven.
+    </para>
+   </listitem>
+  </itemizedlist>
+
+  The <function>daitch_mokotoff</function> function shown in
+  <xref linkend="functions-daitch-mokotoff-table"/> generates Daitch-Mokotoff
+  soundex codes for matching of similar-sounding names.
+ </para>
+
+ <table id="functions-daitch-mokotoff-table">
+  <title><filename>daitch_mokotoff</filename> Functions</title>
+    <tgroup cols="1">
+     <thead>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        Function
+       </para>
+       <para>
+        Description
+       </para></entry>
+      </row>
+     </thead>
+
+     <tbody>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <function>daitch_mokotoff</function> ( <parameter>name</parameter> <type>text</type> )
+        <returnvalue>text</returnvalue>
+       </para>
+       <para>
+        Generates Daitch-Mokotoff soundex codes.
+       </para></entry>
+      </row>
+     </tbody>
+  </tgroup>
+ </table>
+
+ <indexterm>
+  <primary>daitch_mokotoff</primary>
+ </indexterm>
+ <para>
+  <function>daitch_mokotoff</function> generates Daitch-Mokotoff soundex codes for the specified
<parameter>name</parameter>.Since alternate soundex codes are separated by spaces, the returned text is suited for use
inFull Text Search, see <xref linkend="textsearch"/>. 
+ </para>
+
+</sect1>
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index 89454e99b9..9ac43e5928 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -117,6 +117,7 @@
 <!ENTITY btree-gist      SYSTEM "btree-gist.sgml">
 <!ENTITY citext          SYSTEM "citext.sgml">
 <!ENTITY cube            SYSTEM "cube.sgml">
+<!ENTITY daitch-mokotoff SYSTEM "daitch-mokotoff.sgml">
 <!ENTITY dblink          SYSTEM "dblink.sgml">
 <!ENTITY dict-int        SYSTEM "dict-int.sgml">
 <!ENTITY dict-xsyn       SYSTEM "dict-xsyn.sgml">

Вложения

Re: daitch_mokotoff module

От
Dag Lem
Дата:
Please find attached an updated patch, with the following fixes:

* Replaced remaining malloc/free with palloc/pfree.
* Made "make check" pass.
* Updated notes on other implementations.

Best regards

Dag Lem

diff --git a/contrib/Makefile b/contrib/Makefile
index 87bf87ab90..5e1111a729 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -14,6 +14,7 @@ SUBDIRS = \
         btree_gist    \
         citext        \
         cube        \
+        daitch_mokotoff    \
         dblink        \
         dict_int    \
         dict_xsyn    \
diff --git a/contrib/daitch_mokotoff/Makefile b/contrib/daitch_mokotoff/Makefile
new file mode 100644
index 0000000000..baec5e31d4
--- /dev/null
+++ b/contrib/daitch_mokotoff/Makefile
@@ -0,0 +1,25 @@
+# contrib/daitch_mokotoff/Makefile
+
+MODULE_big = daitch_mokotoff
+OBJS = \
+    $(WIN32RES) \
+    daitch_mokotoff.o
+
+EXTENSION = daitch_mokotoff
+DATA = daitch_mokotoff--1.0.sql
+PGFILEDESC = "daitch_mokotoff - Daitch-Mokotoff Soundex System"
+
+HEADERS = daitch_mokotoff.h
+
+REGRESS = daitch_mokotoff
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/daitch_mokotoff
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/daitch_mokotoff/daitch_mokotoff--1.0.sql b/contrib/daitch_mokotoff/daitch_mokotoff--1.0.sql
new file mode 100644
index 0000000000..0b5a643175
--- /dev/null
+++ b/contrib/daitch_mokotoff/daitch_mokotoff--1.0.sql
@@ -0,0 +1,8 @@
+/* contrib/daitch_mokotoff/daitch_mokotoff--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION daitch_mokotoff" to load this file. \quit
+
+CREATE FUNCTION daitch_mokotoff(text) RETURNS text
+AS 'MODULE_PATHNAME', 'daitch_mokotoff'
+LANGUAGE C IMMUTABLE STRICT PARALLEL SAFE;
diff --git a/contrib/daitch_mokotoff/daitch_mokotoff.c b/contrib/daitch_mokotoff/daitch_mokotoff.c
new file mode 100644
index 0000000000..a7f1fd8541
--- /dev/null
+++ b/contrib/daitch_mokotoff/daitch_mokotoff.c
@@ -0,0 +1,549 @@
+/*
+ * Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem <dag@nimrod.no>
+ *
+ * This implementation of the Daitch-Mokotoff Soundex System aims at high
+ * performance.
+ *
+ * - The processing of each phoneme is initiated by an O(1) table lookup.
+ * - For phonemes containing more than one character, a coding tree is traversed
+ *   to process the complete phoneme.
+ * - The (alternate) soundex codes are produced digit by digit in-place in
+ *   another tree structure.
+ *
+ *  References:
+ *
+ * https://www.avotaynu.com/soundex.htm
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex
+ * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php)
+ * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java)
+ * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm)
+ *
+ * A few notes on other implementations:
+ *
+ * - "J" is considered a vowel in dmlat.php
+ * - The official rules for "RS" are commented out in dmlat.php
+ * - Identical code digits for adjacent letters are not collapsed correctly in
+ *   dmsoundex.php when double digit codes are involved. E.g. "BESST" yields
+ *   744300 instead of 743000 as for "BEST".
+ * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java
+ * - Both dmlat.php and dmrules.txt have the same unofficial rules for "UE".
+ * - Coding of MN/NM + M/N differs between dmsoundex.php and DaitchMokotoffSoundex.java
+ * - No other known implementation yields the correct set of codes for e.g.
+ *   "CJC" (550000 540000 545000 450000 400000 440000).
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+*/
+
+#include "daitch_mokotoff.h"
+
+#ifndef DM_MAIN
+
+#include "postgres.h"
+#include "utils/builtins.h"
+#include "mb/pg_wchar.h"
+
+#else                            /* DM_MAIN */
+
+#include <stdio.h>
+#include <stdlib.h>
+
+void *
+palloc(size_t size)
+{
+    void       *ptr;
+
+    ptr = malloc(size);
+
+    if (ptr == NULL)
+    {
+        fprintf(stderr, "Unable to allocate memory\n");
+        exit(EXIT_FAILURE);
+    }
+
+    return ptr;
+}
+
+#define pfree free
+
+#endif                            /* DM_MAIN */
+
+#include <ctype.h>
+#include <string.h>
+
+
+/* Internal C implementation */
+static char *_daitch_mokotoff(char *word, char *soundex, size_t n);
+
+
+#ifndef DM_MAIN
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(daitch_mokotoff);
+Datum
+daitch_mokotoff(PG_FUNCTION_ARGS)
+{
+    text       *arg = PG_GETARG_TEXT_PP(0);
+    char       *string,
+               *tmp_soundex;
+    text       *soundex;
+
+    /*
+     * The maximum theoretical soundex size is several KB, however in practice
+     * anything but contrived synthetic inputs will yield a soundex size of
+     * less than 100 bytes. We thus allocate and free a temporary work buffer,
+     * and return only the actual soundex result.
+     */
+    string = pg_server_to_any(text_to_cstring(arg), VARSIZE_ANY_EXHDR(arg), PG_UTF8);
+    tmp_soundex = palloc(DM_MAX_SOUNDEX_CHARS);
+
+    _daitch_mokotoff(string, tmp_soundex, DM_MAX_SOUNDEX_CHARS);
+
+    soundex = cstring_to_text(pg_any_to_server(tmp_soundex, strlen(tmp_soundex), PG_UTF8));
+    pfree(tmp_soundex);
+
+    PG_RETURN_TEXT_P(soundex);
+}
+
+#endif                            /* DM_MAIN */
+
+
+typedef dm_node dm_nodes[DM_MAX_NODES];
+typedef dm_node * dm_leaves[DM_MAX_LEAVES];
+
+
+static const dm_node start_node = {
+    .soundex_length = 0,
+    .soundex = "000000 ",        /* Six digits + joining space */
+    .is_leaf = 0,
+    .last_update = 0,
+    .code_digit = '\0',
+    .prev_code_digits = {'\0', '\0'},
+    .next_code_digits = {'\0', '\0'},
+    .next_nodes = {NULL}
+};
+
+
+static void
+initialize_node(dm_node * node, int last_update)
+{
+    if (node->last_update < last_update)
+    {
+        node->prev_code_digits[0] = node->next_code_digits[0];
+        node->prev_code_digits[1] = node->next_code_digits[1];
+        node->next_code_digits[0] = '\0';
+        node->next_code_digits[1] = '\0';
+        node->is_leaf = 0;
+        node->last_update = last_update;
+    }
+}
+
+
+static void
+add_next_code_digit(dm_node * node, char code_digit)
+{
+    if (!node->next_code_digits[0])
+    {
+        node->next_code_digits[0] = code_digit;
+    }
+    else if (node->next_code_digits[0] != code_digit)
+    {
+        node->next_code_digits[1] = code_digit;
+    }
+}
+
+
+static void
+set_leaf(dm_leaves leaves_next, int *num_leaves_next, dm_node * node)
+{
+    if (!node->is_leaf)
+    {
+        node->is_leaf = 1;
+        leaves_next[(*num_leaves_next)++] = node;
+    }
+}
+
+
+static dm_node * find_or_create_node(dm_nodes nodes, int *num_nodes, dm_node * node, char code_digit)
+{
+    dm_node   **next_nodes;
+    dm_node    *next_node;
+
+    for (next_nodes = node->next_nodes; (next_node = *next_nodes); next_nodes++)
+    {
+        if (next_node->code_digit == code_digit)
+        {
+            return next_node;
+        }
+    }
+
+    next_node = &nodes[(*num_nodes)++];
+    *next_nodes = next_node;
+
+    *next_node = start_node;
+    memcpy(next_node->soundex, node->soundex, sizeof(next_node->soundex));
+    next_node->soundex_length = node->soundex_length;
+    next_node->soundex[next_node->soundex_length++] = code_digit;
+    next_node->code_digit = code_digit;
+
+    return next_node;
+}
+
+
+static int
+update_node(dm_nodes nodes, dm_node * node, int *num_nodes, dm_leaves leaves_next, int *num_leaves_next, int
char_number,char next_code_digit, char next_code_digit_2) 
+{
+    int            i;
+    int            num_dirty_nodes = 0;
+    dm_node    *dirty_nodes[2];
+
+    initialize_node(node, char_number);
+
+    if (node->soundex_length == DM_MAX_CODE_DIGITS)
+    {
+        /* Keep completed soundex code. */
+        set_leaf(leaves_next, num_leaves_next, node);
+        return 0;
+    }
+
+    if (next_code_digit == 'X' ||
+        node->prev_code_digits[0] == next_code_digit ||
+        node->prev_code_digits[1] == next_code_digit)
+    {
+        /* The code digit is the same as one of the previous (i.e. not added). */
+        dirty_nodes[num_dirty_nodes++] = node;
+    }
+
+    if (next_code_digit != 'X' &&
+        (node->prev_code_digits[0] != next_code_digit ||
+         node->prev_code_digits[1]))
+    {
+        /* The code digit is different from one of the previous (i.e. added). */
+        node = find_or_create_node(nodes, num_nodes, node, next_code_digit);
+        initialize_node(node, char_number);
+        dirty_nodes[num_dirty_nodes++] = node;
+    }
+
+    for (i = 0; i < num_dirty_nodes; i++)
+    {
+        /* Add code digit leading to the current node. */
+        add_next_code_digit(dirty_nodes[i], next_code_digit);
+
+        if (next_code_digit_2)
+        {
+            update_node(nodes, dirty_nodes[i], num_nodes, leaves_next, num_leaves_next, char_number,
next_code_digit_2,'\0'); 
+        }
+        else
+        {
+            set_leaf(leaves_next, num_leaves_next, dirty_nodes[i]);
+        }
+    }
+
+    return 1;
+}
+
+
+static int
+update_leaves(dm_nodes nodes, int *num_nodes, dm_leaves leaves[2], int *ix_leaves, int *num_leaves, int char_number,
dm_codescodes) 
+{
+    int            i,
+                j;
+    char       *code;
+    int            num_leaves_next = 0;
+    int            ix_leaves_next = (*ix_leaves + 1) % 2;
+    int            finished = 1;
+
+    for (i = 0; i < *num_leaves; i++)
+    {
+        dm_node    *node = leaves[*ix_leaves][i];
+
+        /* One or two alternate code sequences. */
+        for (j = 0; j < 2 && (code = codes[j]) && code[0]; j++)
+        {
+            /* One or two sequential code digits. */
+            if (update_node(nodes, node, num_nodes, leaves[ix_leaves_next], &num_leaves_next, char_number, code[0],
code[1]))
+            {
+                finished = 0;
+            }
+        }
+    }
+
+    *ix_leaves = ix_leaves_next;
+    *num_leaves = num_leaves_next;
+
+    return finished;
+}
+
+
+static const char tr_accents_iso8859_1[] =
+/*
+"ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ"
+*/
+"AAAAAAECEEEEIIIIDNOOOOO*OUUUUYDsaaaaaaeceeeeiiiidnooooo/ouuuuydy";
+
+static char
+unaccent_iso8859_1(unsigned char c)
+{
+    return c >= 192 ? tr_accents_iso8859_1[c - 192] : c;
+}
+
+
+/* Convert an UTF-8 character to ISO-8859-1.
+ * Unconvertable characters are returned as '?'.
+ * NB! Beware of the domain specific conversion of Ą, Ę, and Ţ/Ț.
+ */
+static char
+utf8_to_iso8859_1(char *str, int *len)
+{
+    const char    unknown = '?';
+    unsigned char c;
+    unsigned int code_point;
+
+    *len = 1;
+    c = (unsigned char) str[0];
+    if (c < 0x80)
+    {
+        /* ASCII code point. */
+        return c;
+    }
+    else if (c < 0xE0)
+    {
+        /* Two-byte character. */
+        if (!str[1])
+        {
+            /* The UTF-8 character is cut short (invalid code point). */
+            return unknown;
+        }
+        *len = 2;
+        code_point = ((c & 0x1F) << 6) | (str[1] & 0x3F);
+        if (code_point < 0x100)
+        {
+            /* ISO-8859-1 code point. */
+            return code_point;
+        }
+        else if (code_point == 0x0104 || code_point == 0x0105)
+        {
+            /* Ą/ą */
+            return '[';
+        }
+        else if (code_point == 0x0118 || code_point == 0x0119)
+        {
+            /* Ę/ę */
+            return '\\';
+        }
+        else if (code_point == 0x0162 || code_point == 0x0163 ||
+                 code_point == 0x021A || code_point == 0x021B)
+        {
+            /* Ţ/ţ or Ț/ț */
+            return ']';
+        }
+        else
+        {
+            return unknown;
+        }
+    }
+    else if (c < 0xF0)
+    {
+        /* Three-byte character. */
+        if (!str[2])
+        {
+            /* The UTF-8 character is cut short (invalid code point). */
+            *len = 2;
+        }
+        else
+        {
+            *len = 3;
+        }
+        return unknown;
+    }
+    else
+    {
+        /* Four-byte character. */
+        if (!str[3])
+        {
+            /* The UTF-8 character is cut short (invalid code point). */
+            *len = 3;
+        }
+        else
+        {
+            *len = 4;
+        }
+        return unknown;
+    }
+}
+
+
+static char
+read_char(char *str, int *len)
+{
+    return toupper(unaccent_iso8859_1(utf8_to_iso8859_1(str, len)));
+}
+
+
+static char
+read_valid_char(char *str, int *len)
+{
+    int            c;
+    int            i,
+                ilen;
+
+    for (i = 0, ilen = 0; (c = read_char(&str[i], &ilen)) && (c < 'A' || c > ']'); i += ilen)
+    {
+    }
+
+    *len = i + ilen;
+    return c;
+}
+
+
+static char *
+_daitch_mokotoff(char *word, char *soundex, size_t n)
+{
+    int            c,
+                cmp;
+    int            i,
+                ilen,
+                i_ok,
+                j,
+                jlen,
+                k;
+    int            first_letter = 1;
+    int            ix_leaves = 0;
+    int            num_nodes = 0,
+                num_leaves = 0;
+    dm_letter  *letter,
+               *letters;
+    dm_codes   *codes_ok,
+                codes;
+
+    dm_node    *nodes = palloc(sizeof(dm_nodes));
+    dm_leaves  *leaves = palloc(2 * sizeof(dm_leaves));
+
+    /* Starting point. */
+    nodes[num_nodes++] = start_node;
+    leaves[ix_leaves][num_leaves++] = &nodes[0];
+
+    for (i = 0; (c = read_valid_char(&word[i], &ilen)); i += ilen)
+    {
+        /* First letter in sequence. */
+        letter = &letter_[c - 'A'];
+        codes_ok = letter->codes;
+        i_ok = i;
+
+        /* Subsequent letters. */
+        for (j = i + ilen; (letters = letter->letters) && (c = read_valid_char(&word[j], &jlen)); j += jlen)
+        {
+            for (k = 0; (cmp = letters[k].letter); k++)
+            {
+                if (cmp == c)
+                {
+                    /* Coding for letter found. */
+                    break;
+                }
+            }
+            if (!cmp)
+            {
+                /* The sequence of letters has no coding. */
+                break;
+            }
+
+            letter = &letters[k];
+            if (letter->codes)
+            {
+                codes_ok = letter->codes;
+                i_ok = j;
+                ilen = jlen;
+            }
+        }
+
+        /* Determine which code to use. */
+        if (first_letter)
+        {
+            /* This is the first letter. */
+            j = 0;
+            first_letter = 0;
+        }
+        else if ((c = read_valid_char(&word[i_ok + ilen], &jlen)) && strchr(DM_VOWELS, c))
+        {
+            /* The next letter is a vowel. */
+            j = 1;
+        }
+        else
+        {
+            /* All other cases. */
+            j = 2;
+        }
+        memcpy(codes, codes_ok[j], sizeof(codes));
+
+        /* Update leaves. */
+        if (update_leaves(nodes, &num_nodes, leaves, &ix_leaves, &num_leaves, i, codes))
+        {
+            /* All soundex codes are completed to six digits. */
+            break;
+        }
+
+        /* Prepare for next letter sequence. */
+        i = i_ok;
+    }
+
+    /* Concatenate all generated soundex codes. */
+    for (i = 0, j = 0; i < num_leaves && j + DM_MAX_CODE_DIGITS + 1 <= n; i++, j += DM_MAX_CODE_DIGITS + 1)
+    {
+        memcpy(&soundex[j], leaves[ix_leaves][i]->soundex, DM_MAX_CODE_DIGITS + 1);
+    }
+
+    /* Terminate string. */
+    soundex[j - (j != 0)] = '\0';
+
+    pfree(leaves);
+    pfree(nodes);
+
+    return soundex;
+}
+
+
+#ifdef DM_MAIN
+
+/* For testing */
+
+int
+main(int argc, char **argv)
+{
+    char       *soundex;
+
+    if (argc != 2)
+    {
+        fprintf(stderr, "Usage: %s string\n", argv[0]);
+        return -1;
+    }
+
+    soundex = palloc(DM_MAX_SOUNDEX_CHARS);
+
+    _daitch_mokotoff(argv[1], soundex, DM_MAX_SOUNDEX_CHARS);
+
+    printf("%s\n", soundex);
+    pfree(soundex);
+
+    return 0;
+}
+
+#endif
diff --git a/contrib/daitch_mokotoff/daitch_mokotoff.control b/contrib/daitch_mokotoff/daitch_mokotoff.control
new file mode 100644
index 0000000000..c5aed8e46e
--- /dev/null
+++ b/contrib/daitch_mokotoff/daitch_mokotoff.control
@@ -0,0 +1,5 @@
+# daitch_mokotoff extension
+comment = 'Daitch-Mokotoff Soundex System'
+default_version = '1.0'
+module_pathname = '$libdir/daitch_mokotoff'
+relocatable = true
diff --git a/contrib/daitch_mokotoff/daitch_mokotoff.h b/contrib/daitch_mokotoff/daitch_mokotoff.h
new file mode 100644
index 0000000000..8fcb98f1cf
--- /dev/null
+++ b/contrib/daitch_mokotoff/daitch_mokotoff.h
@@ -0,0 +1,1110 @@
+/*
+ * Types and lookup tables for Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem <dag@nimrod.no>
+ *
+ * This file is generated by daitch_mokotoff_header.pl
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+ */
+
+#include <stdlib.h>
+
+#define DM_MAX_CODE_DIGITS 6
+#define DM_MAX_ALTERNATE_CODES 5
+#define DM_MAX_NODES 1564
+#define DM_MAX_LEAVES 1250
+#define DM_MAX_SOUNDEX_CHARS (DM_MAX_NODES*(DM_MAX_CODE_DIGITS + 1))
+#define DM_VOWELS "AEIOUY"
+
+typedef char dm_code[2 + 1];    /* One or two sequential code digits + NUL */
+typedef dm_code dm_codes[2];    /* One or two alternate code sequences */
+
+struct dm_letter
+{
+    char        letter;
+    struct dm_letter *letters;
+    dm_codes   *codes;
+};
+
+struct dm_node
+{
+    int            soundex_length;
+    char        soundex[DM_MAX_CODE_DIGITS + 1];
+    int            is_leaf;
+    int            last_update;
+    char        code_digit;
+
+    /*
+     * One or two alternate code digits leading to this node - repeated code
+     * digits and 'X' lead back to the same node.
+     */
+    char        prev_code_digits[2];
+    char        next_code_digits[2];
+    struct dm_node *next_nodes[DM_MAX_ALTERNATE_CODES + 1];
+};
+
+typedef struct dm_letter dm_letter;
+typedef struct dm_node dm_node;
+
+static dm_codes codes_0_1_X[] =
+{
+    {
+        "0"
+    },
+    {
+        "1"
+    },
+    {
+        "X"
+    }
+};
+static dm_codes codes_0_7_X[] =
+{
+    {
+        "0"
+    },
+    {
+        "7"
+    },
+    {
+        "X"
+    }
+};
+static dm_codes codes_0_X_X[] =
+{
+    {
+        "0"
+    },
+    {
+        "X"
+    },
+    {
+        "X"
+    }
+};
+static dm_codes codes_1_1_X[] =
+{
+    {
+        "1"
+    },
+    {
+        "1"
+    },
+    {
+        "X"
+    }
+};
+static dm_codes codes_1_X_X[] =
+{
+    {
+        "1"
+    },
+    {
+        "X"
+    },
+    {
+        "X"
+    }
+};
+static dm_codes codes_1or4_Xor4_Xor4[] =
+{
+    {
+        "1", "4"
+    },
+    {
+        "X", "4"
+    },
+    {
+        "X", "4"
+    }
+};
+static dm_codes codes_2_43_43[] =
+{
+    {
+        "2"
+    },
+    {
+        "43"
+    },
+    {
+        "43"
+    }
+};
+static dm_codes codes_2_4_4[] =
+{
+    {
+        "2"
+    },
+    {
+        "4"
+    },
+    {
+        "4"
+    }
+};
+static dm_codes codes_3_3_3[] =
+{
+    {
+        "3"
+    },
+    {
+        "3"
+    },
+    {
+        "3"
+    }
+};
+static dm_codes codes_3or4_3or4_3or4[] =
+{
+    {
+        "3", "4"
+    },
+    {
+        "3", "4"
+    },
+    {
+        "3", "4"
+    }
+};
+static dm_codes codes_4_4_4[] =
+{
+    {
+        "4"
+    },
+    {
+        "4"
+    },
+    {
+        "4"
+    }
+};
+static dm_codes codes_5_54_54[] =
+{
+    {
+        "5"
+    },
+    {
+        "54"
+    },
+    {
+        "54"
+    }
+};
+static dm_codes codes_5_5_5[] =
+{
+    {
+        "5"
+    },
+    {
+        "5"
+    },
+    {
+        "5"
+    }
+};
+static dm_codes codes_5_5_X[] =
+{
+    {
+        "5"
+    },
+    {
+        "5"
+    },
+    {
+        "X"
+    }
+};
+static dm_codes codes_5or45_5or45_5or45[] =
+{
+    {
+        "5", "45"
+    },
+    {
+        "5", "45"
+    },
+    {
+        "5", "45"
+    }
+};
+static dm_codes codes_5or4_5or4_5or4[] =
+{
+    {
+        "5", "4"
+    },
+    {
+        "5", "4"
+    },
+    {
+        "5", "4"
+    }
+};
+static dm_codes codes_66_66_66[] =
+{
+    {
+        "66"
+    },
+    {
+        "66"
+    },
+    {
+        "66"
+    }
+};
+static dm_codes codes_6_6_6[] =
+{
+    {
+        "6"
+    },
+    {
+        "6"
+    },
+    {
+        "6"
+    }
+};
+static dm_codes codes_7_7_7[] =
+{
+    {
+        "7"
+    },
+    {
+        "7"
+    },
+    {
+        "7"
+    }
+};
+static dm_codes codes_8_8_8[] =
+{
+    {
+        "8"
+    },
+    {
+        "8"
+    },
+    {
+        "8"
+    }
+};
+static dm_codes codes_94or4_94or4_94or4[] =
+{
+    {
+        "94", "4"
+    },
+    {
+        "94", "4"
+    },
+    {
+        "94", "4"
+    }
+};
+static dm_codes codes_9_9_9[] =
+{
+    {
+        "9"
+    },
+    {
+        "9"
+    },
+    {
+        "9"
+    }
+};
+static dm_codes codes_X_X_6orX[] =
+{
+    {
+        "X"
+    },
+    {
+        "X"
+    },
+    {
+        "6", "X"
+    }
+};
+
+static dm_letter letter_A[] =
+{
+    {
+        'I', NULL, codes_0_1_X
+    },
+    {
+        'J', NULL, codes_0_1_X
+    },
+    {
+        'U', NULL, codes_0_7_X
+    },
+    {
+        'Y', NULL, codes_0_1_X
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_CH[] =
+{
+    {
+        'S', NULL, codes_5_54_54
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_CS[] =
+{
+    {
+        'Z', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_CZ[] =
+{
+    {
+        'S', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_C[] =
+{
+    {
+        'H', letter_CH, codes_5or4_5or4_5or4
+    },
+    {
+        'K', NULL, codes_5or45_5or45_5or45
+    },
+    {
+        'S', letter_CS, codes_4_4_4
+    },
+    {
+        'Z', letter_CZ, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_DR[] =
+{
+    {
+        'S', NULL, codes_4_4_4
+    },
+    {
+        'Z', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_DS[] =
+{
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        'Z', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_DZ[] =
+{
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        'S', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_D[] =
+{
+    {
+        'R', letter_DR, NULL
+    },
+    {
+        'S', letter_DS, codes_4_4_4
+    },
+    {
+        'T', NULL, codes_3_3_3
+    },
+    {
+        'Z', letter_DZ, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_E[] =
+{
+    {
+        'I', NULL, codes_0_1_X
+    },
+    {
+        'J', NULL, codes_0_1_X
+    },
+    {
+        'U', NULL, codes_1_1_X
+    },
+    {
+        'Y', NULL, codes_0_1_X
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_F[] =
+{
+    {
+        'B', NULL, codes_7_7_7
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_I[] =
+{
+    {
+        'A', NULL, codes_1_X_X
+    },
+    {
+        'E', NULL, codes_1_X_X
+    },
+    {
+        'O', NULL, codes_1_X_X
+    },
+    {
+        'U', NULL, codes_1_X_X
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_K[] =
+{
+    {
+        'H', NULL, codes_5_5_5
+    },
+    {
+        'S', NULL, codes_5_54_54
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_M[] =
+{
+    {
+        'N', NULL, codes_66_66_66
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_N[] =
+{
+    {
+        'M', NULL, codes_66_66_66
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_O[] =
+{
+    {
+        'I', NULL, codes_0_1_X
+    },
+    {
+        'J', NULL, codes_0_1_X
+    },
+    {
+        'Y', NULL, codes_0_1_X
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_P[] =
+{
+    {
+        'F', NULL, codes_7_7_7
+    },
+    {
+        'H', NULL, codes_7_7_7
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_R[] =
+{
+    {
+        'S', NULL, codes_94or4_94or4_94or4
+    },
+    {
+        'Z', NULL, codes_94or4_94or4_94or4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SCHTC[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SCHTSC[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SCHTS[] =
+{
+    {
+        'C', letter_SCHTSC, NULL
+    },
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SCHT[] =
+{
+    {
+        'C', letter_SCHTC, NULL
+    },
+    {
+        'S', letter_SCHTS, NULL
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SCH[] =
+{
+    {
+        'D', NULL, codes_2_43_43
+    },
+    {
+        'T', letter_SCHT, codes_2_43_43
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SC[] =
+{
+    {
+        'H', letter_SCH, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SHC[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SHTC[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SHTS[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SHT[] =
+{
+    {
+        'C', letter_SHTC, NULL
+    },
+    {
+        'S', letter_SHTS, NULL
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SH[] =
+{
+    {
+        'C', letter_SHC, NULL
+    },
+    {
+        'D', NULL, codes_2_43_43
+    },
+    {
+        'T', letter_SHT, codes_2_43_43
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_STC[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_STR[] =
+{
+    {
+        'S', NULL, codes_2_4_4
+    },
+    {
+        'Z', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_STSC[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_STS[] =
+{
+    {
+        'C', letter_STSC, NULL
+    },
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_ST[] =
+{
+    {
+        'C', letter_STC, NULL
+    },
+    {
+        'R', letter_STR, NULL
+    },
+    {
+        'S', letter_STS, NULL
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SZC[] =
+{
+    {
+        'S', NULL, codes_2_4_4
+    },
+    {
+        'Z', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SZ[] =
+{
+    {
+        'C', letter_SZC, NULL
+    },
+    {
+        'D', NULL, codes_2_43_43
+    },
+    {
+        'T', NULL, codes_2_43_43
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_S[] =
+{
+    {
+        'C', letter_SC, codes_2_4_4
+    },
+    {
+        'D', NULL, codes_2_43_43
+    },
+    {
+        'H', letter_SH, codes_4_4_4
+    },
+    {
+        'T', letter_ST, codes_2_43_43
+    },
+    {
+        'Z', letter_SZ, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TC[] =
+{
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TR[] =
+{
+    {
+        'S', NULL, codes_4_4_4
+    },
+    {
+        'Z', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TSC[] =
+{
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TS[] =
+{
+    {
+        'C', letter_TSC, NULL
+    },
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        'Z', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TTC[] =
+{
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TTSC[] =
+{
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TTS[] =
+{
+    {
+        'C', letter_TTSC, NULL
+    },
+    {
+        'Z', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TT[] =
+{
+    {
+        'C', letter_TTC, NULL
+    },
+    {
+        'S', letter_TTS, codes_4_4_4
+    },
+    {
+        'Z', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TZ[] =
+{
+    {
+        'S', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_T[] =
+{
+    {
+        'C', letter_TC, codes_4_4_4
+    },
+    {
+        'H', NULL, codes_3_3_3
+    },
+    {
+        'R', letter_TR, NULL
+    },
+    {
+        'S', letter_TS, codes_4_4_4
+    },
+    {
+        'T', letter_TT, NULL
+    },
+    {
+        'Z', letter_TZ, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_U[] =
+{
+    {
+        'E', NULL, codes_0_X_X
+    },
+    {
+        'I', NULL, codes_0_1_X
+    },
+    {
+        'J', NULL, codes_0_1_X
+    },
+    {
+        'Y', NULL, codes_0_1_X
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_ZDZ[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_ZD[] =
+{
+    {
+        'Z', letter_ZDZ, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_ZHDZ[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_ZHD[] =
+{
+    {
+        'Z', letter_ZHDZ, NULL
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_ZH[] =
+{
+    {
+        'D', letter_ZHD, codes_2_43_43
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_ZSC[] =
+{
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_ZS[] =
+{
+    {
+        'C', letter_ZSC, NULL
+    },
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_Z[] =
+{
+    {
+        'D', letter_ZD, codes_2_43_43
+    },
+    {
+        'H', letter_ZH, codes_4_4_4
+    },
+    {
+        'S', letter_ZS, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_[] =
+{
+    {
+        'A', letter_A, codes_0_X_X
+    },
+    {
+        'B', NULL, codes_7_7_7
+    },
+    {
+        'C', letter_C, codes_5or4_5or4_5or4
+    },
+    {
+        'D', letter_D, codes_3_3_3
+    },
+    {
+        'E', letter_E, codes_0_X_X
+    },
+    {
+        'F', letter_F, codes_7_7_7
+    },
+    {
+        'G', NULL, codes_5_5_5
+    },
+    {
+        'H', NULL, codes_5_5_X
+    },
+    {
+        'I', letter_I, codes_0_X_X
+    },
+    {
+        'J', NULL, codes_1or4_Xor4_Xor4
+    },
+    {
+        'K', letter_K, codes_5_5_5
+    },
+    {
+        'L', NULL, codes_8_8_8
+    },
+    {
+        'M', letter_M, codes_6_6_6
+    },
+    {
+        'N', letter_N, codes_6_6_6
+    },
+    {
+        'O', letter_O, codes_0_X_X
+    },
+    {
+        'P', letter_P, codes_7_7_7
+    },
+    {
+        'Q', NULL, codes_5_5_5
+    },
+    {
+        'R', letter_R, codes_9_9_9
+    },
+    {
+        'S', letter_S, codes_4_4_4
+    },
+    {
+        'T', letter_T, codes_3_3_3
+    },
+    {
+        'U', letter_U, codes_0_X_X
+    },
+    {
+        'V', NULL, codes_7_7_7
+    },
+    {
+        'W', NULL, codes_7_7_7
+    },
+    {
+        'X', NULL, codes_5_54_54
+    },
+    {
+        'Y', NULL, codes_1_X_X
+    },
+    {
+        'Z', letter_Z, codes_4_4_4
+    },
+    {
+        'a', NULL, codes_X_X_6orX
+    },
+    {
+        'e', NULL, codes_X_X_6orX
+    },
+    {
+        't', NULL, codes_3or4_3or4_3or4
+    },
+    {
+        '\0'
+    }
+};
diff --git a/contrib/daitch_mokotoff/daitch_mokotoff_header.pl b/contrib/daitch_mokotoff/daitch_mokotoff_header.pl
new file mode 100755
index 0000000000..26e67fe5df
--- /dev/null
+++ b/contrib/daitch_mokotoff/daitch_mokotoff_header.pl
@@ -0,0 +1,274 @@
+#!/bin/perl
+#
+# Generation of types and lookup tables for Daitch-Mokotoff soundex.
+#
+# Copyright (c) 2021 Finance Norway
+# Author: Dag Lem <dag@nimrod.no>
+#
+# Permission to use, copy, modify, and distribute this software and its
+# documentation for any purpose, without fee, and without a written agreement
+# is hereby granted, provided that the above copyright notice and this
+# paragraph and the following two paragraphs appear in all copies.
+#
+# IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+# DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+# LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+# DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+# POSSIBILITY OF SUCH DAMAGE.
+#
+# THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+# INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+# AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+# ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+# PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+#
+
+use strict;
+use warnings;
+use utf8;
+use open IO => ':utf8', ':std';
+use Data::Dumper;
+
+# Parse code table and generate tree for letter transitions.
+my %codes;
+my %alternates;
+my $table = [{}, [["","",""]]];
+while (<DATA>) {
+    chomp;
+    my ($letters, $codes) = split(/\s+/);
+    my @codes = map { [ split(/\|/) ] } split(/,/, $codes);
+
+    # Find alternate code transitions for calculation of storage.
+    # The first character can never yield more than two alternate codes, and is not considered here.
+    for my $c (@codes[1..2]) {
+        if (@$c > 1) {
+            for my $i (0..1) {
+                my ($a, $b) = (substr($c->[$i], -1, 1), substr($c->[($i + 1)%2], 0, 1));
+                $alternates{$a}{$b} = 1 if $a ne $b;
+            }
+        }
+    }
+    my $key = "codes_" . join("_", map { join("or", @$_) } @codes);
+    my $val = join(",\n", map { "\t{\n\t\t" . join(", ", map { "\"$_\"" } @$_) . "\n\t}" } @codes);
+    $codes{$key} = $val;
+
+    for my $letter (split(/,/, $letters)) {
+        my $ref = $table->[0];
+        # Link each character to the next in the letter combination.
+        my @c = split(//, $letter);
+        my $last_c = pop(@c);
+        for my $c (@c) {
+            $ref->{$c} //= [ {}, undef ];
+            $ref->{$c}[0] //= {};
+            $ref = $ref->{$c}[0];
+        }
+        # The sound code for the letter combination is stored at the last character.
+        $ref->{$last_c}[1] = $key;
+    }
+}
+close(DATA);
+
+# Find the maximum number of alternate codes in one position.
+my $alt_x = $alternates{"X"};
+while (my ($k, $v) = each %alternates) {
+    if (defined delete $v->{"X"}) {
+        for my $x (keys %$alt_x) {
+            $v->{$x} = 1;
+        }
+    }
+}
+my $max_alt = (reverse sort (2, map { scalar keys %$_ } values %alternates))[0];
+
+# The maximum number of nodes and leaves in the soundex tree.
+# These are safe estimates, but in practice somewhat higher than the actual maximums.
+# Note that the first character can never yield more than two alternate codes,
+# hence the calculations are performed as sums of two subtrees.
+my $digits = 6;
+# Number of nodes (sum of geometric progression).
+my $max_nodes = 2 + 2*(1 - $max_alt**($digits - 1))/(1 - $max_alt);
+# Number of leaves (exponential of base number).
+my $max_leaves = 2*$max_alt**($digits - 2);
+
+print <<EOF;
+/*
+ * Types and lookup tables for Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem <dag\@nimrod.no>
+ *
+ * This file is generated by daitch_mokotoff_header.pl
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+ */
+
+#include <stdlib.h>
+
+#define DM_MAX_CODE_DIGITS $digits
+#define DM_MAX_ALTERNATE_CODES $max_alt
+#define DM_MAX_NODES $max_nodes
+#define DM_MAX_LEAVES $max_leaves
+#define DM_MAX_SOUNDEX_CHARS (DM_MAX_NODES*(DM_MAX_CODE_DIGITS + 1))
+#define DM_VOWELS "AEIOUY"
+
+typedef char dm_code[2 + 1];    /* One or two sequential code digits + NUL */
+typedef dm_code dm_codes[2];    /* One or two alternate code sequences */
+
+struct dm_letter
+{
+    char        letter;
+    struct dm_letter *letters;
+    dm_codes   *codes;
+};
+
+struct dm_node
+{
+    int            soundex_length;
+    char        soundex[DM_MAX_CODE_DIGITS + 1];
+    int            is_leaf;
+    int            last_update;
+    char        code_digit;
+
+    /*
+     * One or two alternate code digits leading to this node - repeated code
+     * digits and 'X' lead back to the same node.
+     */
+    char        prev_code_digits[2];
+    char        next_code_digits[2];
+    struct dm_node *next_nodes[DM_MAX_ALTERNATE_CODES + 1];
+};
+
+typedef struct dm_letter dm_letter;
+typedef struct dm_node dm_node;
+
+EOF
+
+for my $key (sort keys %codes) {
+    print "static dm_codes $key\[\] =\n{\n" . $codes{$key} . "\n};\n";
+}
+
+print "\n";
+
+sub hash2code {
+    my ($ref, $letter) = @_;
+
+    my @letters = ();
+
+    my $h = $ref->[0];
+    for my $key (sort keys %$h) {
+        $ref = $h->{$key};
+        my $children = "NULL";
+        if (defined $ref->[0]) {
+            $children = "letter_$letter$key";
+            hash2code($ref, "$letter$key");
+        }
+        my $codes = $ref->[1] // "NULL";
+        push(@letters, "\t{\n\t\t'$key', $children, $codes\n\t}");
+    }
+
+    print "static dm_letter letter_$letter\[\] =\n{\n";
+    for (@letters) {
+        print "$_,\n";
+    }
+    print "\t{\n\t\t'\\0'\n\t}\n";
+    print "};\n";
+}
+
+hash2code($table, '');
+
+
+# Table adapted from https://www.jewishgen.org/InfoFiles/Soundex.html
+#
+# X = NC (not coded)
+#
+# Note that the following letters are coded with substitute letters
+#
+# Ą => a (use '[' for table lookup)
+# Ę => e (use '\\' for table lookup)
+# Ţ => t (use ']' for table lookup)
+#
+
+__DATA__
+AI,AJ,AY                0,1,X
+AU                        0,7,X
+a                        X,X,6|X
+A                        0,X,X
+B                        7,7,7
+CHS                        5,54,54
+CH                        5|4,5|4,5|4
+CK                        5|45,5|45,5|45
+CZ,CS,CSZ,CZS            4,4,4
+C                        5|4,5|4,5|4
+DRZ,DRS                    4,4,4
+DS,DSH,DSZ                4,4,4
+DZ,DZH,DZS                4,4,4
+D,DT                    3,3,3
+EI,EJ,EY                0,1,X
+EU                        1,1,X
+e                        X,X,6|X
+E                        0,X,X
+FB                        7,7,7
+F                        7,7,7
+G                        5,5,5
+H                        5,5,X
+IA,IE,IO,IU                1,X,X
+I                        0,X,X
+J                        1|4,X|4,X|4
+KS                        5,54,54
+KH                        5,5,5
+K                        5,5,5
+L                        8,8,8
+MN                        66,66,66
+M                        6,6,6
+NM                        66,66,66
+N                        6,6,6
+OI,OJ,OY                0,1,X
+O                        0,X,X
+P,PF,PH                    7,7,7
+Q                        5,5,5
+RZ,RS                    94|4,94|4,94|4
+R                        9,9,9
+SCHTSCH,SCHTSH,SCHTCH    2,4,4
+SCH                        4,4,4
+SHTCH,SHCH,SHTSH        2,4,4
+SHT,SCHT,SCHD            2,43,43
+SH                        4,4,4
+STCH,STSCH,SC            2,4,4
+STRZ,STRS,STSH            2,4,4
+ST                        2,43,43
+SZCZ,SZCS                2,4,4
+SZT,SHD,SZD,SD            2,43,43
+SZ                        4,4,4
+S                        4,4,4
+TCH,TTCH,TTSCH            4,4,4
+TH                        3,3,3
+TRZ,TRS                    4,4,4
+TSCH,TSH                4,4,4
+TS,TTS,TTSZ,TC            4,4,4
+TZ,TTZ,TZS,TSZ            4,4,4
+t                        3|4,3|4,3|4
+T                        3,3,3
+UI,UJ,UY                0,1,X
+U,UE                    0,X,X
+V                        7,7,7
+W                        7,7,7
+X                        5,54,54
+Y                        1,X,X
+ZDZ,ZDZH,ZHDZH            2,4,4
+ZD,ZHD                    2,43,43
+ZH,ZS,ZSCH,ZSH            4,4,4
+Z                        4,4,4
diff --git a/contrib/daitch_mokotoff/expected/daitch_mokotoff.out
b/contrib/daitch_mokotoff/expected/daitch_mokotoff.out
new file mode 100644
index 0000000000..57825aeb99
--- /dev/null
+++ b/contrib/daitch_mokotoff/expected/daitch_mokotoff.out
@@ -0,0 +1,472 @@
+CREATE EXTENSION daitch_mokotoff;
+-- https://www.jewishgen.org/InfoFiles/Soundex.html
+SELECT daitch_mokotoff('GOLDEN');
+ daitch_mokotoff
+-----------------
+ 583600
+(1 row)
+
+SELECT daitch_mokotoff('Alpert');
+ daitch_mokotoff
+-----------------
+ 087930
+(1 row)
+
+SELECT daitch_mokotoff('Breuer');
+ daitch_mokotoff
+-----------------
+ 791900
+(1 row)
+
+SELECT daitch_mokotoff('Freud');
+ daitch_mokotoff
+-----------------
+ 793000
+(1 row)
+
+SELECT daitch_mokotoff('Haber');
+ daitch_mokotoff
+-----------------
+ 579000
+(1 row)
+
+SELECT daitch_mokotoff('Manheim');
+ daitch_mokotoff
+-----------------
+ 665600
+(1 row)
+
+SELECT daitch_mokotoff('Mintz');
+ daitch_mokotoff
+-----------------
+ 664000
+(1 row)
+
+SELECT daitch_mokotoff('Topf');
+ daitch_mokotoff
+-----------------
+ 370000
+(1 row)
+
+SELECT daitch_mokotoff('Kleinman');
+ daitch_mokotoff
+-----------------
+ 586660
+(1 row)
+
+SELECT daitch_mokotoff('Ben Aron');
+ daitch_mokotoff
+-----------------
+ 769600
+(1 row)
+
+SELECT daitch_mokotoff('AUERBACH');
+ daitch_mokotoff
+-----------------
+ 097500 097400
+(1 row)
+
+SELECT daitch_mokotoff('OHRBACH');
+ daitch_mokotoff
+-----------------
+ 097500 097400
+(1 row)
+
+SELECT daitch_mokotoff('LIPSHITZ');
+ daitch_mokotoff
+-----------------
+ 874400
+(1 row)
+
+SELECT daitch_mokotoff('LIPPSZYC');
+ daitch_mokotoff
+-----------------
+ 874500 874400
+(1 row)
+
+SELECT daitch_mokotoff('LEWINSKY');
+ daitch_mokotoff
+-----------------
+ 876450
+(1 row)
+
+SELECT daitch_mokotoff('LEVINSKI');
+ daitch_mokotoff
+-----------------
+ 876450
+(1 row)
+
+SELECT daitch_mokotoff('SZLAMAWICZ');
+ daitch_mokotoff
+-----------------
+ 486740
+(1 row)
+
+SELECT daitch_mokotoff('SHLAMOVITZ');
+ daitch_mokotoff
+-----------------
+ 486740
+(1 row)
+
+-- https://www.avotaynu.com/soundex.htm
+SELECT daitch_mokotoff('Ceniow');
+ daitch_mokotoff
+-----------------
+ 567000 467000
+(1 row)
+
+SELECT daitch_mokotoff('Tsenyuv');
+ daitch_mokotoff
+-----------------
+ 467000
+(1 row)
+
+SELECT daitch_mokotoff('Holubica');
+ daitch_mokotoff
+-----------------
+ 587500 587400
+(1 row)
+
+SELECT daitch_mokotoff('Golubitsa');
+ daitch_mokotoff
+-----------------
+ 587400
+(1 row)
+
+SELECT daitch_mokotoff('Przemysl');
+ daitch_mokotoff
+-----------------
+ 794648 746480
+(1 row)
+
+SELECT daitch_mokotoff('Pshemeshil');
+ daitch_mokotoff
+-----------------
+ 746480
+(1 row)
+
+SELECT daitch_mokotoff('Rosochowaciec');
+                     daitch_mokotoff
+---------------------------------------------------------
+ 945755 945754 945745 945744 944755 944754 944745 944744
+(1 row)
+
+SELECT daitch_mokotoff('Rosokhovatsets');
+ daitch_mokotoff
+-----------------
+ 945744
+(1 row)
+
+-- https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex
+SELECT daitch_mokotoff('Peters');
+ daitch_mokotoff
+-----------------
+ 739400 734000
+(1 row)
+
+SELECT daitch_mokotoff('Peterson');
+ daitch_mokotoff
+-----------------
+ 739460 734600
+(1 row)
+
+SELECT daitch_mokotoff('Moskowitz');
+ daitch_mokotoff
+-----------------
+ 645740
+(1 row)
+
+SELECT daitch_mokotoff('Moskovitz');
+ daitch_mokotoff
+-----------------
+ 645740
+(1 row)
+
+SELECT daitch_mokotoff('Auerbach');
+ daitch_mokotoff
+-----------------
+ 097500 097400
+(1 row)
+
+SELECT daitch_mokotoff('Uhrbach');
+ daitch_mokotoff
+-----------------
+ 097500 097400
+(1 row)
+
+SELECT daitch_mokotoff('Jackson');
+       daitch_mokotoff
+-----------------------------
+ 154600 145460 454600 445460
+(1 row)
+
+SELECT daitch_mokotoff('Jackson-Jackson');
+                            daitch_mokotoff
+-----------------------------------------------------------------------
+ 154654 154645 154644 145465 145464 454654 454645 454644 445465 445464
+(1 row)
+
+-- Perl Text::Phonetic::DaitchMokotoff 006_daitchmokotoff.t
+-- Tests covered above are omitted.
+SELECT daitch_mokotoff('Müller');
+ daitch_mokotoff
+-----------------
+ 689000
+(1 row)
+
+SELECT daitch_mokotoff('schmidt');
+ daitch_mokotoff
+-----------------
+ 463000
+(1 row)
+
+SELECT daitch_mokotoff('schneider');
+ daitch_mokotoff
+-----------------
+ 463900
+(1 row)
+
+SELECT daitch_mokotoff('fischer');
+ daitch_mokotoff
+-----------------
+ 749000
+(1 row)
+
+SELECT daitch_mokotoff('weber');
+ daitch_mokotoff
+-----------------
+ 779000
+(1 row)
+
+SELECT daitch_mokotoff('meyer');
+ daitch_mokotoff
+-----------------
+ 619000
+(1 row)
+
+SELECT daitch_mokotoff('wagner');
+ daitch_mokotoff
+-----------------
+ 756900
+(1 row)
+
+SELECT daitch_mokotoff('schulz');
+ daitch_mokotoff
+-----------------
+ 484000
+(1 row)
+
+SELECT daitch_mokotoff('becker');
+ daitch_mokotoff
+-----------------
+ 759000 745900
+(1 row)
+
+SELECT daitch_mokotoff('hoffmann');
+ daitch_mokotoff
+-----------------
+ 576600
+(1 row)
+
+SELECT daitch_mokotoff('schäfer');
+ daitch_mokotoff
+-----------------
+ 479000
+(1 row)
+
+-- Apache Commons DaitchMokotoffSoundexTest.java
+-- testAccentedCharacterFolding
+SELECT daitch_mokotoff('Straßburg');
+ daitch_mokotoff
+-----------------
+ 294795
+(1 row)
+
+SELECT daitch_mokotoff('Strasburg');
+ daitch_mokotoff
+-----------------
+ 294795
+(1 row)
+
+SELECT daitch_mokotoff('Éregon');
+ daitch_mokotoff
+-----------------
+ 095600
+(1 row)
+
+SELECT daitch_mokotoff('Eregon');
+ daitch_mokotoff
+-----------------
+ 095600
+(1 row)
+
+-- testAdjacentCodes
+SELECT daitch_mokotoff('AKSSOL');
+ daitch_mokotoff
+-----------------
+ 054800
+(1 row)
+
+SELECT daitch_mokotoff('GERSCHFELD');
+       daitch_mokotoff
+-----------------------------
+ 594578 594783 545783 547830
+(1 row)
+
+-- testEncodeBasic
+-- Tests covered above are omitted.
+-- testEncodeIgnoreApostrophes
+SELECT daitch_mokotoff('OBrien');
+ daitch_mokotoff
+-----------------
+ 079600
+(1 row)
+
+SELECT daitch_mokotoff('''OBrien');
+ daitch_mokotoff
+-----------------
+ 079600
+(1 row)
+
+SELECT daitch_mokotoff('O''Brien');
+ daitch_mokotoff
+-----------------
+ 079600
+(1 row)
+
+SELECT daitch_mokotoff('OB''rien');
+ daitch_mokotoff
+-----------------
+ 079600
+(1 row)
+
+SELECT daitch_mokotoff('OBr''ien');
+ daitch_mokotoff
+-----------------
+ 079600
+(1 row)
+
+SELECT daitch_mokotoff('OBri''en');
+ daitch_mokotoff
+-----------------
+ 079600
+(1 row)
+
+SELECT daitch_mokotoff('OBrie''n');
+ daitch_mokotoff
+-----------------
+ 079600
+(1 row)
+
+SELECT daitch_mokotoff('OBrien''');
+ daitch_mokotoff
+-----------------
+ 079600
+(1 row)
+
+-- testEncodeIgnoreHyphens
+SELECT daitch_mokotoff('KINGSMITH');
+ daitch_mokotoff
+-----------------
+ 565463
+(1 row)
+
+SELECT daitch_mokotoff('-KINGSMITH');
+ daitch_mokotoff
+-----------------
+ 565463
+(1 row)
+
+SELECT daitch_mokotoff('K-INGSMITH');
+ daitch_mokotoff
+-----------------
+ 565463
+(1 row)
+
+SELECT daitch_mokotoff('KI-NGSMITH');
+ daitch_mokotoff
+-----------------
+ 565463
+(1 row)
+
+SELECT daitch_mokotoff('KIN-GSMITH');
+ daitch_mokotoff
+-----------------
+ 565463
+(1 row)
+
+SELECT daitch_mokotoff('KING-SMITH');
+ daitch_mokotoff
+-----------------
+ 565463
+(1 row)
+
+SELECT daitch_mokotoff('KINGS-MITH');
+ daitch_mokotoff
+-----------------
+ 565463
+(1 row)
+
+SELECT daitch_mokotoff('KINGSM-ITH');
+ daitch_mokotoff
+-----------------
+ 565463
+(1 row)
+
+SELECT daitch_mokotoff('KINGSMI-TH');
+ daitch_mokotoff
+-----------------
+ 565463
+(1 row)
+
+SELECT daitch_mokotoff('KINGSMIT-H');
+ daitch_mokotoff
+-----------------
+ 565463
+(1 row)
+
+SELECT daitch_mokotoff('KINGSMITH-');
+ daitch_mokotoff
+-----------------
+ 565463
+(1 row)
+
+-- testEncodeIgnoreTrimmable
+SELECT daitch_mokotoff(E' \t\n\r Washington \t\n\r ');
+ daitch_mokotoff
+-----------------
+ 746536
+(1 row)
+
+SELECT daitch_mokotoff('Washington');
+ daitch_mokotoff
+-----------------
+ 746536
+(1 row)
+
+-- testSoundexBasic
+-- Tests covered above are omitted.
+-- testSoundexBasic2
+-- Tests covered above are omitted.
+-- testSoundexBasic3
+-- Tests covered above are omitted.
+-- testSpecialRomanianCharacters
+SELECT daitch_mokotoff('ţamas'); -- t-cedila
+ daitch_mokotoff
+-----------------
+ 364000 464000
+(1 row)
+
+SELECT daitch_mokotoff('țamas'); -- t-comma
+ daitch_mokotoff
+-----------------
+ 364000 464000
+(1 row)
+
+-- Contrived case which is not handled correctly by other implementations.
+SELECT daitch_mokotoff('CJC');
+              daitch_mokotoff
+-------------------------------------------
+ 550000 540000 545000 450000 400000 440000
+(1 row)
+
diff --git a/contrib/daitch_mokotoff/sql/daitch_mokotoff.sql b/contrib/daitch_mokotoff/sql/daitch_mokotoff.sql
new file mode 100644
index 0000000000..eafc24ee87
--- /dev/null
+++ b/contrib/daitch_mokotoff/sql/daitch_mokotoff.sql
@@ -0,0 +1,121 @@
+CREATE EXTENSION daitch_mokotoff;
+
+
+-- https://www.jewishgen.org/InfoFiles/Soundex.html
+SELECT daitch_mokotoff('GOLDEN');
+SELECT daitch_mokotoff('Alpert');
+SELECT daitch_mokotoff('Breuer');
+SELECT daitch_mokotoff('Freud');
+SELECT daitch_mokotoff('Haber');
+SELECT daitch_mokotoff('Manheim');
+SELECT daitch_mokotoff('Mintz');
+SELECT daitch_mokotoff('Topf');
+SELECT daitch_mokotoff('Kleinman');
+SELECT daitch_mokotoff('Ben Aron');
+
+SELECT daitch_mokotoff('AUERBACH');
+SELECT daitch_mokotoff('OHRBACH');
+SELECT daitch_mokotoff('LIPSHITZ');
+SELECT daitch_mokotoff('LIPPSZYC');
+SELECT daitch_mokotoff('LEWINSKY');
+SELECT daitch_mokotoff('LEVINSKI');
+SELECT daitch_mokotoff('SZLAMAWICZ');
+SELECT daitch_mokotoff('SHLAMOVITZ');
+
+
+-- https://www.avotaynu.com/soundex.htm
+SELECT daitch_mokotoff('Ceniow');
+SELECT daitch_mokotoff('Tsenyuv');
+SELECT daitch_mokotoff('Holubica');
+SELECT daitch_mokotoff('Golubitsa');
+SELECT daitch_mokotoff('Przemysl');
+SELECT daitch_mokotoff('Pshemeshil');
+SELECT daitch_mokotoff('Rosochowaciec');
+SELECT daitch_mokotoff('Rosokhovatsets');
+
+
+-- https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex
+SELECT daitch_mokotoff('Peters');
+SELECT daitch_mokotoff('Peterson');
+SELECT daitch_mokotoff('Moskowitz');
+SELECT daitch_mokotoff('Moskovitz');
+SELECT daitch_mokotoff('Auerbach');
+SELECT daitch_mokotoff('Uhrbach');
+SELECT daitch_mokotoff('Jackson');
+SELECT daitch_mokotoff('Jackson-Jackson');
+
+
+-- Perl Text::Phonetic::DaitchMokotoff 006_daitchmokotoff.t
+-- Tests covered above are omitted.
+SELECT daitch_mokotoff('Müller');
+SELECT daitch_mokotoff('schmidt');
+SELECT daitch_mokotoff('schneider');
+SELECT daitch_mokotoff('fischer');
+SELECT daitch_mokotoff('weber');
+SELECT daitch_mokotoff('meyer');
+SELECT daitch_mokotoff('wagner');
+SELECT daitch_mokotoff('schulz');
+SELECT daitch_mokotoff('becker');
+SELECT daitch_mokotoff('hoffmann');
+SELECT daitch_mokotoff('schäfer');
+
+
+-- Apache Commons DaitchMokotoffSoundexTest.java
+
+-- testAccentedCharacterFolding
+SELECT daitch_mokotoff('Straßburg');
+SELECT daitch_mokotoff('Strasburg');
+
+SELECT daitch_mokotoff('Éregon');
+SELECT daitch_mokotoff('Eregon');
+
+-- testAdjacentCodes
+SELECT daitch_mokotoff('AKSSOL');
+SELECT daitch_mokotoff('GERSCHFELD');
+
+-- testEncodeBasic
+-- Tests covered above are omitted.
+
+-- testEncodeIgnoreApostrophes
+SELECT daitch_mokotoff('OBrien');
+SELECT daitch_mokotoff('''OBrien');
+SELECT daitch_mokotoff('O''Brien');
+SELECT daitch_mokotoff('OB''rien');
+SELECT daitch_mokotoff('OBr''ien');
+SELECT daitch_mokotoff('OBri''en');
+SELECT daitch_mokotoff('OBrie''n');
+SELECT daitch_mokotoff('OBrien''');
+
+-- testEncodeIgnoreHyphens
+SELECT daitch_mokotoff('KINGSMITH');
+SELECT daitch_mokotoff('-KINGSMITH');
+SELECT daitch_mokotoff('K-INGSMITH');
+SELECT daitch_mokotoff('KI-NGSMITH');
+SELECT daitch_mokotoff('KIN-GSMITH');
+SELECT daitch_mokotoff('KING-SMITH');
+SELECT daitch_mokotoff('KINGS-MITH');
+SELECT daitch_mokotoff('KINGSM-ITH');
+SELECT daitch_mokotoff('KINGSMI-TH');
+SELECT daitch_mokotoff('KINGSMIT-H');
+SELECT daitch_mokotoff('KINGSMITH-');
+
+-- testEncodeIgnoreTrimmable
+SELECT daitch_mokotoff(E' \t\n\r Washington \t\n\r ');
+SELECT daitch_mokotoff('Washington');
+
+-- testSoundexBasic
+-- Tests covered above are omitted.
+
+-- testSoundexBasic2
+-- Tests covered above are omitted.
+
+-- testSoundexBasic3
+-- Tests covered above are omitted.
+
+-- testSpecialRomanianCharacters
+SELECT daitch_mokotoff('ţamas'); -- t-cedila
+SELECT daitch_mokotoff('țamas'); -- t-comma
+
+
+-- Contrived case which is not handled correctly by other implementations.
+SELECT daitch_mokotoff('CJC');
diff --git a/doc/src/sgml/contrib.sgml b/doc/src/sgml/contrib.sgml
index d3ca4b6932..0788db060d 100644
--- a/doc/src/sgml/contrib.sgml
+++ b/doc/src/sgml/contrib.sgml
@@ -105,6 +105,7 @@ CREATE EXTENSION <replaceable>module_name</replaceable>;
  &citext;
  &cube;
  &dblink;
+ &daitch-mokotoff;
  &dict-int;
  &dict-xsyn;
  &earthdistance;
diff --git a/doc/src/sgml/daitch-mokotoff.sgml b/doc/src/sgml/daitch-mokotoff.sgml
new file mode 100644
index 0000000000..6dbc817eca
--- /dev/null
+++ b/doc/src/sgml/daitch-mokotoff.sgml
@@ -0,0 +1,92 @@
+<!-- doc/src/sgml/daitch-mokotoff.sgml -->
+
+<sect1 id="daitch-mokotoff" xreflabel="daitch_mokotoff">
+ <title>daitch_mokotoff</title>
+
+ <indexterm zone="daitch-mokotoff">
+  <primary>daitch_mokotoff</primary>
+ </indexterm>
+
+ <para>
+  The <filename>daitch_mokotoff</filename> module provides an implementation
+  of the Daitch-Mokotoff Soundex System.
+ </para>
+
+ <para>
+  Compared to the American Soundex System implemented in
+  the <filename>fuzzystrmatch</filename> module, the major improvements of the
+  Daitch-Mokotoff Soundex System are:
+
+  <itemizedlist spacing="compact" mark="bullet">
+   <listitem>
+    <para>
+     Information is coded to the first six meaningful letters rather than
+     four.
+    </para>
+   </listitem>
+   <listitem>
+    <para>
+     The initial letter is coded rather than kept as is.
+    </para>
+   </listitem>
+   <listitem>
+    <para>
+     Where two consecutive letters have a single sound, they are coded as a
+     single number.
+    </para>
+   </listitem>
+   <listitem>
+    <para>
+     When a letter or combination of letters may have two different sounds,
+     it is double coded under the two different codes.
+    </para>
+   </listitem>
+   <listitem>
+    <para>
+     A letter or combination of letters maps into ten possible codes rather
+     than seven.
+    </para>
+   </listitem>
+  </itemizedlist>
+
+  The <function>daitch_mokotoff</function> function shown in
+  <xref linkend="functions-daitch-mokotoff-table"/> generates Daitch-Mokotoff
+  soundex codes for matching of similar-sounding names.
+ </para>
+
+ <table id="functions-daitch-mokotoff-table">
+  <title><filename>daitch_mokotoff</filename> Functions</title>
+    <tgroup cols="1">
+     <thead>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        Function
+       </para>
+       <para>
+        Description
+       </para></entry>
+      </row>
+     </thead>
+
+     <tbody>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <function>daitch_mokotoff</function> ( <parameter>name</parameter> <type>text</type> )
+        <returnvalue>text</returnvalue>
+       </para>
+       <para>
+        Generates Daitch-Mokotoff soundex codes.
+       </para></entry>
+      </row>
+     </tbody>
+  </tgroup>
+ </table>
+
+ <indexterm>
+  <primary>daitch_mokotoff</primary>
+ </indexterm>
+ <para>
+  <function>daitch_mokotoff</function> generates Daitch-Mokotoff soundex codes for the specified
<parameter>name</parameter>.Since alternate soundex codes are separated by spaces, the returned text is suited for use
inFull Text Search, see <xref linkend="textsearch"/>. 
+ </para>
+
+</sect1>
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index 89454e99b9..9ac43e5928 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -117,6 +117,7 @@
 <!ENTITY btree-gist      SYSTEM "btree-gist.sgml">
 <!ENTITY citext          SYSTEM "citext.sgml">
 <!ENTITY cube            SYSTEM "cube.sgml">
+<!ENTITY daitch-mokotoff SYSTEM "daitch-mokotoff.sgml">
 <!ENTITY dblink          SYSTEM "dblink.sgml">
 <!ENTITY dict-int        SYSTEM "dict-int.sgml">
 <!ENTITY dict-xsyn       SYSTEM "dict-xsyn.sgml">


Dag Lem <dag@nimrod.no> writes:

> Hello,
>
> Please find attached a patch for the daitch_mokotoff module.
>
> This implements the Daitch-Mokotoff Soundex System, as described in
> https://www.avotaynu.com/soundex.htm
>
> The module is used in production at Finance Norway.
>
> In order to verify correctness, I have compared generated soundex codes
> with corresponding results from the implementation by Stephen P. Morse
> at https://stevemorse.org/census/soundex.html
>
> Where soundex codes differ, the daitch_mokotoff module has been found
> to be correct. The Morse implementation uses a few unofficial rules,
> and also has an error in the handling of adjacent identical code
> digits. Please see daitch_mokotoff.c for further references and
> comments.
>
> For reference, detailed instructions for soundex code comparison are
> attached.
>
>
> Best regards
>
> Dag Lem
>

Re: daitch_mokotoff module

От
Tomas Vondra
Дата:
On 12/13/21 14:38, Dag Lem wrote:
> Please find attached an updated patch, with the following fixes:
> 
> * Replaced remaining malloc/free with palloc/pfree.
> * Made "make check" pass.
> * Updated notes on other implementations.
> 

Thanks, looks interesting. A couple generic comments, based on a quick 
code review.

1) Can the extension be marked as trusted, just like fuzzystrmatch?

2) The docs really need an explanation of what the extension is for, not 
just a link to fuzzystrmatch. Also, a couple examples would be helpful, 
I guess - similarly to fuzzystrmatch. The last line in the docs is 
annoyingly long.

3) What's daitch_mokotov_header.pl for? I mean, it generates the header, 
but when do we need to run it?

4) It seems to require perl-open, which is a module we did not need 
until now. Not sure how well supported it is, but maybe we can use a 
more standard module?

5) Do we need to keep DM_MAIN? It seems to be meant for some kind of 
testing, but our regression tests certainly don't need it (or the palloc 
mockup). I suggest to get rid of it.

6) I really don't understand some of the comments in daitch_mokotov.sql, 
like for example:

-- testEncodeBasic
-- Tests covered above are omitted.

Also, comments with names of Java methods seem pretty confusing. It'd be 
better to actually explain what rules are the tests checking.

7) There are almost no comments in the .c file (ignoring the comment on 
top). Short functions like initialize_node are probably fine without 
one, but e.g. update_node would deserve one.

8) Some of the lines are pretty long (e.g. the update_node signature is 
almost 170 chars). That should be wrapped. Maybe try running pgindent on 
the code, that'll show which parts need better formatting (so as not to 
get broken later).

9) I'm sure there's better way to get the number of valid chars than this:

   for (i = 0, ilen = 0; (c = read_char(&str[i], &ilen)) && (c < 'A' || 
c > ']'); i += ilen)
   {
   }

Say, a while loop or something?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: daitch_mokotoff module

От
Andrew Dunstan
Дата:
On 12/13/21 09:26, Tomas Vondra wrote:
> On 12/13/21 14:38, Dag Lem wrote:
>> Please find attached an updated patch, with the following fixes:
>>
>> * Replaced remaining malloc/free with palloc/pfree.
>> * Made "make check" pass.
>> * Updated notes on other implementations.
>>
>
> Thanks, looks interesting. A couple generic comments, based on a quick
> code review.
>
> 1) Can the extension be marked as trusted, just like fuzzystrmatch?
>
> 2) The docs really need an explanation of what the extension is for,
> not just a link to fuzzystrmatch. Also, a couple examples would be
> helpful, I guess - similarly to fuzzystrmatch. The last line in the
> docs is annoyingly long.


It's not clear to me why we need a new module for this. Wouldn't it be
better just to add the new function to fuzzystrmatch?


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: daitch_mokotoff module

От
Tomas Vondra
Дата:
On 12/13/21 16:05, Andrew Dunstan wrote:
> 
> On 12/13/21 09:26, Tomas Vondra wrote:
>> On 12/13/21 14:38, Dag Lem wrote:
>>> Please find attached an updated patch, with the following fixes:
>>>
>>> * Replaced remaining malloc/free with palloc/pfree.
>>> * Made "make check" pass.
>>> * Updated notes on other implementations.
>>>
>>
>> Thanks, looks interesting. A couple generic comments, based on a quick
>> code review.
>>
>> 1) Can the extension be marked as trusted, just like fuzzystrmatch?
>>
>> 2) The docs really need an explanation of what the extension is for,
>> not just a link to fuzzystrmatch. Also, a couple examples would be
>> helpful, I guess - similarly to fuzzystrmatch. The last line in the
>> docs is annoyingly long.
> 
> 
> It's not clear to me why we need a new module for this. Wouldn't it be
> better just to add the new function to fuzzystrmatch?
> 

Yeah, that's a valid point. I think we're quite conservative about 
adding more contrib modules, and adding a function to an existing one 
works around a lot of that.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: daitch_mokotoff module

От
Dag Lem
Дата:
Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

[...]

>
> Thanks, looks interesting. A couple generic comments, based on a quick
> code review.

Thank you for the constructive review!

>
> 1) Can the extension be marked as trusted, just like fuzzystrmatch?

I have now moved the daitch_mokotoff function into the fuzzystrmatch
module, as suggested by Andrew Dunstan.

>
> 2) The docs really need an explanation of what the extension is for,
> not just a link to fuzzystrmatch. Also, a couple examples would be
> helpful, I guess - similarly to fuzzystrmatch. The last line in the
> docs is annoyingly long.

Please see the updated documentation for the fuzzystrmatch module.

>
> 3) What's daitch_mokotov_header.pl for? I mean, it generates the
> header, but when do we need to run it?

It only has to be run if the soundex rules are changed. I have now
made the dependencies explicit in the fuzzystrmatch Makefile.

>
> 4) It seems to require perl-open, which is a module we did not need
> until now. Not sure how well supported it is, but maybe we can use a
> more standard module?

I believe Perl I/O layers have been part of Perl core for two decades
now :-)

>
> 5) Do we need to keep DM_MAIN? It seems to be meant for some kind of
> testing, but our regression tests certainly don't need it (or the
> palloc mockup). I suggest to get rid of it.

Done. BTW this was modeled after dmetaphone.c

>
> 6) I really don't understand some of the comments in
> daitch_mokotov.sql, like for example:
>
> -- testEncodeBasic
> -- Tests covered above are omitted.
>
> Also, comments with names of Java methods seem pretty confusing. It'd
> be better to actually explain what rules are the tests checking.

The tests were copied from various web sites and implementations. I have
cut down on the number of tests and made the comments more to the point.

>
> 7) There are almost no comments in the .c file (ignoring the comment
> on top). Short functions like initialize_node are probably fine
> without one, but e.g. update_node would deserve one.

More comments are added to both the .h and the .c file.

>
> 8) Some of the lines are pretty long (e.g. the update_node signature
> is almost 170 chars). That should be wrapped. Maybe try running
> pgindent on the code, that'll show which parts need better formatting
> (so as not to get broken later).

Fixed. I did run pgindent earlier, however it didn't catch those long
lines.

>
> 9) I'm sure there's better way to get the number of valid chars than this:
>
>   for (i = 0, ilen = 0; (c = read_char(&str[i], &ilen)) && (c < 'A' ||
> c > ']'); i += ilen)
>   {
>   }
>
> Say, a while loop or something?

The code gets to the next encodable character, skipping any other
characters. I have now added a comment which should hopefully make this
clearer, and broken up the for loop for readability.

Please find attached the revised patch.

Best regards

Dag Lem

diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile
index 0704894f88..826e529e3e 100644
--- a/contrib/fuzzystrmatch/Makefile
+++ b/contrib/fuzzystrmatch/Makefile
@@ -3,11 +3,12 @@
 MODULE_big = fuzzystrmatch
 OBJS = \
     $(WIN32RES) \
+    daitch_mokotoff.o \
     dmetaphone.o \
     fuzzystrmatch.o

 EXTENSION = fuzzystrmatch
-DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql
+DATA = fuzzystrmatch--1.2.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql
 PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"

 REGRESS = fuzzystrmatch
@@ -22,3 +23,8 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
+
+daitch_mokotoff.o: daitch_mokotoff.h
+
+daitch_mokotoff.h: daitch_mokotoff_header.pl
+    perl $< > $@
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c
new file mode 100644
index 0000000000..302e9a6d86
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff.c
@@ -0,0 +1,516 @@
+/*
+ * Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem <dag@nimrod.no>
+ *
+ * This implementation of the Daitch-Mokotoff Soundex System aims at high
+ * performance.
+ *
+ * - The processing of each phoneme is initiated by an O(1) table lookup.
+ * - For phonemes containing more than one character, a coding tree is traversed
+ *   to process the complete phoneme.
+ * - The (alternate) soundex codes are produced digit by digit in-place in
+ *   another tree structure.
+ *
+ *  References:
+ *
+ * https://www.avotaynu.com/soundex.htm
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex
+ * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php)
+ * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java)
+ * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm)
+ *
+ * A few notes on other implementations:
+ *
+ * - "J" is considered a vowel in dmlat.php
+ * - The official rules for "RS" are commented out in dmlat.php
+ * - Identical code digits for adjacent letters are not collapsed correctly in
+ *   dmsoundex.php when double digit codes are involved. E.g. "BESST" yields
+ *   744300 instead of 743000 as for "BEST".
+ * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java
+ * - Both dmlat.php and dmrules.txt have the same unofficial rules for "UE".
+ * - Coding of MN/NM + M/N differs between dmsoundex.php and DaitchMokotoffSoundex.java
+ * - No other known implementation yields the correct set of codes for e.g.
+ *   "CJC" (550000 540000 545000 450000 400000 440000).
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+*/
+
+#include "daitch_mokotoff.h"
+
+#include "postgres.h"
+#include "utils/builtins.h"
+#include "mb/pg_wchar.h"
+
+#include <ctype.h>
+#include <string.h>
+
+
+/* Internal C implementation */
+static char *_daitch_mokotoff(char *word, char *soundex, size_t n);
+
+
+PG_FUNCTION_INFO_V1(daitch_mokotoff);
+Datum
+daitch_mokotoff(PG_FUNCTION_ARGS)
+{
+    text       *arg = PG_GETARG_TEXT_PP(0);
+    char       *string,
+               *tmp_soundex;
+    text       *soundex;
+
+    /*
+     * The maximum theoretical soundex size is several KB, however in practice
+     * anything but contrived synthetic inputs will yield a soundex size of
+     * less than 100 bytes. We thus allocate and free a temporary work buffer,
+     * and return only the actual soundex result.
+     */
+    string = pg_server_to_any(text_to_cstring(arg), VARSIZE_ANY_EXHDR(arg), PG_UTF8);
+    tmp_soundex = palloc(DM_MAX_SOUNDEX_CHARS);
+
+    _daitch_mokotoff(string, tmp_soundex, DM_MAX_SOUNDEX_CHARS);
+
+    soundex = cstring_to_text(pg_any_to_server(tmp_soundex, strlen(tmp_soundex), PG_UTF8));
+    pfree(tmp_soundex);
+
+    PG_RETURN_TEXT_P(soundex);
+}
+
+
+typedef dm_node dm_nodes[DM_MAX_NODES];
+typedef dm_node * dm_leaves[DM_MAX_LEAVES];
+
+
+/* Template for new node in soundex code tree */
+static const dm_node start_node = {
+    .soundex_length = 0,
+    .soundex = "000000 ",        /* Six digits + joining space */
+    .is_leaf = 0,
+    .last_update = 0,
+    .code_digit = '\0',
+    .prev_code_digits = {'\0', '\0'},
+    .next_code_digits = {'\0', '\0'},
+    .next_nodes = {NULL}
+};
+
+
+/* Initialize soundex code tree node for next code digit */
+static void
+initialize_node(dm_node * node, int last_update)
+{
+    if (node->last_update < last_update)
+    {
+        node->prev_code_digits[0] = node->next_code_digits[0];
+        node->prev_code_digits[1] = node->next_code_digits[1];
+        node->next_code_digits[0] = '\0';
+        node->next_code_digits[1] = '\0';
+        node->is_leaf = 0;
+        node->last_update = last_update;
+    }
+}
+
+
+/* Update soundex code tree node with next code digit */
+static void
+add_next_code_digit(dm_node * node, char code_digit)
+{
+    if (!node->next_code_digits[0])
+    {
+        node->next_code_digits[0] = code_digit;
+    }
+    else if (node->next_code_digits[0] != code_digit)
+    {
+        node->next_code_digits[1] = code_digit;
+    }
+}
+
+
+/* Mark soundex code tree node as leaf (soundex code completed) */
+static void
+set_leaf(dm_leaves leaves_next, int *num_leaves_next, dm_node * node)
+{
+    if (!node->is_leaf)
+    {
+        node->is_leaf = 1;
+        leaves_next[(*num_leaves_next)++] = node;
+    }
+}
+
+
+/* Find next node corresponding to code digit, or create a new node */
+static dm_node * find_or_create_node(dm_nodes nodes, int *num_nodes,
+                                     dm_node * node, char code_digit)
+{
+    dm_node   **next_nodes;
+    dm_node    *next_node;
+
+    for (next_nodes = node->next_nodes; (next_node = *next_nodes); next_nodes++)
+    {
+        if (next_node->code_digit == code_digit)
+        {
+            return next_node;
+        }
+    }
+
+    next_node = &nodes[(*num_nodes)++];
+    *next_nodes = next_node;
+
+    *next_node = start_node;
+    memcpy(next_node->soundex, node->soundex, sizeof(next_node->soundex));
+    next_node->soundex_length = node->soundex_length;
+    next_node->soundex[next_node->soundex_length++] = code_digit;
+    next_node->code_digit = code_digit;
+
+    return next_node;
+}
+
+
+/* Update node for next code digit(s) */
+static int
+update_node(dm_nodes nodes, dm_node * node, int *num_nodes,
+            dm_leaves leaves_next, int *num_leaves_next,
+            int char_number, char next_code_digit, char next_code_digit_2)
+{
+    int            i;
+    int            num_dirty_nodes = 0;
+    dm_node    *dirty_nodes[2];
+
+    initialize_node(node, char_number);
+
+    if (node->soundex_length == DM_MAX_CODE_DIGITS)
+    {
+        /* Keep completed soundex code. */
+        set_leaf(leaves_next, num_leaves_next, node);
+        return 0;
+    }
+
+    if (next_code_digit == 'X' ||
+        node->prev_code_digits[0] == next_code_digit ||
+        node->prev_code_digits[1] == next_code_digit)
+    {
+        /* The code digit is the same as one of the previous (i.e. not added). */
+        dirty_nodes[num_dirty_nodes++] = node;
+    }
+
+    if (next_code_digit != 'X' &&
+        (node->prev_code_digits[0] != next_code_digit ||
+         node->prev_code_digits[1]))
+    {
+        /* The code digit is different from one of the previous (i.e. added). */
+        node = find_or_create_node(nodes, num_nodes, node, next_code_digit);
+        initialize_node(node, char_number);
+        dirty_nodes[num_dirty_nodes++] = node;
+    }
+
+    for (i = 0; i < num_dirty_nodes; i++)
+    {
+        /* Add code digit leading to the current node. */
+        add_next_code_digit(dirty_nodes[i], next_code_digit);
+
+        if (next_code_digit_2)
+        {
+            update_node(nodes, dirty_nodes[i], num_nodes,
+                        leaves_next, num_leaves_next,
+                        char_number, next_code_digit_2, '\0');
+        }
+        else
+        {
+            set_leaf(leaves_next, num_leaves_next, dirty_nodes[i]);
+        }
+    }
+
+    return 1;
+}
+
+
+/* Mark completed soundex node leaves. Return 1 when all nodes are completed */
+static int
+update_leaves(dm_nodes nodes, int *num_nodes,
+              dm_leaves leaves[2], int *ix_leaves, int *num_leaves,
+              int char_number, dm_codes codes)
+{
+    int            i,
+                j;
+    char       *code;
+    int            num_leaves_next = 0;
+    int            ix_leaves_next = (*ix_leaves + 1) % 2;
+    int            finished = 1;
+
+    for (i = 0; i < *num_leaves; i++)
+    {
+        dm_node    *node = leaves[*ix_leaves][i];
+
+        /* One or two alternate code sequences. */
+        for (j = 0; j < 2 && (code = codes[j]) && code[0]; j++)
+        {
+            /* One or two sequential code digits. */
+            if (update_node(nodes, node, num_nodes,
+                            leaves[ix_leaves_next], &num_leaves_next,
+                            char_number, code[0], code[1]))
+            {
+                finished = 0;
+            }
+        }
+    }
+
+    *ix_leaves = ix_leaves_next;
+    *num_leaves = num_leaves_next;
+
+    return finished;
+}
+
+
+/* Mapping from ISO8859-1 to ASCII */
+static const char tr_accents_iso8859_1[] =
+/*
+"ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ"
+*/
+"AAAAAAECEEEEIIIIDNOOOOO*OUUUUYDsaaaaaaeceeeeiiiidnooooo/ouuuuydy";
+
+static char
+unaccent_iso8859_1(unsigned char c)
+{
+    return c >= 192 ? tr_accents_iso8859_1[c - 192] : c;
+}
+
+
+/* Convert an UTF-8 character to ISO-8859-1.
+ * Unconvertable characters are returned as '?'.
+ * NB! Beware of the domain specific conversion of Ą, Ę, and Ţ/Ț.
+ */
+static char
+utf8_to_iso8859_1(char *str, int *len)
+{
+    const char    unknown = '?';
+    unsigned char c;
+    unsigned int code_point;
+
+    *len = 1;
+    c = (unsigned char) str[0];
+    if (c < 0x80)
+    {
+        /* ASCII code point. */
+        return c;
+    }
+    else if (c < 0xE0)
+    {
+        /* Two-byte character. */
+        if (!str[1])
+        {
+            /* The UTF-8 character is cut short (invalid code point). */
+            return unknown;
+        }
+        *len = 2;
+        code_point = ((c & 0x1F) << 6) | (str[1] & 0x3F);
+        if (code_point < 0x100)
+        {
+            /* ISO-8859-1 code point. */
+            return code_point;
+        }
+        else if (code_point == 0x0104 || code_point == 0x0105)
+        {
+            /* Ą/ą */
+            return '[';
+        }
+        else if (code_point == 0x0118 || code_point == 0x0119)
+        {
+            /* Ę/ę */
+            return '\\';
+        }
+        else if (code_point == 0x0162 || code_point == 0x0163 ||
+                 code_point == 0x021A || code_point == 0x021B)
+        {
+            /* Ţ/ţ or Ț/ț */
+            return ']';
+        }
+        else
+        {
+            return unknown;
+        }
+    }
+    else if (c < 0xF0)
+    {
+        /* Three-byte character. */
+        if (!str[2])
+        {
+            /* The UTF-8 character is cut short (invalid code point). */
+            *len = 2;
+        }
+        else
+        {
+            *len = 3;
+        }
+        return unknown;
+    }
+    else
+    {
+        /* Four-byte character. */
+        if (!str[3])
+        {
+            /* The UTF-8 character is cut short (invalid code point). */
+            *len = 3;
+        }
+        else
+        {
+            *len = 4;
+        }
+        return unknown;
+    }
+}
+
+
+/* Return next character, converted from UTF-8 to uppercase ASCII */
+static char
+read_char(char *str, int *len)
+{
+    return toupper(unaccent_iso8859_1(utf8_to_iso8859_1(str, len)));
+}
+
+
+/* Return next character in the character set [A..\]], skipping any other characters */
+static char
+read_valid_char(char *str, int *len)
+{
+    int            c;
+    int            i,
+                ilen;
+
+    for (i = 0, ilen = 0;
+         (c = read_char(&str[i], &ilen)) && (c < 'A' || c > ']');
+         i += ilen)
+    {
+    }
+
+    *len = i + ilen;
+    return c;
+}
+
+
+/* Generate all Daitch-Mokotoff soundex codes for word, separated by space */
+static char *
+_daitch_mokotoff(char *word, char *soundex, size_t n)
+{
+    int            c,
+                cmp;
+    int            i,
+                ilen,
+                i_ok,
+                j,
+                jlen,
+                k;
+    int            first_letter = 1;
+    int            ix_leaves = 0;
+    int            num_nodes = 0,
+                num_leaves = 0;
+    dm_letter  *letter,
+               *letters;
+    dm_codes   *codes_ok,
+                codes;
+
+    dm_node    *nodes = palloc(sizeof(dm_nodes));
+    dm_leaves  *leaves = palloc(2 * sizeof(dm_leaves));
+
+    /* Starting point. */
+    nodes[num_nodes++] = start_node;
+    leaves[ix_leaves][num_leaves++] = &nodes[0];
+
+    for (i = 0; (c = read_valid_char(&word[i], &ilen)); i += ilen)
+    {
+        /* First letter in sequence. */
+        letter = &letter_[c - 'A'];
+        codes_ok = letter->codes;
+        i_ok = i;
+
+        /* Subsequent letters. */
+        for (j = i + ilen;
+             (letters = letter->letters) && (c = read_valid_char(&word[j], &jlen));
+             j += jlen)
+        {
+            for (k = 0; (cmp = letters[k].letter); k++)
+            {
+                if (cmp == c)
+                {
+                    /* Coding for letter found. */
+                    break;
+                }
+            }
+            if (!cmp)
+            {
+                /* The sequence of letters has no coding. */
+                break;
+            }
+
+            letter = &letters[k];
+            if (letter->codes)
+            {
+                codes_ok = letter->codes;
+                i_ok = j;
+                ilen = jlen;
+            }
+        }
+
+        /* Determine which code to use. */
+        if (first_letter)
+        {
+            /* This is the first letter. */
+            j = 0;
+            first_letter = 0;
+        }
+        else if ((c = read_valid_char(&word[i_ok + ilen], &jlen)) && strchr(DM_VOWELS, c))
+        {
+            /* The next letter is a vowel. */
+            j = 1;
+        }
+        else
+        {
+            /* All other cases. */
+            j = 2;
+        }
+        memcpy(codes, codes_ok[j], sizeof(codes));
+
+        /* Update leaves. */
+        if (update_leaves(nodes, &num_nodes,
+                          leaves, &ix_leaves, &num_leaves,
+                          i, codes))
+        {
+            /* All soundex codes are completed to six digits. */
+            break;
+        }
+
+        /* Prepare for next letter sequence. */
+        i = i_ok;
+    }
+
+    /* Concatenate all generated soundex codes. */
+    for (i = 0, j = 0;
+         i < num_leaves && j + DM_MAX_CODE_DIGITS + 1 <= n;
+         i++, j += DM_MAX_CODE_DIGITS + 1)
+    {
+        memcpy(&soundex[j], leaves[ix_leaves][i]->soundex, DM_MAX_CODE_DIGITS + 1);
+    }
+
+    /* Terminate string. */
+    soundex[j - (j != 0)] = '\0';
+
+    pfree(leaves);
+    pfree(nodes);
+
+    return soundex;
+}
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.h b/contrib/fuzzystrmatch/daitch_mokotoff.h
new file mode 100644
index 0000000000..071760495c
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff.h
@@ -0,0 +1,1115 @@
+/*
+ * Types and lookup tables for Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem <dag@nimrod.no>
+ *
+ * This file is generated by daitch_mokotoff_header.pl
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+ */
+
+#include <stdlib.h>
+
+#define DM_MAX_CODE_DIGITS 6
+#define DM_MAX_ALTERNATE_CODES 5
+#define DM_MAX_NODES 1564
+#define DM_MAX_LEAVES 1250
+#define DM_MAX_SOUNDEX_CHARS (DM_MAX_NODES*(DM_MAX_CODE_DIGITS + 1))
+#define DM_VOWELS "AEIOUY"
+
+typedef char dm_code[2 + 1];    /* One or two sequential code digits + NUL */
+typedef dm_code dm_codes[2];    /* One or two alternate code sequences */
+
+/* Letter in input sequence */
+struct dm_letter
+{
+    char        letter;            /* Present letter in sequence */
+    struct dm_letter *letters;    /* List of possible successive letters */
+    dm_codes   *codes;            /* Code sequence(s) for complete sequence */
+};
+
+/* Node in soundex code tree */
+struct dm_node
+{
+    int            soundex_length; /* Length of generated soundex code */
+    char        soundex[DM_MAX_CODE_DIGITS + 1];    /* Soundex code */
+    int            is_leaf;        /* Completed soundex codes are leaves */
+    int            last_update;    /* Character index for last update of node */
+    char        code_digit;        /* Current code digit */
+
+    /*
+     * One or two alternate code digits leading to this node - repeated code
+     * digits and 'X' lead back to the same node.
+     */
+    char        prev_code_digits[2];
+    char        next_code_digits[2];
+    /* Branching nodes */
+    struct dm_node *next_nodes[DM_MAX_ALTERNATE_CODES + 1];
+};
+
+typedef struct dm_letter dm_letter;
+typedef struct dm_node dm_node;
+
+/* Codes for letter sequence at start of name, before a vowel, and any other. */
+static dm_codes codes_0_1_X[] =
+{
+    {
+        "0"
+    },
+    {
+        "1"
+    },
+    {
+        "X"
+    }
+};
+static dm_codes codes_0_7_X[] =
+{
+    {
+        "0"
+    },
+    {
+        "7"
+    },
+    {
+        "X"
+    }
+};
+static dm_codes codes_0_X_X[] =
+{
+    {
+        "0"
+    },
+    {
+        "X"
+    },
+    {
+        "X"
+    }
+};
+static dm_codes codes_1_1_X[] =
+{
+    {
+        "1"
+    },
+    {
+        "1"
+    },
+    {
+        "X"
+    }
+};
+static dm_codes codes_1_X_X[] =
+{
+    {
+        "1"
+    },
+    {
+        "X"
+    },
+    {
+        "X"
+    }
+};
+static dm_codes codes_1or4_Xor4_Xor4[] =
+{
+    {
+        "1", "4"
+    },
+    {
+        "X", "4"
+    },
+    {
+        "X", "4"
+    }
+};
+static dm_codes codes_2_43_43[] =
+{
+    {
+        "2"
+    },
+    {
+        "43"
+    },
+    {
+        "43"
+    }
+};
+static dm_codes codes_2_4_4[] =
+{
+    {
+        "2"
+    },
+    {
+        "4"
+    },
+    {
+        "4"
+    }
+};
+static dm_codes codes_3_3_3[] =
+{
+    {
+        "3"
+    },
+    {
+        "3"
+    },
+    {
+        "3"
+    }
+};
+static dm_codes codes_3or4_3or4_3or4[] =
+{
+    {
+        "3", "4"
+    },
+    {
+        "3", "4"
+    },
+    {
+        "3", "4"
+    }
+};
+static dm_codes codes_4_4_4[] =
+{
+    {
+        "4"
+    },
+    {
+        "4"
+    },
+    {
+        "4"
+    }
+};
+static dm_codes codes_5_54_54[] =
+{
+    {
+        "5"
+    },
+    {
+        "54"
+    },
+    {
+        "54"
+    }
+};
+static dm_codes codes_5_5_5[] =
+{
+    {
+        "5"
+    },
+    {
+        "5"
+    },
+    {
+        "5"
+    }
+};
+static dm_codes codes_5_5_X[] =
+{
+    {
+        "5"
+    },
+    {
+        "5"
+    },
+    {
+        "X"
+    }
+};
+static dm_codes codes_5or45_5or45_5or45[] =
+{
+    {
+        "5", "45"
+    },
+    {
+        "5", "45"
+    },
+    {
+        "5", "45"
+    }
+};
+static dm_codes codes_5or4_5or4_5or4[] =
+{
+    {
+        "5", "4"
+    },
+    {
+        "5", "4"
+    },
+    {
+        "5", "4"
+    }
+};
+static dm_codes codes_66_66_66[] =
+{
+    {
+        "66"
+    },
+    {
+        "66"
+    },
+    {
+        "66"
+    }
+};
+static dm_codes codes_6_6_6[] =
+{
+    {
+        "6"
+    },
+    {
+        "6"
+    },
+    {
+        "6"
+    }
+};
+static dm_codes codes_7_7_7[] =
+{
+    {
+        "7"
+    },
+    {
+        "7"
+    },
+    {
+        "7"
+    }
+};
+static dm_codes codes_8_8_8[] =
+{
+    {
+        "8"
+    },
+    {
+        "8"
+    },
+    {
+        "8"
+    }
+};
+static dm_codes codes_94or4_94or4_94or4[] =
+{
+    {
+        "94", "4"
+    },
+    {
+        "94", "4"
+    },
+    {
+        "94", "4"
+    }
+};
+static dm_codes codes_9_9_9[] =
+{
+    {
+        "9"
+    },
+    {
+        "9"
+    },
+    {
+        "9"
+    }
+};
+static dm_codes codes_X_X_6orX[] =
+{
+    {
+        "X"
+    },
+    {
+        "X"
+    },
+    {
+        "6", "X"
+    }
+};
+
+/* Coding for alternative following letters in sequence. */
+static dm_letter letter_A[] =
+{
+    {
+        'I', NULL, codes_0_1_X
+    },
+    {
+        'J', NULL, codes_0_1_X
+    },
+    {
+        'U', NULL, codes_0_7_X
+    },
+    {
+        'Y', NULL, codes_0_1_X
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_CH[] =
+{
+    {
+        'S', NULL, codes_5_54_54
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_CS[] =
+{
+    {
+        'Z', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_CZ[] =
+{
+    {
+        'S', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_C[] =
+{
+    {
+        'H', letter_CH, codes_5or4_5or4_5or4
+    },
+    {
+        'K', NULL, codes_5or45_5or45_5or45
+    },
+    {
+        'S', letter_CS, codes_4_4_4
+    },
+    {
+        'Z', letter_CZ, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_DR[] =
+{
+    {
+        'S', NULL, codes_4_4_4
+    },
+    {
+        'Z', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_DS[] =
+{
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        'Z', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_DZ[] =
+{
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        'S', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_D[] =
+{
+    {
+        'R', letter_DR, NULL
+    },
+    {
+        'S', letter_DS, codes_4_4_4
+    },
+    {
+        'T', NULL, codes_3_3_3
+    },
+    {
+        'Z', letter_DZ, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_E[] =
+{
+    {
+        'I', NULL, codes_0_1_X
+    },
+    {
+        'J', NULL, codes_0_1_X
+    },
+    {
+        'U', NULL, codes_1_1_X
+    },
+    {
+        'Y', NULL, codes_0_1_X
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_F[] =
+{
+    {
+        'B', NULL, codes_7_7_7
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_I[] =
+{
+    {
+        'A', NULL, codes_1_X_X
+    },
+    {
+        'E', NULL, codes_1_X_X
+    },
+    {
+        'O', NULL, codes_1_X_X
+    },
+    {
+        'U', NULL, codes_1_X_X
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_K[] =
+{
+    {
+        'H', NULL, codes_5_5_5
+    },
+    {
+        'S', NULL, codes_5_54_54
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_M[] =
+{
+    {
+        'N', NULL, codes_66_66_66
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_N[] =
+{
+    {
+        'M', NULL, codes_66_66_66
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_O[] =
+{
+    {
+        'I', NULL, codes_0_1_X
+    },
+    {
+        'J', NULL, codes_0_1_X
+    },
+    {
+        'Y', NULL, codes_0_1_X
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_P[] =
+{
+    {
+        'F', NULL, codes_7_7_7
+    },
+    {
+        'H', NULL, codes_7_7_7
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_R[] =
+{
+    {
+        'S', NULL, codes_94or4_94or4_94or4
+    },
+    {
+        'Z', NULL, codes_94or4_94or4_94or4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SCHTC[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SCHTSC[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SCHTS[] =
+{
+    {
+        'C', letter_SCHTSC, NULL
+    },
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SCHT[] =
+{
+    {
+        'C', letter_SCHTC, NULL
+    },
+    {
+        'S', letter_SCHTS, NULL
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SCH[] =
+{
+    {
+        'D', NULL, codes_2_43_43
+    },
+    {
+        'T', letter_SCHT, codes_2_43_43
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SC[] =
+{
+    {
+        'H', letter_SCH, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SHC[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SHTC[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SHTS[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SHT[] =
+{
+    {
+        'C', letter_SHTC, NULL
+    },
+    {
+        'S', letter_SHTS, NULL
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SH[] =
+{
+    {
+        'C', letter_SHC, NULL
+    },
+    {
+        'D', NULL, codes_2_43_43
+    },
+    {
+        'T', letter_SHT, codes_2_43_43
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_STC[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_STR[] =
+{
+    {
+        'S', NULL, codes_2_4_4
+    },
+    {
+        'Z', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_STSC[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_STS[] =
+{
+    {
+        'C', letter_STSC, NULL
+    },
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_ST[] =
+{
+    {
+        'C', letter_STC, NULL
+    },
+    {
+        'R', letter_STR, NULL
+    },
+    {
+        'S', letter_STS, NULL
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SZC[] =
+{
+    {
+        'S', NULL, codes_2_4_4
+    },
+    {
+        'Z', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SZ[] =
+{
+    {
+        'C', letter_SZC, NULL
+    },
+    {
+        'D', NULL, codes_2_43_43
+    },
+    {
+        'T', NULL, codes_2_43_43
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_S[] =
+{
+    {
+        'C', letter_SC, codes_2_4_4
+    },
+    {
+        'D', NULL, codes_2_43_43
+    },
+    {
+        'H', letter_SH, codes_4_4_4
+    },
+    {
+        'T', letter_ST, codes_2_43_43
+    },
+    {
+        'Z', letter_SZ, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TC[] =
+{
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TR[] =
+{
+    {
+        'S', NULL, codes_4_4_4
+    },
+    {
+        'Z', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TSC[] =
+{
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TS[] =
+{
+    {
+        'C', letter_TSC, NULL
+    },
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        'Z', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TTC[] =
+{
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TTSC[] =
+{
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TTS[] =
+{
+    {
+        'C', letter_TTSC, NULL
+    },
+    {
+        'Z', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TT[] =
+{
+    {
+        'C', letter_TTC, NULL
+    },
+    {
+        'S', letter_TTS, codes_4_4_4
+    },
+    {
+        'Z', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TZ[] =
+{
+    {
+        'S', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_T[] =
+{
+    {
+        'C', letter_TC, codes_4_4_4
+    },
+    {
+        'H', NULL, codes_3_3_3
+    },
+    {
+        'R', letter_TR, NULL
+    },
+    {
+        'S', letter_TS, codes_4_4_4
+    },
+    {
+        'T', letter_TT, NULL
+    },
+    {
+        'Z', letter_TZ, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_U[] =
+{
+    {
+        'E', NULL, codes_0_X_X
+    },
+    {
+        'I', NULL, codes_0_1_X
+    },
+    {
+        'J', NULL, codes_0_1_X
+    },
+    {
+        'Y', NULL, codes_0_1_X
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_ZDZ[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_ZD[] =
+{
+    {
+        'Z', letter_ZDZ, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_ZHDZ[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_ZHD[] =
+{
+    {
+        'Z', letter_ZHDZ, NULL
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_ZH[] =
+{
+    {
+        'D', letter_ZHD, codes_2_43_43
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_ZSC[] =
+{
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_ZS[] =
+{
+    {
+        'C', letter_ZSC, NULL
+    },
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_Z[] =
+{
+    {
+        'D', letter_ZD, codes_2_43_43
+    },
+    {
+        'H', letter_ZH, codes_4_4_4
+    },
+    {
+        'S', letter_ZS, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_[] =
+{
+    {
+        'A', letter_A, codes_0_X_X
+    },
+    {
+        'B', NULL, codes_7_7_7
+    },
+    {
+        'C', letter_C, codes_5or4_5or4_5or4
+    },
+    {
+        'D', letter_D, codes_3_3_3
+    },
+    {
+        'E', letter_E, codes_0_X_X
+    },
+    {
+        'F', letter_F, codes_7_7_7
+    },
+    {
+        'G', NULL, codes_5_5_5
+    },
+    {
+        'H', NULL, codes_5_5_X
+    },
+    {
+        'I', letter_I, codes_0_X_X
+    },
+    {
+        'J', NULL, codes_1or4_Xor4_Xor4
+    },
+    {
+        'K', letter_K, codes_5_5_5
+    },
+    {
+        'L', NULL, codes_8_8_8
+    },
+    {
+        'M', letter_M, codes_6_6_6
+    },
+    {
+        'N', letter_N, codes_6_6_6
+    },
+    {
+        'O', letter_O, codes_0_X_X
+    },
+    {
+        'P', letter_P, codes_7_7_7
+    },
+    {
+        'Q', NULL, codes_5_5_5
+    },
+    {
+        'R', letter_R, codes_9_9_9
+    },
+    {
+        'S', letter_S, codes_4_4_4
+    },
+    {
+        'T', letter_T, codes_3_3_3
+    },
+    {
+        'U', letter_U, codes_0_X_X
+    },
+    {
+        'V', NULL, codes_7_7_7
+    },
+    {
+        'W', NULL, codes_7_7_7
+    },
+    {
+        'X', NULL, codes_5_54_54
+    },
+    {
+        'Y', NULL, codes_1_X_X
+    },
+    {
+        'Z', letter_Z, codes_4_4_4
+    },
+    {
+        'a', NULL, codes_X_X_6orX
+    },
+    {
+        'e', NULL, codes_X_X_6orX
+    },
+    {
+        't', NULL, codes_3or4_3or4_3or4
+    },
+    {
+        '\0'
+    }
+};
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff_header.pl b/contrib/fuzzystrmatch/daitch_mokotoff_header.pl
new file mode 100755
index 0000000000..30cf9d3909
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff_header.pl
@@ -0,0 +1,281 @@
+#!/bin/perl
+#
+# Generation of types and lookup tables for Daitch-Mokotoff soundex.
+#
+# Copyright (c) 2021 Finance Norway
+# Author: Dag Lem <dag@nimrod.no>
+#
+# Permission to use, copy, modify, and distribute this software and its
+# documentation for any purpose, without fee, and without a written agreement
+# is hereby granted, provided that the above copyright notice and this
+# paragraph and the following two paragraphs appear in all copies.
+#
+# IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+# DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+# LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+# DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+# POSSIBILITY OF SUCH DAMAGE.
+#
+# THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+# INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+# AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+# ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+# PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+#
+
+use strict;
+use warnings;
+use utf8;
+use open IO => ':utf8', ':std';
+use Data::Dumper;
+
+# Parse code table and generate tree for letter transitions.
+my %codes;
+my %alternates;
+my $table = [{}, [["","",""]]];
+while (<DATA>) {
+    chomp;
+    my ($letters, $codes) = split(/\s+/);
+    my @codes = map { [ split(/\|/) ] } split(/,/, $codes);
+
+    # Find alternate code transitions for calculation of storage.
+    # The first character can never yield more than two alternate codes, and is not considered here.
+    for my $c (@codes[1..2]) {
+        if (@$c > 1) {
+            for my $i (0..1) {
+                my ($a, $b) = (substr($c->[$i], -1, 1), substr($c->[($i + 1)%2], 0, 1));
+                $alternates{$a}{$b} = 1 if $a ne $b;
+            }
+        }
+    }
+    my $key = "codes_" . join("_", map { join("or", @$_) } @codes);
+    my $val = join(",\n", map { "\t{\n\t\t" . join(", ", map { "\"$_\"" } @$_) . "\n\t}" } @codes);
+    $codes{$key} = $val;
+
+    for my $letter (split(/,/, $letters)) {
+        my $ref = $table->[0];
+        # Link each character to the next in the letter combination.
+        my @c = split(//, $letter);
+        my $last_c = pop(@c);
+        for my $c (@c) {
+            $ref->{$c} //= [ {}, undef ];
+            $ref->{$c}[0] //= {};
+            $ref = $ref->{$c}[0];
+        }
+        # The sound code for the letter combination is stored at the last character.
+        $ref->{$last_c}[1] = $key;
+    }
+}
+close(DATA);
+
+# Find the maximum number of alternate codes in one position.
+my $alt_x = $alternates{"X"};
+while (my ($k, $v) = each %alternates) {
+    if (defined delete $v->{"X"}) {
+        for my $x (keys %$alt_x) {
+            $v->{$x} = 1;
+        }
+    }
+}
+my $max_alt = (reverse sort (2, map { scalar keys %$_ } values %alternates))[0];
+
+# The maximum number of nodes and leaves in the soundex tree.
+# These are safe estimates, but in practice somewhat higher than the actual maximums.
+# Note that the first character can never yield more than two alternate codes,
+# hence the calculations are performed as sums of two subtrees.
+my $digits = 6;
+# Number of nodes (sum of geometric progression).
+my $max_nodes = 2 + 2*(1 - $max_alt**($digits - 1))/(1 - $max_alt);
+# Number of leaves (exponential of base number).
+my $max_leaves = 2*$max_alt**($digits - 2);
+
+print <<EOF;
+/*
+ * Types and lookup tables for Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem <dag\@nimrod.no>
+ *
+ * This file is generated by daitch_mokotoff_header.pl
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+ */
+
+#include <stdlib.h>
+
+#define DM_MAX_CODE_DIGITS $digits
+#define DM_MAX_ALTERNATE_CODES $max_alt
+#define DM_MAX_NODES $max_nodes
+#define DM_MAX_LEAVES $max_leaves
+#define DM_MAX_SOUNDEX_CHARS (DM_MAX_NODES*(DM_MAX_CODE_DIGITS + 1))
+#define DM_VOWELS "AEIOUY"
+
+typedef char dm_code[2 + 1];    /* One or two sequential code digits + NUL */
+typedef dm_code dm_codes[2];    /* One or two alternate code sequences */
+
+/* Letter in input sequence */
+struct dm_letter
+{
+    char        letter;            /* Present letter in sequence */
+    struct dm_letter *letters;    /* List of possible successive letters */
+    dm_codes   *codes;            /* Code sequence(s) for complete sequence */
+};
+
+/* Node in soundex code tree */
+struct dm_node
+{
+    int            soundex_length; /* Length of generated soundex code */
+    char        soundex[DM_MAX_CODE_DIGITS + 1];    /* Soundex code */
+    int            is_leaf;        /* Completed soundex codes are leaves */
+    int            last_update;    /* Character index for last update of node */
+    char        code_digit;        /* Current code digit */
+
+    /*
+     * One or two alternate code digits leading to this node - repeated code
+     * digits and 'X' lead back to the same node.
+     */
+    char        prev_code_digits[2];
+    char        next_code_digits[2];
+    /* Branching nodes */
+    struct dm_node *next_nodes[DM_MAX_ALTERNATE_CODES + 1];
+};
+
+typedef struct dm_letter dm_letter;
+typedef struct dm_node dm_node;
+
+/* Codes for letter sequence at start of name, before a vowel, and any other. */
+EOF
+
+for my $key (sort keys %codes) {
+    print "static dm_codes $key\[\] =\n{\n" . $codes{$key} . "\n};\n";
+}
+
+print <<EOF;
+
+/* Coding for alternative following letters in sequence. */
+EOF
+
+sub hash2code {
+    my ($ref, $letter) = @_;
+
+    my @letters = ();
+
+    my $h = $ref->[0];
+    for my $key (sort keys %$h) {
+        $ref = $h->{$key};
+        my $children = "NULL";
+        if (defined $ref->[0]) {
+            $children = "letter_$letter$key";
+            hash2code($ref, "$letter$key");
+        }
+        my $codes = $ref->[1] // "NULL";
+        push(@letters, "\t{\n\t\t'$key', $children, $codes\n\t}");
+    }
+
+    print "static dm_letter letter_$letter\[\] =\n{\n";
+    for (@letters) {
+        print "$_,\n";
+    }
+    print "\t{\n\t\t'\\0'\n\t}\n";
+    print "};\n";
+}
+
+hash2code($table, '');
+
+
+# Table adapted from https://www.jewishgen.org/InfoFiles/Soundex.html
+#
+# X = NC (not coded)
+#
+# Note that the following letters are coded with substitute letters
+#
+# Ą => a (use '[' for table lookup)
+# Ę => e (use '\\' for table lookup)
+# Ţ => t (use ']' for table lookup)
+#
+
+__DATA__
+AI,AJ,AY                0,1,X
+AU                        0,7,X
+a                        X,X,6|X
+A                        0,X,X
+B                        7,7,7
+CHS                        5,54,54
+CH                        5|4,5|4,5|4
+CK                        5|45,5|45,5|45
+CZ,CS,CSZ,CZS            4,4,4
+C                        5|4,5|4,5|4
+DRZ,DRS                    4,4,4
+DS,DSH,DSZ                4,4,4
+DZ,DZH,DZS                4,4,4
+D,DT                    3,3,3
+EI,EJ,EY                0,1,X
+EU                        1,1,X
+e                        X,X,6|X
+E                        0,X,X
+FB                        7,7,7
+F                        7,7,7
+G                        5,5,5
+H                        5,5,X
+IA,IE,IO,IU                1,X,X
+I                        0,X,X
+J                        1|4,X|4,X|4
+KS                        5,54,54
+KH                        5,5,5
+K                        5,5,5
+L                        8,8,8
+MN                        66,66,66
+M                        6,6,6
+NM                        66,66,66
+N                        6,6,6
+OI,OJ,OY                0,1,X
+O                        0,X,X
+P,PF,PH                    7,7,7
+Q                        5,5,5
+RZ,RS                    94|4,94|4,94|4
+R                        9,9,9
+SCHTSCH,SCHTSH,SCHTCH    2,4,4
+SCH                        4,4,4
+SHTCH,SHCH,SHTSH        2,4,4
+SHT,SCHT,SCHD            2,43,43
+SH                        4,4,4
+STCH,STSCH,SC            2,4,4
+STRZ,STRS,STSH            2,4,4
+ST                        2,43,43
+SZCZ,SZCS                2,4,4
+SZT,SHD,SZD,SD            2,43,43
+SZ                        4,4,4
+S                        4,4,4
+TCH,TTCH,TTSCH            4,4,4
+TH                        3,3,3
+TRZ,TRS                    4,4,4
+TSCH,TSH                4,4,4
+TS,TTS,TTSZ,TC            4,4,4
+TZ,TTZ,TZS,TSZ            4,4,4
+t                        3|4,3|4,3|4
+T                        3,3,3
+UI,UJ,UY                0,1,X
+U,UE                    0,X,X
+V                        7,7,7
+W                        7,7,7
+X                        5,54,54
+Y                        1,X,X
+ZDZ,ZDZH,ZHDZH            2,4,4
+ZD,ZHD                    2,43,43
+ZH,ZS,ZSCH,ZSH            4,4,4
+Z                        4,4,4
diff --git a/contrib/fuzzystrmatch/expected/fuzzystrmatch.out b/contrib/fuzzystrmatch/expected/fuzzystrmatch.out
index 493c95cdfa..1f7708fff0 100644
--- a/contrib/fuzzystrmatch/expected/fuzzystrmatch.out
+++ b/contrib/fuzzystrmatch/expected/fuzzystrmatch.out
@@ -65,3 +65,194 @@ SELECT dmetaphone_alt('gumbo');
  KMP
 (1 row)

+-- Wovels
+SELECT daitch_mokotoff('Augsburg');
+ daitch_mokotoff
+-----------------
+ 054795
+(1 row)
+
+SELECT daitch_mokotoff('Breuer');
+ daitch_mokotoff
+-----------------
+ 791900
+(1 row)
+
+SELECT daitch_mokotoff('Freud');
+ daitch_mokotoff
+-----------------
+ 793000
+(1 row)
+
+-- The letter "H"
+SELECT daitch_mokotoff('Halberstadt');
+ daitch_mokotoff
+-----------------
+ 587943 587433
+(1 row)
+
+SELECT daitch_mokotoff('Mannheim');
+ daitch_mokotoff
+-----------------
+ 665600
+(1 row)
+
+-- Adjacent sounds
+SELECT daitch_mokotoff('Chernowitz');
+ daitch_mokotoff
+-----------------
+ 596740 496740
+(1 row)
+
+-- Adjacent letters with identical adjacent code digits
+SELECT daitch_mokotoff('Cherkassy');
+ daitch_mokotoff
+-----------------
+ 595400 495400
+(1 row)
+
+SELECT daitch_mokotoff('Kleinman');
+ daitch_mokotoff
+-----------------
+ 586660
+(1 row)
+
+SELECT daitch_mokotoff('Besst');
+ daitch_mokotoff
+-----------------
+ 743000
+(1 row)
+
+-- More than one word
+SELECT daitch_mokotoff('Nowy Targ');
+ daitch_mokotoff
+-----------------
+ 673950
+(1 row)
+
+-- Padded with "0"
+SELECT daitch_mokotoff('Berlin');
+ daitch_mokotoff
+-----------------
+ 798600
+(1 row)
+
+-- Other examples from https://www.avotaynu.com/soundex.htm
+SELECT daitch_mokotoff('Ceniow');
+ daitch_mokotoff
+-----------------
+ 567000 467000
+(1 row)
+
+SELECT daitch_mokotoff('Tsenyuv');
+ daitch_mokotoff
+-----------------
+ 467000
+(1 row)
+
+SELECT daitch_mokotoff('Holubica');
+ daitch_mokotoff
+-----------------
+ 587500 587400
+(1 row)
+
+SELECT daitch_mokotoff('Golubitsa');
+ daitch_mokotoff
+-----------------
+ 587400
+(1 row)
+
+SELECT daitch_mokotoff('Przemysl');
+ daitch_mokotoff
+-----------------
+ 794648 746480
+(1 row)
+
+SELECT daitch_mokotoff('Pshemeshil');
+ daitch_mokotoff
+-----------------
+ 746480
+(1 row)
+
+SELECT daitch_mokotoff('Rosochowaciec');
+                     daitch_mokotoff
+---------------------------------------------------------
+ 945755 945754 945745 945744 944755 944754 944745 944744
+(1 row)
+
+SELECT daitch_mokotoff('Rosokhovatsets');
+ daitch_mokotoff
+-----------------
+ 945744
+(1 row)
+
+-- Accents
+SELECT daitch_mokotoff('Müller');
+ daitch_mokotoff
+-----------------
+ 689000
+(1 row)
+
+SELECT daitch_mokotoff('Schäfer');
+ daitch_mokotoff
+-----------------
+ 479000
+(1 row)
+
+SELECT daitch_mokotoff('Straßburg');
+ daitch_mokotoff
+-----------------
+ 294795
+(1 row)
+
+SELECT daitch_mokotoff('Éregon');
+ daitch_mokotoff
+-----------------
+ 095600
+(1 row)
+
+-- Ignored characters
+SELECT daitch_mokotoff('''OBrien');
+ daitch_mokotoff
+-----------------
+ 079600
+(1 row)
+
+SELECT daitch_mokotoff('O''Brien');
+ daitch_mokotoff
+-----------------
+ 079600
+(1 row)
+
+-- Special characters added at https://www.jewishgen.org/InfoFiles/Soundex.html
+SELECT daitch_mokotoff('gąszczu');
+ daitch_mokotoff
+-----------------
+ 564000 540000
+(1 row)
+
+SELECT daitch_mokotoff('brzęczy');
+       daitch_mokotoff
+-----------------------------
+ 794640 794400 746400 744000
+(1 row)
+
+SELECT daitch_mokotoff('ţamas');
+ daitch_mokotoff
+-----------------
+ 364000 464000
+(1 row)
+
+SELECT daitch_mokotoff('țamas');
+ daitch_mokotoff
+-----------------
+ 364000 464000
+(1 row)
+
+-- Contrived case which is not handled correctly by other implementations.
+SELECT daitch_mokotoff('CJC');
+              daitch_mokotoff
+-------------------------------------------
+ 550000 540000 545000 450000 400000 440000
+(1 row)
+
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql b/contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql
new file mode 100644
index 0000000000..b9d7b229a3
--- /dev/null
+++ b/contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql
@@ -0,0 +1,8 @@
+/* contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION fuzzystrmatch UPDATE TO '1.2'" to load this file. \quit
+
+CREATE FUNCTION daitch_mokotoff(text) RETURNS text
+AS 'MODULE_PATHNAME', 'daitch_mokotoff'
+LANGUAGE C IMMUTABLE STRICT PARALLEL SAFE;
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch--1.1.sql b/contrib/fuzzystrmatch/fuzzystrmatch--1.2.sql
similarity index 92%
rename from contrib/fuzzystrmatch/fuzzystrmatch--1.1.sql
rename to contrib/fuzzystrmatch/fuzzystrmatch--1.2.sql
index 41de9d949b..2a8a100699 100644
--- a/contrib/fuzzystrmatch/fuzzystrmatch--1.1.sql
+++ b/contrib/fuzzystrmatch/fuzzystrmatch--1.2.sql
@@ -42,3 +42,7 @@ LANGUAGE C IMMUTABLE STRICT PARALLEL SAFE;
 CREATE FUNCTION dmetaphone_alt (text) RETURNS text
 AS 'MODULE_PATHNAME', 'dmetaphone_alt'
 LANGUAGE C IMMUTABLE STRICT PARALLEL SAFE;
+
+CREATE FUNCTION daitch_mokotoff(text) RETURNS text
+AS 'MODULE_PATHNAME', 'daitch_mokotoff'
+LANGUAGE C IMMUTABLE STRICT PARALLEL SAFE;
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch.control b/contrib/fuzzystrmatch/fuzzystrmatch.control
index 3cd6660bf9..8b6e9fd993 100644
--- a/contrib/fuzzystrmatch/fuzzystrmatch.control
+++ b/contrib/fuzzystrmatch/fuzzystrmatch.control
@@ -1,6 +1,6 @@
 # fuzzystrmatch extension
 comment = 'determine similarities and distance between strings'
-default_version = '1.1'
+default_version = '1.2'
 module_pathname = '$libdir/fuzzystrmatch'
 relocatable = true
 trusted = true
diff --git a/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql b/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql
index f05dc28ffb..85ac755ddd 100644
--- a/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql
+++ b/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql
@@ -19,3 +19,55 @@ SELECT metaphone('GUMBO', 4);

 SELECT dmetaphone('gumbo');
 SELECT dmetaphone_alt('gumbo');
+
+-- Wovels
+SELECT daitch_mokotoff('Augsburg');
+SELECT daitch_mokotoff('Breuer');
+SELECT daitch_mokotoff('Freud');
+
+-- The letter "H"
+SELECT daitch_mokotoff('Halberstadt');
+SELECT daitch_mokotoff('Mannheim');
+
+-- Adjacent sounds
+SELECT daitch_mokotoff('Chernowitz');
+
+-- Adjacent letters with identical adjacent code digits
+SELECT daitch_mokotoff('Cherkassy');
+SELECT daitch_mokotoff('Kleinman');
+SELECT daitch_mokotoff('Besst');
+
+-- More than one word
+SELECT daitch_mokotoff('Nowy Targ');
+
+-- Padded with "0"
+SELECT daitch_mokotoff('Berlin');
+
+-- Other examples from https://www.avotaynu.com/soundex.htm
+SELECT daitch_mokotoff('Ceniow');
+SELECT daitch_mokotoff('Tsenyuv');
+SELECT daitch_mokotoff('Holubica');
+SELECT daitch_mokotoff('Golubitsa');
+SELECT daitch_mokotoff('Przemysl');
+SELECT daitch_mokotoff('Pshemeshil');
+SELECT daitch_mokotoff('Rosochowaciec');
+SELECT daitch_mokotoff('Rosokhovatsets');
+
+-- Accents
+SELECT daitch_mokotoff('Müller');
+SELECT daitch_mokotoff('Schäfer');
+SELECT daitch_mokotoff('Straßburg');
+SELECT daitch_mokotoff('Éregon');
+
+-- Ignored characters
+SELECT daitch_mokotoff('''OBrien');
+SELECT daitch_mokotoff('O''Brien');
+
+-- Special characters added at https://www.jewishgen.org/InfoFiles/Soundex.html
+SELECT daitch_mokotoff('gąszczu');
+SELECT daitch_mokotoff('brzęczy');
+SELECT daitch_mokotoff('ţamas');
+SELECT daitch_mokotoff('țamas');
+
+-- Contrived case which is not handled correctly by other implementations.
+SELECT daitch_mokotoff('CJC');
diff --git a/doc/src/sgml/fuzzystrmatch.sgml b/doc/src/sgml/fuzzystrmatch.sgml
index 382e54be91..a5dbd535f4 100644
--- a/doc/src/sgml/fuzzystrmatch.sgml
+++ b/doc/src/sgml/fuzzystrmatch.sgml
@@ -241,4 +241,98 @@ test=# SELECT dmetaphone('gumbo');
 </screen>
  </sect2>

+ <sect2>
+  <title>Daitch-Mokotoff Soundex</title>
+
+  <para>
+   Compared to the American Soundex System implemented in the
+   <function>soundex</function> function, the major improvements of the
+   Daitch-Mokotoff Soundex System are:
+
+   <itemizedlist spacing="compact" mark="bullet">
+    <listitem>
+     <para>
+      Information is coded to the first six meaningful letters rather than
+      four.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      The initial letter is coded rather than kept as is.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Where two consecutive letters have a single sound, they are coded as a
+      single number.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      When a letter or combination of letters may have two different sounds,
+      it is double coded under the two different codes.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      A letter or combination of letters maps into ten possible codes rather
+      than seven.
+     </para>
+    </listitem>
+   </itemizedlist>
+  </para>
+
+  <indexterm>
+   <primary>daitch_mokotoff</primary>
+  </indexterm>
+
+  <para>
+   The following function generates Daitch-Mokotoff soundex codes for matching
+   of similar-sounding input:
+  </para>
+
+<synopsis>
+daitch_mokotoff(text source) returns text
+</synopsis>
+
+  <para>
+   Since a Daitch-Mokotoff soundex code consists of only 6 digits,
+   <literal>source</literal> should be preferably a single word or name.
+   Any alternative soundex codes are separated by space, which makes the returned
+   text suited for use in Full Text Search, see <xref linkend="textsearch"/> and
+   <xref linkend="functions-textsearch"/>.
+  </para>
+
+  <para>
+   Example:
+  </para>
+
+<programlisting>
+CREATE OR REPLACE FUNCTION soundex_name(v_name text) RETURNS text AS $$
+  SELECT string_agg(daitch_mokotoff(n), ' ')
+  FROM regexp_split_to_table(v_name, '\s+') AS n
+$$ LANGUAGE sql STRICT IMMUTABLE PARALLEL SAFE;
+
+CREATE OR REPLACE FUNCTION soundex_tsvector(v_name text) RETURNS tsvector AS $$
+  SELECT to_tsvector('simple', soundex_name(v_name))
+$$ LANGUAGE sql STRICT IMMUTABLE PARALLEL SAFE;
+
+CREATE OR REPLACE FUNCTION soundex_tsquery(v_name text) RETURNS tsquery AS $$
+  SELECT to_tsquery('simple', quote_literal(soundex_name(v_name)))
+$$ LANGUAGE sql STRICT IMMUTABLE PARALLEL SAFE;
+
+-- Note that searches could be more efficient with the tsvector in a separate column
+-- (no computation on table row recheck).
+CREATE TABLE s (nm text);
+CREATE INDEX ix_s_txt ON s USING gin (soundex_tsvector(nm)) WITH (fastupdate = off);
+
+INSERT INTO s VALUES ('john');
+INSERT INTO s VALUES ('joan');
+INSERT INTO s VALUES ('wobbly');
+INSERT INTO s VALUES ('jack');
+
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('john');
+</programlisting>
+ </sect2>
+
 </sect1>

Re: daitch_mokotoff module

От
Andres Freund
Дата:
On 2021-12-21 22:41:18 +0100, Dag Lem wrote:
> This is my very first code contribution to PostgreSQL, and I would be
> grateful for any advice on how to proceed in order to get the patch
> accepted.

Currently the tests don't seem to pass on any platform:
https://cirrus-ci.com/task/5941863248035840?logs=test_world#L572
https://api.cirrus-ci.com/v1/artifact/task/5941863248035840/regress_diffs/contrib/fuzzystrmatch/regression.diffs

Greetings,

Andres Freund



Re: daitch_mokotoff module

От
Thomas Munro
Дата:
On Mon, Jan 3, 2022 at 10:32 AM Andres Freund <andres@anarazel.de> wrote:
> On 2021-12-21 22:41:18 +0100, Dag Lem wrote:
> > This is my very first code contribution to PostgreSQL, and I would be
> > grateful for any advice on how to proceed in order to get the patch
> > accepted.
>
> Currently the tests don't seem to pass on any platform:
> https://cirrus-ci.com/task/5941863248035840?logs=test_world#L572
> https://api.cirrus-ci.com/v1/artifact/task/5941863248035840/regress_diffs/contrib/fuzzystrmatch/regression.diffs

Erm, it looks like something weird is happening somewhere in cfbot's
pipeline, because Dag's patch says:

+SELECT daitch_mokotoff('Straßburg');
+ daitch_mokotoff
+-----------------
+ 294795
+(1 row)

... but it's failing like:

 SELECT daitch_mokotoff('Straßburg');
  daitch_mokotoff
 -----------------
- 294795
+ 297950
 (1 row)

It's possible that I broke cfbot when upgrading to Python 3 a few
months back (ie encoding snafu when using the "requests" module to
pull patches down from the archives).  I'll try to fix this soon.



Re: daitch_mokotoff module

От
Tom Lane
Дата:
Thomas Munro <thomas.munro@gmail.com> writes:
> Erm, it looks like something weird is happening somewhere in cfbot's
> pipeline, because Dag's patch says:

> +SELECT daitch_mokotoff('Straßburg');
> + daitch_mokotoff
> +-----------------
> + 294795
> +(1 row)

... so, that test case is guaranteed to fail in non-UTF8 encodings,
I suppose?  I wonder what the LANG environment is in that cfbot
instance.

(We do have methods for dealing with non-ASCII test cases, but
I can't see that this patch is using any of them.)

            regards, tom lane



Re: daitch_mokotoff module

От
Dag Lem
Дата:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Thomas Munro <thomas.munro@gmail.com> writes:
>> Erm, it looks like something weird is happening somewhere in cfbot's
>> pipeline, because Dag's patch says:
>
>> +SELECT daitch_mokotoff('Straßburg');
>> + daitch_mokotoff
>> +-----------------
>> + 294795
>> +(1 row)
>
> ... so, that test case is guaranteed to fail in non-UTF8 encodings,
> I suppose?  I wonder what the LANG environment is in that cfbot
> instance.
>
> (We do have methods for dealing with non-ASCII test cases, but
> I can't see that this patch is using any of them.)
>
>             regards, tom lane
>

I naively assumed that tests would be run in an UTF8 environment.

Running "ack -l '[\x80-\xff]'" in the contrib/ directory reveals that
two other modules are using UTF8 characters in tests - citext and
unaccent.

The citext tests seem to be commented out - "Multibyte sanity
tests. Uncomment to run."

Looking into the unaccent module, I don't quite understand how it will
work with various encodings, since it doesn't seem to decode its input -
will it fail if run under anything but ASCII or UTF8?

In any case, I see that unaccent.sql starts as follows:


CREATE EXTENSION unaccent;

-- must have a UTF8 database
SELECT getdatabaseencoding();

SET client_encoding TO 'UTF8';


Would doing the same thing in fuzzystrmatch.sql fix the problem with
failing tests? Should I prepare a new patch?


Best regards

Dag Lem



Re: daitch_mokotoff module

От
Tom Lane
Дата:
Dag Lem <dag@nimrod.no> writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> (We do have methods for dealing with non-ASCII test cases, but
>> I can't see that this patch is using any of them.)

> I naively assumed that tests would be run in an UTF8 environment.

Nope, not necessarily.

Our current best practice for this is to separate out encoding-dependent
test cases into their own test script, and guard the script with an
initial test on database encoding.  You can see an example in
src/test/modules/test_regex/sql/test_regex_utf8.sql
and the two associated expected-files.  It's a good idea to also cover
as much as you can with pure-ASCII test cases that will run regardless
of the prevailing encoding.

> Running "ack -l '[\x80-\xff]'" in the contrib/ directory reveals that
> two other modules are using UTF8 characters in tests - citext and
> unaccent.

Yeah, neither of those have been upgraded to said best practice.
(If you feel like doing the legwork to improve that situation,
that'd be great.)

> Looking into the unaccent module, I don't quite understand how it will
> work with various encodings, since it doesn't seem to decode its input -
> will it fail if run under anything but ASCII or UTF8?

Its Makefile seems to be forcing the test database to use UTF8.
I think this is a less-than-best-practice choice, because then
we have zero test coverage for other encodings; but it does
prevent test failures.

            regards, tom lane



Re: daitch_mokotoff module

От
Andres Freund
Дата:
Hi,

On 2022-01-02 21:41:53 -0500, Tom Lane wrote:
> ... so, that test case is guaranteed to fail in non-UTF8 encodings,
> I suppose?  I wonder what the LANG environment is in that cfbot
> instance.

LANG="en_US.UTF-8"

But it looks to me like the problem is in the commit cfbot creates, rather
than the test run itself:

https://github.com/postgresql-cfbot/postgresql/commit/d5b4ec87cfd65dc08d26e1b789bd254405c90a66#diff-388d4bb360a3b24c425e29a85899315dc02f9c1dd9b9bc9aaa828876bdfea50aR56

Greetings,

Andres Freund



Re: daitch_mokotoff module

От
Dag Lem
Дата:
Andres Freund <andres@anarazel.de> writes:

> Hi,
>
> On 2022-01-02 21:41:53 -0500, Tom Lane wrote:
>> ... so, that test case is guaranteed to fail in non-UTF8 encodings,
>> I suppose?  I wonder what the LANG environment is in that cfbot
>> instance.
>
> LANG="en_US.UTF-8"
>
> But it looks to me like the problem is in the commit cfbot creates, rather
> than the test run itself:
>
https://github.com/postgresql-cfbot/postgresql/commit/d5b4ec87cfd65dc08d26e1b789bd254405c90a66#diff-388d4bb360a3b24c425e29a85899315dc02f9c1dd9b9bc9aaa828876bdfea50aR56
>
> Greetings,
>
> Andres Freund
>
>

I have now separated out the UTF8-dependent tests, hopefully according
to the current best practice (based on src/test/modules/test_regex/ and
https://www.postgresql.org/docs/14/regress-variant.html).

However I guess this won't make any difference wrt. actually running the
tests, as long as there seems to be an encoding problem in the cfbot
pipeline.

Is there anything else I can do? Could perhaps fuzzystrmatch_utf8 simply
be commented out from the Makefile for the time being?

Best regards

Dag Lem

diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile
index 0704894f88..1d5bd84be8 100644
--- a/contrib/fuzzystrmatch/Makefile
+++ b/contrib/fuzzystrmatch/Makefile
@@ -3,14 +3,15 @@
 MODULE_big = fuzzystrmatch
 OBJS = \
     $(WIN32RES) \
+    daitch_mokotoff.o \
     dmetaphone.o \
     fuzzystrmatch.o

 EXTENSION = fuzzystrmatch
-DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql
+DATA = fuzzystrmatch--1.2.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql
 PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"

-REGRESS = fuzzystrmatch
+REGRESS = fuzzystrmatch fuzzystrmatch_utf8

 ifdef USE_PGXS
 PG_CONFIG = pg_config
@@ -22,3 +23,8 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
+
+daitch_mokotoff.o: daitch_mokotoff.h
+
+daitch_mokotoff.h: daitch_mokotoff_header.pl
+    perl $< > $@
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c
new file mode 100644
index 0000000000..1b7263c349
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff.c
@@ -0,0 +1,593 @@
+/*
+ * Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem <dag@nimrod.no>
+ *
+ * This implementation of the Daitch-Mokotoff Soundex System aims at high
+ * performance.
+ *
+ * - The processing of each phoneme is initiated by an O(1) table lookup.
+ * - For phonemes containing more than one character, a coding tree is traversed
+ *   to process the complete phoneme.
+ * - The (alternate) soundex codes are produced digit by digit in-place in
+ *   another tree structure.
+ *
+ *  References:
+ *
+ * https://www.avotaynu.com/soundex.htm
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex
+ * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php)
+ * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java)
+ * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm)
+ *
+ * A few notes on other implementations:
+ *
+ * - All other known implementations have the same unofficial rules for "UE",
+ *   these are also adapted by this implementation (0, 1, NC).
+ * - The only other known implementation which is capable of generating all
+ *   correct soundex codes in all cases is the JOS Soundex Calculator at
+ *   https://www.jewishgen.org/jos/jossound.htm
+ * - "J" is considered (only) a vowel in dmlat.php
+ * - The official rules for "RS" are commented out in dmlat.php
+ * - Identical code digits for adjacent letters are not collapsed correctly in
+ *   dmsoundex.php when double digit codes are involved. E.g. "BESST" yields
+ *   744300 instead of 743000 as for "BEST".
+ * - "J" is considered (only) a consonant in DaitchMokotoffSoundex.java
+ * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+*/
+
+#include "daitch_mokotoff.h"
+
+#include "postgres.h"
+#include "utils/builtins.h"
+#include "mb/pg_wchar.h"
+
+#include <ctype.h>
+#include <string.h>
+
+/* Internal C implementation */
+static char *_daitch_mokotoff(char *word, char *soundex, size_t n);
+
+
+PG_FUNCTION_INFO_V1(daitch_mokotoff);
+Datum
+daitch_mokotoff(PG_FUNCTION_ARGS)
+{
+    text       *arg = PG_GETARG_TEXT_PP(0);
+    char       *string,
+               *tmp_soundex;
+    text       *soundex;
+
+    /*
+     * The maximum theoretical soundex size is several KB, however in practice
+     * anything but contrived synthetic inputs will yield a soundex size of
+     * less than 100 bytes. We thus allocate and free a temporary work buffer,
+     * and return only the actual soundex result.
+     */
+    string = pg_server_to_any(text_to_cstring(arg), VARSIZE_ANY_EXHDR(arg), PG_UTF8);
+    tmp_soundex = palloc(DM_MAX_SOUNDEX_CHARS);
+
+    if (!_daitch_mokotoff(string, tmp_soundex, DM_MAX_SOUNDEX_CHARS))
+    {
+        /* No encodable characters in input. */
+        pfree(tmp_soundex);
+        PG_RETURN_NULL();
+    }
+
+    soundex = cstring_to_text(pg_any_to_server(tmp_soundex, strlen(tmp_soundex), PG_UTF8));
+    pfree(tmp_soundex);
+
+    PG_RETURN_TEXT_P(soundex);
+}
+
+
+typedef dm_node dm_nodes[DM_MAX_NODES];
+typedef dm_node * dm_leaves[DM_MAX_LEAVES];
+
+
+/* Template for new node in soundex code tree. */
+static const dm_node start_node = {
+    .soundex_length = 0,
+    .soundex = "000000 ",        /* Six digits + joining space */
+    .is_leaf = 0,
+    .last_update = 0,
+    .code_digit = '\0',
+    .prev_code_digits = {'\0', '\0'},
+    .next_code_digits = {'\0', '\0'},
+    .prev_code_index = 0,
+    .next_code_index = 0,
+    .next_nodes = {NULL}
+};
+
+/* Dummy soundex codes at end of input. */
+static dm_codes end_codes[2] =
+{
+    {
+        "X", "X", "X"
+    }
+};
+
+
+/* Initialize soundex code tree node for next code digit. */
+static void
+initialize_node(dm_node * node, int last_update)
+{
+    if (node->last_update < last_update)
+    {
+        node->prev_code_digits[0] = node->next_code_digits[0];
+        node->prev_code_digits[1] = node->next_code_digits[1];
+        node->next_code_digits[0] = '\0';
+        node->next_code_digits[1] = '\0';
+        node->prev_code_index = node->next_code_index;
+        node->next_code_index = 0;
+        node->is_leaf = 0;
+        node->last_update = last_update;
+    }
+}
+
+
+/* Update soundex code tree node with next code digit. */
+static void
+add_next_code_digit(dm_node * node, int code_index, char code_digit)
+{
+    /* OR in index 1 or 2. */
+    node->next_code_index |= code_index;
+
+    if (!node->next_code_digits[0])
+    {
+        node->next_code_digits[0] = code_digit;
+    }
+    else if (node->next_code_digits[0] != code_digit)
+    {
+        node->next_code_digits[1] = code_digit;
+    }
+}
+
+
+/* Mark soundex code tree node as leaf. */
+static void
+set_leaf(dm_leaves leaves_next, int *num_leaves_next, dm_node * node)
+{
+    if (!node->is_leaf)
+    {
+        node->is_leaf = 1;
+        leaves_next[(*num_leaves_next)++] = node;
+    }
+}
+
+
+/* Find next node corresponding to code digit, or create a new node. */
+static dm_node * find_or_create_node(dm_nodes nodes, int *num_nodes,
+                                     dm_node * node, char code_digit)
+{
+    dm_node   **next_nodes;
+    dm_node    *next_node;
+
+    for (next_nodes = node->next_nodes; (next_node = *next_nodes); next_nodes++)
+    {
+        if (next_node->code_digit == code_digit)
+        {
+            return next_node;
+        }
+    }
+
+    next_node = &nodes[(*num_nodes)++];
+    *next_nodes = next_node;
+
+    *next_node = start_node;
+    memcpy(next_node->soundex, node->soundex, sizeof(next_node->soundex));
+    next_node->soundex_length = node->soundex_length;
+    next_node->soundex[next_node->soundex_length++] = code_digit;
+    next_node->code_digit = code_digit;
+    next_node->next_code_index = node->prev_code_index;
+
+    return next_node;
+}
+
+
+/* Update node for next code digit(s). */
+static int
+update_node(dm_nodes nodes, dm_node * node, int *num_nodes,
+            dm_leaves leaves_next, int *num_leaves_next,
+            int letter_no, int prev_code_index, int next_code_index,
+            char *next_code_digits, int digit_no)
+{
+    int            i;
+    char        next_code_digit = next_code_digits[digit_no];
+    int            num_dirty_nodes = 0;
+    dm_node    *dirty_nodes[2];
+
+    initialize_node(node, letter_no);
+
+    if (node->soundex_length == DM_MAX_CODE_DIGITS)
+    {
+        /* Keep completed soundex code. */
+        set_leaf(leaves_next, num_leaves_next, node);
+        return 0;
+    }
+
+    if (node->prev_code_index && !(node->prev_code_index & prev_code_index))
+    {
+        /*
+         * If the sound (vowel / consonant) of this letter encoding doesn't
+         * correspond to the coding index of the previous letter, we skip this
+         * letter encoding. Note that currently, only "J" can be either a
+         * vowel or a consonant.
+         */
+        return 1;
+    }
+
+    if (next_code_digit == 'X' ||
+        (digit_no == 0 &&
+         (node->prev_code_digits[0] == next_code_digit ||
+          node->prev_code_digits[1] == next_code_digit)))
+    {
+        /* The code digit is the same as one of the previous (i.e. not added). */
+        dirty_nodes[num_dirty_nodes++] = node;
+    }
+
+    if (next_code_digit != 'X' &&
+        (digit_no > 0 ||
+         node->prev_code_digits[0] != next_code_digit ||
+         node->prev_code_digits[1]))
+    {
+        /* The code digit is different from one of the previous (i.e. added). */
+        node = find_or_create_node(nodes, num_nodes, node, next_code_digit);
+        initialize_node(node, letter_no);
+        dirty_nodes[num_dirty_nodes++] = node;
+    }
+
+    for (i = 0; i < num_dirty_nodes; i++)
+    {
+        /* Add code digit leading to the current node. */
+        add_next_code_digit(dirty_nodes[i], next_code_index, next_code_digit);
+
+        if (next_code_digits[++digit_no])
+        {
+            update_node(nodes, dirty_nodes[i], num_nodes,
+                        leaves_next, num_leaves_next,
+                        letter_no, prev_code_index, next_code_index,
+                        next_code_digits, digit_no);
+        }
+        else
+        {
+            set_leaf(leaves_next, num_leaves_next, dirty_nodes[i]);
+        }
+    }
+
+    return 1;
+}
+
+
+/* Update soundex tree leaf nodes. Return 1 when all nodes are completed. */
+static int
+update_leaves(dm_nodes nodes, int *num_nodes,
+              dm_leaves leaves[2], int *ix_leaves, int *num_leaves,
+              int letter_no, dm_codes * codes, dm_codes * next_codes)
+{
+    int            i,
+                j,
+                k,
+                code_index;
+    dm_code    *code,
+               *next_code;
+    int            num_leaves_next = 0;
+    int            ix_leaves_next = (*ix_leaves + 1) & 1;    /* Alternate ix: 0, 1 */
+    int            finished = 1;
+
+    for (i = 0; i < *num_leaves; i++)
+    {
+        dm_node    *node = leaves[*ix_leaves][i];
+
+        /* One or two alternate code sequences. */
+        for (j = 0; j < 2 && (code = codes[j]) && code[0][0]; j++)
+        {
+            /* Coding for previous letter - before vowel: 1, all other: 2 */
+            int            prev_code_index = (code[0][0] > '1') + 1;
+
+            /* One or two alternate next code sequences. */
+            for (k = 0; k < 2 && (next_code = next_codes[k]) && next_code[0][0]; k++)
+            {
+                /* Determine which code to use. */
+                if (letter_no == 0)
+                {
+                    /* This is the first letter. */
+                    code_index = 0;
+                }
+                else if (next_code[0][0] <= '1')
+                {
+                    /* The next letter is a vowel. */
+                    code_index = 1;
+                }
+                else
+                {
+                    /* All other cases. */
+                    code_index = 2;
+                }
+
+                /* One or two sequential code digits. */
+                if (update_node(nodes, node, num_nodes,
+                                leaves[ix_leaves_next], &num_leaves_next,
+                                letter_no, prev_code_index, code_index,
+                                code[code_index], 0))
+                {
+                    finished = 0;
+                }
+            }
+        }
+    }
+
+    *ix_leaves = ix_leaves_next;
+    *num_leaves = num_leaves_next;
+
+    return finished;
+}
+
+
+/* Mapping from ISO8859-1 to ASCII */
+static const char tr_accents_iso8859_1[] =
+/*
+"ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ"
+*/
+"AAAAAAECEEEEIIIIDNOOOOO*OUUUUYDsaaaaaaeceeeeiiiidnooooo/ouuuuydy";
+
+static char
+unaccent_iso8859_1(unsigned char c)
+{
+    return c >= 192 ? tr_accents_iso8859_1[c - 192] : c;
+}
+
+
+/* Convert an UTF-8 character to ISO-8859-1.
+ * Unconvertable characters are returned as '?'.
+ * NB! Beware of the domain specific conversion of Ą, Ę, and Ţ/Ț.
+ */
+static char
+utf8_to_iso8859_1(char *str, int *ix)
+{
+    const char    unknown = '?';
+    unsigned char c,
+                c2;
+    unsigned int code_point;
+
+    /* First byte. */
+    c = (unsigned char) str[(*ix)++];
+    if (c < 0x80)
+    {
+        /* ASCII code point. */
+        if (c >= '[' && c <= ']')
+        {
+            /* Codes reserved for Ą, Ę, and Ţ/Ț. */
+            return unknown;
+        }
+
+        return c;
+    }
+
+    /* Second byte. */
+    c2 = (unsigned char) str[(*ix)++];
+    if (!c2)
+    {
+        /* The UTF-8 character is cut short (invalid code point). */
+        (*ix)--;
+        return unknown;
+    }
+
+    if (c < 0xE0)
+    {
+        /* Two-byte character. */
+        code_point = ((c & 0x1F) << 6) | (c2 & 0x3F);
+        if (code_point < 0x100)
+        {
+            /* ISO-8859-1 code point. */
+            return code_point;
+        }
+        else if (code_point == 0x0104 || code_point == 0x0105)
+        {
+            /* Ą/ą */
+            return '[';
+        }
+        else if (code_point == 0x0118 || code_point == 0x0119)
+        {
+            /* Ę/ę */
+            return '\\';
+        }
+        else if (code_point == 0x0162 || code_point == 0x0163 ||
+                 code_point == 0x021A || code_point == 0x021B)
+        {
+            /* Ţ/ţ or Ț/ț */
+            return ']';
+        }
+        else
+        {
+            return unknown;
+        }
+    }
+
+    /* Third byte. */
+    if (!str[(*ix)++])
+    {
+        /* The UTF-8 character is cut short (invalid code point). */
+        (*ix)--;
+        return unknown;
+    }
+
+    if (c < 0xF0)
+    {
+        /* Three-byte character. */
+        return unknown;
+    }
+
+    /* Fourth byte. */
+    if (!str[(*ix)++])
+    {
+        /* The UTF-8 character is cut short (invalid code point). */
+        (*ix)--;
+    }
+
+    return unknown;
+}
+
+
+/* Return next character, converted from UTF-8 to uppercase ASCII. */
+static char
+read_char(char *str, int *ix)
+{
+    return toupper(unaccent_iso8859_1(utf8_to_iso8859_1(str, ix)));
+}
+
+
+/* Convert input to ASCII, skipping any characters not in [A-\]]. */
+static void
+normalize_input(char *src, char *dst)
+{
+    int            c;
+    int            i = 0,
+                j = 0;
+
+    while ((c = read_char(src, &i)))
+    {
+        if (c >= 'A' && c <= ']')
+        {
+            dst[j++] = c;
+        }
+    }
+
+    dst[j] = '\0';
+}
+
+
+/* Return sound coding for "letter" (letter sequence) */
+static dm_codes *
+read_letter(char *str, int *ix)
+{
+    int            c,
+                cmp;
+    int            i = *ix,
+                j;
+    dm_letter  *letters;
+    dm_codes   *codes;
+
+    /* First letter in sequence. */
+    if (!(c = str[i++]))
+    {
+        return NULL;
+    }
+    letters = &letter_[c - 'A'];
+    codes = letters->codes;
+    *ix = i;
+
+    /* Any subsequent letters in sequence. */
+    while ((letters = letters->letters) && (c = str[i++]))
+    {
+        for (j = 0; (cmp = letters[j].letter); j++)
+        {
+            if (cmp == c)
+            {
+                /* Letter found. */
+                letters = &letters[j];
+                if (letters->codes)
+                {
+                    /* Coding for letter sequence found. */
+                    codes = letters->codes;
+                    *ix = i;
+                }
+                break;
+            }
+        }
+        if (!cmp)
+        {
+            /* The sequence of letters has no coding. */
+            break;
+        }
+    }
+
+    return codes;
+}
+
+
+/* Generate all Daitch-Mokotoff soundex codes for word, separated by space. */
+static char *
+_daitch_mokotoff(char *word, char *soundex, size_t n)
+{
+    int            i = 0,
+                j;
+    int            letter_no = 0;
+    int            ix_leaves = 0;
+    int            num_nodes = 0,
+                num_leaves = 0;
+    dm_codes   *codes,
+               *next_codes;
+    dm_node    *nodes;
+    dm_leaves  *leaves;
+
+    /* Convert input to encodable ASCII characters, stored in soundex buffer. */
+    normalize_input(word, soundex);
+    if (!soundex[0])
+    {
+        /* No encodable character in input. */
+        return NULL;
+    }
+
+    /* Allocate memory for node tree. */
+    nodes = palloc(sizeof(dm_nodes));
+    leaves = palloc(2 * sizeof(dm_leaves));
+
+    /* Starting point. */
+    nodes[num_nodes++] = start_node;
+    leaves[ix_leaves][num_leaves++] = &nodes[0];
+
+    codes = read_letter(soundex, &i);
+
+    while (codes)
+    {
+        next_codes = read_letter(soundex, &i);
+
+        /* Update leaf nodes. */
+        if (update_leaves(nodes, &num_nodes,
+                          leaves, &ix_leaves, &num_leaves,
+                          letter_no, codes, next_codes ? next_codes : end_codes))
+        {
+            /* All soundex codes are completed to six digits. */
+            break;
+        }
+
+        codes = next_codes;
+        letter_no++;
+    }
+
+    /* Concatenate all generated soundex codes. */
+    for (i = 0, j = 0;
+         i < num_leaves && j + DM_MAX_CODE_DIGITS + 1 <= n;
+         i++, j += DM_MAX_CODE_DIGITS + 1)
+    {
+        memcpy(&soundex[j], leaves[ix_leaves][i]->soundex, DM_MAX_CODE_DIGITS + 1);
+    }
+
+    /* Terminate string. */
+    soundex[j - (j != 0)] = '\0';
+
+    pfree(leaves);
+    pfree(nodes);
+
+    return soundex;
+}
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.h b/contrib/fuzzystrmatch/daitch_mokotoff.h
new file mode 100644
index 0000000000..8426069825
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff.h
@@ -0,0 +1,999 @@
+/*
+ * Types and lookup tables for Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem <dag@nimrod.no>
+ *
+ * This file is generated by daitch_mokotoff_header.pl
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+ */
+
+#include <stdlib.h>
+
+#define DM_MAX_CODE_DIGITS 6
+#define DM_MAX_ALTERNATE_CODES 5
+#define DM_MAX_NODES 1564
+#define DM_MAX_LEAVES 1250
+#define DM_MAX_SOUNDEX_CHARS (DM_MAX_NODES*(DM_MAX_CODE_DIGITS + 1))
+
+typedef char dm_code[2 + 1];    /* One or two sequential code digits + NUL */
+typedef dm_code dm_codes[3];    /* Start of name, before a vowel, any other */
+
+/* Letter in input sequence */
+struct dm_letter
+{
+    char        letter;            /* Present letter in sequence */
+    struct dm_letter *letters;    /* List of possible successive letters */
+    dm_codes   *codes;            /* Code sequence(s) for complete sequence */
+};
+
+/* Node in soundex code tree */
+struct dm_node
+{
+    int            soundex_length; /* Length of generated soundex code */
+    char        soundex[DM_MAX_CODE_DIGITS + 1];    /* Soundex code */
+    int            is_leaf;        /* Candidate for complete soundex code */
+    int            last_update;    /* Letter number for last update of node */
+    char        code_digit;        /* Last code digit, 0 - 9 */
+
+    /*
+     * One or two alternate code digits leading to this node. If there are two
+     * digits, one of them is always an 'X'. Repeated code digits and 'X' lead
+     * back to the same node.
+     */
+    char        prev_code_digits[2];
+    /* One or two alternate code digits moving forward. */
+    char        next_code_digits[2];
+    /* ORed together code index(es) used to reach current node. */
+    int            prev_code_index;
+    int            next_code_index;
+    /* Nodes branching out from this node. */
+    struct dm_node *next_nodes[DM_MAX_ALTERNATE_CODES + 1];
+};
+
+typedef struct dm_letter dm_letter;
+typedef struct dm_node dm_node;
+
+/* Codes for letter sequence at start of name, before a vowel, and any other. */
+static dm_codes codes_0_1_X[2] =
+{
+    {
+        "0", "1", "X"
+    }
+};
+static dm_codes codes_0_7_X[2] =
+{
+    {
+        "0", "7", "X"
+    }
+};
+static dm_codes codes_0_X_X[2] =
+{
+    {
+        "0", "X", "X"
+    }
+};
+static dm_codes codes_1_1_X[2] =
+{
+    {
+        "1", "1", "X"
+    }
+};
+static dm_codes codes_1_X_X[2] =
+{
+    {
+        "1", "X", "X"
+    }
+};
+static dm_codes codes_1_X_X_or_4_4_4[2] =
+{
+    {
+        "1", "X", "X"
+    },
+    {
+        "4", "4", "4"
+    }
+};
+static dm_codes codes_2_43_43[2] =
+{
+    {
+        "2", "43", "43"
+    }
+};
+static dm_codes codes_2_4_4[2] =
+{
+    {
+        "2", "4", "4"
+    }
+};
+static dm_codes codes_3_3_3[2] =
+{
+    {
+        "3", "3", "3"
+    }
+};
+static dm_codes codes_3_3_3_or_4_4_4[2] =
+{
+    {
+        "3", "3", "3"
+    },
+    {
+        "4", "4", "4"
+    }
+};
+static dm_codes codes_4_4_4[2] =
+{
+    {
+        "4", "4", "4"
+    }
+};
+static dm_codes codes_5_54_54[2] =
+{
+    {
+        "5", "54", "54"
+    }
+};
+static dm_codes codes_5_5_5[2] =
+{
+    {
+        "5", "5", "5"
+    }
+};
+static dm_codes codes_5_5_5_or_45_45_45[2] =
+{
+    {
+        "5", "5", "5"
+    },
+    {
+        "45", "45", "45"
+    }
+};
+static dm_codes codes_5_5_5_or_4_4_4[2] =
+{
+    {
+        "5", "5", "5"
+    },
+    {
+        "4", "4", "4"
+    }
+};
+static dm_codes codes_5_5_X[2] =
+{
+    {
+        "5", "5", "X"
+    }
+};
+static dm_codes codes_66_66_66[2] =
+{
+    {
+        "66", "66", "66"
+    }
+};
+static dm_codes codes_6_6_6[2] =
+{
+    {
+        "6", "6", "6"
+    }
+};
+static dm_codes codes_7_7_7[2] =
+{
+    {
+        "7", "7", "7"
+    }
+};
+static dm_codes codes_8_8_8[2] =
+{
+    {
+        "8", "8", "8"
+    }
+};
+static dm_codes codes_94_94_94_or_4_4_4[2] =
+{
+    {
+        "94", "94", "94"
+    },
+    {
+        "4", "4", "4"
+    }
+};
+static dm_codes codes_9_9_9[2] =
+{
+    {
+        "9", "9", "9"
+    }
+};
+static dm_codes codes_X_X_6_or_X_X_X[2] =
+{
+    {
+        "X", "X", "6"
+    },
+    {
+        "X", "X", "X"
+    }
+};
+
+/* Coding for alternative following letters in sequence. */
+static dm_letter letter_A[] =
+{
+    {
+        'I', NULL, codes_0_1_X
+    },
+    {
+        'J', NULL, codes_0_1_X
+    },
+    {
+        'U', NULL, codes_0_7_X
+    },
+    {
+        'Y', NULL, codes_0_1_X
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_CH[] =
+{
+    {
+        'S', NULL, codes_5_54_54
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_CS[] =
+{
+    {
+        'Z', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_CZ[] =
+{
+    {
+        'S', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_C[] =
+{
+    {
+        'H', letter_CH, codes_5_5_5_or_4_4_4
+    },
+    {
+        'K', NULL, codes_5_5_5_or_45_45_45
+    },
+    {
+        'S', letter_CS, codes_4_4_4
+    },
+    {
+        'Z', letter_CZ, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_DR[] =
+{
+    {
+        'S', NULL, codes_4_4_4
+    },
+    {
+        'Z', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_DS[] =
+{
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        'Z', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_DZ[] =
+{
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        'S', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_D[] =
+{
+    {
+        'R', letter_DR, NULL
+    },
+    {
+        'S', letter_DS, codes_4_4_4
+    },
+    {
+        'T', NULL, codes_3_3_3
+    },
+    {
+        'Z', letter_DZ, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_E[] =
+{
+    {
+        'I', NULL, codes_0_1_X
+    },
+    {
+        'J', NULL, codes_0_1_X
+    },
+    {
+        'U', NULL, codes_1_1_X
+    },
+    {
+        'Y', NULL, codes_0_1_X
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_F[] =
+{
+    {
+        'B', NULL, codes_7_7_7
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_I[] =
+{
+    {
+        'A', NULL, codes_1_X_X
+    },
+    {
+        'E', NULL, codes_1_X_X
+    },
+    {
+        'O', NULL, codes_1_X_X
+    },
+    {
+        'U', NULL, codes_1_X_X
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_K[] =
+{
+    {
+        'H', NULL, codes_5_5_5
+    },
+    {
+        'S', NULL, codes_5_54_54
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_M[] =
+{
+    {
+        'N', NULL, codes_66_66_66
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_N[] =
+{
+    {
+        'M', NULL, codes_66_66_66
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_O[] =
+{
+    {
+        'I', NULL, codes_0_1_X
+    },
+    {
+        'J', NULL, codes_0_1_X
+    },
+    {
+        'Y', NULL, codes_0_1_X
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_P[] =
+{
+    {
+        'F', NULL, codes_7_7_7
+    },
+    {
+        'H', NULL, codes_7_7_7
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_R[] =
+{
+    {
+        'S', NULL, codes_94_94_94_or_4_4_4
+    },
+    {
+        'Z', NULL, codes_94_94_94_or_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SCHTC[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SCHTSC[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SCHTS[] =
+{
+    {
+        'C', letter_SCHTSC, NULL
+    },
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SCHT[] =
+{
+    {
+        'C', letter_SCHTC, NULL
+    },
+    {
+        'S', letter_SCHTS, NULL
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SCH[] =
+{
+    {
+        'D', NULL, codes_2_43_43
+    },
+    {
+        'T', letter_SCHT, codes_2_43_43
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SC[] =
+{
+    {
+        'H', letter_SCH, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SHC[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SHTC[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SHTS[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SHT[] =
+{
+    {
+        'C', letter_SHTC, NULL
+    },
+    {
+        'S', letter_SHTS, NULL
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SH[] =
+{
+    {
+        'C', letter_SHC, NULL
+    },
+    {
+        'D', NULL, codes_2_43_43
+    },
+    {
+        'T', letter_SHT, codes_2_43_43
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_STC[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_STR[] =
+{
+    {
+        'S', NULL, codes_2_4_4
+    },
+    {
+        'Z', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_STSC[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_STS[] =
+{
+    {
+        'C', letter_STSC, NULL
+    },
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_ST[] =
+{
+    {
+        'C', letter_STC, NULL
+    },
+    {
+        'R', letter_STR, NULL
+    },
+    {
+        'S', letter_STS, NULL
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SZC[] =
+{
+    {
+        'S', NULL, codes_2_4_4
+    },
+    {
+        'Z', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SZ[] =
+{
+    {
+        'C', letter_SZC, NULL
+    },
+    {
+        'D', NULL, codes_2_43_43
+    },
+    {
+        'T', NULL, codes_2_43_43
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_S[] =
+{
+    {
+        'C', letter_SC, codes_2_4_4
+    },
+    {
+        'D', NULL, codes_2_43_43
+    },
+    {
+        'H', letter_SH, codes_4_4_4
+    },
+    {
+        'T', letter_ST, codes_2_43_43
+    },
+    {
+        'Z', letter_SZ, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TC[] =
+{
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TR[] =
+{
+    {
+        'S', NULL, codes_4_4_4
+    },
+    {
+        'Z', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TSC[] =
+{
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TS[] =
+{
+    {
+        'C', letter_TSC, NULL
+    },
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        'Z', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TTC[] =
+{
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TTSC[] =
+{
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TTS[] =
+{
+    {
+        'C', letter_TTSC, NULL
+    },
+    {
+        'Z', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TT[] =
+{
+    {
+        'C', letter_TTC, NULL
+    },
+    {
+        'S', letter_TTS, codes_4_4_4
+    },
+    {
+        'Z', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TZ[] =
+{
+    {
+        'S', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_T[] =
+{
+    {
+        'C', letter_TC, codes_4_4_4
+    },
+    {
+        'H', NULL, codes_3_3_3
+    },
+    {
+        'R', letter_TR, NULL
+    },
+    {
+        'S', letter_TS, codes_4_4_4
+    },
+    {
+        'T', letter_TT, NULL
+    },
+    {
+        'Z', letter_TZ, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_U[] =
+{
+    {
+        'E', NULL, codes_0_1_X
+    },
+    {
+        'I', NULL, codes_0_1_X
+    },
+    {
+        'J', NULL, codes_0_1_X
+    },
+    {
+        'Y', NULL, codes_0_1_X
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_ZDZ[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_ZD[] =
+{
+    {
+        'Z', letter_ZDZ, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_ZHDZ[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_ZHD[] =
+{
+    {
+        'Z', letter_ZHDZ, NULL
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_ZH[] =
+{
+    {
+        'D', letter_ZHD, codes_2_43_43
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_ZSC[] =
+{
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_ZS[] =
+{
+    {
+        'C', letter_ZSC, NULL
+    },
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_Z[] =
+{
+    {
+        'D', letter_ZD, codes_2_43_43
+    },
+    {
+        'H', letter_ZH, codes_4_4_4
+    },
+    {
+        'S', letter_ZS, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_[] =
+{
+    {
+        'A', letter_A, codes_0_X_X
+    },
+    {
+        'B', NULL, codes_7_7_7
+    },
+    {
+        'C', letter_C, codes_5_5_5_or_4_4_4
+    },
+    {
+        'D', letter_D, codes_3_3_3
+    },
+    {
+        'E', letter_E, codes_0_X_X
+    },
+    {
+        'F', letter_F, codes_7_7_7
+    },
+    {
+        'G', NULL, codes_5_5_5
+    },
+    {
+        'H', NULL, codes_5_5_X
+    },
+    {
+        'I', letter_I, codes_0_X_X
+    },
+    {
+        'J', NULL, codes_1_X_X_or_4_4_4
+    },
+    {
+        'K', letter_K, codes_5_5_5
+    },
+    {
+        'L', NULL, codes_8_8_8
+    },
+    {
+        'M', letter_M, codes_6_6_6
+    },
+    {
+        'N', letter_N, codes_6_6_6
+    },
+    {
+        'O', letter_O, codes_0_X_X
+    },
+    {
+        'P', letter_P, codes_7_7_7
+    },
+    {
+        'Q', NULL, codes_5_5_5
+    },
+    {
+        'R', letter_R, codes_9_9_9
+    },
+    {
+        'S', letter_S, codes_4_4_4
+    },
+    {
+        'T', letter_T, codes_3_3_3
+    },
+    {
+        'U', letter_U, codes_0_X_X
+    },
+    {
+        'V', NULL, codes_7_7_7
+    },
+    {
+        'W', NULL, codes_7_7_7
+    },
+    {
+        'X', NULL, codes_5_54_54
+    },
+    {
+        'Y', NULL, codes_1_X_X
+    },
+    {
+        'Z', letter_Z, codes_4_4_4
+    },
+    {
+        'a', NULL, codes_X_X_6_or_X_X_X
+    },
+    {
+        'e', NULL, codes_X_X_6_or_X_X_X
+    },
+    {
+        't', NULL, codes_3_3_3_or_4_4_4
+    },
+    {
+        '\0'
+    }
+};
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff_header.pl b/contrib/fuzzystrmatch/daitch_mokotoff_header.pl
new file mode 100755
index 0000000000..3e97e000ee
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff_header.pl
@@ -0,0 +1,288 @@
+#!/bin/perl
+#
+# Generation of types and lookup tables for Daitch-Mokotoff soundex.
+#
+# Copyright (c) 2021 Finance Norway
+# Author: Dag Lem <dag@nimrod.no>
+#
+# Permission to use, copy, modify, and distribute this software and its
+# documentation for any purpose, without fee, and without a written agreement
+# is hereby granted, provided that the above copyright notice and this
+# paragraph and the following two paragraphs appear in all copies.
+#
+# IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+# DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+# LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+# DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+# POSSIBILITY OF SUCH DAMAGE.
+#
+# THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+# INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+# AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+# ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+# PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+#
+
+use strict;
+use warnings;
+use utf8;
+use open IO => ':utf8', ':std';
+use Data::Dumper;
+
+# Parse code table and generate tree for letter transitions.
+my %codes;
+my %alternates;
+my $table = [{}, [["","",""]]];
+while (<DATA>) {
+    chomp;
+    my ($letters, $codes) = split(/\s+/);
+    my @codes = map { [ split(/,/) ] } split(/\|/, $codes);
+
+    # Find alternate code transitions for calculation of storage.
+    # The first character can never yield more than two alternate codes, and is not considered here.
+    if (@codes > 1) {
+        for my $j (1..2) {
+            for my $i (0..1) {
+                my ($a, $b) = (substr($codes[$i][$j], -1, 1), substr($codes[($i + 1)%2][$j], 0, 1));
+                $alternates{$a}{$b} = 1 if $a ne $b;
+            }
+        }
+    }
+    my $key = "codes_" . join("_or_", map { join("_", @$_) } @codes);
+    my $val = join(",\n", map { "\t{\n\t\t" . join(", ", map { "\"$_\"" } @$_) . "\n\t}" } @codes);
+    $codes{$key} = $val;
+
+    for my $letter (split(/,/, $letters)) {
+        my $ref = $table->[0];
+        # Link each character to the next in the letter combination.
+        my @c = split(//, $letter);
+        my $last_c = pop(@c);
+        for my $c (@c) {
+            $ref->{$c} //= [ {}, undef ];
+            $ref->{$c}[0] //= {};
+            $ref = $ref->{$c}[0];
+        }
+        # The sound code for the letter combination is stored at the last character.
+        $ref->{$last_c}[1] = $key;
+    }
+}
+close(DATA);
+
+# Find the maximum number of alternate codes in one position.
+my $alt_x = $alternates{"X"};
+while (my ($k, $v) = each %alternates) {
+    if (defined delete $v->{"X"}) {
+        for my $x (keys %$alt_x) {
+            $v->{$x} = 1;
+        }
+    }
+}
+my $max_alt = (reverse sort (2, map { scalar keys %$_ } values %alternates))[0];
+
+# The maximum number of nodes and leaves in the soundex tree.
+# These are safe estimates, but in practice somewhat higher than the actual maximums.
+# Note that the first character can never yield more than two alternate codes,
+# hence the calculations are performed as sums of two subtrees.
+my $digits = 6;
+# Number of nodes (sum of geometric progression).
+my $max_nodes = 2 + 2*(1 - $max_alt**($digits - 1))/(1 - $max_alt);
+# Number of leaves (exponential of base number).
+my $max_leaves = 2*$max_alt**($digits - 2);
+
+print <<EOF;
+/*
+ * Types and lookup tables for Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem <dag\@nimrod.no>
+ *
+ * This file is generated by daitch_mokotoff_header.pl
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+ */
+
+#include <stdlib.h>
+
+#define DM_MAX_CODE_DIGITS $digits
+#define DM_MAX_ALTERNATE_CODES $max_alt
+#define DM_MAX_NODES $max_nodes
+#define DM_MAX_LEAVES $max_leaves
+#define DM_MAX_SOUNDEX_CHARS (DM_MAX_NODES*(DM_MAX_CODE_DIGITS + 1))
+
+typedef char dm_code[2 + 1];    /* One or two sequential code digits + NUL */
+typedef dm_code dm_codes[3];    /* Start of name, before a vowel, any other */
+
+/* Letter in input sequence */
+struct dm_letter
+{
+    char        letter;            /* Present letter in sequence */
+    struct dm_letter *letters;    /* List of possible successive letters */
+    dm_codes   *codes;            /* Code sequence(s) for complete sequence */
+};
+
+/* Node in soundex code tree */
+struct dm_node
+{
+    int            soundex_length; /* Length of generated soundex code */
+    char        soundex[DM_MAX_CODE_DIGITS + 1];    /* Soundex code */
+    int            is_leaf;        /* Candidate for complete soundex code */
+    int            last_update;    /* Letter number for last update of node */
+    char        code_digit;        /* Last code digit, 0 - 9 */
+
+    /*
+     * One or two alternate code digits leading to this node. If there are two
+     * digits, one of them is always an 'X'. Repeated code digits and 'X' lead
+     * back to the same node.
+     */
+    char        prev_code_digits[2];
+    /* One or two alternate code digits moving forward. */
+    char        next_code_digits[2];
+    /* ORed together code index(es) used to reach current node. */
+    int            prev_code_index;
+    int            next_code_index;
+    /* Nodes branching out from this node. */
+    struct dm_node *next_nodes[DM_MAX_ALTERNATE_CODES + 1];
+};
+
+typedef struct dm_letter dm_letter;
+typedef struct dm_node dm_node;
+
+/* Codes for letter sequence at start of name, before a vowel, and any other. */
+EOF
+
+for my $key (sort keys %codes) {
+    print "static dm_codes $key\[2\] =\n{\n" . $codes{$key} . "\n};\n";
+}
+
+print <<EOF;
+
+/* Coding for alternative following letters in sequence. */
+EOF
+
+sub hash2code {
+    my ($ref, $letter) = @_;
+
+    my @letters = ();
+
+    my $h = $ref->[0];
+    for my $key (sort keys %$h) {
+        $ref = $h->{$key};
+        my $children = "NULL";
+        if (defined $ref->[0]) {
+            $children = "letter_$letter$key";
+            hash2code($ref, "$letter$key");
+        }
+        my $codes = $ref->[1] // "NULL";
+        push(@letters, "\t{\n\t\t'$key', $children, $codes\n\t}");
+    }
+
+    print "static dm_letter letter_$letter\[\] =\n{\n";
+    for (@letters) {
+        print "$_,\n";
+    }
+    print "\t{\n\t\t'\\0'\n\t}\n";
+    print "};\n";
+}
+
+hash2code($table, '');
+
+
+# Table adapted from https://www.jewishgen.org/InfoFiles/Soundex.html
+#
+# X = NC (not coded)
+#
+# Note that the following letters are coded with substitute letters
+#
+# Ą => a (use '[' for table lookup)
+# Ę => e (use '\\' for table lookup)
+# Ţ => t (use ']' for table lookup)
+#
+# The rule for "UE" below does not correspond to the table referred to above,
+# however it is used by all other known implementations, including the one at
+# https://www.jewishgen.org/jos/jossound.htm (try e.g. "bouey").
+
+__DATA__
+AI,AJ,AY                0,1,X
+AU                        0,7,X
+a                        X,X,6|X,X,X
+A                        0,X,X
+B                        7,7,7
+CHS                        5,54,54
+CH                        5,5,5|4,4,4
+CK                        5,5,5|45,45,45
+CZ,CS,CSZ,CZS            4,4,4
+C                        5,5,5|4,4,4
+DRZ,DRS                    4,4,4
+DS,DSH,DSZ                4,4,4
+DZ,DZH,DZS                4,4,4
+D,DT                    3,3,3
+EI,EJ,EY                0,1,X
+EU                        1,1,X
+e                        X,X,6|X,X,X
+E                        0,X,X
+FB                        7,7,7
+F                        7,7,7
+G                        5,5,5
+H                        5,5,X
+IA,IE,IO,IU                1,X,X
+I                        0,X,X
+J                        1,X,X|4,4,4
+KS                        5,54,54
+KH                        5,5,5
+K                        5,5,5
+L                        8,8,8
+MN                        66,66,66
+M                        6,6,6
+NM                        66,66,66
+N                        6,6,6
+OI,OJ,OY                0,1,X
+O                        0,X,X
+P,PF,PH                    7,7,7
+Q                        5,5,5
+RZ,RS                    94,94,94|4,4,4
+R                        9,9,9
+SCHTSCH,SCHTSH,SCHTCH    2,4,4
+SCH                        4,4,4
+SHTCH,SHCH,SHTSH        2,4,4
+SHT,SCHT,SCHD            2,43,43
+SH                        4,4,4
+STCH,STSCH,SC            2,4,4
+STRZ,STRS,STSH            2,4,4
+ST                        2,43,43
+SZCZ,SZCS                2,4,4
+SZT,SHD,SZD,SD            2,43,43
+SZ                        4,4,4
+S                        4,4,4
+TCH,TTCH,TTSCH            4,4,4
+TH                        3,3,3
+TRZ,TRS                    4,4,4
+TSCH,TSH                4,4,4
+TS,TTS,TTSZ,TC            4,4,4
+TZ,TTZ,TZS,TSZ            4,4,4
+t                        3,3,3|4,4,4
+T                        3,3,3
+UI,UJ,UY,UE                0,1,X
+U                        0,X,X
+V                        7,7,7
+W                        7,7,7
+X                        5,54,54
+Y                        1,X,X
+ZDZ,ZDZH,ZHDZH            2,4,4
+ZD,ZHD                    2,43,43
+ZH,ZS,ZSCH,ZSH            4,4,4
+Z                        4,4,4
diff --git a/contrib/fuzzystrmatch/expected/fuzzystrmatch.out b/contrib/fuzzystrmatch/expected/fuzzystrmatch.out
index 493c95cdfa..f62ddad4ee 100644
--- a/contrib/fuzzystrmatch/expected/fuzzystrmatch.out
+++ b/contrib/fuzzystrmatch/expected/fuzzystrmatch.out
@@ -65,3 +65,174 @@ SELECT dmetaphone_alt('gumbo');
  KMP
 (1 row)

+-- Wovels
+SELECT daitch_mokotoff('Augsburg');
+ daitch_mokotoff
+-----------------
+ 054795
+(1 row)
+
+SELECT daitch_mokotoff('Breuer');
+ daitch_mokotoff
+-----------------
+ 791900
+(1 row)
+
+SELECT daitch_mokotoff('Freud');
+ daitch_mokotoff
+-----------------
+ 793000
+(1 row)
+
+-- The letter "H"
+SELECT daitch_mokotoff('Halberstadt');
+ daitch_mokotoff
+-----------------
+ 587943 587433
+(1 row)
+
+SELECT daitch_mokotoff('Mannheim');
+ daitch_mokotoff
+-----------------
+ 665600
+(1 row)
+
+-- Adjacent sounds
+SELECT daitch_mokotoff('Chernowitz');
+ daitch_mokotoff
+-----------------
+ 596740 496740
+(1 row)
+
+-- Adjacent letters with identical adjacent code digits
+SELECT daitch_mokotoff('Cherkassy');
+ daitch_mokotoff
+-----------------
+ 595400 495400
+(1 row)
+
+SELECT daitch_mokotoff('Kleinman');
+ daitch_mokotoff
+-----------------
+ 586660
+(1 row)
+
+-- More than one word
+SELECT daitch_mokotoff('Nowy Targ');
+ daitch_mokotoff
+-----------------
+ 673950
+(1 row)
+
+-- Padded with "0"
+SELECT daitch_mokotoff('Berlin');
+ daitch_mokotoff
+-----------------
+ 798600
+(1 row)
+
+-- Other examples from https://www.avotaynu.com/soundex.htm
+SELECT daitch_mokotoff('Ceniow');
+ daitch_mokotoff
+-----------------
+ 567000 467000
+(1 row)
+
+SELECT daitch_mokotoff('Tsenyuv');
+ daitch_mokotoff
+-----------------
+ 467000
+(1 row)
+
+SELECT daitch_mokotoff('Holubica');
+ daitch_mokotoff
+-----------------
+ 587500 587400
+(1 row)
+
+SELECT daitch_mokotoff('Golubitsa');
+ daitch_mokotoff
+-----------------
+ 587400
+(1 row)
+
+SELECT daitch_mokotoff('Przemysl');
+ daitch_mokotoff
+-----------------
+ 794648 746480
+(1 row)
+
+SELECT daitch_mokotoff('Pshemeshil');
+ daitch_mokotoff
+-----------------
+ 746480
+(1 row)
+
+SELECT daitch_mokotoff('Rosochowaciec');
+                     daitch_mokotoff
+---------------------------------------------------------
+ 945755 945754 945745 945744 944755 944754 944745 944744
+(1 row)
+
+SELECT daitch_mokotoff('Rosokhovatsets');
+ daitch_mokotoff
+-----------------
+ 945744
+(1 row)
+
+-- Ignored characters
+SELECT daitch_mokotoff('''OBrien');
+ daitch_mokotoff
+-----------------
+ 079600
+(1 row)
+
+SELECT daitch_mokotoff('O''Brien');
+ daitch_mokotoff
+-----------------
+ 079600
+(1 row)
+
+-- "Difficult" cases, likely to cause trouble for other implementations.
+SELECT daitch_mokotoff('CJC');
+              daitch_mokotoff
+-------------------------------------------
+ 550000 540000 545000 450000 400000 440000
+(1 row)
+
+SELECT daitch_mokotoff('BESST');
+ daitch_mokotoff
+-----------------
+ 743000
+(1 row)
+
+SELECT daitch_mokotoff('BOUEY');
+ daitch_mokotoff
+-----------------
+ 710000
+(1 row)
+
+SELECT daitch_mokotoff('HANNMANN');
+ daitch_mokotoff
+-----------------
+ 566600
+(1 row)
+
+SELECT daitch_mokotoff('MCCOYJR');
+                     daitch_mokotoff
+---------------------------------------------------------
+ 651900 654900 654190 654490 645190 645490 641900 644900
+(1 row)
+
+SELECT daitch_mokotoff('ACCURSO');
+                     daitch_mokotoff
+---------------------------------------------------------
+ 059400 054000 054940 054400 045940 045400 049400 044000
+(1 row)
+
+SELECT daitch_mokotoff('BIERSCHBACH');
+                     daitch_mokotoff
+---------------------------------------------------------
+ 794575 794574 794750 794740 745750 745740 747500 747400
+(1 row)
+
diff --git a/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8.out
b/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8.out
new file mode 100644
index 0000000000..32d8260383
--- /dev/null
+++ b/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8.out
@@ -0,0 +1,61 @@
+/*
+ * This test must be run in a database with UTF-8 encoding,
+ * because other encodings don't support all the characters used.
+ */
+SELECT getdatabaseencoding() <> 'UTF8'
+       AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+set client_encoding = utf8;
+-- CREATE EXTENSION IF NOT EXISTS fuzzystrmatch;
+-- Accents
+SELECT daitch_mokotoff('Müller');
+ daitch_mokotoff
+-----------------
+ 689000
+(1 row)
+
+SELECT daitch_mokotoff('Schäfer');
+ daitch_mokotoff
+-----------------
+ 479000
+(1 row)
+
+SELECT daitch_mokotoff('Straßburg');
+ daitch_mokotoff
+-----------------
+ 294795
+(1 row)
+
+SELECT daitch_mokotoff('Éregon');
+ daitch_mokotoff
+-----------------
+ 095600
+(1 row)
+
+-- Special characters added at https://www.jewishgen.org/InfoFiles/Soundex.html
+SELECT daitch_mokotoff('gąszczu');
+ daitch_mokotoff
+-----------------
+ 564000 540000
+(1 row)
+
+SELECT daitch_mokotoff('brzęczy');
+       daitch_mokotoff
+-----------------------------
+ 794640 794400 746400 744000
+(1 row)
+
+SELECT daitch_mokotoff('ţamas');
+ daitch_mokotoff
+-----------------
+ 364000 464000
+(1 row)
+
+SELECT daitch_mokotoff('țamas');
+ daitch_mokotoff
+-----------------
+ 364000 464000
+(1 row)
+
diff --git a/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8_1.out
b/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8_1.out
new file mode 100644
index 0000000000..37aead89c0
--- /dev/null
+++ b/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8_1.out
@@ -0,0 +1,8 @@
+/*
+ * This test must be run in a database with UTF-8 encoding,
+ * because other encodings don't support all the characters used.
+ */
+SELECT getdatabaseencoding() <> 'UTF8'
+       AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql b/contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql
new file mode 100644
index 0000000000..b9d7b229a3
--- /dev/null
+++ b/contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql
@@ -0,0 +1,8 @@
+/* contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION fuzzystrmatch UPDATE TO '1.2'" to load this file. \quit
+
+CREATE FUNCTION daitch_mokotoff(text) RETURNS text
+AS 'MODULE_PATHNAME', 'daitch_mokotoff'
+LANGUAGE C IMMUTABLE STRICT PARALLEL SAFE;
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch--1.1.sql b/contrib/fuzzystrmatch/fuzzystrmatch--1.2.sql
similarity index 92%
rename from contrib/fuzzystrmatch/fuzzystrmatch--1.1.sql
rename to contrib/fuzzystrmatch/fuzzystrmatch--1.2.sql
index 41de9d949b..2a8a100699 100644
--- a/contrib/fuzzystrmatch/fuzzystrmatch--1.1.sql
+++ b/contrib/fuzzystrmatch/fuzzystrmatch--1.2.sql
@@ -42,3 +42,7 @@ LANGUAGE C IMMUTABLE STRICT PARALLEL SAFE;
 CREATE FUNCTION dmetaphone_alt (text) RETURNS text
 AS 'MODULE_PATHNAME', 'dmetaphone_alt'
 LANGUAGE C IMMUTABLE STRICT PARALLEL SAFE;
+
+CREATE FUNCTION daitch_mokotoff(text) RETURNS text
+AS 'MODULE_PATHNAME', 'daitch_mokotoff'
+LANGUAGE C IMMUTABLE STRICT PARALLEL SAFE;
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch.control b/contrib/fuzzystrmatch/fuzzystrmatch.control
index 3cd6660bf9..8b6e9fd993 100644
--- a/contrib/fuzzystrmatch/fuzzystrmatch.control
+++ b/contrib/fuzzystrmatch/fuzzystrmatch.control
@@ -1,6 +1,6 @@
 # fuzzystrmatch extension
 comment = 'determine similarities and distance between strings'
-default_version = '1.1'
+default_version = '1.2'
 module_pathname = '$libdir/fuzzystrmatch'
 relocatable = true
 trusted = true
diff --git a/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql b/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql
index f05dc28ffb..db05c7d6b6 100644
--- a/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql
+++ b/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql
@@ -19,3 +19,48 @@ SELECT metaphone('GUMBO', 4);

 SELECT dmetaphone('gumbo');
 SELECT dmetaphone_alt('gumbo');
+
+-- Wovels
+SELECT daitch_mokotoff('Augsburg');
+SELECT daitch_mokotoff('Breuer');
+SELECT daitch_mokotoff('Freud');
+
+-- The letter "H"
+SELECT daitch_mokotoff('Halberstadt');
+SELECT daitch_mokotoff('Mannheim');
+
+-- Adjacent sounds
+SELECT daitch_mokotoff('Chernowitz');
+
+-- Adjacent letters with identical adjacent code digits
+SELECT daitch_mokotoff('Cherkassy');
+SELECT daitch_mokotoff('Kleinman');
+
+-- More than one word
+SELECT daitch_mokotoff('Nowy Targ');
+
+-- Padded with "0"
+SELECT daitch_mokotoff('Berlin');
+
+-- Other examples from https://www.avotaynu.com/soundex.htm
+SELECT daitch_mokotoff('Ceniow');
+SELECT daitch_mokotoff('Tsenyuv');
+SELECT daitch_mokotoff('Holubica');
+SELECT daitch_mokotoff('Golubitsa');
+SELECT daitch_mokotoff('Przemysl');
+SELECT daitch_mokotoff('Pshemeshil');
+SELECT daitch_mokotoff('Rosochowaciec');
+SELECT daitch_mokotoff('Rosokhovatsets');
+
+-- Ignored characters
+SELECT daitch_mokotoff('''OBrien');
+SELECT daitch_mokotoff('O''Brien');
+
+-- "Difficult" cases, likely to cause trouble for other implementations.
+SELECT daitch_mokotoff('CJC');
+SELECT daitch_mokotoff('BESST');
+SELECT daitch_mokotoff('BOUEY');
+SELECT daitch_mokotoff('HANNMANN');
+SELECT daitch_mokotoff('MCCOYJR');
+SELECT daitch_mokotoff('ACCURSO');
+SELECT daitch_mokotoff('BIERSCHBACH');
diff --git a/contrib/fuzzystrmatch/sql/fuzzystrmatch_utf8.sql b/contrib/fuzzystrmatch/sql/fuzzystrmatch_utf8.sql
new file mode 100644
index 0000000000..f42c01a1bb
--- /dev/null
+++ b/contrib/fuzzystrmatch/sql/fuzzystrmatch_utf8.sql
@@ -0,0 +1,26 @@
+/*
+ * This test must be run in a database with UTF-8 encoding,
+ * because other encodings don't support all the characters used.
+ */
+
+SELECT getdatabaseencoding() <> 'UTF8'
+       AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+set client_encoding = utf8;
+
+-- CREATE EXTENSION IF NOT EXISTS fuzzystrmatch;
+
+-- Accents
+SELECT daitch_mokotoff('Müller');
+SELECT daitch_mokotoff('Schäfer');
+SELECT daitch_mokotoff('Straßburg');
+SELECT daitch_mokotoff('Éregon');
+
+-- Special characters added at https://www.jewishgen.org/InfoFiles/Soundex.html
+SELECT daitch_mokotoff('gąszczu');
+SELECT daitch_mokotoff('brzęczy');
+SELECT daitch_mokotoff('ţamas');
+SELECT daitch_mokotoff('țamas');
diff --git a/doc/src/sgml/fuzzystrmatch.sgml b/doc/src/sgml/fuzzystrmatch.sgml
index 382e54be91..08781778f8 100644
--- a/doc/src/sgml/fuzzystrmatch.sgml
+++ b/doc/src/sgml/fuzzystrmatch.sgml
@@ -241,4 +241,101 @@ test=# SELECT dmetaphone('gumbo');
 </screen>
  </sect2>

+ <sect2>
+  <title>Daitch-Mokotoff Soundex</title>
+
+  <para>
+   Compared to the American Soundex System implemented in the
+   <function>soundex</function> function, the major improvements of the
+   Daitch-Mokotoff Soundex System are:
+
+   <itemizedlist spacing="compact" mark="bullet">
+    <listitem>
+     <para>
+      Information is coded to the first six meaningful letters rather than
+      four.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      The initial letter is coded rather than kept as is.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Where two consecutive letters have a single sound, they are coded as a
+      single number.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      When a letter or combination of letters may have two different sounds,
+      it is double coded under the two different codes.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      A letter or combination of letters maps into ten possible codes rather
+      than seven.
+     </para>
+    </listitem>
+   </itemizedlist>
+  </para>
+
+  <indexterm>
+   <primary>daitch_mokotoff</primary>
+  </indexterm>
+
+  <para>
+   The following function generates Daitch-Mokotoff soundex codes for matching
+   of similar-sounding input:
+  </para>
+
+<synopsis>
+daitch_mokotoff(text source) returns text
+</synopsis>
+
+  <para>
+   Since a Daitch-Mokotoff soundex code consists of only 6 digits,
+   <literal>source</literal> should be preferably a single word or name.
+   Any alternative soundex codes are separated by space, which makes the returned
+   text suited for use in Full Text Search, see <xref linkend="textsearch"/> and
+   <xref linkend="functions-textsearch"/>.
+  </para>
+
+  <para>
+   Example:
+  </para>
+
+<programlisting>
+CREATE OR REPLACE FUNCTION soundex_name(v_name text) RETURNS text AS $$
+  SELECT string_agg(daitch_mokotoff(n), ' ')
+  FROM regexp_split_to_table(v_name, '\s+') AS n
+$$ LANGUAGE sql STRICT IMMUTABLE PARALLEL SAFE;
+
+CREATE OR REPLACE FUNCTION soundex_tsvector(v_name text) RETURNS tsvector AS $$
+  SELECT to_tsvector('simple', soundex_name(v_name))
+$$ LANGUAGE sql STRICT IMMUTABLE PARALLEL SAFE;
+
+CREATE OR REPLACE FUNCTION soundex_tsquery(v_name text) RETURNS tsquery AS $$
+  SELECT to_tsquery('simple', quote_literal(soundex_name(v_name)))
+$$ LANGUAGE sql STRICT IMMUTABLE PARALLEL SAFE;
+
+-- Note that searches could be more efficient with the tsvector in a separate column
+-- (no recalculation on table row recheck).
+CREATE TABLE s (nm text);
+CREATE INDEX ix_s_txt ON s USING gin (soundex_tsvector(nm)) WITH (fastupdate = off);
+
+INSERT INTO s VALUES ('John Doe');
+INSERT INTO s VALUES ('Jane Roe');
+INSERT INTO s VALUES ('Public John Q.');
+INSERT INTO s VALUES ('George Best');
+
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('john');
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('jane doe');
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('john public');
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('besst, giorgio');
+</programlisting>
+ </sect2>
+
 </sect1>

Re: daitch_mokotoff module

От
Thomas Munro
Дата:
On Wed, Jan 5, 2022 at 2:49 AM Dag Lem <dag@nimrod.no> wrote:
> However I guess this won't make any difference wrt. actually running the
> tests, as long as there seems to be an encoding problem in the cfbot

Fixed -- I told it to pull down patches as binary, not text.  Now it
makes commits that look healthier, and so far all the Unix systems
have survived CI:

https://github.com/postgresql-cfbot/postgresql/commit/79700efc61d15c2414b8450a786951fa9308c07f
http://cfbot.cputube.org/dag-lem.html



Re: daitch_mokotoff module

От
Dag Lem
Дата:
Thomas Munro <thomas.munro@gmail.com> writes:

> On Wed, Jan 5, 2022 at 2:49 AM Dag Lem <dag@nimrod.no> wrote:
>> However I guess this won't make any difference wrt. actually running the
>> tests, as long as there seems to be an encoding problem in the cfbot
>
> Fixed -- I told it to pull down patches as binary, not text.  Now it
> makes commits that look healthier, and so far all the Unix systems
> have survived CI:
>
> https://github.com/postgresql-cfbot/postgresql/commit/79700efc61d15c2414b8450a786951fa9308c07f
> http://cfbot.cputube.org/dag-lem.html
>

Great!

Dag



[PATCH] Run UTF8-dependent tests for citext [Re: daitch_mokotoff module]

От
Dag Lem
Дата:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Dag Lem <dag@nimrod.no> writes:
>
>> Running "ack -l '[\x80-\xff]'" in the contrib/ directory reveals that
>> two other modules are using UTF8 characters in tests - citext and
>> unaccent.
>
> Yeah, neither of those have been upgraded to said best practice.
> (If you feel like doing the legwork to improve that situation,
> that'd be great.)
>

Please find attached a patch to run the previously commented-out
UTF8-dependent tests for citext, according to best practice. For now I
don't dare to touch the unaccent module, which seems to be UTF8-only
anyway.


Best regards

Dag Lem

diff --git a/contrib/citext/Makefile b/contrib/citext/Makefile
index a7de52928d..789932fe36 100644
--- a/contrib/citext/Makefile
+++ b/contrib/citext/Makefile
@@ -11,7 +11,7 @@ DATA = citext--1.4.sql \
     citext--1.0--1.1.sql
 PGFILEDESC = "citext - case-insensitive character string data type"

-REGRESS = citext
+REGRESS = citext citext_utf8

 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/contrib/citext/expected/citext.out b/contrib/citext/expected/citext.out
index 3bac0534fb..48b4de8993 100644
--- a/contrib/citext/expected/citext.out
+++ b/contrib/citext/expected/citext.out
@@ -48,29 +48,6 @@ SELECT 'a'::citext <> 'ab'::citext AS t;
  t
 (1 row)

--- Multibyte sanity tests. Uncomment to run.
--- SELECT 'À'::citext =  'À'::citext AS t;
--- SELECT 'À'::citext =  'à'::citext AS t;
--- SELECT 'À'::text   =  'à'::text   AS f; -- text wins.
--- SELECT 'À'::citext <> 'B'::citext AS t;
--- Test combining characters making up canonically equivalent strings.
--- SELECT 'Ä'::text   <> 'Ä'::text   AS t;
--- SELECT 'Ä'::citext <> 'Ä'::citext AS t;
--- Test the Turkish dotted I. The lowercase is a single byte while the
--- uppercase is multibyte. This is why the comparison code can't be optimized
--- to compare string lengths.
--- SELECT 'i'::citext = 'İ'::citext AS t;
--- Regression.
--- SELECT 'láska'::citext <> 'laská'::citext AS t;
--- SELECT 'Ask Bjørn Hansen'::citext = 'Ask Bjørn Hansen'::citext AS t;
--- SELECT 'Ask Bjørn Hansen'::citext = 'ASK BJØRN HANSEN'::citext AS t;
--- SELECT 'Ask Bjørn Hansen'::citext <> 'Ask Bjorn Hansen'::citext AS t;
--- SELECT 'Ask Bjørn Hansen'::citext <> 'ASK BJORN HANSEN'::citext AS t;
--- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'Ask Bjørn Hansen'::citext) AS zero;
--- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'ask bjørn hansen'::citext) AS zero;
--- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'ASK BJØRN HANSEN'::citext) AS zero;
--- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'Ask Bjorn Hansen'::citext) AS positive;
--- SELECT citext_cmp('Ask Bjorn Hansen'::citext, 'Ask Bjørn Hansen'::citext) AS negative;
 -- Test > and >=
 SELECT 'B'::citext > 'a'::citext AS t;
  t
diff --git a/contrib/citext/expected/citext_1.out b/contrib/citext/expected/citext_1.out
index 57fc863f7a..8ab4d4224e 100644
--- a/contrib/citext/expected/citext_1.out
+++ b/contrib/citext/expected/citext_1.out
@@ -48,29 +48,6 @@ SELECT 'a'::citext <> 'ab'::citext AS t;
  t
 (1 row)

--- Multibyte sanity tests. Uncomment to run.
--- SELECT 'À'::citext =  'À'::citext AS t;
--- SELECT 'À'::citext =  'à'::citext AS t;
--- SELECT 'À'::text   =  'à'::text   AS f; -- text wins.
--- SELECT 'À'::citext <> 'B'::citext AS t;
--- Test combining characters making up canonically equivalent strings.
--- SELECT 'Ä'::text   <> 'Ä'::text   AS t;
--- SELECT 'Ä'::citext <> 'Ä'::citext AS t;
--- Test the Turkish dotted I. The lowercase is a single byte while the
--- uppercase is multibyte. This is why the comparison code can't be optimized
--- to compare string lengths.
--- SELECT 'i'::citext = 'İ'::citext AS t;
--- Regression.
--- SELECT 'láska'::citext <> 'laská'::citext AS t;
--- SELECT 'Ask Bjørn Hansen'::citext = 'Ask Bjørn Hansen'::citext AS t;
--- SELECT 'Ask Bjørn Hansen'::citext = 'ASK BJØRN HANSEN'::citext AS t;
--- SELECT 'Ask Bjørn Hansen'::citext <> 'Ask Bjorn Hansen'::citext AS t;
--- SELECT 'Ask Bjørn Hansen'::citext <> 'ASK BJORN HANSEN'::citext AS t;
--- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'Ask Bjørn Hansen'::citext) AS zero;
--- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'ask bjørn hansen'::citext) AS zero;
--- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'ASK BJØRN HANSEN'::citext) AS zero;
--- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'Ask Bjorn Hansen'::citext) AS positive;
--- SELECT citext_cmp('Ask Bjorn Hansen'::citext, 'Ask Bjørn Hansen'::citext) AS negative;
 -- Test > and >=
 SELECT 'B'::citext > 'a'::citext AS t;
  t
diff --git a/contrib/citext/expected/citext_utf8.out b/contrib/citext/expected/citext_utf8.out
new file mode 100644
index 0000000000..1f4fa79aff
--- /dev/null
+++ b/contrib/citext/expected/citext_utf8.out
@@ -0,0 +1,119 @@
+/*
+ * This test must be run in a database with UTF-8 encoding,
+ * because other encodings don't support all the characters used.
+ */
+SELECT getdatabaseencoding() <> 'UTF8'
+       AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+set client_encoding = utf8;
+-- CREATE EXTENSION IF NOT EXISTS citext;
+-- Multibyte sanity tests.
+SELECT 'À'::citext =  'À'::citext AS t;
+ t
+---
+ t
+(1 row)
+
+SELECT 'À'::citext =  'à'::citext AS t;
+ t
+---
+ t
+(1 row)
+
+SELECT 'À'::text   =  'à'::text   AS f; -- text wins.
+ f
+---
+ f
+(1 row)
+
+SELECT 'À'::citext <> 'B'::citext AS t;
+ t
+---
+ t
+(1 row)
+
+-- Test combining characters making up canonically equivalent strings.
+SELECT 'Ä'::text   <> 'Ä'::text   AS t;
+ t
+---
+ t
+(1 row)
+
+SELECT 'Ä'::citext <> 'Ä'::citext AS t;
+ t
+---
+ t
+(1 row)
+
+-- Test the Turkish dotted I. The lowercase is a single byte while the
+-- uppercase is multibyte. This is why the comparison code can't be optimized
+-- to compare string lengths.
+SELECT 'i'::citext = 'İ'::citext AS t;
+ t
+---
+ t
+(1 row)
+
+-- Regression.
+SELECT 'láska'::citext <> 'laská'::citext AS t;
+ t
+---
+ t
+(1 row)
+
+SELECT 'Ask Bjørn Hansen'::citext = 'Ask Bjørn Hansen'::citext AS t;
+ t
+---
+ t
+(1 row)
+
+SELECT 'Ask Bjørn Hansen'::citext = 'ASK BJØRN HANSEN'::citext AS t;
+ t
+---
+ t
+(1 row)
+
+SELECT 'Ask Bjørn Hansen'::citext <> 'Ask Bjorn Hansen'::citext AS t;
+ t
+---
+ t
+(1 row)
+
+SELECT 'Ask Bjørn Hansen'::citext <> 'ASK BJORN HANSEN'::citext AS t;
+ t
+---
+ t
+(1 row)
+
+SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'Ask Bjørn Hansen'::citext) AS zero;
+ zero
+------
+    0
+(1 row)
+
+SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'ask bjørn hansen'::citext) AS zero;
+ zero
+------
+    0
+(1 row)
+
+SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'ASK BJØRN HANSEN'::citext) AS zero;
+ zero
+------
+    0
+(1 row)
+
+SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'Ask Bjorn Hansen'::citext) AS positive;
+ positive
+----------
+       15
+(1 row)
+
+SELECT citext_cmp('Ask Bjorn Hansen'::citext, 'Ask Bjørn Hansen'::citext) AS negative;
+ negative
+----------
+      -15
+(1 row)
+
diff --git a/contrib/citext/expected/citext_utf8_1.out b/contrib/citext/expected/citext_utf8_1.out
new file mode 100644
index 0000000000..37aead89c0
--- /dev/null
+++ b/contrib/citext/expected/citext_utf8_1.out
@@ -0,0 +1,8 @@
+/*
+ * This test must be run in a database with UTF-8 encoding,
+ * because other encodings don't support all the characters used.
+ */
+SELECT getdatabaseencoding() <> 'UTF8'
+       AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/contrib/citext/sql/citext.sql b/contrib/citext/sql/citext.sql
index 55fb1d11a6..bd62ab8047 100644
--- a/contrib/citext/sql/citext.sql
+++ b/contrib/citext/sql/citext.sql
@@ -19,34 +19,6 @@ SELECT 'a'::citext = 'b'::citext AS f;
 SELECT 'a'::citext = 'ab'::citext AS f;
 SELECT 'a'::citext <> 'ab'::citext AS t;

--- Multibyte sanity tests. Uncomment to run.
--- SELECT 'À'::citext =  'À'::citext AS t;
--- SELECT 'À'::citext =  'à'::citext AS t;
--- SELECT 'À'::text   =  'à'::text   AS f; -- text wins.
--- SELECT 'À'::citext <> 'B'::citext AS t;
-
--- Test combining characters making up canonically equivalent strings.
--- SELECT 'Ä'::text   <> 'Ä'::text   AS t;
--- SELECT 'Ä'::citext <> 'Ä'::citext AS t;
-
--- Test the Turkish dotted I. The lowercase is a single byte while the
--- uppercase is multibyte. This is why the comparison code can't be optimized
--- to compare string lengths.
--- SELECT 'i'::citext = 'İ'::citext AS t;
-
--- Regression.
--- SELECT 'láska'::citext <> 'laská'::citext AS t;
-
--- SELECT 'Ask Bjørn Hansen'::citext = 'Ask Bjørn Hansen'::citext AS t;
--- SELECT 'Ask Bjørn Hansen'::citext = 'ASK BJØRN HANSEN'::citext AS t;
--- SELECT 'Ask Bjørn Hansen'::citext <> 'Ask Bjorn Hansen'::citext AS t;
--- SELECT 'Ask Bjørn Hansen'::citext <> 'ASK BJORN HANSEN'::citext AS t;
--- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'Ask Bjørn Hansen'::citext) AS zero;
--- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'ask bjørn hansen'::citext) AS zero;
--- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'ASK BJØRN HANSEN'::citext) AS zero;
--- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'Ask Bjorn Hansen'::citext) AS positive;
--- SELECT citext_cmp('Ask Bjorn Hansen'::citext, 'Ask Bjørn Hansen'::citext) AS negative;
-
 -- Test > and >=
 SELECT 'B'::citext > 'a'::citext AS t;
 SELECT 'b'::citext >  'A'::citext AS t;
diff --git a/contrib/citext/sql/citext_utf8.sql b/contrib/citext/sql/citext_utf8.sql
new file mode 100644
index 0000000000..91822b85c2
--- /dev/null
+++ b/contrib/citext/sql/citext_utf8.sql
@@ -0,0 +1,42 @@
+/*
+ * This test must be run in a database with UTF-8 encoding,
+ * because other encodings don't support all the characters used.
+ */
+
+SELECT getdatabaseencoding() <> 'UTF8'
+       AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+set client_encoding = utf8;
+
+-- CREATE EXTENSION IF NOT EXISTS citext;
+
+-- Multibyte sanity tests.
+SELECT 'À'::citext =  'À'::citext AS t;
+SELECT 'À'::citext =  'à'::citext AS t;
+SELECT 'À'::text   =  'à'::text   AS f; -- text wins.
+SELECT 'À'::citext <> 'B'::citext AS t;
+
+-- Test combining characters making up canonically equivalent strings.
+SELECT 'Ä'::text   <> 'Ä'::text   AS t;
+SELECT 'Ä'::citext <> 'Ä'::citext AS t;
+
+-- Test the Turkish dotted I. The lowercase is a single byte while the
+-- uppercase is multibyte. This is why the comparison code can't be optimized
+-- to compare string lengths.
+SELECT 'i'::citext = 'İ'::citext AS t;
+
+-- Regression.
+SELECT 'láska'::citext <> 'laská'::citext AS t;
+
+SELECT 'Ask Bjørn Hansen'::citext = 'Ask Bjørn Hansen'::citext AS t;
+SELECT 'Ask Bjørn Hansen'::citext = 'ASK BJØRN HANSEN'::citext AS t;
+SELECT 'Ask Bjørn Hansen'::citext <> 'Ask Bjorn Hansen'::citext AS t;
+SELECT 'Ask Bjørn Hansen'::citext <> 'ASK BJORN HANSEN'::citext AS t;
+SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'Ask Bjørn Hansen'::citext) AS zero;
+SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'ask bjørn hansen'::citext) AS zero;
+SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'ASK BJØRN HANSEN'::citext) AS zero;
+SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'Ask Bjorn Hansen'::citext) AS positive;
+SELECT citext_cmp('Ask Bjorn Hansen'::citext, 'Ask Bjørn Hansen'::citext) AS negative;

Re: [PATCH] Run UTF8-dependent tests for citext [Re: daitch_mokotoff module]

От
Tom Lane
Дата:
Dag Lem <dag@nimrod.no> writes:
> Please find attached a patch to run the previously commented-out
> UTF8-dependent tests for citext, according to best practice. For now I
> don't dare to touch the unaccent module, which seems to be UTF8-only
> anyway.

I tried this on a bunch of different locale settings and concluded that
we need to restrict the locale to avoid failures: it falls over with
locale C.  With that, it passes on all UTF8 LANG settings on RHEL8
and FreeBSD 12, and all except am_ET.UTF-8 on current macOS.  I'm not
sure what the deal is with am_ET, but macOS has a long and sad history
of wonky UTF8 locales, so I was actually expecting worse.  If the
buildfarm shows more problems, we can restrict it further --- I won't
be too upset if we end up restricting to just Linux systems, like
collate.linux.utf8.  Anyway, pushed to see what happens.

            regards, tom lane



Re: daitch_mokotoff module

От
Dag Lem
Дата:
Dag Lem <dag@nimrod.no> writes:

> Thomas Munro <thomas.munro@gmail.com> writes:
>
>> On Wed, Jan 5, 2022 at 2:49 AM Dag Lem <dag@nimrod.no> wrote:
>>> However I guess this won't make any difference wrt. actually running the
>>> tests, as long as there seems to be an encoding problem in the cfbot
>>
>> Fixed -- I told it to pull down patches as binary, not text.  Now it
>> makes commits that look healthier, and so far all the Unix systems
>> have survived CI:
>>
>> https://github.com/postgresql-cfbot/postgresql/commit/79700efc61d15c2414b8450a786951fa9308c07f
>> http://cfbot.cputube.org/dag-lem.html
>>
>
> Great!
>
> Dag
>
>

After this I did the mistake of including a patch for citext in this
thread, which is now picked up by cfbot instead of the Daitch-Mokotoff
patch.

Attaching the original patch again in order to hopefully fix my mistake.

Best regards

Dag Lem

diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile
index 0704894f88..1d5bd84be8 100644
--- a/contrib/fuzzystrmatch/Makefile
+++ b/contrib/fuzzystrmatch/Makefile
@@ -3,14 +3,15 @@
 MODULE_big = fuzzystrmatch
 OBJS = \
     $(WIN32RES) \
+    daitch_mokotoff.o \
     dmetaphone.o \
     fuzzystrmatch.o

 EXTENSION = fuzzystrmatch
-DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql
+DATA = fuzzystrmatch--1.2.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql
 PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"

-REGRESS = fuzzystrmatch
+REGRESS = fuzzystrmatch fuzzystrmatch_utf8

 ifdef USE_PGXS
 PG_CONFIG = pg_config
@@ -22,3 +23,8 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
+
+daitch_mokotoff.o: daitch_mokotoff.h
+
+daitch_mokotoff.h: daitch_mokotoff_header.pl
+    perl $< > $@
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c
new file mode 100644
index 0000000000..1b7263c349
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff.c
@@ -0,0 +1,593 @@
+/*
+ * Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem <dag@nimrod.no>
+ *
+ * This implementation of the Daitch-Mokotoff Soundex System aims at high
+ * performance.
+ *
+ * - The processing of each phoneme is initiated by an O(1) table lookup.
+ * - For phonemes containing more than one character, a coding tree is traversed
+ *   to process the complete phoneme.
+ * - The (alternate) soundex codes are produced digit by digit in-place in
+ *   another tree structure.
+ *
+ *  References:
+ *
+ * https://www.avotaynu.com/soundex.htm
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex
+ * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php)
+ * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java)
+ * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm)
+ *
+ * A few notes on other implementations:
+ *
+ * - All other known implementations have the same unofficial rules for "UE",
+ *   these are also adapted by this implementation (0, 1, NC).
+ * - The only other known implementation which is capable of generating all
+ *   correct soundex codes in all cases is the JOS Soundex Calculator at
+ *   https://www.jewishgen.org/jos/jossound.htm
+ * - "J" is considered (only) a vowel in dmlat.php
+ * - The official rules for "RS" are commented out in dmlat.php
+ * - Identical code digits for adjacent letters are not collapsed correctly in
+ *   dmsoundex.php when double digit codes are involved. E.g. "BESST" yields
+ *   744300 instead of 743000 as for "BEST".
+ * - "J" is considered (only) a consonant in DaitchMokotoffSoundex.java
+ * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+*/
+
+#include "daitch_mokotoff.h"
+
+#include "postgres.h"
+#include "utils/builtins.h"
+#include "mb/pg_wchar.h"
+
+#include <ctype.h>
+#include <string.h>
+
+/* Internal C implementation */
+static char *_daitch_mokotoff(char *word, char *soundex, size_t n);
+
+
+PG_FUNCTION_INFO_V1(daitch_mokotoff);
+Datum
+daitch_mokotoff(PG_FUNCTION_ARGS)
+{
+    text       *arg = PG_GETARG_TEXT_PP(0);
+    char       *string,
+               *tmp_soundex;
+    text       *soundex;
+
+    /*
+     * The maximum theoretical soundex size is several KB, however in practice
+     * anything but contrived synthetic inputs will yield a soundex size of
+     * less than 100 bytes. We thus allocate and free a temporary work buffer,
+     * and return only the actual soundex result.
+     */
+    string = pg_server_to_any(text_to_cstring(arg), VARSIZE_ANY_EXHDR(arg), PG_UTF8);
+    tmp_soundex = palloc(DM_MAX_SOUNDEX_CHARS);
+
+    if (!_daitch_mokotoff(string, tmp_soundex, DM_MAX_SOUNDEX_CHARS))
+    {
+        /* No encodable characters in input. */
+        pfree(tmp_soundex);
+        PG_RETURN_NULL();
+    }
+
+    soundex = cstring_to_text(pg_any_to_server(tmp_soundex, strlen(tmp_soundex), PG_UTF8));
+    pfree(tmp_soundex);
+
+    PG_RETURN_TEXT_P(soundex);
+}
+
+
+typedef dm_node dm_nodes[DM_MAX_NODES];
+typedef dm_node * dm_leaves[DM_MAX_LEAVES];
+
+
+/* Template for new node in soundex code tree. */
+static const dm_node start_node = {
+    .soundex_length = 0,
+    .soundex = "000000 ",        /* Six digits + joining space */
+    .is_leaf = 0,
+    .last_update = 0,
+    .code_digit = '\0',
+    .prev_code_digits = {'\0', '\0'},
+    .next_code_digits = {'\0', '\0'},
+    .prev_code_index = 0,
+    .next_code_index = 0,
+    .next_nodes = {NULL}
+};
+
+/* Dummy soundex codes at end of input. */
+static dm_codes end_codes[2] =
+{
+    {
+        "X", "X", "X"
+    }
+};
+
+
+/* Initialize soundex code tree node for next code digit. */
+static void
+initialize_node(dm_node * node, int last_update)
+{
+    if (node->last_update < last_update)
+    {
+        node->prev_code_digits[0] = node->next_code_digits[0];
+        node->prev_code_digits[1] = node->next_code_digits[1];
+        node->next_code_digits[0] = '\0';
+        node->next_code_digits[1] = '\0';
+        node->prev_code_index = node->next_code_index;
+        node->next_code_index = 0;
+        node->is_leaf = 0;
+        node->last_update = last_update;
+    }
+}
+
+
+/* Update soundex code tree node with next code digit. */
+static void
+add_next_code_digit(dm_node * node, int code_index, char code_digit)
+{
+    /* OR in index 1 or 2. */
+    node->next_code_index |= code_index;
+
+    if (!node->next_code_digits[0])
+    {
+        node->next_code_digits[0] = code_digit;
+    }
+    else if (node->next_code_digits[0] != code_digit)
+    {
+        node->next_code_digits[1] = code_digit;
+    }
+}
+
+
+/* Mark soundex code tree node as leaf. */
+static void
+set_leaf(dm_leaves leaves_next, int *num_leaves_next, dm_node * node)
+{
+    if (!node->is_leaf)
+    {
+        node->is_leaf = 1;
+        leaves_next[(*num_leaves_next)++] = node;
+    }
+}
+
+
+/* Find next node corresponding to code digit, or create a new node. */
+static dm_node * find_or_create_node(dm_nodes nodes, int *num_nodes,
+                                     dm_node * node, char code_digit)
+{
+    dm_node   **next_nodes;
+    dm_node    *next_node;
+
+    for (next_nodes = node->next_nodes; (next_node = *next_nodes); next_nodes++)
+    {
+        if (next_node->code_digit == code_digit)
+        {
+            return next_node;
+        }
+    }
+
+    next_node = &nodes[(*num_nodes)++];
+    *next_nodes = next_node;
+
+    *next_node = start_node;
+    memcpy(next_node->soundex, node->soundex, sizeof(next_node->soundex));
+    next_node->soundex_length = node->soundex_length;
+    next_node->soundex[next_node->soundex_length++] = code_digit;
+    next_node->code_digit = code_digit;
+    next_node->next_code_index = node->prev_code_index;
+
+    return next_node;
+}
+
+
+/* Update node for next code digit(s). */
+static int
+update_node(dm_nodes nodes, dm_node * node, int *num_nodes,
+            dm_leaves leaves_next, int *num_leaves_next,
+            int letter_no, int prev_code_index, int next_code_index,
+            char *next_code_digits, int digit_no)
+{
+    int            i;
+    char        next_code_digit = next_code_digits[digit_no];
+    int            num_dirty_nodes = 0;
+    dm_node    *dirty_nodes[2];
+
+    initialize_node(node, letter_no);
+
+    if (node->soundex_length == DM_MAX_CODE_DIGITS)
+    {
+        /* Keep completed soundex code. */
+        set_leaf(leaves_next, num_leaves_next, node);
+        return 0;
+    }
+
+    if (node->prev_code_index && !(node->prev_code_index & prev_code_index))
+    {
+        /*
+         * If the sound (vowel / consonant) of this letter encoding doesn't
+         * correspond to the coding index of the previous letter, we skip this
+         * letter encoding. Note that currently, only "J" can be either a
+         * vowel or a consonant.
+         */
+        return 1;
+    }
+
+    if (next_code_digit == 'X' ||
+        (digit_no == 0 &&
+         (node->prev_code_digits[0] == next_code_digit ||
+          node->prev_code_digits[1] == next_code_digit)))
+    {
+        /* The code digit is the same as one of the previous (i.e. not added). */
+        dirty_nodes[num_dirty_nodes++] = node;
+    }
+
+    if (next_code_digit != 'X' &&
+        (digit_no > 0 ||
+         node->prev_code_digits[0] != next_code_digit ||
+         node->prev_code_digits[1]))
+    {
+        /* The code digit is different from one of the previous (i.e. added). */
+        node = find_or_create_node(nodes, num_nodes, node, next_code_digit);
+        initialize_node(node, letter_no);
+        dirty_nodes[num_dirty_nodes++] = node;
+    }
+
+    for (i = 0; i < num_dirty_nodes; i++)
+    {
+        /* Add code digit leading to the current node. */
+        add_next_code_digit(dirty_nodes[i], next_code_index, next_code_digit);
+
+        if (next_code_digits[++digit_no])
+        {
+            update_node(nodes, dirty_nodes[i], num_nodes,
+                        leaves_next, num_leaves_next,
+                        letter_no, prev_code_index, next_code_index,
+                        next_code_digits, digit_no);
+        }
+        else
+        {
+            set_leaf(leaves_next, num_leaves_next, dirty_nodes[i]);
+        }
+    }
+
+    return 1;
+}
+
+
+/* Update soundex tree leaf nodes. Return 1 when all nodes are completed. */
+static int
+update_leaves(dm_nodes nodes, int *num_nodes,
+              dm_leaves leaves[2], int *ix_leaves, int *num_leaves,
+              int letter_no, dm_codes * codes, dm_codes * next_codes)
+{
+    int            i,
+                j,
+                k,
+                code_index;
+    dm_code    *code,
+               *next_code;
+    int            num_leaves_next = 0;
+    int            ix_leaves_next = (*ix_leaves + 1) & 1;    /* Alternate ix: 0, 1 */
+    int            finished = 1;
+
+    for (i = 0; i < *num_leaves; i++)
+    {
+        dm_node    *node = leaves[*ix_leaves][i];
+
+        /* One or two alternate code sequences. */
+        for (j = 0; j < 2 && (code = codes[j]) && code[0][0]; j++)
+        {
+            /* Coding for previous letter - before vowel: 1, all other: 2 */
+            int            prev_code_index = (code[0][0] > '1') + 1;
+
+            /* One or two alternate next code sequences. */
+            for (k = 0; k < 2 && (next_code = next_codes[k]) && next_code[0][0]; k++)
+            {
+                /* Determine which code to use. */
+                if (letter_no == 0)
+                {
+                    /* This is the first letter. */
+                    code_index = 0;
+                }
+                else if (next_code[0][0] <= '1')
+                {
+                    /* The next letter is a vowel. */
+                    code_index = 1;
+                }
+                else
+                {
+                    /* All other cases. */
+                    code_index = 2;
+                }
+
+                /* One or two sequential code digits. */
+                if (update_node(nodes, node, num_nodes,
+                                leaves[ix_leaves_next], &num_leaves_next,
+                                letter_no, prev_code_index, code_index,
+                                code[code_index], 0))
+                {
+                    finished = 0;
+                }
+            }
+        }
+    }
+
+    *ix_leaves = ix_leaves_next;
+    *num_leaves = num_leaves_next;
+
+    return finished;
+}
+
+
+/* Mapping from ISO8859-1 to ASCII */
+static const char tr_accents_iso8859_1[] =
+/*
+"ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ"
+*/
+"AAAAAAECEEEEIIIIDNOOOOO*OUUUUYDsaaaaaaeceeeeiiiidnooooo/ouuuuydy";
+
+static char
+unaccent_iso8859_1(unsigned char c)
+{
+    return c >= 192 ? tr_accents_iso8859_1[c - 192] : c;
+}
+
+
+/* Convert an UTF-8 character to ISO-8859-1.
+ * Unconvertable characters are returned as '?'.
+ * NB! Beware of the domain specific conversion of Ą, Ę, and Ţ/Ț.
+ */
+static char
+utf8_to_iso8859_1(char *str, int *ix)
+{
+    const char    unknown = '?';
+    unsigned char c,
+                c2;
+    unsigned int code_point;
+
+    /* First byte. */
+    c = (unsigned char) str[(*ix)++];
+    if (c < 0x80)
+    {
+        /* ASCII code point. */
+        if (c >= '[' && c <= ']')
+        {
+            /* Codes reserved for Ą, Ę, and Ţ/Ț. */
+            return unknown;
+        }
+
+        return c;
+    }
+
+    /* Second byte. */
+    c2 = (unsigned char) str[(*ix)++];
+    if (!c2)
+    {
+        /* The UTF-8 character is cut short (invalid code point). */
+        (*ix)--;
+        return unknown;
+    }
+
+    if (c < 0xE0)
+    {
+        /* Two-byte character. */
+        code_point = ((c & 0x1F) << 6) | (c2 & 0x3F);
+        if (code_point < 0x100)
+        {
+            /* ISO-8859-1 code point. */
+            return code_point;
+        }
+        else if (code_point == 0x0104 || code_point == 0x0105)
+        {
+            /* Ą/ą */
+            return '[';
+        }
+        else if (code_point == 0x0118 || code_point == 0x0119)
+        {
+            /* Ę/ę */
+            return '\\';
+        }
+        else if (code_point == 0x0162 || code_point == 0x0163 ||
+                 code_point == 0x021A || code_point == 0x021B)
+        {
+            /* Ţ/ţ or Ț/ț */
+            return ']';
+        }
+        else
+        {
+            return unknown;
+        }
+    }
+
+    /* Third byte. */
+    if (!str[(*ix)++])
+    {
+        /* The UTF-8 character is cut short (invalid code point). */
+        (*ix)--;
+        return unknown;
+    }
+
+    if (c < 0xF0)
+    {
+        /* Three-byte character. */
+        return unknown;
+    }
+
+    /* Fourth byte. */
+    if (!str[(*ix)++])
+    {
+        /* The UTF-8 character is cut short (invalid code point). */
+        (*ix)--;
+    }
+
+    return unknown;
+}
+
+
+/* Return next character, converted from UTF-8 to uppercase ASCII. */
+static char
+read_char(char *str, int *ix)
+{
+    return toupper(unaccent_iso8859_1(utf8_to_iso8859_1(str, ix)));
+}
+
+
+/* Convert input to ASCII, skipping any characters not in [A-\]]. */
+static void
+normalize_input(char *src, char *dst)
+{
+    int            c;
+    int            i = 0,
+                j = 0;
+
+    while ((c = read_char(src, &i)))
+    {
+        if (c >= 'A' && c <= ']')
+        {
+            dst[j++] = c;
+        }
+    }
+
+    dst[j] = '\0';
+}
+
+
+/* Return sound coding for "letter" (letter sequence) */
+static dm_codes *
+read_letter(char *str, int *ix)
+{
+    int            c,
+                cmp;
+    int            i = *ix,
+                j;
+    dm_letter  *letters;
+    dm_codes   *codes;
+
+    /* First letter in sequence. */
+    if (!(c = str[i++]))
+    {
+        return NULL;
+    }
+    letters = &letter_[c - 'A'];
+    codes = letters->codes;
+    *ix = i;
+
+    /* Any subsequent letters in sequence. */
+    while ((letters = letters->letters) && (c = str[i++]))
+    {
+        for (j = 0; (cmp = letters[j].letter); j++)
+        {
+            if (cmp == c)
+            {
+                /* Letter found. */
+                letters = &letters[j];
+                if (letters->codes)
+                {
+                    /* Coding for letter sequence found. */
+                    codes = letters->codes;
+                    *ix = i;
+                }
+                break;
+            }
+        }
+        if (!cmp)
+        {
+            /* The sequence of letters has no coding. */
+            break;
+        }
+    }
+
+    return codes;
+}
+
+
+/* Generate all Daitch-Mokotoff soundex codes for word, separated by space. */
+static char *
+_daitch_mokotoff(char *word, char *soundex, size_t n)
+{
+    int            i = 0,
+                j;
+    int            letter_no = 0;
+    int            ix_leaves = 0;
+    int            num_nodes = 0,
+                num_leaves = 0;
+    dm_codes   *codes,
+               *next_codes;
+    dm_node    *nodes;
+    dm_leaves  *leaves;
+
+    /* Convert input to encodable ASCII characters, stored in soundex buffer. */
+    normalize_input(word, soundex);
+    if (!soundex[0])
+    {
+        /* No encodable character in input. */
+        return NULL;
+    }
+
+    /* Allocate memory for node tree. */
+    nodes = palloc(sizeof(dm_nodes));
+    leaves = palloc(2 * sizeof(dm_leaves));
+
+    /* Starting point. */
+    nodes[num_nodes++] = start_node;
+    leaves[ix_leaves][num_leaves++] = &nodes[0];
+
+    codes = read_letter(soundex, &i);
+
+    while (codes)
+    {
+        next_codes = read_letter(soundex, &i);
+
+        /* Update leaf nodes. */
+        if (update_leaves(nodes, &num_nodes,
+                          leaves, &ix_leaves, &num_leaves,
+                          letter_no, codes, next_codes ? next_codes : end_codes))
+        {
+            /* All soundex codes are completed to six digits. */
+            break;
+        }
+
+        codes = next_codes;
+        letter_no++;
+    }
+
+    /* Concatenate all generated soundex codes. */
+    for (i = 0, j = 0;
+         i < num_leaves && j + DM_MAX_CODE_DIGITS + 1 <= n;
+         i++, j += DM_MAX_CODE_DIGITS + 1)
+    {
+        memcpy(&soundex[j], leaves[ix_leaves][i]->soundex, DM_MAX_CODE_DIGITS + 1);
+    }
+
+    /* Terminate string. */
+    soundex[j - (j != 0)] = '\0';
+
+    pfree(leaves);
+    pfree(nodes);
+
+    return soundex;
+}
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.h b/contrib/fuzzystrmatch/daitch_mokotoff.h
new file mode 100644
index 0000000000..8426069825
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff.h
@@ -0,0 +1,999 @@
+/*
+ * Types and lookup tables for Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem <dag@nimrod.no>
+ *
+ * This file is generated by daitch_mokotoff_header.pl
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+ */
+
+#include <stdlib.h>
+
+#define DM_MAX_CODE_DIGITS 6
+#define DM_MAX_ALTERNATE_CODES 5
+#define DM_MAX_NODES 1564
+#define DM_MAX_LEAVES 1250
+#define DM_MAX_SOUNDEX_CHARS (DM_MAX_NODES*(DM_MAX_CODE_DIGITS + 1))
+
+typedef char dm_code[2 + 1];    /* One or two sequential code digits + NUL */
+typedef dm_code dm_codes[3];    /* Start of name, before a vowel, any other */
+
+/* Letter in input sequence */
+struct dm_letter
+{
+    char        letter;            /* Present letter in sequence */
+    struct dm_letter *letters;    /* List of possible successive letters */
+    dm_codes   *codes;            /* Code sequence(s) for complete sequence */
+};
+
+/* Node in soundex code tree */
+struct dm_node
+{
+    int            soundex_length; /* Length of generated soundex code */
+    char        soundex[DM_MAX_CODE_DIGITS + 1];    /* Soundex code */
+    int            is_leaf;        /* Candidate for complete soundex code */
+    int            last_update;    /* Letter number for last update of node */
+    char        code_digit;        /* Last code digit, 0 - 9 */
+
+    /*
+     * One or two alternate code digits leading to this node. If there are two
+     * digits, one of them is always an 'X'. Repeated code digits and 'X' lead
+     * back to the same node.
+     */
+    char        prev_code_digits[2];
+    /* One or two alternate code digits moving forward. */
+    char        next_code_digits[2];
+    /* ORed together code index(es) used to reach current node. */
+    int            prev_code_index;
+    int            next_code_index;
+    /* Nodes branching out from this node. */
+    struct dm_node *next_nodes[DM_MAX_ALTERNATE_CODES + 1];
+};
+
+typedef struct dm_letter dm_letter;
+typedef struct dm_node dm_node;
+
+/* Codes for letter sequence at start of name, before a vowel, and any other. */
+static dm_codes codes_0_1_X[2] =
+{
+    {
+        "0", "1", "X"
+    }
+};
+static dm_codes codes_0_7_X[2] =
+{
+    {
+        "0", "7", "X"
+    }
+};
+static dm_codes codes_0_X_X[2] =
+{
+    {
+        "0", "X", "X"
+    }
+};
+static dm_codes codes_1_1_X[2] =
+{
+    {
+        "1", "1", "X"
+    }
+};
+static dm_codes codes_1_X_X[2] =
+{
+    {
+        "1", "X", "X"
+    }
+};
+static dm_codes codes_1_X_X_or_4_4_4[2] =
+{
+    {
+        "1", "X", "X"
+    },
+    {
+        "4", "4", "4"
+    }
+};
+static dm_codes codes_2_43_43[2] =
+{
+    {
+        "2", "43", "43"
+    }
+};
+static dm_codes codes_2_4_4[2] =
+{
+    {
+        "2", "4", "4"
+    }
+};
+static dm_codes codes_3_3_3[2] =
+{
+    {
+        "3", "3", "3"
+    }
+};
+static dm_codes codes_3_3_3_or_4_4_4[2] =
+{
+    {
+        "3", "3", "3"
+    },
+    {
+        "4", "4", "4"
+    }
+};
+static dm_codes codes_4_4_4[2] =
+{
+    {
+        "4", "4", "4"
+    }
+};
+static dm_codes codes_5_54_54[2] =
+{
+    {
+        "5", "54", "54"
+    }
+};
+static dm_codes codes_5_5_5[2] =
+{
+    {
+        "5", "5", "5"
+    }
+};
+static dm_codes codes_5_5_5_or_45_45_45[2] =
+{
+    {
+        "5", "5", "5"
+    },
+    {
+        "45", "45", "45"
+    }
+};
+static dm_codes codes_5_5_5_or_4_4_4[2] =
+{
+    {
+        "5", "5", "5"
+    },
+    {
+        "4", "4", "4"
+    }
+};
+static dm_codes codes_5_5_X[2] =
+{
+    {
+        "5", "5", "X"
+    }
+};
+static dm_codes codes_66_66_66[2] =
+{
+    {
+        "66", "66", "66"
+    }
+};
+static dm_codes codes_6_6_6[2] =
+{
+    {
+        "6", "6", "6"
+    }
+};
+static dm_codes codes_7_7_7[2] =
+{
+    {
+        "7", "7", "7"
+    }
+};
+static dm_codes codes_8_8_8[2] =
+{
+    {
+        "8", "8", "8"
+    }
+};
+static dm_codes codes_94_94_94_or_4_4_4[2] =
+{
+    {
+        "94", "94", "94"
+    },
+    {
+        "4", "4", "4"
+    }
+};
+static dm_codes codes_9_9_9[2] =
+{
+    {
+        "9", "9", "9"
+    }
+};
+static dm_codes codes_X_X_6_or_X_X_X[2] =
+{
+    {
+        "X", "X", "6"
+    },
+    {
+        "X", "X", "X"
+    }
+};
+
+/* Coding for alternative following letters in sequence. */
+static dm_letter letter_A[] =
+{
+    {
+        'I', NULL, codes_0_1_X
+    },
+    {
+        'J', NULL, codes_0_1_X
+    },
+    {
+        'U', NULL, codes_0_7_X
+    },
+    {
+        'Y', NULL, codes_0_1_X
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_CH[] =
+{
+    {
+        'S', NULL, codes_5_54_54
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_CS[] =
+{
+    {
+        'Z', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_CZ[] =
+{
+    {
+        'S', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_C[] =
+{
+    {
+        'H', letter_CH, codes_5_5_5_or_4_4_4
+    },
+    {
+        'K', NULL, codes_5_5_5_or_45_45_45
+    },
+    {
+        'S', letter_CS, codes_4_4_4
+    },
+    {
+        'Z', letter_CZ, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_DR[] =
+{
+    {
+        'S', NULL, codes_4_4_4
+    },
+    {
+        'Z', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_DS[] =
+{
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        'Z', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_DZ[] =
+{
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        'S', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_D[] =
+{
+    {
+        'R', letter_DR, NULL
+    },
+    {
+        'S', letter_DS, codes_4_4_4
+    },
+    {
+        'T', NULL, codes_3_3_3
+    },
+    {
+        'Z', letter_DZ, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_E[] =
+{
+    {
+        'I', NULL, codes_0_1_X
+    },
+    {
+        'J', NULL, codes_0_1_X
+    },
+    {
+        'U', NULL, codes_1_1_X
+    },
+    {
+        'Y', NULL, codes_0_1_X
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_F[] =
+{
+    {
+        'B', NULL, codes_7_7_7
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_I[] =
+{
+    {
+        'A', NULL, codes_1_X_X
+    },
+    {
+        'E', NULL, codes_1_X_X
+    },
+    {
+        'O', NULL, codes_1_X_X
+    },
+    {
+        'U', NULL, codes_1_X_X
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_K[] =
+{
+    {
+        'H', NULL, codes_5_5_5
+    },
+    {
+        'S', NULL, codes_5_54_54
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_M[] =
+{
+    {
+        'N', NULL, codes_66_66_66
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_N[] =
+{
+    {
+        'M', NULL, codes_66_66_66
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_O[] =
+{
+    {
+        'I', NULL, codes_0_1_X
+    },
+    {
+        'J', NULL, codes_0_1_X
+    },
+    {
+        'Y', NULL, codes_0_1_X
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_P[] =
+{
+    {
+        'F', NULL, codes_7_7_7
+    },
+    {
+        'H', NULL, codes_7_7_7
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_R[] =
+{
+    {
+        'S', NULL, codes_94_94_94_or_4_4_4
+    },
+    {
+        'Z', NULL, codes_94_94_94_or_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SCHTC[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SCHTSC[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SCHTS[] =
+{
+    {
+        'C', letter_SCHTSC, NULL
+    },
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SCHT[] =
+{
+    {
+        'C', letter_SCHTC, NULL
+    },
+    {
+        'S', letter_SCHTS, NULL
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SCH[] =
+{
+    {
+        'D', NULL, codes_2_43_43
+    },
+    {
+        'T', letter_SCHT, codes_2_43_43
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SC[] =
+{
+    {
+        'H', letter_SCH, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SHC[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SHTC[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SHTS[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SHT[] =
+{
+    {
+        'C', letter_SHTC, NULL
+    },
+    {
+        'S', letter_SHTS, NULL
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SH[] =
+{
+    {
+        'C', letter_SHC, NULL
+    },
+    {
+        'D', NULL, codes_2_43_43
+    },
+    {
+        'T', letter_SHT, codes_2_43_43
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_STC[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_STR[] =
+{
+    {
+        'S', NULL, codes_2_4_4
+    },
+    {
+        'Z', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_STSC[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_STS[] =
+{
+    {
+        'C', letter_STSC, NULL
+    },
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_ST[] =
+{
+    {
+        'C', letter_STC, NULL
+    },
+    {
+        'R', letter_STR, NULL
+    },
+    {
+        'S', letter_STS, NULL
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SZC[] =
+{
+    {
+        'S', NULL, codes_2_4_4
+    },
+    {
+        'Z', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SZ[] =
+{
+    {
+        'C', letter_SZC, NULL
+    },
+    {
+        'D', NULL, codes_2_43_43
+    },
+    {
+        'T', NULL, codes_2_43_43
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_S[] =
+{
+    {
+        'C', letter_SC, codes_2_4_4
+    },
+    {
+        'D', NULL, codes_2_43_43
+    },
+    {
+        'H', letter_SH, codes_4_4_4
+    },
+    {
+        'T', letter_ST, codes_2_43_43
+    },
+    {
+        'Z', letter_SZ, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TC[] =
+{
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TR[] =
+{
+    {
+        'S', NULL, codes_4_4_4
+    },
+    {
+        'Z', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TSC[] =
+{
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TS[] =
+{
+    {
+        'C', letter_TSC, NULL
+    },
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        'Z', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TTC[] =
+{
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TTSC[] =
+{
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TTS[] =
+{
+    {
+        'C', letter_TTSC, NULL
+    },
+    {
+        'Z', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TT[] =
+{
+    {
+        'C', letter_TTC, NULL
+    },
+    {
+        'S', letter_TTS, codes_4_4_4
+    },
+    {
+        'Z', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TZ[] =
+{
+    {
+        'S', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_T[] =
+{
+    {
+        'C', letter_TC, codes_4_4_4
+    },
+    {
+        'H', NULL, codes_3_3_3
+    },
+    {
+        'R', letter_TR, NULL
+    },
+    {
+        'S', letter_TS, codes_4_4_4
+    },
+    {
+        'T', letter_TT, NULL
+    },
+    {
+        'Z', letter_TZ, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_U[] =
+{
+    {
+        'E', NULL, codes_0_1_X
+    },
+    {
+        'I', NULL, codes_0_1_X
+    },
+    {
+        'J', NULL, codes_0_1_X
+    },
+    {
+        'Y', NULL, codes_0_1_X
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_ZDZ[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_ZD[] =
+{
+    {
+        'Z', letter_ZDZ, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_ZHDZ[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_ZHD[] =
+{
+    {
+        'Z', letter_ZHDZ, NULL
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_ZH[] =
+{
+    {
+        'D', letter_ZHD, codes_2_43_43
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_ZSC[] =
+{
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_ZS[] =
+{
+    {
+        'C', letter_ZSC, NULL
+    },
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_Z[] =
+{
+    {
+        'D', letter_ZD, codes_2_43_43
+    },
+    {
+        'H', letter_ZH, codes_4_4_4
+    },
+    {
+        'S', letter_ZS, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_[] =
+{
+    {
+        'A', letter_A, codes_0_X_X
+    },
+    {
+        'B', NULL, codes_7_7_7
+    },
+    {
+        'C', letter_C, codes_5_5_5_or_4_4_4
+    },
+    {
+        'D', letter_D, codes_3_3_3
+    },
+    {
+        'E', letter_E, codes_0_X_X
+    },
+    {
+        'F', letter_F, codes_7_7_7
+    },
+    {
+        'G', NULL, codes_5_5_5
+    },
+    {
+        'H', NULL, codes_5_5_X
+    },
+    {
+        'I', letter_I, codes_0_X_X
+    },
+    {
+        'J', NULL, codes_1_X_X_or_4_4_4
+    },
+    {
+        'K', letter_K, codes_5_5_5
+    },
+    {
+        'L', NULL, codes_8_8_8
+    },
+    {
+        'M', letter_M, codes_6_6_6
+    },
+    {
+        'N', letter_N, codes_6_6_6
+    },
+    {
+        'O', letter_O, codes_0_X_X
+    },
+    {
+        'P', letter_P, codes_7_7_7
+    },
+    {
+        'Q', NULL, codes_5_5_5
+    },
+    {
+        'R', letter_R, codes_9_9_9
+    },
+    {
+        'S', letter_S, codes_4_4_4
+    },
+    {
+        'T', letter_T, codes_3_3_3
+    },
+    {
+        'U', letter_U, codes_0_X_X
+    },
+    {
+        'V', NULL, codes_7_7_7
+    },
+    {
+        'W', NULL, codes_7_7_7
+    },
+    {
+        'X', NULL, codes_5_54_54
+    },
+    {
+        'Y', NULL, codes_1_X_X
+    },
+    {
+        'Z', letter_Z, codes_4_4_4
+    },
+    {
+        'a', NULL, codes_X_X_6_or_X_X_X
+    },
+    {
+        'e', NULL, codes_X_X_6_or_X_X_X
+    },
+    {
+        't', NULL, codes_3_3_3_or_4_4_4
+    },
+    {
+        '\0'
+    }
+};
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff_header.pl b/contrib/fuzzystrmatch/daitch_mokotoff_header.pl
new file mode 100755
index 0000000000..3e97e000ee
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff_header.pl
@@ -0,0 +1,288 @@
+#!/bin/perl
+#
+# Generation of types and lookup tables for Daitch-Mokotoff soundex.
+#
+# Copyright (c) 2021 Finance Norway
+# Author: Dag Lem <dag@nimrod.no>
+#
+# Permission to use, copy, modify, and distribute this software and its
+# documentation for any purpose, without fee, and without a written agreement
+# is hereby granted, provided that the above copyright notice and this
+# paragraph and the following two paragraphs appear in all copies.
+#
+# IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+# DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+# LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+# DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+# POSSIBILITY OF SUCH DAMAGE.
+#
+# THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+# INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+# AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+# ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+# PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+#
+
+use strict;
+use warnings;
+use utf8;
+use open IO => ':utf8', ':std';
+use Data::Dumper;
+
+# Parse code table and generate tree for letter transitions.
+my %codes;
+my %alternates;
+my $table = [{}, [["","",""]]];
+while (<DATA>) {
+    chomp;
+    my ($letters, $codes) = split(/\s+/);
+    my @codes = map { [ split(/,/) ] } split(/\|/, $codes);
+
+    # Find alternate code transitions for calculation of storage.
+    # The first character can never yield more than two alternate codes, and is not considered here.
+    if (@codes > 1) {
+        for my $j (1..2) {
+            for my $i (0..1) {
+                my ($a, $b) = (substr($codes[$i][$j], -1, 1), substr($codes[($i + 1)%2][$j], 0, 1));
+                $alternates{$a}{$b} = 1 if $a ne $b;
+            }
+        }
+    }
+    my $key = "codes_" . join("_or_", map { join("_", @$_) } @codes);
+    my $val = join(",\n", map { "\t{\n\t\t" . join(", ", map { "\"$_\"" } @$_) . "\n\t}" } @codes);
+    $codes{$key} = $val;
+
+    for my $letter (split(/,/, $letters)) {
+        my $ref = $table->[0];
+        # Link each character to the next in the letter combination.
+        my @c = split(//, $letter);
+        my $last_c = pop(@c);
+        for my $c (@c) {
+            $ref->{$c} //= [ {}, undef ];
+            $ref->{$c}[0] //= {};
+            $ref = $ref->{$c}[0];
+        }
+        # The sound code for the letter combination is stored at the last character.
+        $ref->{$last_c}[1] = $key;
+    }
+}
+close(DATA);
+
+# Find the maximum number of alternate codes in one position.
+my $alt_x = $alternates{"X"};
+while (my ($k, $v) = each %alternates) {
+    if (defined delete $v->{"X"}) {
+        for my $x (keys %$alt_x) {
+            $v->{$x} = 1;
+        }
+    }
+}
+my $max_alt = (reverse sort (2, map { scalar keys %$_ } values %alternates))[0];
+
+# The maximum number of nodes and leaves in the soundex tree.
+# These are safe estimates, but in practice somewhat higher than the actual maximums.
+# Note that the first character can never yield more than two alternate codes,
+# hence the calculations are performed as sums of two subtrees.
+my $digits = 6;
+# Number of nodes (sum of geometric progression).
+my $max_nodes = 2 + 2*(1 - $max_alt**($digits - 1))/(1 - $max_alt);
+# Number of leaves (exponential of base number).
+my $max_leaves = 2*$max_alt**($digits - 2);
+
+print <<EOF;
+/*
+ * Types and lookup tables for Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem <dag\@nimrod.no>
+ *
+ * This file is generated by daitch_mokotoff_header.pl
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+ */
+
+#include <stdlib.h>
+
+#define DM_MAX_CODE_DIGITS $digits
+#define DM_MAX_ALTERNATE_CODES $max_alt
+#define DM_MAX_NODES $max_nodes
+#define DM_MAX_LEAVES $max_leaves
+#define DM_MAX_SOUNDEX_CHARS (DM_MAX_NODES*(DM_MAX_CODE_DIGITS + 1))
+
+typedef char dm_code[2 + 1];    /* One or two sequential code digits + NUL */
+typedef dm_code dm_codes[3];    /* Start of name, before a vowel, any other */
+
+/* Letter in input sequence */
+struct dm_letter
+{
+    char        letter;            /* Present letter in sequence */
+    struct dm_letter *letters;    /* List of possible successive letters */
+    dm_codes   *codes;            /* Code sequence(s) for complete sequence */
+};
+
+/* Node in soundex code tree */
+struct dm_node
+{
+    int            soundex_length; /* Length of generated soundex code */
+    char        soundex[DM_MAX_CODE_DIGITS + 1];    /* Soundex code */
+    int            is_leaf;        /* Candidate for complete soundex code */
+    int            last_update;    /* Letter number for last update of node */
+    char        code_digit;        /* Last code digit, 0 - 9 */
+
+    /*
+     * One or two alternate code digits leading to this node. If there are two
+     * digits, one of them is always an 'X'. Repeated code digits and 'X' lead
+     * back to the same node.
+     */
+    char        prev_code_digits[2];
+    /* One or two alternate code digits moving forward. */
+    char        next_code_digits[2];
+    /* ORed together code index(es) used to reach current node. */
+    int            prev_code_index;
+    int            next_code_index;
+    /* Nodes branching out from this node. */
+    struct dm_node *next_nodes[DM_MAX_ALTERNATE_CODES + 1];
+};
+
+typedef struct dm_letter dm_letter;
+typedef struct dm_node dm_node;
+
+/* Codes for letter sequence at start of name, before a vowel, and any other. */
+EOF
+
+for my $key (sort keys %codes) {
+    print "static dm_codes $key\[2\] =\n{\n" . $codes{$key} . "\n};\n";
+}
+
+print <<EOF;
+
+/* Coding for alternative following letters in sequence. */
+EOF
+
+sub hash2code {
+    my ($ref, $letter) = @_;
+
+    my @letters = ();
+
+    my $h = $ref->[0];
+    for my $key (sort keys %$h) {
+        $ref = $h->{$key};
+        my $children = "NULL";
+        if (defined $ref->[0]) {
+            $children = "letter_$letter$key";
+            hash2code($ref, "$letter$key");
+        }
+        my $codes = $ref->[1] // "NULL";
+        push(@letters, "\t{\n\t\t'$key', $children, $codes\n\t}");
+    }
+
+    print "static dm_letter letter_$letter\[\] =\n{\n";
+    for (@letters) {
+        print "$_,\n";
+    }
+    print "\t{\n\t\t'\\0'\n\t}\n";
+    print "};\n";
+}
+
+hash2code($table, '');
+
+
+# Table adapted from https://www.jewishgen.org/InfoFiles/Soundex.html
+#
+# X = NC (not coded)
+#
+# Note that the following letters are coded with substitute letters
+#
+# Ą => a (use '[' for table lookup)
+# Ę => e (use '\\' for table lookup)
+# Ţ => t (use ']' for table lookup)
+#
+# The rule for "UE" below does not correspond to the table referred to above,
+# however it is used by all other known implementations, including the one at
+# https://www.jewishgen.org/jos/jossound.htm (try e.g. "bouey").
+
+__DATA__
+AI,AJ,AY                0,1,X
+AU                        0,7,X
+a                        X,X,6|X,X,X
+A                        0,X,X
+B                        7,7,7
+CHS                        5,54,54
+CH                        5,5,5|4,4,4
+CK                        5,5,5|45,45,45
+CZ,CS,CSZ,CZS            4,4,4
+C                        5,5,5|4,4,4
+DRZ,DRS                    4,4,4
+DS,DSH,DSZ                4,4,4
+DZ,DZH,DZS                4,4,4
+D,DT                    3,3,3
+EI,EJ,EY                0,1,X
+EU                        1,1,X
+e                        X,X,6|X,X,X
+E                        0,X,X
+FB                        7,7,7
+F                        7,7,7
+G                        5,5,5
+H                        5,5,X
+IA,IE,IO,IU                1,X,X
+I                        0,X,X
+J                        1,X,X|4,4,4
+KS                        5,54,54
+KH                        5,5,5
+K                        5,5,5
+L                        8,8,8
+MN                        66,66,66
+M                        6,6,6
+NM                        66,66,66
+N                        6,6,6
+OI,OJ,OY                0,1,X
+O                        0,X,X
+P,PF,PH                    7,7,7
+Q                        5,5,5
+RZ,RS                    94,94,94|4,4,4
+R                        9,9,9
+SCHTSCH,SCHTSH,SCHTCH    2,4,4
+SCH                        4,4,4
+SHTCH,SHCH,SHTSH        2,4,4
+SHT,SCHT,SCHD            2,43,43
+SH                        4,4,4
+STCH,STSCH,SC            2,4,4
+STRZ,STRS,STSH            2,4,4
+ST                        2,43,43
+SZCZ,SZCS                2,4,4
+SZT,SHD,SZD,SD            2,43,43
+SZ                        4,4,4
+S                        4,4,4
+TCH,TTCH,TTSCH            4,4,4
+TH                        3,3,3
+TRZ,TRS                    4,4,4
+TSCH,TSH                4,4,4
+TS,TTS,TTSZ,TC            4,4,4
+TZ,TTZ,TZS,TSZ            4,4,4
+t                        3,3,3|4,4,4
+T                        3,3,3
+UI,UJ,UY,UE                0,1,X
+U                        0,X,X
+V                        7,7,7
+W                        7,7,7
+X                        5,54,54
+Y                        1,X,X
+ZDZ,ZDZH,ZHDZH            2,4,4
+ZD,ZHD                    2,43,43
+ZH,ZS,ZSCH,ZSH            4,4,4
+Z                        4,4,4
diff --git a/contrib/fuzzystrmatch/expected/fuzzystrmatch.out b/contrib/fuzzystrmatch/expected/fuzzystrmatch.out
index 493c95cdfa..f62ddad4ee 100644
--- a/contrib/fuzzystrmatch/expected/fuzzystrmatch.out
+++ b/contrib/fuzzystrmatch/expected/fuzzystrmatch.out
@@ -65,3 +65,174 @@ SELECT dmetaphone_alt('gumbo');
  KMP
 (1 row)

+-- Wovels
+SELECT daitch_mokotoff('Augsburg');
+ daitch_mokotoff
+-----------------
+ 054795
+(1 row)
+
+SELECT daitch_mokotoff('Breuer');
+ daitch_mokotoff
+-----------------
+ 791900
+(1 row)
+
+SELECT daitch_mokotoff('Freud');
+ daitch_mokotoff
+-----------------
+ 793000
+(1 row)
+
+-- The letter "H"
+SELECT daitch_mokotoff('Halberstadt');
+ daitch_mokotoff
+-----------------
+ 587943 587433
+(1 row)
+
+SELECT daitch_mokotoff('Mannheim');
+ daitch_mokotoff
+-----------------
+ 665600
+(1 row)
+
+-- Adjacent sounds
+SELECT daitch_mokotoff('Chernowitz');
+ daitch_mokotoff
+-----------------
+ 596740 496740
+(1 row)
+
+-- Adjacent letters with identical adjacent code digits
+SELECT daitch_mokotoff('Cherkassy');
+ daitch_mokotoff
+-----------------
+ 595400 495400
+(1 row)
+
+SELECT daitch_mokotoff('Kleinman');
+ daitch_mokotoff
+-----------------
+ 586660
+(1 row)
+
+-- More than one word
+SELECT daitch_mokotoff('Nowy Targ');
+ daitch_mokotoff
+-----------------
+ 673950
+(1 row)
+
+-- Padded with "0"
+SELECT daitch_mokotoff('Berlin');
+ daitch_mokotoff
+-----------------
+ 798600
+(1 row)
+
+-- Other examples from https://www.avotaynu.com/soundex.htm
+SELECT daitch_mokotoff('Ceniow');
+ daitch_mokotoff
+-----------------
+ 567000 467000
+(1 row)
+
+SELECT daitch_mokotoff('Tsenyuv');
+ daitch_mokotoff
+-----------------
+ 467000
+(1 row)
+
+SELECT daitch_mokotoff('Holubica');
+ daitch_mokotoff
+-----------------
+ 587500 587400
+(1 row)
+
+SELECT daitch_mokotoff('Golubitsa');
+ daitch_mokotoff
+-----------------
+ 587400
+(1 row)
+
+SELECT daitch_mokotoff('Przemysl');
+ daitch_mokotoff
+-----------------
+ 794648 746480
+(1 row)
+
+SELECT daitch_mokotoff('Pshemeshil');
+ daitch_mokotoff
+-----------------
+ 746480
+(1 row)
+
+SELECT daitch_mokotoff('Rosochowaciec');
+                     daitch_mokotoff
+---------------------------------------------------------
+ 945755 945754 945745 945744 944755 944754 944745 944744
+(1 row)
+
+SELECT daitch_mokotoff('Rosokhovatsets');
+ daitch_mokotoff
+-----------------
+ 945744
+(1 row)
+
+-- Ignored characters
+SELECT daitch_mokotoff('''OBrien');
+ daitch_mokotoff
+-----------------
+ 079600
+(1 row)
+
+SELECT daitch_mokotoff('O''Brien');
+ daitch_mokotoff
+-----------------
+ 079600
+(1 row)
+
+-- "Difficult" cases, likely to cause trouble for other implementations.
+SELECT daitch_mokotoff('CJC');
+              daitch_mokotoff
+-------------------------------------------
+ 550000 540000 545000 450000 400000 440000
+(1 row)
+
+SELECT daitch_mokotoff('BESST');
+ daitch_mokotoff
+-----------------
+ 743000
+(1 row)
+
+SELECT daitch_mokotoff('BOUEY');
+ daitch_mokotoff
+-----------------
+ 710000
+(1 row)
+
+SELECT daitch_mokotoff('HANNMANN');
+ daitch_mokotoff
+-----------------
+ 566600
+(1 row)
+
+SELECT daitch_mokotoff('MCCOYJR');
+                     daitch_mokotoff
+---------------------------------------------------------
+ 651900 654900 654190 654490 645190 645490 641900 644900
+(1 row)
+
+SELECT daitch_mokotoff('ACCURSO');
+                     daitch_mokotoff
+---------------------------------------------------------
+ 059400 054000 054940 054400 045940 045400 049400 044000
+(1 row)
+
+SELECT daitch_mokotoff('BIERSCHBACH');
+                     daitch_mokotoff
+---------------------------------------------------------
+ 794575 794574 794750 794740 745750 745740 747500 747400
+(1 row)
+
diff --git a/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8.out
b/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8.out
new file mode 100644
index 0000000000..32d8260383
--- /dev/null
+++ b/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8.out
@@ -0,0 +1,61 @@
+/*
+ * This test must be run in a database with UTF-8 encoding,
+ * because other encodings don't support all the characters used.
+ */
+SELECT getdatabaseencoding() <> 'UTF8'
+       AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+set client_encoding = utf8;
+-- CREATE EXTENSION IF NOT EXISTS fuzzystrmatch;
+-- Accents
+SELECT daitch_mokotoff('Müller');
+ daitch_mokotoff
+-----------------
+ 689000
+(1 row)
+
+SELECT daitch_mokotoff('Schäfer');
+ daitch_mokotoff
+-----------------
+ 479000
+(1 row)
+
+SELECT daitch_mokotoff('Straßburg');
+ daitch_mokotoff
+-----------------
+ 294795
+(1 row)
+
+SELECT daitch_mokotoff('Éregon');
+ daitch_mokotoff
+-----------------
+ 095600
+(1 row)
+
+-- Special characters added at https://www.jewishgen.org/InfoFiles/Soundex.html
+SELECT daitch_mokotoff('gąszczu');
+ daitch_mokotoff
+-----------------
+ 564000 540000
+(1 row)
+
+SELECT daitch_mokotoff('brzęczy');
+       daitch_mokotoff
+-----------------------------
+ 794640 794400 746400 744000
+(1 row)
+
+SELECT daitch_mokotoff('ţamas');
+ daitch_mokotoff
+-----------------
+ 364000 464000
+(1 row)
+
+SELECT daitch_mokotoff('țamas');
+ daitch_mokotoff
+-----------------
+ 364000 464000
+(1 row)
+
diff --git a/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8_1.out
b/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8_1.out
new file mode 100644
index 0000000000..37aead89c0
--- /dev/null
+++ b/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8_1.out
@@ -0,0 +1,8 @@
+/*
+ * This test must be run in a database with UTF-8 encoding,
+ * because other encodings don't support all the characters used.
+ */
+SELECT getdatabaseencoding() <> 'UTF8'
+       AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql b/contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql
new file mode 100644
index 0000000000..b9d7b229a3
--- /dev/null
+++ b/contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql
@@ -0,0 +1,8 @@
+/* contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION fuzzystrmatch UPDATE TO '1.2'" to load this file. \quit
+
+CREATE FUNCTION daitch_mokotoff(text) RETURNS text
+AS 'MODULE_PATHNAME', 'daitch_mokotoff'
+LANGUAGE C IMMUTABLE STRICT PARALLEL SAFE;
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch--1.1.sql b/contrib/fuzzystrmatch/fuzzystrmatch--1.2.sql
similarity index 92%
rename from contrib/fuzzystrmatch/fuzzystrmatch--1.1.sql
rename to contrib/fuzzystrmatch/fuzzystrmatch--1.2.sql
index 41de9d949b..2a8a100699 100644
--- a/contrib/fuzzystrmatch/fuzzystrmatch--1.1.sql
+++ b/contrib/fuzzystrmatch/fuzzystrmatch--1.2.sql
@@ -42,3 +42,7 @@ LANGUAGE C IMMUTABLE STRICT PARALLEL SAFE;
 CREATE FUNCTION dmetaphone_alt (text) RETURNS text
 AS 'MODULE_PATHNAME', 'dmetaphone_alt'
 LANGUAGE C IMMUTABLE STRICT PARALLEL SAFE;
+
+CREATE FUNCTION daitch_mokotoff(text) RETURNS text
+AS 'MODULE_PATHNAME', 'daitch_mokotoff'
+LANGUAGE C IMMUTABLE STRICT PARALLEL SAFE;
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch.control b/contrib/fuzzystrmatch/fuzzystrmatch.control
index 3cd6660bf9..8b6e9fd993 100644
--- a/contrib/fuzzystrmatch/fuzzystrmatch.control
+++ b/contrib/fuzzystrmatch/fuzzystrmatch.control
@@ -1,6 +1,6 @@
 # fuzzystrmatch extension
 comment = 'determine similarities and distance between strings'
-default_version = '1.1'
+default_version = '1.2'
 module_pathname = '$libdir/fuzzystrmatch'
 relocatable = true
 trusted = true
diff --git a/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql b/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql
index f05dc28ffb..db05c7d6b6 100644
--- a/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql
+++ b/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql
@@ -19,3 +19,48 @@ SELECT metaphone('GUMBO', 4);

 SELECT dmetaphone('gumbo');
 SELECT dmetaphone_alt('gumbo');
+
+-- Wovels
+SELECT daitch_mokotoff('Augsburg');
+SELECT daitch_mokotoff('Breuer');
+SELECT daitch_mokotoff('Freud');
+
+-- The letter "H"
+SELECT daitch_mokotoff('Halberstadt');
+SELECT daitch_mokotoff('Mannheim');
+
+-- Adjacent sounds
+SELECT daitch_mokotoff('Chernowitz');
+
+-- Adjacent letters with identical adjacent code digits
+SELECT daitch_mokotoff('Cherkassy');
+SELECT daitch_mokotoff('Kleinman');
+
+-- More than one word
+SELECT daitch_mokotoff('Nowy Targ');
+
+-- Padded with "0"
+SELECT daitch_mokotoff('Berlin');
+
+-- Other examples from https://www.avotaynu.com/soundex.htm
+SELECT daitch_mokotoff('Ceniow');
+SELECT daitch_mokotoff('Tsenyuv');
+SELECT daitch_mokotoff('Holubica');
+SELECT daitch_mokotoff('Golubitsa');
+SELECT daitch_mokotoff('Przemysl');
+SELECT daitch_mokotoff('Pshemeshil');
+SELECT daitch_mokotoff('Rosochowaciec');
+SELECT daitch_mokotoff('Rosokhovatsets');
+
+-- Ignored characters
+SELECT daitch_mokotoff('''OBrien');
+SELECT daitch_mokotoff('O''Brien');
+
+-- "Difficult" cases, likely to cause trouble for other implementations.
+SELECT daitch_mokotoff('CJC');
+SELECT daitch_mokotoff('BESST');
+SELECT daitch_mokotoff('BOUEY');
+SELECT daitch_mokotoff('HANNMANN');
+SELECT daitch_mokotoff('MCCOYJR');
+SELECT daitch_mokotoff('ACCURSO');
+SELECT daitch_mokotoff('BIERSCHBACH');
diff --git a/contrib/fuzzystrmatch/sql/fuzzystrmatch_utf8.sql b/contrib/fuzzystrmatch/sql/fuzzystrmatch_utf8.sql
new file mode 100644
index 0000000000..f42c01a1bb
--- /dev/null
+++ b/contrib/fuzzystrmatch/sql/fuzzystrmatch_utf8.sql
@@ -0,0 +1,26 @@
+/*
+ * This test must be run in a database with UTF-8 encoding,
+ * because other encodings don't support all the characters used.
+ */
+
+SELECT getdatabaseencoding() <> 'UTF8'
+       AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+set client_encoding = utf8;
+
+-- CREATE EXTENSION IF NOT EXISTS fuzzystrmatch;
+
+-- Accents
+SELECT daitch_mokotoff('Müller');
+SELECT daitch_mokotoff('Schäfer');
+SELECT daitch_mokotoff('Straßburg');
+SELECT daitch_mokotoff('Éregon');
+
+-- Special characters added at https://www.jewishgen.org/InfoFiles/Soundex.html
+SELECT daitch_mokotoff('gąszczu');
+SELECT daitch_mokotoff('brzęczy');
+SELECT daitch_mokotoff('ţamas');
+SELECT daitch_mokotoff('țamas');
diff --git a/doc/src/sgml/fuzzystrmatch.sgml b/doc/src/sgml/fuzzystrmatch.sgml
index 382e54be91..08781778f8 100644
--- a/doc/src/sgml/fuzzystrmatch.sgml
+++ b/doc/src/sgml/fuzzystrmatch.sgml
@@ -241,4 +241,101 @@ test=# SELECT dmetaphone('gumbo');
 </screen>
  </sect2>

+ <sect2>
+  <title>Daitch-Mokotoff Soundex</title>
+
+  <para>
+   Compared to the American Soundex System implemented in the
+   <function>soundex</function> function, the major improvements of the
+   Daitch-Mokotoff Soundex System are:
+
+   <itemizedlist spacing="compact" mark="bullet">
+    <listitem>
+     <para>
+      Information is coded to the first six meaningful letters rather than
+      four.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      The initial letter is coded rather than kept as is.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Where two consecutive letters have a single sound, they are coded as a
+      single number.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      When a letter or combination of letters may have two different sounds,
+      it is double coded under the two different codes.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      A letter or combination of letters maps into ten possible codes rather
+      than seven.
+     </para>
+    </listitem>
+   </itemizedlist>
+  </para>
+
+  <indexterm>
+   <primary>daitch_mokotoff</primary>
+  </indexterm>
+
+  <para>
+   The following function generates Daitch-Mokotoff soundex codes for matching
+   of similar-sounding input:
+  </para>
+
+<synopsis>
+daitch_mokotoff(text source) returns text
+</synopsis>
+
+  <para>
+   Since a Daitch-Mokotoff soundex code consists of only 6 digits,
+   <literal>source</literal> should be preferably a single word or name.
+   Any alternative soundex codes are separated by space, which makes the returned
+   text suited for use in Full Text Search, see <xref linkend="textsearch"/> and
+   <xref linkend="functions-textsearch"/>.
+  </para>
+
+  <para>
+   Example:
+  </para>
+
+<programlisting>
+CREATE OR REPLACE FUNCTION soundex_name(v_name text) RETURNS text AS $$
+  SELECT string_agg(daitch_mokotoff(n), ' ')
+  FROM regexp_split_to_table(v_name, '\s+') AS n
+$$ LANGUAGE sql STRICT IMMUTABLE PARALLEL SAFE;
+
+CREATE OR REPLACE FUNCTION soundex_tsvector(v_name text) RETURNS tsvector AS $$
+  SELECT to_tsvector('simple', soundex_name(v_name))
+$$ LANGUAGE sql STRICT IMMUTABLE PARALLEL SAFE;
+
+CREATE OR REPLACE FUNCTION soundex_tsquery(v_name text) RETURNS tsquery AS $$
+  SELECT to_tsquery('simple', quote_literal(soundex_name(v_name)))
+$$ LANGUAGE sql STRICT IMMUTABLE PARALLEL SAFE;
+
+-- Note that searches could be more efficient with the tsvector in a separate column
+-- (no recalculation on table row recheck).
+CREATE TABLE s (nm text);
+CREATE INDEX ix_s_txt ON s USING gin (soundex_tsvector(nm)) WITH (fastupdate = off);
+
+INSERT INTO s VALUES ('John Doe');
+INSERT INTO s VALUES ('Jane Roe');
+INSERT INTO s VALUES ('Public John Q.');
+INSERT INTO s VALUES ('George Best');
+
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('john');
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('jane doe');
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('john public');
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('besst, giorgio');
+</programlisting>
+ </sect2>
+
 </sect1>

Re: daitch_mokotoff module

От
Dag Lem
Дата:
Hi,

Just some minor adjustments to the patch:

* Removed call to locale-dependent toupper()
* Cleaned up input normalization

I have been asked to sign up to review a commitfest patch or patches -
unfortunately I've been ill with COVID-19 and it's not until now that
I feel well enough to have a look.

Julien: I'll have a look at https://commitfest.postgresql.org/36/3468/
as you suggested (https://commitfest.postgresql.org/36/3379/ seems to
have been reviewed now).

If there are other suggestions for a patch or patches to review for
someone new to PostgreSQL internals, I'd be grateful for that.


Best regards

Dag Lem

diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile
index 0704894f88..1d5bd84be8 100644
--- a/contrib/fuzzystrmatch/Makefile
+++ b/contrib/fuzzystrmatch/Makefile
@@ -3,14 +3,15 @@
 MODULE_big = fuzzystrmatch
 OBJS = \
     $(WIN32RES) \
+    daitch_mokotoff.o \
     dmetaphone.o \
     fuzzystrmatch.o

 EXTENSION = fuzzystrmatch
-DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql
+DATA = fuzzystrmatch--1.2.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql
 PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"

-REGRESS = fuzzystrmatch
+REGRESS = fuzzystrmatch fuzzystrmatch_utf8

 ifdef USE_PGXS
 PG_CONFIG = pg_config
@@ -22,3 +23,8 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
+
+daitch_mokotoff.o: daitch_mokotoff.h
+
+daitch_mokotoff.h: daitch_mokotoff_header.pl
+    perl $< > $@
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c
new file mode 100644
index 0000000000..ba87061845
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff.c
@@ -0,0 +1,587 @@
+/*
+ * Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem <dag@nimrod.no>
+ *
+ * This implementation of the Daitch-Mokotoff Soundex System aims at high
+ * performance.
+ *
+ * - The processing of each phoneme is initiated by an O(1) table lookup.
+ * - For phonemes containing more than one character, a coding tree is traversed
+ *   to process the complete phoneme.
+ * - The (alternate) soundex codes are produced digit by digit in-place in
+ *   another tree structure.
+ *
+ *  References:
+ *
+ * https://www.avotaynu.com/soundex.htm
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex
+ * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php)
+ * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java)
+ * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm)
+ *
+ * A few notes on other implementations:
+ *
+ * - All other known implementations have the same unofficial rules for "UE",
+ *   these are also adapted by this implementation (0, 1, NC).
+ * - The only other known implementation which is capable of generating all
+ *   correct soundex codes in all cases is the JOS Soundex Calculator at
+ *   https://www.jewishgen.org/jos/jossound.htm
+ * - "J" is considered (only) a vowel in dmlat.php
+ * - The official rules for "RS" are commented out in dmlat.php
+ * - Identical code digits for adjacent letters are not collapsed correctly in
+ *   dmsoundex.php when double digit codes are involved. E.g. "BESST" yields
+ *   744300 instead of 743000 as for "BEST".
+ * - "J" is considered (only) a consonant in DaitchMokotoffSoundex.java
+ * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+*/
+
+#include "daitch_mokotoff.h"
+
+#include "postgres.h"
+#include "utils/builtins.h"
+#include "mb/pg_wchar.h"
+
+#include <string.h>
+
+/* Internal C implementation */
+static char *_daitch_mokotoff(char *word, char *soundex, size_t n);
+
+
+PG_FUNCTION_INFO_V1(daitch_mokotoff);
+Datum
+daitch_mokotoff(PG_FUNCTION_ARGS)
+{
+    text       *arg = PG_GETARG_TEXT_PP(0);
+    char       *string,
+               *tmp_soundex;
+    text       *soundex;
+
+    /*
+     * The maximum theoretical soundex size is several KB, however in practice
+     * anything but contrived synthetic inputs will yield a soundex size of
+     * less than 100 bytes. We thus allocate and free a temporary work buffer,
+     * and return only the actual soundex result.
+     */
+    string = pg_server_to_any(text_to_cstring(arg), VARSIZE_ANY_EXHDR(arg), PG_UTF8);
+    tmp_soundex = palloc(DM_MAX_SOUNDEX_CHARS);
+
+    if (!_daitch_mokotoff(string, tmp_soundex, DM_MAX_SOUNDEX_CHARS))
+    {
+        /* No encodable characters in input. */
+        pfree(tmp_soundex);
+        PG_RETURN_NULL();
+    }
+
+    soundex = cstring_to_text(pg_any_to_server(tmp_soundex, strlen(tmp_soundex), PG_UTF8));
+    pfree(tmp_soundex);
+
+    PG_RETURN_TEXT_P(soundex);
+}
+
+
+typedef dm_node dm_nodes[DM_MAX_NODES];
+typedef dm_node * dm_leaves[DM_MAX_LEAVES];
+
+
+/* Template for new node in soundex code tree. */
+static const dm_node start_node = {
+    .soundex_length = 0,
+    .soundex = "000000 ",        /* Six digits + joining space */
+    .is_leaf = 0,
+    .last_update = 0,
+    .code_digit = '\0',
+    .prev_code_digits = {'\0', '\0'},
+    .next_code_digits = {'\0', '\0'},
+    .prev_code_index = 0,
+    .next_code_index = 0,
+    .next_nodes = {NULL}
+};
+
+/* Dummy soundex codes at end of input. */
+static dm_codes end_codes[2] =
+{
+    {
+        "X", "X", "X"
+    }
+};
+
+
+/* Initialize soundex code tree node for next code digit. */
+static void
+initialize_node(dm_node * node, int last_update)
+{
+    if (node->last_update < last_update)
+    {
+        node->prev_code_digits[0] = node->next_code_digits[0];
+        node->prev_code_digits[1] = node->next_code_digits[1];
+        node->next_code_digits[0] = '\0';
+        node->next_code_digits[1] = '\0';
+        node->prev_code_index = node->next_code_index;
+        node->next_code_index = 0;
+        node->is_leaf = 0;
+        node->last_update = last_update;
+    }
+}
+
+
+/* Update soundex code tree node with next code digit. */
+static void
+add_next_code_digit(dm_node * node, int code_index, char code_digit)
+{
+    /* OR in index 1 or 2. */
+    node->next_code_index |= code_index;
+
+    if (!node->next_code_digits[0])
+    {
+        node->next_code_digits[0] = code_digit;
+    }
+    else if (node->next_code_digits[0] != code_digit)
+    {
+        node->next_code_digits[1] = code_digit;
+    }
+}
+
+
+/* Mark soundex code tree node as leaf. */
+static void
+set_leaf(dm_leaves leaves_next, int *num_leaves_next, dm_node * node)
+{
+    if (!node->is_leaf)
+    {
+        node->is_leaf = 1;
+        leaves_next[(*num_leaves_next)++] = node;
+    }
+}
+
+
+/* Find next node corresponding to code digit, or create a new node. */
+static dm_node * find_or_create_node(dm_nodes nodes, int *num_nodes,
+                                     dm_node * node, char code_digit)
+{
+    dm_node   **next_nodes;
+    dm_node    *next_node;
+
+    for (next_nodes = node->next_nodes; (next_node = *next_nodes); next_nodes++)
+    {
+        if (next_node->code_digit == code_digit)
+        {
+            return next_node;
+        }
+    }
+
+    next_node = &nodes[(*num_nodes)++];
+    *next_nodes = next_node;
+
+    *next_node = start_node;
+    memcpy(next_node->soundex, node->soundex, sizeof(next_node->soundex));
+    next_node->soundex_length = node->soundex_length;
+    next_node->soundex[next_node->soundex_length++] = code_digit;
+    next_node->code_digit = code_digit;
+    next_node->next_code_index = node->prev_code_index;
+
+    return next_node;
+}
+
+
+/* Update node for next code digit(s). */
+static int
+update_node(dm_nodes nodes, dm_node * node, int *num_nodes,
+            dm_leaves leaves_next, int *num_leaves_next,
+            int letter_no, int prev_code_index, int next_code_index,
+            char *next_code_digits, int digit_no)
+{
+    int            i;
+    char        next_code_digit = next_code_digits[digit_no];
+    int            num_dirty_nodes = 0;
+    dm_node    *dirty_nodes[2];
+
+    initialize_node(node, letter_no);
+
+    if (node->soundex_length == DM_MAX_CODE_DIGITS)
+    {
+        /* Keep completed soundex code. */
+        set_leaf(leaves_next, num_leaves_next, node);
+        return 0;
+    }
+
+    if (node->prev_code_index && !(node->prev_code_index & prev_code_index))
+    {
+        /*
+         * If the sound (vowel / consonant) of this letter encoding doesn't
+         * correspond to the coding index of the previous letter, we skip this
+         * letter encoding. Note that currently, only "J" can be either a
+         * vowel or a consonant.
+         */
+        return 1;
+    }
+
+    if (next_code_digit == 'X' ||
+        (digit_no == 0 &&
+         (node->prev_code_digits[0] == next_code_digit ||
+          node->prev_code_digits[1] == next_code_digit)))
+    {
+        /* The code digit is the same as one of the previous (i.e. not added). */
+        dirty_nodes[num_dirty_nodes++] = node;
+    }
+
+    if (next_code_digit != 'X' &&
+        (digit_no > 0 ||
+         node->prev_code_digits[0] != next_code_digit ||
+         node->prev_code_digits[1]))
+    {
+        /* The code digit is different from one of the previous (i.e. added). */
+        node = find_or_create_node(nodes, num_nodes, node, next_code_digit);
+        initialize_node(node, letter_no);
+        dirty_nodes[num_dirty_nodes++] = node;
+    }
+
+    for (i = 0; i < num_dirty_nodes; i++)
+    {
+        /* Add code digit leading to the current node. */
+        add_next_code_digit(dirty_nodes[i], next_code_index, next_code_digit);
+
+        if (next_code_digits[++digit_no])
+        {
+            update_node(nodes, dirty_nodes[i], num_nodes,
+                        leaves_next, num_leaves_next,
+                        letter_no, prev_code_index, next_code_index,
+                        next_code_digits, digit_no);
+        }
+        else
+        {
+            set_leaf(leaves_next, num_leaves_next, dirty_nodes[i]);
+        }
+    }
+
+    return 1;
+}
+
+
+/* Update soundex tree leaf nodes. Return 1 when all nodes are completed. */
+static int
+update_leaves(dm_nodes nodes, int *num_nodes,
+              dm_leaves leaves[2], int *ix_leaves, int *num_leaves,
+              int letter_no, dm_codes * codes, dm_codes * next_codes)
+{
+    int            i,
+                j,
+                k,
+                code_index;
+    dm_code    *code,
+               *next_code;
+    int            num_leaves_next = 0;
+    int            ix_leaves_next = (*ix_leaves + 1) & 1;    /* Alternate ix: 0, 1 */
+    int            finished = 1;
+
+    for (i = 0; i < *num_leaves; i++)
+    {
+        dm_node    *node = leaves[*ix_leaves][i];
+
+        /* One or two alternate code sequences. */
+        for (j = 0; j < 2 && (code = codes[j]) && code[0][0]; j++)
+        {
+            /* Coding for previous letter - before vowel: 1, all other: 2 */
+            int            prev_code_index = (code[0][0] > '1') + 1;
+
+            /* One or two alternate next code sequences. */
+            for (k = 0; k < 2 && (next_code = next_codes[k]) && next_code[0][0]; k++)
+            {
+                /* Determine which code to use. */
+                if (letter_no == 0)
+                {
+                    /* This is the first letter. */
+                    code_index = 0;
+                }
+                else if (next_code[0][0] <= '1')
+                {
+                    /* The next letter is a vowel. */
+                    code_index = 1;
+                }
+                else
+                {
+                    /* All other cases. */
+                    code_index = 2;
+                }
+
+                /* One or two sequential code digits. */
+                if (update_node(nodes, node, num_nodes,
+                                leaves[ix_leaves_next], &num_leaves_next,
+                                letter_no, prev_code_index, code_index,
+                                code[code_index], 0))
+                {
+                    finished = 0;
+                }
+            }
+        }
+    }
+
+    *ix_leaves = ix_leaves_next;
+    *num_leaves = num_leaves_next;
+
+    return finished;
+}
+
+
+/* Mapping from ISO8859-1 to upper-case ASCII */
+static const char tr_iso8859_1_to_ascii_upper[] =
+/*
+"`abcdefghijklmnopqrstuvwxyz{|}~                                  ¡¢£¤¥¦§¨©ª«¬
®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ"
+*/
+"`ABCDEFGHIJKLMNOPQRSTUVWXYZ{|}~                                  !
?AAAAAAECEEEEIIIIDNOOOOO*OUUUUYDSAAAAAAECEEEEIIIIDNOOOOO/OUUUUYDY";
+
+static char
+iso8859_1_to_ascii_upper(unsigned char c)
+{
+    return c >= 0x60 ? tr_iso8859_1_to_ascii_upper[c - 0x60] : c;
+}
+
+
+/* Convert an UTF-8 character to ISO-8859-1.
+ * Unconvertable characters are returned as '?'.
+ * NB! Beware of the domain specific conversion of Ą, Ę, and Ţ/Ț.
+ */
+static char
+utf8_to_iso8859_1(char *str, int *ix)
+{
+    const char    unknown = '?';
+    unsigned char c,
+                c2;
+    unsigned int code_point;
+
+    /* First byte. */
+    c = (unsigned char) str[(*ix)++];
+    if (c < 0x80)
+    {
+        /* ASCII code point. */
+        if (c >= '[' && c <= ']')
+        {
+            /* Codes reserved for Ą, Ę, and Ţ/Ț. */
+            return unknown;
+        }
+
+        return c;
+    }
+
+    /* Second byte. */
+    c2 = (unsigned char) str[(*ix)++];
+    if (!c2)
+    {
+        /* The UTF-8 character is cut short (invalid code point). */
+        (*ix)--;
+        return unknown;
+    }
+
+    if (c < 0xE0)
+    {
+        /* Two-byte character. */
+        code_point = ((c & 0x1F) << 6) | (c2 & 0x3F);
+        if (code_point < 0x100)
+        {
+            /* ISO-8859-1 code point. */
+            return code_point;
+        }
+        else if (code_point == 0x0104 || code_point == 0x0105)
+        {
+            /* Ą/ą */
+            return '[';
+        }
+        else if (code_point == 0x0118 || code_point == 0x0119)
+        {
+            /* Ę/ę */
+            return '\\';
+        }
+        else if (code_point == 0x0162 || code_point == 0x0163 ||
+                 code_point == 0x021A || code_point == 0x021B)
+        {
+            /* Ţ/ţ or Ț/ț */
+            return ']';
+        }
+        else
+        {
+            return unknown;
+        }
+    }
+
+    /* Third byte. */
+    if (!str[(*ix)++])
+    {
+        /* The UTF-8 character is cut short (invalid code point). */
+        (*ix)--;
+        return unknown;
+    }
+
+    if (c < 0xF0)
+    {
+        /* Three-byte character. */
+        return unknown;
+    }
+
+    /* Fourth byte. */
+    if (!str[(*ix)++])
+    {
+        /* The UTF-8 character is cut short (invalid code point). */
+        (*ix)--;
+    }
+
+    return unknown;
+}
+
+
+/* Return next character, converted from UTF-8 to uppercase ASCII. */
+static char
+read_char(char *str, int *ix)
+{
+    return iso8859_1_to_ascii_upper(utf8_to_iso8859_1(str, ix));
+}
+
+
+/* Read next ASCII character, skipping any characters not in [A-\]]. */
+static char
+read_valid_char(char *str, int *ix)
+{
+    char        c;
+
+    while ((c = read_char(str, ix)))
+    {
+        if (c >= 'A' && c <= ']')
+        {
+            break;
+        }
+    }
+
+    return c;
+}
+
+
+/* Return sound coding for "letter" (letter sequence) */
+static dm_codes *
+read_letter(char *str, int *ix)
+{
+    char        c,
+                cmp;
+    int            i,
+                j;
+    dm_letter  *letters;
+    dm_codes   *codes;
+
+    /* First letter in sequence. */
+    if (!(c = read_valid_char(str, ix)))
+    {
+        return NULL;
+    }
+    letters = &letter_[c - 'A'];
+    codes = letters->codes;
+    i = *ix;
+
+    /* Any subsequent letters in sequence. */
+    while ((letters = letters->letters) && (c = read_valid_char(str, &i)))
+    {
+        for (j = 0; (cmp = letters[j].letter); j++)
+        {
+            if (cmp == c)
+            {
+                /* Letter found. */
+                letters = &letters[j];
+                if (letters->codes)
+                {
+                    /* Coding for letter sequence found. */
+                    codes = letters->codes;
+                    *ix = i;
+                }
+                break;
+            }
+        }
+        if (!cmp)
+        {
+            /* The sequence of letters has no coding. */
+            break;
+        }
+    }
+
+    return codes;
+}
+
+
+/* Generate all Daitch-Mokotoff soundex codes for word, separated by space. */
+static char *
+_daitch_mokotoff(char *word, char *soundex, size_t n)
+{
+    int            i = 0,
+                j;
+    int            letter_no = 0;
+    int            ix_leaves = 0;
+    int            num_nodes = 0,
+                num_leaves = 0;
+    dm_codes   *codes,
+               *next_codes;
+    dm_node    *nodes;
+    dm_leaves  *leaves;
+
+    /* First letter. */
+    if (!(codes = read_letter(word, &i)))
+    {
+        /* No encodable character in input. */
+        return NULL;
+    }
+
+    /* Allocate memory for node tree. */
+    nodes = palloc(sizeof(dm_nodes));
+    leaves = palloc(2 * sizeof(dm_leaves));
+
+    /* Starting point. */
+    nodes[num_nodes++] = start_node;
+    leaves[ix_leaves][num_leaves++] = &nodes[0];
+
+    while (codes)
+    {
+        next_codes = read_letter(word, &i);
+
+        /* Update leaf nodes. */
+        if (update_leaves(nodes, &num_nodes,
+                          leaves, &ix_leaves, &num_leaves,
+                          letter_no, codes, next_codes ? next_codes : end_codes))
+        {
+            /* All soundex codes are completed to six digits. */
+            break;
+        }
+
+        codes = next_codes;
+        letter_no++;
+    }
+
+    /* Concatenate all generated soundex codes. */
+    for (i = 0, j = 0;
+         i < num_leaves && j + DM_MAX_CODE_DIGITS + 1 <= n;
+         i++, j += DM_MAX_CODE_DIGITS + 1)
+    {
+        memcpy(&soundex[j], leaves[ix_leaves][i]->soundex, DM_MAX_CODE_DIGITS + 1);
+    }
+
+    /* Terminate string. */
+    soundex[j - 1] = '\0';
+
+    pfree(leaves);
+    pfree(nodes);
+
+    return soundex;
+}
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.h b/contrib/fuzzystrmatch/daitch_mokotoff.h
new file mode 100644
index 0000000000..8426069825
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff.h
@@ -0,0 +1,999 @@
+/*
+ * Types and lookup tables for Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem <dag@nimrod.no>
+ *
+ * This file is generated by daitch_mokotoff_header.pl
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+ */
+
+#include <stdlib.h>
+
+#define DM_MAX_CODE_DIGITS 6
+#define DM_MAX_ALTERNATE_CODES 5
+#define DM_MAX_NODES 1564
+#define DM_MAX_LEAVES 1250
+#define DM_MAX_SOUNDEX_CHARS (DM_MAX_NODES*(DM_MAX_CODE_DIGITS + 1))
+
+typedef char dm_code[2 + 1];    /* One or two sequential code digits + NUL */
+typedef dm_code dm_codes[3];    /* Start of name, before a vowel, any other */
+
+/* Letter in input sequence */
+struct dm_letter
+{
+    char        letter;            /* Present letter in sequence */
+    struct dm_letter *letters;    /* List of possible successive letters */
+    dm_codes   *codes;            /* Code sequence(s) for complete sequence */
+};
+
+/* Node in soundex code tree */
+struct dm_node
+{
+    int            soundex_length; /* Length of generated soundex code */
+    char        soundex[DM_MAX_CODE_DIGITS + 1];    /* Soundex code */
+    int            is_leaf;        /* Candidate for complete soundex code */
+    int            last_update;    /* Letter number for last update of node */
+    char        code_digit;        /* Last code digit, 0 - 9 */
+
+    /*
+     * One or two alternate code digits leading to this node. If there are two
+     * digits, one of them is always an 'X'. Repeated code digits and 'X' lead
+     * back to the same node.
+     */
+    char        prev_code_digits[2];
+    /* One or two alternate code digits moving forward. */
+    char        next_code_digits[2];
+    /* ORed together code index(es) used to reach current node. */
+    int            prev_code_index;
+    int            next_code_index;
+    /* Nodes branching out from this node. */
+    struct dm_node *next_nodes[DM_MAX_ALTERNATE_CODES + 1];
+};
+
+typedef struct dm_letter dm_letter;
+typedef struct dm_node dm_node;
+
+/* Codes for letter sequence at start of name, before a vowel, and any other. */
+static dm_codes codes_0_1_X[2] =
+{
+    {
+        "0", "1", "X"
+    }
+};
+static dm_codes codes_0_7_X[2] =
+{
+    {
+        "0", "7", "X"
+    }
+};
+static dm_codes codes_0_X_X[2] =
+{
+    {
+        "0", "X", "X"
+    }
+};
+static dm_codes codes_1_1_X[2] =
+{
+    {
+        "1", "1", "X"
+    }
+};
+static dm_codes codes_1_X_X[2] =
+{
+    {
+        "1", "X", "X"
+    }
+};
+static dm_codes codes_1_X_X_or_4_4_4[2] =
+{
+    {
+        "1", "X", "X"
+    },
+    {
+        "4", "4", "4"
+    }
+};
+static dm_codes codes_2_43_43[2] =
+{
+    {
+        "2", "43", "43"
+    }
+};
+static dm_codes codes_2_4_4[2] =
+{
+    {
+        "2", "4", "4"
+    }
+};
+static dm_codes codes_3_3_3[2] =
+{
+    {
+        "3", "3", "3"
+    }
+};
+static dm_codes codes_3_3_3_or_4_4_4[2] =
+{
+    {
+        "3", "3", "3"
+    },
+    {
+        "4", "4", "4"
+    }
+};
+static dm_codes codes_4_4_4[2] =
+{
+    {
+        "4", "4", "4"
+    }
+};
+static dm_codes codes_5_54_54[2] =
+{
+    {
+        "5", "54", "54"
+    }
+};
+static dm_codes codes_5_5_5[2] =
+{
+    {
+        "5", "5", "5"
+    }
+};
+static dm_codes codes_5_5_5_or_45_45_45[2] =
+{
+    {
+        "5", "5", "5"
+    },
+    {
+        "45", "45", "45"
+    }
+};
+static dm_codes codes_5_5_5_or_4_4_4[2] =
+{
+    {
+        "5", "5", "5"
+    },
+    {
+        "4", "4", "4"
+    }
+};
+static dm_codes codes_5_5_X[2] =
+{
+    {
+        "5", "5", "X"
+    }
+};
+static dm_codes codes_66_66_66[2] =
+{
+    {
+        "66", "66", "66"
+    }
+};
+static dm_codes codes_6_6_6[2] =
+{
+    {
+        "6", "6", "6"
+    }
+};
+static dm_codes codes_7_7_7[2] =
+{
+    {
+        "7", "7", "7"
+    }
+};
+static dm_codes codes_8_8_8[2] =
+{
+    {
+        "8", "8", "8"
+    }
+};
+static dm_codes codes_94_94_94_or_4_4_4[2] =
+{
+    {
+        "94", "94", "94"
+    },
+    {
+        "4", "4", "4"
+    }
+};
+static dm_codes codes_9_9_9[2] =
+{
+    {
+        "9", "9", "9"
+    }
+};
+static dm_codes codes_X_X_6_or_X_X_X[2] =
+{
+    {
+        "X", "X", "6"
+    },
+    {
+        "X", "X", "X"
+    }
+};
+
+/* Coding for alternative following letters in sequence. */
+static dm_letter letter_A[] =
+{
+    {
+        'I', NULL, codes_0_1_X
+    },
+    {
+        'J', NULL, codes_0_1_X
+    },
+    {
+        'U', NULL, codes_0_7_X
+    },
+    {
+        'Y', NULL, codes_0_1_X
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_CH[] =
+{
+    {
+        'S', NULL, codes_5_54_54
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_CS[] =
+{
+    {
+        'Z', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_CZ[] =
+{
+    {
+        'S', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_C[] =
+{
+    {
+        'H', letter_CH, codes_5_5_5_or_4_4_4
+    },
+    {
+        'K', NULL, codes_5_5_5_or_45_45_45
+    },
+    {
+        'S', letter_CS, codes_4_4_4
+    },
+    {
+        'Z', letter_CZ, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_DR[] =
+{
+    {
+        'S', NULL, codes_4_4_4
+    },
+    {
+        'Z', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_DS[] =
+{
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        'Z', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_DZ[] =
+{
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        'S', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_D[] =
+{
+    {
+        'R', letter_DR, NULL
+    },
+    {
+        'S', letter_DS, codes_4_4_4
+    },
+    {
+        'T', NULL, codes_3_3_3
+    },
+    {
+        'Z', letter_DZ, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_E[] =
+{
+    {
+        'I', NULL, codes_0_1_X
+    },
+    {
+        'J', NULL, codes_0_1_X
+    },
+    {
+        'U', NULL, codes_1_1_X
+    },
+    {
+        'Y', NULL, codes_0_1_X
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_F[] =
+{
+    {
+        'B', NULL, codes_7_7_7
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_I[] =
+{
+    {
+        'A', NULL, codes_1_X_X
+    },
+    {
+        'E', NULL, codes_1_X_X
+    },
+    {
+        'O', NULL, codes_1_X_X
+    },
+    {
+        'U', NULL, codes_1_X_X
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_K[] =
+{
+    {
+        'H', NULL, codes_5_5_5
+    },
+    {
+        'S', NULL, codes_5_54_54
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_M[] =
+{
+    {
+        'N', NULL, codes_66_66_66
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_N[] =
+{
+    {
+        'M', NULL, codes_66_66_66
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_O[] =
+{
+    {
+        'I', NULL, codes_0_1_X
+    },
+    {
+        'J', NULL, codes_0_1_X
+    },
+    {
+        'Y', NULL, codes_0_1_X
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_P[] =
+{
+    {
+        'F', NULL, codes_7_7_7
+    },
+    {
+        'H', NULL, codes_7_7_7
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_R[] =
+{
+    {
+        'S', NULL, codes_94_94_94_or_4_4_4
+    },
+    {
+        'Z', NULL, codes_94_94_94_or_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SCHTC[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SCHTSC[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SCHTS[] =
+{
+    {
+        'C', letter_SCHTSC, NULL
+    },
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SCHT[] =
+{
+    {
+        'C', letter_SCHTC, NULL
+    },
+    {
+        'S', letter_SCHTS, NULL
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SCH[] =
+{
+    {
+        'D', NULL, codes_2_43_43
+    },
+    {
+        'T', letter_SCHT, codes_2_43_43
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SC[] =
+{
+    {
+        'H', letter_SCH, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SHC[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SHTC[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SHTS[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SHT[] =
+{
+    {
+        'C', letter_SHTC, NULL
+    },
+    {
+        'S', letter_SHTS, NULL
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SH[] =
+{
+    {
+        'C', letter_SHC, NULL
+    },
+    {
+        'D', NULL, codes_2_43_43
+    },
+    {
+        'T', letter_SHT, codes_2_43_43
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_STC[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_STR[] =
+{
+    {
+        'S', NULL, codes_2_4_4
+    },
+    {
+        'Z', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_STSC[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_STS[] =
+{
+    {
+        'C', letter_STSC, NULL
+    },
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_ST[] =
+{
+    {
+        'C', letter_STC, NULL
+    },
+    {
+        'R', letter_STR, NULL
+    },
+    {
+        'S', letter_STS, NULL
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SZC[] =
+{
+    {
+        'S', NULL, codes_2_4_4
+    },
+    {
+        'Z', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_SZ[] =
+{
+    {
+        'C', letter_SZC, NULL
+    },
+    {
+        'D', NULL, codes_2_43_43
+    },
+    {
+        'T', NULL, codes_2_43_43
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_S[] =
+{
+    {
+        'C', letter_SC, codes_2_4_4
+    },
+    {
+        'D', NULL, codes_2_43_43
+    },
+    {
+        'H', letter_SH, codes_4_4_4
+    },
+    {
+        'T', letter_ST, codes_2_43_43
+    },
+    {
+        'Z', letter_SZ, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TC[] =
+{
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TR[] =
+{
+    {
+        'S', NULL, codes_4_4_4
+    },
+    {
+        'Z', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TSC[] =
+{
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TS[] =
+{
+    {
+        'C', letter_TSC, NULL
+    },
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        'Z', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TTC[] =
+{
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TTSC[] =
+{
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TTS[] =
+{
+    {
+        'C', letter_TTSC, NULL
+    },
+    {
+        'Z', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TT[] =
+{
+    {
+        'C', letter_TTC, NULL
+    },
+    {
+        'S', letter_TTS, codes_4_4_4
+    },
+    {
+        'Z', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_TZ[] =
+{
+    {
+        'S', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_T[] =
+{
+    {
+        'C', letter_TC, codes_4_4_4
+    },
+    {
+        'H', NULL, codes_3_3_3
+    },
+    {
+        'R', letter_TR, NULL
+    },
+    {
+        'S', letter_TS, codes_4_4_4
+    },
+    {
+        'T', letter_TT, NULL
+    },
+    {
+        'Z', letter_TZ, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_U[] =
+{
+    {
+        'E', NULL, codes_0_1_X
+    },
+    {
+        'I', NULL, codes_0_1_X
+    },
+    {
+        'J', NULL, codes_0_1_X
+    },
+    {
+        'Y', NULL, codes_0_1_X
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_ZDZ[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_ZD[] =
+{
+    {
+        'Z', letter_ZDZ, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_ZHDZ[] =
+{
+    {
+        'H', NULL, codes_2_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_ZHD[] =
+{
+    {
+        'Z', letter_ZHDZ, NULL
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_ZH[] =
+{
+    {
+        'D', letter_ZHD, codes_2_43_43
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_ZSC[] =
+{
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_ZS[] =
+{
+    {
+        'C', letter_ZSC, NULL
+    },
+    {
+        'H', NULL, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_Z[] =
+{
+    {
+        'D', letter_ZD, codes_2_43_43
+    },
+    {
+        'H', letter_ZH, codes_4_4_4
+    },
+    {
+        'S', letter_ZS, codes_4_4_4
+    },
+    {
+        '\0'
+    }
+};
+static dm_letter letter_[] =
+{
+    {
+        'A', letter_A, codes_0_X_X
+    },
+    {
+        'B', NULL, codes_7_7_7
+    },
+    {
+        'C', letter_C, codes_5_5_5_or_4_4_4
+    },
+    {
+        'D', letter_D, codes_3_3_3
+    },
+    {
+        'E', letter_E, codes_0_X_X
+    },
+    {
+        'F', letter_F, codes_7_7_7
+    },
+    {
+        'G', NULL, codes_5_5_5
+    },
+    {
+        'H', NULL, codes_5_5_X
+    },
+    {
+        'I', letter_I, codes_0_X_X
+    },
+    {
+        'J', NULL, codes_1_X_X_or_4_4_4
+    },
+    {
+        'K', letter_K, codes_5_5_5
+    },
+    {
+        'L', NULL, codes_8_8_8
+    },
+    {
+        'M', letter_M, codes_6_6_6
+    },
+    {
+        'N', letter_N, codes_6_6_6
+    },
+    {
+        'O', letter_O, codes_0_X_X
+    },
+    {
+        'P', letter_P, codes_7_7_7
+    },
+    {
+        'Q', NULL, codes_5_5_5
+    },
+    {
+        'R', letter_R, codes_9_9_9
+    },
+    {
+        'S', letter_S, codes_4_4_4
+    },
+    {
+        'T', letter_T, codes_3_3_3
+    },
+    {
+        'U', letter_U, codes_0_X_X
+    },
+    {
+        'V', NULL, codes_7_7_7
+    },
+    {
+        'W', NULL, codes_7_7_7
+    },
+    {
+        'X', NULL, codes_5_54_54
+    },
+    {
+        'Y', NULL, codes_1_X_X
+    },
+    {
+        'Z', letter_Z, codes_4_4_4
+    },
+    {
+        'a', NULL, codes_X_X_6_or_X_X_X
+    },
+    {
+        'e', NULL, codes_X_X_6_or_X_X_X
+    },
+    {
+        't', NULL, codes_3_3_3_or_4_4_4
+    },
+    {
+        '\0'
+    }
+};
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff_header.pl b/contrib/fuzzystrmatch/daitch_mokotoff_header.pl
new file mode 100755
index 0000000000..3e97e000ee
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff_header.pl
@@ -0,0 +1,288 @@
+#!/bin/perl
+#
+# Generation of types and lookup tables for Daitch-Mokotoff soundex.
+#
+# Copyright (c) 2021 Finance Norway
+# Author: Dag Lem <dag@nimrod.no>
+#
+# Permission to use, copy, modify, and distribute this software and its
+# documentation for any purpose, without fee, and without a written agreement
+# is hereby granted, provided that the above copyright notice and this
+# paragraph and the following two paragraphs appear in all copies.
+#
+# IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+# DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+# LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+# DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+# POSSIBILITY OF SUCH DAMAGE.
+#
+# THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+# INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+# AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+# ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+# PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+#
+
+use strict;
+use warnings;
+use utf8;
+use open IO => ':utf8', ':std';
+use Data::Dumper;
+
+# Parse code table and generate tree for letter transitions.
+my %codes;
+my %alternates;
+my $table = [{}, [["","",""]]];
+while (<DATA>) {
+    chomp;
+    my ($letters, $codes) = split(/\s+/);
+    my @codes = map { [ split(/,/) ] } split(/\|/, $codes);
+
+    # Find alternate code transitions for calculation of storage.
+    # The first character can never yield more than two alternate codes, and is not considered here.
+    if (@codes > 1) {
+        for my $j (1..2) {
+            for my $i (0..1) {
+                my ($a, $b) = (substr($codes[$i][$j], -1, 1), substr($codes[($i + 1)%2][$j], 0, 1));
+                $alternates{$a}{$b} = 1 if $a ne $b;
+            }
+        }
+    }
+    my $key = "codes_" . join("_or_", map { join("_", @$_) } @codes);
+    my $val = join(",\n", map { "\t{\n\t\t" . join(", ", map { "\"$_\"" } @$_) . "\n\t}" } @codes);
+    $codes{$key} = $val;
+
+    for my $letter (split(/,/, $letters)) {
+        my $ref = $table->[0];
+        # Link each character to the next in the letter combination.
+        my @c = split(//, $letter);
+        my $last_c = pop(@c);
+        for my $c (@c) {
+            $ref->{$c} //= [ {}, undef ];
+            $ref->{$c}[0] //= {};
+            $ref = $ref->{$c}[0];
+        }
+        # The sound code for the letter combination is stored at the last character.
+        $ref->{$last_c}[1] = $key;
+    }
+}
+close(DATA);
+
+# Find the maximum number of alternate codes in one position.
+my $alt_x = $alternates{"X"};
+while (my ($k, $v) = each %alternates) {
+    if (defined delete $v->{"X"}) {
+        for my $x (keys %$alt_x) {
+            $v->{$x} = 1;
+        }
+    }
+}
+my $max_alt = (reverse sort (2, map { scalar keys %$_ } values %alternates))[0];
+
+# The maximum number of nodes and leaves in the soundex tree.
+# These are safe estimates, but in practice somewhat higher than the actual maximums.
+# Note that the first character can never yield more than two alternate codes,
+# hence the calculations are performed as sums of two subtrees.
+my $digits = 6;
+# Number of nodes (sum of geometric progression).
+my $max_nodes = 2 + 2*(1 - $max_alt**($digits - 1))/(1 - $max_alt);
+# Number of leaves (exponential of base number).
+my $max_leaves = 2*$max_alt**($digits - 2);
+
+print <<EOF;
+/*
+ * Types and lookup tables for Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem <dag\@nimrod.no>
+ *
+ * This file is generated by daitch_mokotoff_header.pl
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+ */
+
+#include <stdlib.h>
+
+#define DM_MAX_CODE_DIGITS $digits
+#define DM_MAX_ALTERNATE_CODES $max_alt
+#define DM_MAX_NODES $max_nodes
+#define DM_MAX_LEAVES $max_leaves
+#define DM_MAX_SOUNDEX_CHARS (DM_MAX_NODES*(DM_MAX_CODE_DIGITS + 1))
+
+typedef char dm_code[2 + 1];    /* One or two sequential code digits + NUL */
+typedef dm_code dm_codes[3];    /* Start of name, before a vowel, any other */
+
+/* Letter in input sequence */
+struct dm_letter
+{
+    char        letter;            /* Present letter in sequence */
+    struct dm_letter *letters;    /* List of possible successive letters */
+    dm_codes   *codes;            /* Code sequence(s) for complete sequence */
+};
+
+/* Node in soundex code tree */
+struct dm_node
+{
+    int            soundex_length; /* Length of generated soundex code */
+    char        soundex[DM_MAX_CODE_DIGITS + 1];    /* Soundex code */
+    int            is_leaf;        /* Candidate for complete soundex code */
+    int            last_update;    /* Letter number for last update of node */
+    char        code_digit;        /* Last code digit, 0 - 9 */
+
+    /*
+     * One or two alternate code digits leading to this node. If there are two
+     * digits, one of them is always an 'X'. Repeated code digits and 'X' lead
+     * back to the same node.
+     */
+    char        prev_code_digits[2];
+    /* One or two alternate code digits moving forward. */
+    char        next_code_digits[2];
+    /* ORed together code index(es) used to reach current node. */
+    int            prev_code_index;
+    int            next_code_index;
+    /* Nodes branching out from this node. */
+    struct dm_node *next_nodes[DM_MAX_ALTERNATE_CODES + 1];
+};
+
+typedef struct dm_letter dm_letter;
+typedef struct dm_node dm_node;
+
+/* Codes for letter sequence at start of name, before a vowel, and any other. */
+EOF
+
+for my $key (sort keys %codes) {
+    print "static dm_codes $key\[2\] =\n{\n" . $codes{$key} . "\n};\n";
+}
+
+print <<EOF;
+
+/* Coding for alternative following letters in sequence. */
+EOF
+
+sub hash2code {
+    my ($ref, $letter) = @_;
+
+    my @letters = ();
+
+    my $h = $ref->[0];
+    for my $key (sort keys %$h) {
+        $ref = $h->{$key};
+        my $children = "NULL";
+        if (defined $ref->[0]) {
+            $children = "letter_$letter$key";
+            hash2code($ref, "$letter$key");
+        }
+        my $codes = $ref->[1] // "NULL";
+        push(@letters, "\t{\n\t\t'$key', $children, $codes\n\t}");
+    }
+
+    print "static dm_letter letter_$letter\[\] =\n{\n";
+    for (@letters) {
+        print "$_,\n";
+    }
+    print "\t{\n\t\t'\\0'\n\t}\n";
+    print "};\n";
+}
+
+hash2code($table, '');
+
+
+# Table adapted from https://www.jewishgen.org/InfoFiles/Soundex.html
+#
+# X = NC (not coded)
+#
+# Note that the following letters are coded with substitute letters
+#
+# Ą => a (use '[' for table lookup)
+# Ę => e (use '\\' for table lookup)
+# Ţ => t (use ']' for table lookup)
+#
+# The rule for "UE" below does not correspond to the table referred to above,
+# however it is used by all other known implementations, including the one at
+# https://www.jewishgen.org/jos/jossound.htm (try e.g. "bouey").
+
+__DATA__
+AI,AJ,AY                0,1,X
+AU                        0,7,X
+a                        X,X,6|X,X,X
+A                        0,X,X
+B                        7,7,7
+CHS                        5,54,54
+CH                        5,5,5|4,4,4
+CK                        5,5,5|45,45,45
+CZ,CS,CSZ,CZS            4,4,4
+C                        5,5,5|4,4,4
+DRZ,DRS                    4,4,4
+DS,DSH,DSZ                4,4,4
+DZ,DZH,DZS                4,4,4
+D,DT                    3,3,3
+EI,EJ,EY                0,1,X
+EU                        1,1,X
+e                        X,X,6|X,X,X
+E                        0,X,X
+FB                        7,7,7
+F                        7,7,7
+G                        5,5,5
+H                        5,5,X
+IA,IE,IO,IU                1,X,X
+I                        0,X,X
+J                        1,X,X|4,4,4
+KS                        5,54,54
+KH                        5,5,5
+K                        5,5,5
+L                        8,8,8
+MN                        66,66,66
+M                        6,6,6
+NM                        66,66,66
+N                        6,6,6
+OI,OJ,OY                0,1,X
+O                        0,X,X
+P,PF,PH                    7,7,7
+Q                        5,5,5
+RZ,RS                    94,94,94|4,4,4
+R                        9,9,9
+SCHTSCH,SCHTSH,SCHTCH    2,4,4
+SCH                        4,4,4
+SHTCH,SHCH,SHTSH        2,4,4
+SHT,SCHT,SCHD            2,43,43
+SH                        4,4,4
+STCH,STSCH,SC            2,4,4
+STRZ,STRS,STSH            2,4,4
+ST                        2,43,43
+SZCZ,SZCS                2,4,4
+SZT,SHD,SZD,SD            2,43,43
+SZ                        4,4,4
+S                        4,4,4
+TCH,TTCH,TTSCH            4,4,4
+TH                        3,3,3
+TRZ,TRS                    4,4,4
+TSCH,TSH                4,4,4
+TS,TTS,TTSZ,TC            4,4,4
+TZ,TTZ,TZS,TSZ            4,4,4
+t                        3,3,3|4,4,4
+T                        3,3,3
+UI,UJ,UY,UE                0,1,X
+U                        0,X,X
+V                        7,7,7
+W                        7,7,7
+X                        5,54,54
+Y                        1,X,X
+ZDZ,ZDZH,ZHDZH            2,4,4
+ZD,ZHD                    2,43,43
+ZH,ZS,ZSCH,ZSH            4,4,4
+Z                        4,4,4
diff --git a/contrib/fuzzystrmatch/expected/fuzzystrmatch.out b/contrib/fuzzystrmatch/expected/fuzzystrmatch.out
index 493c95cdfa..f62ddad4ee 100644
--- a/contrib/fuzzystrmatch/expected/fuzzystrmatch.out
+++ b/contrib/fuzzystrmatch/expected/fuzzystrmatch.out
@@ -65,3 +65,174 @@ SELECT dmetaphone_alt('gumbo');
  KMP
 (1 row)

+-- Wovels
+SELECT daitch_mokotoff('Augsburg');
+ daitch_mokotoff
+-----------------
+ 054795
+(1 row)
+
+SELECT daitch_mokotoff('Breuer');
+ daitch_mokotoff
+-----------------
+ 791900
+(1 row)
+
+SELECT daitch_mokotoff('Freud');
+ daitch_mokotoff
+-----------------
+ 793000
+(1 row)
+
+-- The letter "H"
+SELECT daitch_mokotoff('Halberstadt');
+ daitch_mokotoff
+-----------------
+ 587943 587433
+(1 row)
+
+SELECT daitch_mokotoff('Mannheim');
+ daitch_mokotoff
+-----------------
+ 665600
+(1 row)
+
+-- Adjacent sounds
+SELECT daitch_mokotoff('Chernowitz');
+ daitch_mokotoff
+-----------------
+ 596740 496740
+(1 row)
+
+-- Adjacent letters with identical adjacent code digits
+SELECT daitch_mokotoff('Cherkassy');
+ daitch_mokotoff
+-----------------
+ 595400 495400
+(1 row)
+
+SELECT daitch_mokotoff('Kleinman');
+ daitch_mokotoff
+-----------------
+ 586660
+(1 row)
+
+-- More than one word
+SELECT daitch_mokotoff('Nowy Targ');
+ daitch_mokotoff
+-----------------
+ 673950
+(1 row)
+
+-- Padded with "0"
+SELECT daitch_mokotoff('Berlin');
+ daitch_mokotoff
+-----------------
+ 798600
+(1 row)
+
+-- Other examples from https://www.avotaynu.com/soundex.htm
+SELECT daitch_mokotoff('Ceniow');
+ daitch_mokotoff
+-----------------
+ 567000 467000
+(1 row)
+
+SELECT daitch_mokotoff('Tsenyuv');
+ daitch_mokotoff
+-----------------
+ 467000
+(1 row)
+
+SELECT daitch_mokotoff('Holubica');
+ daitch_mokotoff
+-----------------
+ 587500 587400
+(1 row)
+
+SELECT daitch_mokotoff('Golubitsa');
+ daitch_mokotoff
+-----------------
+ 587400
+(1 row)
+
+SELECT daitch_mokotoff('Przemysl');
+ daitch_mokotoff
+-----------------
+ 794648 746480
+(1 row)
+
+SELECT daitch_mokotoff('Pshemeshil');
+ daitch_mokotoff
+-----------------
+ 746480
+(1 row)
+
+SELECT daitch_mokotoff('Rosochowaciec');
+                     daitch_mokotoff
+---------------------------------------------------------
+ 945755 945754 945745 945744 944755 944754 944745 944744
+(1 row)
+
+SELECT daitch_mokotoff('Rosokhovatsets');
+ daitch_mokotoff
+-----------------
+ 945744
+(1 row)
+
+-- Ignored characters
+SELECT daitch_mokotoff('''OBrien');
+ daitch_mokotoff
+-----------------
+ 079600
+(1 row)
+
+SELECT daitch_mokotoff('O''Brien');
+ daitch_mokotoff
+-----------------
+ 079600
+(1 row)
+
+-- "Difficult" cases, likely to cause trouble for other implementations.
+SELECT daitch_mokotoff('CJC');
+              daitch_mokotoff
+-------------------------------------------
+ 550000 540000 545000 450000 400000 440000
+(1 row)
+
+SELECT daitch_mokotoff('BESST');
+ daitch_mokotoff
+-----------------
+ 743000
+(1 row)
+
+SELECT daitch_mokotoff('BOUEY');
+ daitch_mokotoff
+-----------------
+ 710000
+(1 row)
+
+SELECT daitch_mokotoff('HANNMANN');
+ daitch_mokotoff
+-----------------
+ 566600
+(1 row)
+
+SELECT daitch_mokotoff('MCCOYJR');
+                     daitch_mokotoff
+---------------------------------------------------------
+ 651900 654900 654190 654490 645190 645490 641900 644900
+(1 row)
+
+SELECT daitch_mokotoff('ACCURSO');
+                     daitch_mokotoff
+---------------------------------------------------------
+ 059400 054000 054940 054400 045940 045400 049400 044000
+(1 row)
+
+SELECT daitch_mokotoff('BIERSCHBACH');
+                     daitch_mokotoff
+---------------------------------------------------------
+ 794575 794574 794750 794740 745750 745740 747500 747400
+(1 row)
+
diff --git a/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8.out
b/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8.out
new file mode 100644
index 0000000000..32d8260383
--- /dev/null
+++ b/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8.out
@@ -0,0 +1,61 @@
+/*
+ * This test must be run in a database with UTF-8 encoding,
+ * because other encodings don't support all the characters used.
+ */
+SELECT getdatabaseencoding() <> 'UTF8'
+       AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+set client_encoding = utf8;
+-- CREATE EXTENSION IF NOT EXISTS fuzzystrmatch;
+-- Accents
+SELECT daitch_mokotoff('Müller');
+ daitch_mokotoff
+-----------------
+ 689000
+(1 row)
+
+SELECT daitch_mokotoff('Schäfer');
+ daitch_mokotoff
+-----------------
+ 479000
+(1 row)
+
+SELECT daitch_mokotoff('Straßburg');
+ daitch_mokotoff
+-----------------
+ 294795
+(1 row)
+
+SELECT daitch_mokotoff('Éregon');
+ daitch_mokotoff
+-----------------
+ 095600
+(1 row)
+
+-- Special characters added at https://www.jewishgen.org/InfoFiles/Soundex.html
+SELECT daitch_mokotoff('gąszczu');
+ daitch_mokotoff
+-----------------
+ 564000 540000
+(1 row)
+
+SELECT daitch_mokotoff('brzęczy');
+       daitch_mokotoff
+-----------------------------
+ 794640 794400 746400 744000
+(1 row)
+
+SELECT daitch_mokotoff('ţamas');
+ daitch_mokotoff
+-----------------
+ 364000 464000
+(1 row)
+
+SELECT daitch_mokotoff('țamas');
+ daitch_mokotoff
+-----------------
+ 364000 464000
+(1 row)
+
diff --git a/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8_1.out
b/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8_1.out
new file mode 100644
index 0000000000..37aead89c0
--- /dev/null
+++ b/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8_1.out
@@ -0,0 +1,8 @@
+/*
+ * This test must be run in a database with UTF-8 encoding,
+ * because other encodings don't support all the characters used.
+ */
+SELECT getdatabaseencoding() <> 'UTF8'
+       AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql b/contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql
new file mode 100644
index 0000000000..b9d7b229a3
--- /dev/null
+++ b/contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql
@@ -0,0 +1,8 @@
+/* contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION fuzzystrmatch UPDATE TO '1.2'" to load this file. \quit
+
+CREATE FUNCTION daitch_mokotoff(text) RETURNS text
+AS 'MODULE_PATHNAME', 'daitch_mokotoff'
+LANGUAGE C IMMUTABLE STRICT PARALLEL SAFE;
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch--1.1.sql b/contrib/fuzzystrmatch/fuzzystrmatch--1.2.sql
similarity index 92%
rename from contrib/fuzzystrmatch/fuzzystrmatch--1.1.sql
rename to contrib/fuzzystrmatch/fuzzystrmatch--1.2.sql
index 41de9d949b..2a8a100699 100644
--- a/contrib/fuzzystrmatch/fuzzystrmatch--1.1.sql
+++ b/contrib/fuzzystrmatch/fuzzystrmatch--1.2.sql
@@ -42,3 +42,7 @@ LANGUAGE C IMMUTABLE STRICT PARALLEL SAFE;
 CREATE FUNCTION dmetaphone_alt (text) RETURNS text
 AS 'MODULE_PATHNAME', 'dmetaphone_alt'
 LANGUAGE C IMMUTABLE STRICT PARALLEL SAFE;
+
+CREATE FUNCTION daitch_mokotoff(text) RETURNS text
+AS 'MODULE_PATHNAME', 'daitch_mokotoff'
+LANGUAGE C IMMUTABLE STRICT PARALLEL SAFE;
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch.control b/contrib/fuzzystrmatch/fuzzystrmatch.control
index 3cd6660bf9..8b6e9fd993 100644
--- a/contrib/fuzzystrmatch/fuzzystrmatch.control
+++ b/contrib/fuzzystrmatch/fuzzystrmatch.control
@@ -1,6 +1,6 @@
 # fuzzystrmatch extension
 comment = 'determine similarities and distance between strings'
-default_version = '1.1'
+default_version = '1.2'
 module_pathname = '$libdir/fuzzystrmatch'
 relocatable = true
 trusted = true
diff --git a/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql b/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql
index f05dc28ffb..db05c7d6b6 100644
--- a/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql
+++ b/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql
@@ -19,3 +19,48 @@ SELECT metaphone('GUMBO', 4);

 SELECT dmetaphone('gumbo');
 SELECT dmetaphone_alt('gumbo');
+
+-- Wovels
+SELECT daitch_mokotoff('Augsburg');
+SELECT daitch_mokotoff('Breuer');
+SELECT daitch_mokotoff('Freud');
+
+-- The letter "H"
+SELECT daitch_mokotoff('Halberstadt');
+SELECT daitch_mokotoff('Mannheim');
+
+-- Adjacent sounds
+SELECT daitch_mokotoff('Chernowitz');
+
+-- Adjacent letters with identical adjacent code digits
+SELECT daitch_mokotoff('Cherkassy');
+SELECT daitch_mokotoff('Kleinman');
+
+-- More than one word
+SELECT daitch_mokotoff('Nowy Targ');
+
+-- Padded with "0"
+SELECT daitch_mokotoff('Berlin');
+
+-- Other examples from https://www.avotaynu.com/soundex.htm
+SELECT daitch_mokotoff('Ceniow');
+SELECT daitch_mokotoff('Tsenyuv');
+SELECT daitch_mokotoff('Holubica');
+SELECT daitch_mokotoff('Golubitsa');
+SELECT daitch_mokotoff('Przemysl');
+SELECT daitch_mokotoff('Pshemeshil');
+SELECT daitch_mokotoff('Rosochowaciec');
+SELECT daitch_mokotoff('Rosokhovatsets');
+
+-- Ignored characters
+SELECT daitch_mokotoff('''OBrien');
+SELECT daitch_mokotoff('O''Brien');
+
+-- "Difficult" cases, likely to cause trouble for other implementations.
+SELECT daitch_mokotoff('CJC');
+SELECT daitch_mokotoff('BESST');
+SELECT daitch_mokotoff('BOUEY');
+SELECT daitch_mokotoff('HANNMANN');
+SELECT daitch_mokotoff('MCCOYJR');
+SELECT daitch_mokotoff('ACCURSO');
+SELECT daitch_mokotoff('BIERSCHBACH');
diff --git a/contrib/fuzzystrmatch/sql/fuzzystrmatch_utf8.sql b/contrib/fuzzystrmatch/sql/fuzzystrmatch_utf8.sql
new file mode 100644
index 0000000000..f42c01a1bb
--- /dev/null
+++ b/contrib/fuzzystrmatch/sql/fuzzystrmatch_utf8.sql
@@ -0,0 +1,26 @@
+/*
+ * This test must be run in a database with UTF-8 encoding,
+ * because other encodings don't support all the characters used.
+ */
+
+SELECT getdatabaseencoding() <> 'UTF8'
+       AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+set client_encoding = utf8;
+
+-- CREATE EXTENSION IF NOT EXISTS fuzzystrmatch;
+
+-- Accents
+SELECT daitch_mokotoff('Müller');
+SELECT daitch_mokotoff('Schäfer');
+SELECT daitch_mokotoff('Straßburg');
+SELECT daitch_mokotoff('Éregon');
+
+-- Special characters added at https://www.jewishgen.org/InfoFiles/Soundex.html
+SELECT daitch_mokotoff('gąszczu');
+SELECT daitch_mokotoff('brzęczy');
+SELECT daitch_mokotoff('ţamas');
+SELECT daitch_mokotoff('țamas');
diff --git a/doc/src/sgml/fuzzystrmatch.sgml b/doc/src/sgml/fuzzystrmatch.sgml
index 382e54be91..08781778f8 100644
--- a/doc/src/sgml/fuzzystrmatch.sgml
+++ b/doc/src/sgml/fuzzystrmatch.sgml
@@ -241,4 +241,101 @@ test=# SELECT dmetaphone('gumbo');
 </screen>
  </sect2>

+ <sect2>
+  <title>Daitch-Mokotoff Soundex</title>
+
+  <para>
+   Compared to the American Soundex System implemented in the
+   <function>soundex</function> function, the major improvements of the
+   Daitch-Mokotoff Soundex System are:
+
+   <itemizedlist spacing="compact" mark="bullet">
+    <listitem>
+     <para>
+      Information is coded to the first six meaningful letters rather than
+      four.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      The initial letter is coded rather than kept as is.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Where two consecutive letters have a single sound, they are coded as a
+      single number.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      When a letter or combination of letters may have two different sounds,
+      it is double coded under the two different codes.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      A letter or combination of letters maps into ten possible codes rather
+      than seven.
+     </para>
+    </listitem>
+   </itemizedlist>
+  </para>
+
+  <indexterm>
+   <primary>daitch_mokotoff</primary>
+  </indexterm>
+
+  <para>
+   The following function generates Daitch-Mokotoff soundex codes for matching
+   of similar-sounding input:
+  </para>
+
+<synopsis>
+daitch_mokotoff(text source) returns text
+</synopsis>
+
+  <para>
+   Since a Daitch-Mokotoff soundex code consists of only 6 digits,
+   <literal>source</literal> should be preferably a single word or name.
+   Any alternative soundex codes are separated by space, which makes the returned
+   text suited for use in Full Text Search, see <xref linkend="textsearch"/> and
+   <xref linkend="functions-textsearch"/>.
+  </para>
+
+  <para>
+   Example:
+  </para>
+
+<programlisting>
+CREATE OR REPLACE FUNCTION soundex_name(v_name text) RETURNS text AS $$
+  SELECT string_agg(daitch_mokotoff(n), ' ')
+  FROM regexp_split_to_table(v_name, '\s+') AS n
+$$ LANGUAGE sql STRICT IMMUTABLE PARALLEL SAFE;
+
+CREATE OR REPLACE FUNCTION soundex_tsvector(v_name text) RETURNS tsvector AS $$
+  SELECT to_tsvector('simple', soundex_name(v_name))
+$$ LANGUAGE sql STRICT IMMUTABLE PARALLEL SAFE;
+
+CREATE OR REPLACE FUNCTION soundex_tsquery(v_name text) RETURNS tsquery AS $$
+  SELECT to_tsquery('simple', quote_literal(soundex_name(v_name)))
+$$ LANGUAGE sql STRICT IMMUTABLE PARALLEL SAFE;
+
+-- Note that searches could be more efficient with the tsvector in a separate column
+-- (no recalculation on table row recheck).
+CREATE TABLE s (nm text);
+CREATE INDEX ix_s_txt ON s USING gin (soundex_tsvector(nm)) WITH (fastupdate = off);
+
+INSERT INTO s VALUES ('John Doe');
+INSERT INTO s VALUES ('Jane Roe');
+INSERT INTO s VALUES ('Public John Q.');
+INSERT INTO s VALUES ('George Best');
+
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('john');
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('jane doe');
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('john public');
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('besst, giorgio');
+</programlisting>
+ </sect2>
+
 </sect1>

Re: daitch_mokotoff module

От
Ian Lawrence Barwick
Дата:
Hi Dag

2022年2月3日(木) 23:27 Dag Lem <dag@nimrod.no>:
>
> Hi,
>
> Just some minor adjustments to the patch:
>
> * Removed call to locale-dependent toupper()
> * Cleaned up input normalization

This patch was marked as "Waiting on Author" in the CommitFest entry [1]
but I see you provided an updated version which hasn't received any feedback,
so I've move this to the next CommitFest [2] and set it to "Needs Review".

[1] https://commitfest.postgresql.org/40/3451/
[2] https://commitfest.postgresql.org/41/3451/

> I have been asked to sign up to review a commitfest patch or patches -
> unfortunately I've been ill with COVID-19 and it's not until now that
> I feel well enough to have a look.
>
> Julien: I'll have a look at https://commitfest.postgresql.org/36/3468/
> as you suggested (https://commitfest.postgresql.org/36/3379/ seems to
> have been reviewed now).
>
> If there are other suggestions for a patch or patches to review for
> someone new to PostgreSQL internals, I'd be grateful for that.

I see you provided some feedback on https://commitfest.postgresql.org/36/3468/,
though the patch seems to have not been accepted (but not conclusively rejected
either). If you still have the chance to review another patch (or more) it would
be much appreciated, as there's quite a few piling up. Things like documentation
or small improvements to client applications are always a good place to start.
Reviews can be provided at any time, there's no need to wait for the next
CommitFest.

Regards

Ian Barwick



Re: daitch_mokotoff module

От
Dag Lem
Дата:
Hi Ian,

Ian Lawrence Barwick <barwick@gmail.com> writes:

> Hi Dag
>
> 2022年2月3日(木) 23:27 Dag Lem <dag@nimrod.no>:
>>
>> Hi,
>>
>> Just some minor adjustments to the patch:
>>
>> * Removed call to locale-dependent toupper()
>> * Cleaned up input normalization
>
> This patch was marked as "Waiting on Author" in the CommitFest entry [1]
> but I see you provided an updated version which hasn't received any feedback,
> so I've move this to the next CommitFest [2] and set it to "Needs Review".
>
> [1] https://commitfest.postgresql.org/40/3451/
> [2] https://commitfest.postgresql.org/41/3451/
>
>> I have been asked to sign up to review a commitfest patch or patches -
>> unfortunately I've been ill with COVID-19 and it's not until now that
>> I feel well enough to have a look.
>>
>> Julien: I'll have a look at https://commitfest.postgresql.org/36/3468/
>> as you suggested (https://commitfest.postgresql.org/36/3379/ seems to
>> have been reviewed now).
>>
>> If there are other suggestions for a patch or patches to review for
>> someone new to PostgreSQL internals, I'd be grateful for that.
>
> I see you provided some feedback on https://commitfest.postgresql.org/36/3468/,
> though the patch seems to have not been accepted (but not conclusively rejected
> either). If you still have the chance to review another patch (or more) it would
> be much appreciated, as there's quite a few piling up. Things like documentation
> or small improvements to client applications are always a good place to start.
> Reviews can be provided at any time, there's no need to wait for the next
> CommitFest.
>

OK, I'll try to find another patch to review.

Regards

Dag Lem



Re: daitch_mokotoff module

От
Andres Freund
Дата:
Hi,

On 2022-02-03 15:27:32 +0100, Dag Lem wrote:
> Just some minor adjustments to the patch:
> 
> * Removed call to locale-dependent toupper()
> * Cleaned up input normalization

This patch currently fails in cfbot, likely because meson.build needs to be
adjusted (this didn't exist at the time you submitted this version of the
patch):

[23:43:34.796] contrib/fuzzystrmatch/meson.build:18:0: ERROR: File fuzzystrmatch--1.1.sql does not exist.


> -DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql
> +DATA = fuzzystrmatch--1.2.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql
>  PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"


The patch seems to remove fuzzystrmatch--1.1.sql - I suggest not doing
that. In recent years our approach has been to just keep the "base version" of
the upgrade script, with extension creation running through the upgrade
scripts.

>  
> +
> +#include "daitch_mokotoff.h"
> +
> +#include "postgres.h"

Postgres policy is that the include of "postgres.h" has to be the first
include in every .c file.


> +#include "utils/builtins.h"
> +#include "mb/pg_wchar.h"
> +
> +#include <string.h>
> +
> +/* Internal C implementation */
> +static char *_daitch_mokotoff(char *word, char *soundex, size_t n);
> +
> +
> +PG_FUNCTION_INFO_V1(daitch_mokotoff);
> +Datum
> +daitch_mokotoff(PG_FUNCTION_ARGS)
> +{
> +    text       *arg = PG_GETARG_TEXT_PP(0);
> +    char       *string,
> +               *tmp_soundex;
> +    text       *soundex;
> +
> +    /*
> +     * The maximum theoretical soundex size is several KB, however in practice
> +     * anything but contrived synthetic inputs will yield a soundex size of
> +     * less than 100 bytes. We thus allocate and free a temporary work buffer,
> +     * and return only the actual soundex result.
> +     */
> +    string = pg_server_to_any(text_to_cstring(arg), VARSIZE_ANY_EXHDR(arg), PG_UTF8);
> +    tmp_soundex = palloc(DM_MAX_SOUNDEX_CHARS);

Seems that just using StringInfo to hold the soundex output would work better
than a static allocation?


> +    if (!_daitch_mokotoff(string, tmp_soundex, DM_MAX_SOUNDEX_CHARS))

We imo shouldn't introduce new functions starting with _.


> +/* Mark soundex code tree node as leaf. */
> +static void
> +set_leaf(dm_leaves leaves_next, int *num_leaves_next, dm_node * node)
> +{
> +    if (!node->is_leaf)
> +    {
> +        node->is_leaf = 1;
> +        leaves_next[(*num_leaves_next)++] = node;
> +    }
> +}
> +
> +
> +/* Find next node corresponding to code digit, or create a new node. */
> +static dm_node * find_or_create_node(dm_nodes nodes, int *num_nodes,
> +                                     dm_node * node, char code_digit)

PG code style is to have a line break between a function defintion's return
type and the function name - like you actually do above.




> +/* Mapping from ISO8859-1 to upper-case ASCII */
> +static const char tr_iso8859_1_to_ascii_upper[] =
> +/*
> +"`abcdefghijklmnopqrstuvwxyz{|}~                                  ¡¢£¤¥¦§¨©ª«¬
®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ"
> +*/
> +"`ABCDEFGHIJKLMNOPQRSTUVWXYZ{|}~                                  !
?AAAAAAECEEEEIIIIDNOOOOO*OUUUUYDSAAAAAAECEEEEIIIIDNOOOOO/OUUUUYDY";
> +
> +static char
> +iso8859_1_to_ascii_upper(unsigned char c)
> +{
> +    return c >= 0x60 ? tr_iso8859_1_to_ascii_upper[c - 0x60] : c;
> +}
> +
> +
> +/* Convert an UTF-8 character to ISO-8859-1.
> + * Unconvertable characters are returned as '?'.
> + * NB! Beware of the domain specific conversion of Ą, Ę, and Ţ/Ț.
> + */
> +static char
> +utf8_to_iso8859_1(char *str, int *ix)

It seems decidedly not great to have custom encoding conversion routines in a
contrib module. Is there any way we can avoid this?


> +/* Generate all Daitch-Mokotoff soundex codes for word, separated by space. */
> +static char *
> +_daitch_mokotoff(char *word, char *soundex, size_t n)
> +{
> +    int            i = 0,
> +                j;
> +    int            letter_no = 0;
> +    int            ix_leaves = 0;
> +    int            num_nodes = 0,
> +                num_leaves = 0;
> +    dm_codes   *codes,
> +               *next_codes;
> +    dm_node    *nodes;
> +    dm_leaves  *leaves;
> +
> +    /* First letter. */
> +    if (!(codes = read_letter(word, &i)))
> +    {
> +        /* No encodable character in input. */
> +        return NULL;
> +    }
> +
> +    /* Allocate memory for node tree. */
> +    nodes = palloc(sizeof(dm_nodes));
> +    leaves = palloc(2 * sizeof(dm_leaves));

So this allocates the worst case memory usage, is that right? That's quite a
bit of memory. Shouldn't nodes be allocated dynamically?

Instead of carefully freeing individual memory allocations, I think it be
better to create a temporary memory context, allocate the necessary nodes etc
on demand, and destroy the temporary memory context at the end.


> +/* Codes for letter sequence at start of name, before a vowel, and any other. */
> +static dm_codes codes_0_1_X[2] =

Any reason these aren't all const?


It's not clear to me where the intended line between the .h and .c file is.


> +print <<EOF;
> +/*
> + * Types and lookup tables for Daitch-Mokotoff Soundex
> + *

If we generate the code, why is the generated header included in the commit?

> +/* Letter in input sequence */
> +struct dm_letter
> +{
> +    char        letter;            /* Present letter in sequence */
> +    struct dm_letter *letters;    /* List of possible successive letters */
> +    dm_codes   *codes;            /* Code sequence(s) for complete sequence */
> +};
> +
> +/* Node in soundex code tree */
> +struct dm_node
> +{
> +    int            soundex_length; /* Length of generated soundex code */
> +    char        soundex[DM_MAX_CODE_DIGITS + 1];    /* Soundex code */
> +    int            is_leaf;        /* Candidate for complete soundex code */
> +    int            last_update;    /* Letter number for last update of node */
> +    char        code_digit;        /* Last code digit, 0 - 9 */
> +
> +    /*
> +     * One or two alternate code digits leading to this node. If there are two
> +     * digits, one of them is always an 'X'. Repeated code digits and 'X' lead
> +     * back to the same node.
> +     */
> +    char        prev_code_digits[2];
> +    /* One or two alternate code digits moving forward. */
> +    char        next_code_digits[2];
> +    /* ORed together code index(es) used to reach current node. */
> +    int            prev_code_index;
> +    int            next_code_index;
> +    /* Nodes branching out from this node. */
> +    struct dm_node *next_nodes[DM_MAX_ALTERNATE_CODES + 1];
> +};
> +
> +typedef struct dm_letter dm_letter;
> +typedef struct dm_node dm_node;

Why is all this in the generated header? It needs DM_MAX_ALTERNATE_CODES etc,
but it seems that the structs could just be defined in the .c file.


> +# Table adapted from https://www.jewishgen.org/InfoFiles/Soundex.html

What does "adapted" mean here? And what's the path to updating the data?

Greetings,

Andres Freund



Re: daitch_mokotoff module

От
Dag Lem
Дата:
Hi Andreas,

Thank you for your detailed and constructive review!

I have made a conscientuous effort to address all the issues you point
out, please see comments below.

Andres Freund <andres@anarazel.de> writes:

> Hi,
>
> On 2022-02-03 15:27:32 +0100, Dag Lem wrote:

[...]

> [23:43:34.796] contrib/fuzzystrmatch/meson.build:18:0: ERROR: File
> fuzzystrmatch--1.1.sql does not exist.
>
>
>> -DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql
>> +DATA = fuzzystrmatch--1.2.sql fuzzystrmatch--1.1--1.2.sql
>> fuzzystrmatch--1.0--1.1.sql
>>  PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"
>
>
> The patch seems to remove fuzzystrmatch--1.1.sql - I suggest not doing
> that. In recent years our approach has been to just keep the "base version" of
> the upgrade script, with extension creation running through the upgrade
> scripts.
>

OK, I have now kept fuzzystrmatch--1.1.sql, and omitted
fuzzystrmatch--1.2.sql

Both the Makefile and meson.build are updated to handle the new files,
including the generated header.

>>
>> +
>> +#include "daitch_mokotoff.h"
>> +
>> +#include "postgres.h"
>
> Postgres policy is that the include of "postgres.h" has to be the first
> include in every .c file.
>
>

OK, fixed.

>> +#include "utils/builtins.h"
>> +#include "mb/pg_wchar.h"
>> +
>> +#include <string.h>
>> +
>> +/* Internal C implementation */
>> +static char *_daitch_mokotoff(char *word, char *soundex, size_t n);
>> +
>> +
>> +PG_FUNCTION_INFO_V1(daitch_mokotoff);
>> +Datum
>> +daitch_mokotoff(PG_FUNCTION_ARGS)
>> +{
>> +    text       *arg = PG_GETARG_TEXT_PP(0);
>> +    char       *string,
>> +               *tmp_soundex;
>> +    text       *soundex;
>> +
>> +    /*
>> + * The maximum theoretical soundex size is several KB, however in
>> practice
>> +     * anything but contrived synthetic inputs will yield a soundex size of
>> + * less than 100 bytes. We thus allocate and free a temporary work
>> buffer,
>> +     * and return only the actual soundex result.
>> +     */
>> + string = pg_server_to_any(text_to_cstring(arg),
>> VARSIZE_ANY_EXHDR(arg), PG_UTF8);
>> +    tmp_soundex = palloc(DM_MAX_SOUNDEX_CHARS);
>
> Seems that just using StringInfo to hold the soundex output would work better
> than a static allocation?
>

OK, fixed.

>
>> +    if (!_daitch_mokotoff(string, tmp_soundex, DM_MAX_SOUNDEX_CHARS))
>
> We imo shouldn't introduce new functions starting with _.
>

OK, fixed. Note that I just followed the existing pattern in
fuzzystrmatch.c there.

[...]

>> +/* Find next node corresponding to code digit, or create a new node. */
>> +static dm_node * find_or_create_node(dm_nodes nodes, int *num_nodes,
>> + dm_node * node, char code_digit)
>
> PG code style is to have a line break between a function defintion's return
> type and the function name - like you actually do above.
>

OK, fixed. Both pgindent and I must have missed that particular
function.

>> +/* Mapping from ISO8859-1 to upper-case ASCII */
>> +static const char tr_iso8859_1_to_ascii_upper[] =
>> +/*
>> +"`abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬
>> ®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ"
>> +*/
>> +"`ABCDEFGHIJKLMNOPQRSTUVWXYZ{|}~ !
>> ?AAAAAAECEEEEIIIIDNOOOOO*OUUUUYDSAAAAAAECEEEEIIIIDNOOOOO/OUUUUYDY";
>> +
>> +static char
>> +iso8859_1_to_ascii_upper(unsigned char c)
>> +{
>> +    return c >= 0x60 ? tr_iso8859_1_to_ascii_upper[c - 0x60] : c;
>> +}
>> +
>> +
>> +/* Convert an UTF-8 character to ISO-8859-1.
>> + * Unconvertable characters are returned as '?'.
>> + * NB! Beware of the domain specific conversion of Ą, Ę, and Ţ/Ț.
>> + */
>> +static char
>> +utf8_to_iso8859_1(char *str, int *ix)
>
> It seems decidedly not great to have custom encoding conversion routines in a
> contrib module. Is there any way we can avoid this?
>

I have now replaced the custom UTF-8 decode with calls to
utf8_to_unicode and pg_utf_mblen, and simplified the subsequent
conversion to ASCII. Hopefully this makes the conversion code more
palatable.

I don't see how the conversion to ASCII could be substantially
simplified further. The conversion maps lowercase and 8 bit ISO8859-1
characters to ASCII via uppercasing, removal of accents, and discarding
of special characters. In addition to that, it maps (the non-ISO8859-1)
Ą, Ę, and Ţ/Ț from the coding chart to [, \, and ]. After this, a simple
O(1) table lookup can be used to retrieve the soundex code tree for a
letter sequence.

>
>> +/* Generate all Daitch-Mokotoff soundex codes for word, separated
>> by space. */
>> +static char *
>> +_daitch_mokotoff(char *word, char *soundex, size_t n)
>> +{
>> +    int            i = 0,
>> +                j;
>> +    int            letter_no = 0;
>> +    int            ix_leaves = 0;
>> +    int            num_nodes = 0,
>> +                num_leaves = 0;
>> +    dm_codes   *codes,
>> +               *next_codes;
>> +    dm_node    *nodes;
>> +    dm_leaves  *leaves;
>> +
>> +    /* First letter. */
>> +    if (!(codes = read_letter(word, &i)))
>> +    {
>> +        /* No encodable character in input. */
>> +        return NULL;
>> +    }
>> +
>> +    /* Allocate memory for node tree. */
>> +    nodes = palloc(sizeof(dm_nodes));
>> +    leaves = palloc(2 * sizeof(dm_leaves));
>
> So this allocates the worst case memory usage, is that right? That's quite a
> bit of memory. Shouldn't nodes be allocated dynamically?
>
> Instead of carefully freeing individual memory allocations, I think it be
> better to create a temporary memory context, allocate the necessary nodes etc
> on demand, and destroy the temporary memory context at the end.
>

Yes, the one-time allocation was intended to cover the worst case memory
usage. This was done to avoid any performance hit incurred by allocating
and deallocating memory for each new node in the soundex code tree.

I have rewritten the bookeeping of nodes in the soundex code tree to use
linked lists, and have followed your advice to use a temporary memory
context for allocation.

I also made an optimization by excluding completed soundex nodes from
the next letter iteration. This seems to offset any allocation overhead
- the performance is more or less the same as before.

>
>> +/* Codes for letter sequence at start of name, before a vowel, and
>> any other. */
>> +static dm_codes codes_0_1_X[2] =
>
> Any reason these aren't all const?
>

No reason why they can't be :-) They are now changed to const.

>
> It's not clear to me where the intended line between the .h and .c file is.
>
>
>> +print <<EOF;
>> +/*
>> + * Types and lookup tables for Daitch-Mokotoff Soundex
>> + *
>
> If we generate the code, why is the generated header included in the commit?
>

This was mainly to have the content available for reference without
having to generate the header. I have removed the file - after the
change you suggest below, the struct declarations are available in the
.c file anyway.

>> +/* Letter in input sequence */
>> +struct dm_letter
>> +{
>> +    char        letter;            /* Present letter in sequence */
>> + struct dm_letter *letters; /* List of possible successive letters
>> */
>> + dm_codes *codes; /* Code sequence(s) for complete sequence */
>> +};
>> +
>> +/* Node in soundex code tree */
>> +struct dm_node
>> +{
>> + int soundex_length; /* Length of generated soundex code */
>> + char soundex[DM_MAX_CODE_DIGITS + 1]; /* Soundex code */
>> + int is_leaf; /* Candidate for complete soundex code */
>> + int last_update; /* Letter number for last update of node */
>> +    char        code_digit;        /* Last code digit, 0 - 9 */
>> +
>> +    /*
>> + * One or two alternate code digits leading to this node. If there
>> are two
>> + * digits, one of them is always an 'X'. Repeated code digits and
>> X' lead
>> +     * back to the same node.
>> +     */
>> +    char        prev_code_digits[2];
>> +    /* One or two alternate code digits moving forward. */
>> +    char        next_code_digits[2];
>> +    /* ORed together code index(es) used to reach current node. */
>> +    int            prev_code_index;
>> +    int            next_code_index;
>> +    /* Nodes branching out from this node. */
>> +    struct dm_node *next_nodes[DM_MAX_ALTERNATE_CODES + 1];
>> +};
>> +
>> +typedef struct dm_letter dm_letter;
>> +typedef struct dm_node dm_node;
>
> Why is all this in the generated header? It needs DM_MAX_ALTERNATE_CODES etc,
> but it seems that the structs could just be defined in the .c file.
>

To accomplish this, I had to rearrange the code a bit. The structs are
now all declared in daitch_mokotoff.c, and the generated header is
included inbetween them.

>
>> +# Table adapted from https://www.jewishgen.org/InfoFiles/Soundex.html
>
> What does "adapted" mean here? And what's the path to updating the data?
>

It means that the original soundex coding chart, which is referred to,
has been converted to a machine readable format, with a few
modifications. These modifications are outlined further down in the
comments. I expanded a bit on the comments, hopefully making things
clearer.

I don't think there is much to be said about updating the data - that's
simply a question of modifying the table and regenerating the header
file. It goes without saying that making changes requires an
understanding of the soundex coding, which is explained in the
reference. However if anything should be unclear, please do point out
what should be explained better.

> Greetings,
>
> Andres Freund
>

Thanks again, and a Merry Christmas to you and all the other PostgreSQL
hackers!


Best regards,

Dag Lem

diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile
index 0704894f88..12baf2d884 100644
--- a/contrib/fuzzystrmatch/Makefile
+++ b/contrib/fuzzystrmatch/Makefile
@@ -3,14 +3,15 @@
 MODULE_big = fuzzystrmatch
 OBJS = \
     $(WIN32RES) \
+    daitch_mokotoff.o \
     dmetaphone.o \
     fuzzystrmatch.o

 EXTENSION = fuzzystrmatch
-DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql
+DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql
 PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"

-REGRESS = fuzzystrmatch
+REGRESS = fuzzystrmatch fuzzystrmatch_utf8

 ifdef USE_PGXS
 PG_CONFIG = pg_config
@@ -22,3 +23,14 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
+
+# Force this dependency to be known even without dependency info built:
+daitch_mokotoff.o: daitch_mokotoff.h
+
+daitch_mokotoff.h: daitch_mokotoff_header.pl
+    perl $< $@
+
+distprep: daitch_mokotoff.h
+
+maintainer-clean:
+    rm -f daitch_mokotoff.h
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c
new file mode 100644
index 0000000000..d4ad95c283
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff.c
@@ -0,0 +1,596 @@
+/*
+ * Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem <dag@nimrod.no>
+ *
+ * This implementation of the Daitch-Mokotoff Soundex System aims at high
+ * performance.
+ *
+ * - The processing of each phoneme is initiated by an O(1) table lookup.
+ * - For phonemes containing more than one character, a coding tree is traversed
+ *   to process the complete phoneme.
+ * - The (alternate) soundex codes are produced digit by digit in-place in
+ *   another tree structure.
+ *
+ *  References:
+ *
+ * https://www.avotaynu.com/soundex.htm
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex
+ * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php)
+ * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java)
+ * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm)
+ *
+ * A few notes on other implementations:
+ *
+ * - All other known implementations have the same unofficial rules for "UE",
+ *   these are also adapted by this implementation (0, 1, NC).
+ * - The only other known implementation which is capable of generating all
+ *   correct soundex codes in all cases is the JOS Soundex Calculator at
+ *   https://www.jewishgen.org/jos/jossound.htm
+ * - "J" is considered (only) a vowel in dmlat.php
+ * - The official rules for "RS" are commented out in dmlat.php
+ * - Identical code digits for adjacent letters are not collapsed correctly in
+ *   dmsoundex.php when double digit codes are involved. E.g. "BESST" yields
+ *   744300 instead of 743000 as for "BEST".
+ * - "J" is considered (only) a consonant in DaitchMokotoffSoundex.java
+ * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+*/
+
+#include "postgres.h"
+
+#include "lib/stringinfo.h"
+#include "mb/pg_wchar.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+
+
+/*
+ * The soundex coding chart table is adapted from from
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * See daitch_mokotoff_header.pl for details.
+*/
+
+#define DM_CODE_DIGITS 6
+
+/* Coding chart table: Soundex codes */
+typedef char dm_code[2 + 1];    /* One or two sequential code digits + NUL */
+typedef dm_code dm_codes[3];    /* Start of name, before a vowel, any other */
+
+/* Coding chart table: Letter in input sequence */
+struct dm_letter
+{
+    char        letter;            /* Present letter in sequence */
+    const struct dm_letter *letters;    /* List of possible successive letters */
+    const        dm_codes *codes;    /* Code sequence(s) for complete sequence */
+};
+
+typedef struct dm_letter dm_letter;
+
+/* Generated coding chart table */
+#include "daitch_mokotoff.h"
+
+/* Node in soundex code tree */
+struct dm_node
+{
+    int            soundex_length; /* Length of generated soundex code */
+    char        soundex[DM_CODE_DIGITS + 1];    /* Soundex code */
+    int            is_leaf;        /* Candidate for complete soundex code */
+    int            last_update;    /* Letter number for last update of node */
+    char        code_digit;        /* Last code digit, 0 - 9 */
+
+    /*
+     * One or two alternate code digits leading to this node. If there are two
+     * digits, one of them is always an 'X'. Repeated code digits and 'X' lead
+     * back to the same node.
+     */
+    char        prev_code_digits[2];
+    /* One or two alternate code digits moving forward. */
+    char        next_code_digits[2];
+    /* ORed together code index(es) used to reach current node. */
+    int            prev_code_index;
+    int            next_code_index;
+    /* Nodes branching out from this node. */
+    struct dm_node *children[DM_MAX_ALTERNATE_CODES + 1];
+    /* Next node in linked list. Alternating index for each iteration. */
+    struct dm_node *next[2];
+};
+
+typedef struct dm_node dm_node;
+
+
+/* Internal C implementation */
+static int    daitch_mokotoff_coding(char *word, StringInfo soundex);
+
+
+PG_FUNCTION_INFO_V1(daitch_mokotoff);
+
+Datum
+daitch_mokotoff(PG_FUNCTION_ARGS)
+{
+    text       *arg = PG_GETARG_TEXT_PP(0);
+    char       *string;
+    StringInfoData soundex;
+    text       *retval;
+    MemoryContext old_ctx,
+                tmp_ctx;
+
+    tmp_ctx = AllocSetContextCreate(CurrentMemoryContext,
+                                    "daitch_mokotoff temporary context",
+                                    ALLOCSET_DEFAULT_SIZES);
+    old_ctx = MemoryContextSwitchTo(tmp_ctx);
+
+    string = pg_server_to_any(text_to_cstring(arg), VARSIZE_ANY_EXHDR(arg), PG_UTF8);
+    initStringInfo(&soundex);
+
+    if (!daitch_mokotoff_coding(string, &soundex))
+    {
+        /* No encodable characters in input. */
+        MemoryContextSwitchTo(old_ctx);
+        MemoryContextDelete(tmp_ctx);
+        PG_RETURN_NULL();
+    }
+
+    string = pg_any_to_server(soundex.data, soundex.len, PG_UTF8);
+    MemoryContextSwitchTo(old_ctx);
+    retval = cstring_to_text(string);
+    MemoryContextDelete(tmp_ctx);
+
+    PG_RETURN_TEXT_P(retval);
+}
+
+
+/* Template for new node in soundex code tree. */
+static const dm_node start_node = {
+    .soundex_length = 0,
+    .soundex = "000000 ",        /* Six digits + joining space */
+    .is_leaf = 0,
+    .last_update = 0,
+    .code_digit = '\0',
+    .prev_code_digits = {'\0', '\0'},
+    .next_code_digits = {'\0', '\0'},
+    .prev_code_index = 0,
+    .next_code_index = 0,
+    .children = {NULL},
+    .next = {NULL}
+};
+
+/* Dummy soundex codes at end of input. */
+static const dm_codes end_codes[2] =
+{
+    {
+        "X", "X", "X"
+    }
+};
+
+
+/* Initialize soundex code tree node for next code digit. */
+static void
+initialize_node(dm_node * node, int last_update)
+{
+    if (node->last_update < last_update)
+    {
+        node->prev_code_digits[0] = node->next_code_digits[0];
+        node->prev_code_digits[1] = node->next_code_digits[1];
+        node->next_code_digits[0] = '\0';
+        node->next_code_digits[1] = '\0';
+        node->prev_code_index = node->next_code_index;
+        node->next_code_index = 0;
+        node->is_leaf = 0;
+        node->last_update = last_update;
+    }
+}
+
+
+/* Update soundex code tree node with next code digit. */
+static void
+add_next_code_digit(dm_node * node, int code_index, char code_digit)
+{
+    /* OR in index 1 or 2. */
+    node->next_code_index |= code_index;
+
+    if (!node->next_code_digits[0])
+    {
+        node->next_code_digits[0] = code_digit;
+    }
+    else if (node->next_code_digits[0] != code_digit)
+    {
+        node->next_code_digits[1] = code_digit;
+    }
+}
+
+
+/* Mark soundex code tree node as leaf. */
+static void
+set_leaf(dm_node * first_node[2], dm_node * last_node[2], dm_node * node, int ix_node)
+{
+    if (!node->is_leaf)
+    {
+        node->is_leaf = 1;
+
+        if (first_node[ix_node] == NULL)
+        {
+            first_node[ix_node] = node;
+        }
+        else
+        {
+            last_node[ix_node]->next[ix_node] = node;
+        }
+
+        last_node[ix_node] = node;
+        node->next[ix_node] = NULL;
+    }
+}
+
+
+/* Find next node corresponding to code digit, or create a new node. */
+static dm_node *
+find_or_create_child_node(dm_node * parent, char code_digit, StringInfo soundex)
+{
+    dm_node   **nodes;
+    dm_node    *node;
+    int            i;
+
+    for (nodes = parent->children, i = 0; (node = nodes[i]); i++)
+    {
+        if (node->code_digit == code_digit)
+        {
+            /* Found existing child node. Skip completed nodes. */
+            return node->soundex_length < DM_CODE_DIGITS ? node : NULL;
+        }
+    }
+
+    /* Create new child node. */
+    Assert(i < DM_MAX_ALTERNATE_CODES);
+    node = palloc(sizeof(dm_node));
+    nodes[i] = node;
+
+    *node = start_node;
+    memcpy(node->soundex, parent->soundex, sizeof(parent->soundex));
+    node->soundex_length = parent->soundex_length;
+    node->soundex[node->soundex_length++] = code_digit;
+    node->code_digit = code_digit;
+    node->next_code_index = node->prev_code_index;
+
+    if (node->soundex_length < DM_CODE_DIGITS)
+    {
+        return node;
+    }
+    else
+    {
+        /* Append completed soundex code to soundex string. */
+        appendBinaryStringInfoNT(soundex, node->soundex, DM_CODE_DIGITS + 1);
+        return NULL;
+    }
+}
+
+
+/* Update node for next code digit(s). */
+static void
+update_node(dm_node * first_node[2], dm_node * last_node[2], dm_node * node, int ix_node,
+            int letter_no, int prev_code_index, int next_code_index,
+            const char *next_code_digits, int digit_no,
+            StringInfo soundex)
+{
+    int            i;
+    char        next_code_digit = next_code_digits[digit_no];
+    int            num_dirty_nodes = 0;
+    dm_node    *dirty_nodes[2];
+
+    initialize_node(node, letter_no);
+
+    if (node->prev_code_index && !(node->prev_code_index & prev_code_index))
+    {
+        /*
+         * If the sound (vowel / consonant) of this letter encoding doesn't
+         * correspond to the coding index of the previous letter, we skip this
+         * letter encoding. Note that currently, only "J" can be either a
+         * vowel or a consonant.
+         */
+        return;
+    }
+
+    if (next_code_digit == 'X' ||
+        (digit_no == 0 &&
+         (node->prev_code_digits[0] == next_code_digit ||
+          node->prev_code_digits[1] == next_code_digit)))
+    {
+        /* The code digit is the same as one of the previous (i.e. not added). */
+        dirty_nodes[num_dirty_nodes++] = node;
+    }
+
+    if (next_code_digit != 'X' &&
+        (digit_no > 0 ||
+         node->prev_code_digits[0] != next_code_digit ||
+         node->prev_code_digits[1]))
+    {
+        /* The code digit is different from one of the previous (i.e. added). */
+        node = find_or_create_child_node(node, next_code_digit, soundex);
+        if (node)
+        {
+            initialize_node(node, letter_no);
+            dirty_nodes[num_dirty_nodes++] = node;
+        }
+    }
+
+    for (i = 0; i < num_dirty_nodes; i++)
+    {
+        /* Add code digit leading to the current node. */
+        add_next_code_digit(dirty_nodes[i], next_code_index, next_code_digit);
+
+        if (next_code_digits[++digit_no])
+        {
+            update_node(first_node, last_node, dirty_nodes[i], ix_node,
+                        letter_no, prev_code_index, next_code_index,
+                        next_code_digits, digit_no,
+                        soundex);
+        }
+        else
+        {
+            /* Add incomplete leaf node to linked list. */
+            set_leaf(first_node, last_node, dirty_nodes[i], ix_node);
+        }
+    }
+}
+
+
+/* Update soundex tree leaf nodes. */
+static void
+update_leaves(dm_node * first_node[2], int *ix_node, int letter_no,
+              const dm_codes * codes, const dm_codes * next_codes,
+              StringInfo soundex)
+{
+    int            i,
+                j,
+                code_index;
+    dm_node    *node,
+               *last_node[2];
+    const        dm_code *code,
+               *next_code;
+    int            ix_node_next = (*ix_node + 1) & 1;    /* Alternating index: 0, 1 */
+
+    /* Initialize for new linked list of leaves. */
+    first_node[ix_node_next] = NULL;
+    last_node[ix_node_next] = NULL;
+
+    /* Process all nodes. */
+    for (node = first_node[*ix_node]; node; node = node->next[*ix_node])
+    {
+        /* One or two alternate code sequences. */
+        for (i = 0; i < 2 && (code = codes[i]) && code[0][0]; i++)
+        {
+            /* Coding for previous letter - before vowel: 1, all other: 2 */
+            int            prev_code_index = (code[0][0] > '1') + 1;
+
+            /* One or two alternate next code sequences. */
+            for (j = 0; j < 2 && (next_code = next_codes[j]) && next_code[0][0]; j++)
+            {
+                /* Determine which code to use. */
+                if (letter_no == 0)
+                {
+                    /* This is the first letter. */
+                    code_index = 0;
+                }
+                else if (next_code[0][0] <= '1')
+                {
+                    /* The next letter is a vowel. */
+                    code_index = 1;
+                }
+                else
+                {
+                    /* All other cases. */
+                    code_index = 2;
+                }
+
+                /* One or two sequential code digits. */
+                update_node(first_node, last_node, node, ix_node_next,
+                            letter_no, prev_code_index, code_index,
+                            code[code_index], 0,
+                            soundex);
+            }
+        }
+    }
+
+    *ix_node = ix_node_next;
+}
+
+
+/* Mapping from ISO8859-1 to upper-case ASCII */
+static const char iso8859_1_to_ascii_upper[] =
+/*
+"`abcdefghijklmnopqrstuvwxyz{|}~                                  ¡¢£¤¥¦§¨©ª«¬
®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ"
+*/
+"`ABCDEFGHIJKLMNOPQRSTUVWXYZ{|}~                                  !
?AAAAAAECEEEEIIIIDNOOOOO*OUUUUYDSAAAAAAECEEEEIIIIDNOOOOO/OUUUUYDY";
+
+
+/* Return next character, converted from UTF-8 to uppercase ASCII. */
+static char
+read_char(unsigned char *str, int *ix)
+{
+    /* Substitute character for skipped code points. */
+    const char    na = '\x1a';
+    pg_wchar    c;
+
+    /* Decode UTF-8 character to ISO 10646 code point. */
+    str += *ix;
+    c = utf8_to_unicode(str);
+    *ix += pg_utf_mblen(str);
+
+    if (c >= (unsigned char) '[' && c <= (unsigned char) ']')
+    {
+        /* ASCII characters [, \, and ] are reserved for Ą, Ę, and Ţ/Ț. */
+        return na;
+    }
+    else if (c < 0x60)
+    {
+        /* Non-lowercase ASCII character. */
+        return c;
+    }
+    else if (c < 0x100)
+    {
+        /* ISO-8859-1 code point, converted to upper-case ASCII. */
+        return iso8859_1_to_ascii_upper[c - 0x60];
+    }
+    else
+    {
+        /* Conversion of non-ASCII characters in the coding chart. */
+        switch (c)
+        {
+            case 0x0104:
+            case 0x0105:
+                /* Ą/ą */
+                return '[';
+            case 0x0118:
+            case 0x0119:
+                /* Ę/ę */
+                return '\\';
+            case 0x0162:
+            case 0x0163:
+            case 0x021A:
+            case 0x021B:
+                /* Ţ/ţ or Ț/ț */
+                return ']';
+            default:
+                return na;
+        }
+    }
+}
+
+
+/* Read next ASCII character, skipping any characters not in [A-\]]. */
+static char
+read_valid_char(char *str, int *ix)
+{
+    char        c;
+
+    while ((c = read_char((unsigned char *) str, ix)))
+    {
+        if (c >= 'A' && c <= ']')
+        {
+            break;
+        }
+    }
+
+    return c;
+}
+
+
+/* Return sound coding for "letter" (letter sequence) */
+static const dm_codes *
+read_letter(char *str, int *ix)
+{
+    char        c,
+                cmp;
+    int            i,
+                j;
+    const        dm_letter *letters;
+    const        dm_codes *codes;
+
+    /* First letter in sequence. */
+    if (!(c = read_valid_char(str, ix)))
+    {
+        return NULL;
+    }
+    letters = &letter_[c - 'A'];
+    codes = letters->codes;
+    i = *ix;
+
+    /* Any subsequent letters in sequence. */
+    while ((letters = letters->letters) && (c = read_valid_char(str, &i)))
+    {
+        for (j = 0; (cmp = letters[j].letter); j++)
+        {
+            if (cmp == c)
+            {
+                /* Letter found. */
+                letters = &letters[j];
+                if (letters->codes)
+                {
+                    /* Coding for letter sequence found. */
+                    codes = letters->codes;
+                    *ix = i;
+                }
+                break;
+            }
+        }
+        if (!cmp)
+        {
+            /* The sequence of letters has no coding. */
+            break;
+        }
+    }
+
+    return codes;
+}
+
+
+/* Generate all Daitch-Mokotoff soundex codes for word, separated by space. */
+static int
+daitch_mokotoff_coding(char *word, StringInfo soundex)
+{
+    int            i = 0;
+    int            letter_no = 0;
+    int            ix_node = 0;
+    const        dm_codes *codes,
+               *next_codes;
+    dm_node    *first_node[2],
+               *node;
+
+    /* First letter. */
+    if (!(codes = read_letter(word, &i)))
+    {
+        /* No encodable character in input. */
+        return 0;
+    }
+
+    /* Starting point. */
+    first_node[ix_node] = palloc(sizeof(dm_node));
+    *first_node[ix_node] = start_node;
+
+    /*
+     * Loop until either the word input is exhausted, or all generated soundex
+     * codes are completed to six digits.
+     */
+    while (codes && first_node[ix_node])
+    {
+        next_codes = read_letter(word, &i);
+
+        /* Update leaf nodes. */
+        update_leaves(first_node, &ix_node, letter_no,
+                      codes, next_codes ? next_codes : end_codes,
+                      soundex);
+
+        codes = next_codes;
+        letter_no++;
+    }
+
+    /* Append all remaining (incomplete) soundex codes. */
+    for (node = first_node[ix_node]; node; node = node->next[ix_node])
+    {
+        appendBinaryStringInfoNT(soundex, node->soundex, DM_CODE_DIGITS + 1);
+    }
+
+    /* Terminate string at the final space. */
+    soundex->len--;
+    soundex->data[soundex->len] = '\0';
+
+    return 1;
+}
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff_header.pl b/contrib/fuzzystrmatch/daitch_mokotoff_header.pl
new file mode 100755
index 0000000000..807b5fb8c5
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff_header.pl
@@ -0,0 +1,260 @@
+#!/bin/perl
+#
+# Generation of types and lookup tables for Daitch-Mokotoff soundex.
+#
+# Copyright (c) 2021 Finance Norway
+# Author: Dag Lem <dag@nimrod.no>
+#
+# Permission to use, copy, modify, and distribute this software and its
+# documentation for any purpose, without fee, and without a written agreement
+# is hereby granted, provided that the above copyright notice and this
+# paragraph and the following two paragraphs appear in all copies.
+#
+# IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+# DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+# LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+# DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+# POSSIBILITY OF SUCH DAMAGE.
+#
+# THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+# INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+# AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+# ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+# PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+#
+
+use strict;
+use warnings;
+use utf8;
+use open IO => ':utf8', ':std';
+use Data::Dumper;
+
+die "Usage: $0 OUTPUT_FILE\n" if @ARGV != 1;
+my $output_file = $ARGV[0];
+
+# Open the output file
+open my $OUTPUT, '>', $output_file
+  or die "Could not open output file $output_file: $!\n";
+
+# Parse code table and generate tree for letter transitions.
+my %codes;
+my %alternates;
+my $table = [{}, [["","",""]]];
+while (<DATA>) {
+    chomp;
+    my ($letters, $codes) = split(/\s+/);
+    my @codes = map { [ split(/,/) ] } split(/\|/, $codes);
+
+    # Find alternate code transitions for calculation of storage.
+    # The first character ("start of a name") can never yield more than two alternate codes,
+    # and is not considered here.
+    if (@codes > 1) {
+        for my $j (1..2) {  # Codes for "before a vowel" and "any other"
+            for my $i (0..1) {  # Alternate codes
+                # Identical code digits for adjacent letters are collapsed.
+                # For each possible non-transition due to code digit
+                # collapsing, find all alternate transitions.
+                my ($present, $next) = ($codes[$i][$j], $codes[($i + 1)%2][$j]);
+                next if length($present) != 1;
+                $next = $present ne substr($next, 0, 1) ? substr($next, 0, 1) : substr($next, -1, 1);
+                $alternates{$present}{$next} = 1;
+            }
+        }
+    }
+    my $key = "codes_" . join("_or_", map { join("_", @$_) } @codes);
+    my $val = join(",\n", map { "\t{\n\t\t" . join(", ", map { "\"$_\"" } @$_) . "\n\t}" } @codes);
+    $codes{$key} = $val;
+
+    for my $letter (split(/,/, $letters)) {
+        my $ref = $table->[0];
+        # Link each character to the next in the letter combination.
+        my @c = split(//, $letter);
+        my $last_c = pop(@c);
+        for my $c (@c) {
+            $ref->{$c} //= [ {}, undef ];
+            $ref->{$c}[0] //= {};
+            $ref = $ref->{$c}[0];
+        }
+        # The sound code for the letter combination is stored at the last character.
+        $ref->{$last_c}[1] = $key;
+    }
+}
+close(DATA);
+
+# Add alternates by following transitions to 'X' (not coded).
+my $alt_x = $alternates{"X"};
+delete $alt_x->{"X"};
+while (my ($k, $v) = each %alternates) {
+    if (delete $v->{"X"}) {
+        for my $x (keys %$alt_x) {
+            $v->{$x} = 1;
+        }
+    }
+}
+
+# Find the maximum number of alternate codes in one position.
+# Add two for any additional final code digit transitions.
+my $max_alt = (sort { $b <=> $a } (map { scalar keys %$_ } values %alternates))[0] + 2;
+
+print $OUTPUT <<EOF;
+/*
+ * Constants and lookup tables for Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem <dag\@nimrod.no>
+ *
+ * This file is generated by daitch_mokotoff_header.pl
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+ */
+
+#define DM_MAX_ALTERNATE_CODES $max_alt
+
+/* Codes for letter sequence at start of name, before a vowel, and any other. */
+EOF
+
+for my $key (sort keys %codes) {
+    print $OUTPUT "static const dm_codes $key\[2\] =\n{\n" . $codes{$key} . "\n};\n";
+}
+
+print $OUTPUT <<EOF;
+
+/* Coding for alternative following letters in sequence. */
+EOF
+
+sub hash2code {
+    my ($ref, $letter) = @_;
+
+    my @letters = ();
+
+    my $h = $ref->[0];
+    for my $key (sort keys %$h) {
+        $ref = $h->{$key};
+        my $children = "NULL";
+        if (defined $ref->[0]) {
+            $children = "letter_$letter$key";
+            hash2code($ref, "$letter$key");
+        }
+        my $codes = $ref->[1] // "NULL";
+        push(@letters, "\t{\n\t\t'$key', $children, $codes\n\t}");
+    }
+
+    print $OUTPUT "static const dm_letter letter_$letter\[\] =\n{\n";
+    for (@letters) {
+        print $OUTPUT "$_,\n";
+    }
+    print $OUTPUT "\t{\n\t\t'\\0'\n\t}\n";
+    print $OUTPUT "};\n";
+}
+
+hash2code($table, '');
+
+close $OUTPUT;
+
+# Table adapted from https://www.jewishgen.org/InfoFiles/Soundex.html
+#
+# The conversion from the coding chart to the table should be self
+# explanatory, but note the differences stated below.
+#
+# X = NC (not coded)
+#
+# The non-ASCII letters in the coding chart are coded with substitute
+# lowercase ASCII letters, which sort after the uppercase ASCII letters:
+#
+# Ą => a (use '[' for table lookup)
+# Ę => e (use '\\' for table lookup)
+# Ţ => t (use ']' for table lookup)
+#
+# The rule for "UE" does not correspond to the coding chart, however
+# it is used by all other known implementations, including the one at
+# https://www.jewishgen.org/jos/jossound.htm (try e.g. "bouey").
+#
+# Note that the implementation assumes that vowels are assigned code
+# 0 or 1. "J" can be either a vowel or a consonant.
+#
+
+__DATA__
+AI,AJ,AY                0,1,X
+AU                        0,7,X
+a                        X,X,6|X,X,X
+A                        0,X,X
+B                        7,7,7
+CHS                        5,54,54
+CH                        5,5,5|4,4,4
+CK                        5,5,5|45,45,45
+CZ,CS,CSZ,CZS            4,4,4
+C                        5,5,5|4,4,4
+DRZ,DRS                    4,4,4
+DS,DSH,DSZ                4,4,4
+DZ,DZH,DZS                4,4,4
+D,DT                    3,3,3
+EI,EJ,EY                0,1,X
+EU                        1,1,X
+e                        X,X,6|X,X,X
+E                        0,X,X
+FB                        7,7,7
+F                        7,7,7
+G                        5,5,5
+H                        5,5,X
+IA,IE,IO,IU                1,X,X
+I                        0,X,X
+J                        1,X,X|4,4,4
+KS                        5,54,54
+KH                        5,5,5
+K                        5,5,5
+L                        8,8,8
+MN                        66,66,66
+M                        6,6,6
+NM                        66,66,66
+N                        6,6,6
+OI,OJ,OY                0,1,X
+O                        0,X,X
+P,PF,PH                    7,7,7
+Q                        5,5,5
+RZ,RS                    94,94,94|4,4,4
+R                        9,9,9
+SCHTSCH,SCHTSH,SCHTCH    2,4,4
+SCH                        4,4,4
+SHTCH,SHCH,SHTSH        2,4,4
+SHT,SCHT,SCHD            2,43,43
+SH                        4,4,4
+STCH,STSCH,SC            2,4,4
+STRZ,STRS,STSH            2,4,4
+ST                        2,43,43
+SZCZ,SZCS                2,4,4
+SZT,SHD,SZD,SD            2,43,43
+SZ                        4,4,4
+S                        4,4,4
+TCH,TTCH,TTSCH            4,4,4
+TH                        3,3,3
+TRZ,TRS                    4,4,4
+TSCH,TSH                4,4,4
+TS,TTS,TTSZ,TC            4,4,4
+TZ,TTZ,TZS,TSZ            4,4,4
+t                        3,3,3|4,4,4
+T                        3,3,3
+UI,UJ,UY,UE                0,1,X
+U                        0,X,X
+V                        7,7,7
+W                        7,7,7
+X                        5,54,54
+Y                        1,X,X
+ZDZ,ZDZH,ZHDZH            2,4,4
+ZD,ZHD                    2,43,43
+ZH,ZS,ZSCH,ZSH            4,4,4
+Z                        4,4,4
diff --git a/contrib/fuzzystrmatch/expected/fuzzystrmatch.out b/contrib/fuzzystrmatch/expected/fuzzystrmatch.out
index 493c95cdfa..f62ddad4ee 100644
--- a/contrib/fuzzystrmatch/expected/fuzzystrmatch.out
+++ b/contrib/fuzzystrmatch/expected/fuzzystrmatch.out
@@ -65,3 +65,174 @@ SELECT dmetaphone_alt('gumbo');
  KMP
 (1 row)

+-- Wovels
+SELECT daitch_mokotoff('Augsburg');
+ daitch_mokotoff
+-----------------
+ 054795
+(1 row)
+
+SELECT daitch_mokotoff('Breuer');
+ daitch_mokotoff
+-----------------
+ 791900
+(1 row)
+
+SELECT daitch_mokotoff('Freud');
+ daitch_mokotoff
+-----------------
+ 793000
+(1 row)
+
+-- The letter "H"
+SELECT daitch_mokotoff('Halberstadt');
+ daitch_mokotoff
+-----------------
+ 587943 587433
+(1 row)
+
+SELECT daitch_mokotoff('Mannheim');
+ daitch_mokotoff
+-----------------
+ 665600
+(1 row)
+
+-- Adjacent sounds
+SELECT daitch_mokotoff('Chernowitz');
+ daitch_mokotoff
+-----------------
+ 596740 496740
+(1 row)
+
+-- Adjacent letters with identical adjacent code digits
+SELECT daitch_mokotoff('Cherkassy');
+ daitch_mokotoff
+-----------------
+ 595400 495400
+(1 row)
+
+SELECT daitch_mokotoff('Kleinman');
+ daitch_mokotoff
+-----------------
+ 586660
+(1 row)
+
+-- More than one word
+SELECT daitch_mokotoff('Nowy Targ');
+ daitch_mokotoff
+-----------------
+ 673950
+(1 row)
+
+-- Padded with "0"
+SELECT daitch_mokotoff('Berlin');
+ daitch_mokotoff
+-----------------
+ 798600
+(1 row)
+
+-- Other examples from https://www.avotaynu.com/soundex.htm
+SELECT daitch_mokotoff('Ceniow');
+ daitch_mokotoff
+-----------------
+ 567000 467000
+(1 row)
+
+SELECT daitch_mokotoff('Tsenyuv');
+ daitch_mokotoff
+-----------------
+ 467000
+(1 row)
+
+SELECT daitch_mokotoff('Holubica');
+ daitch_mokotoff
+-----------------
+ 587500 587400
+(1 row)
+
+SELECT daitch_mokotoff('Golubitsa');
+ daitch_mokotoff
+-----------------
+ 587400
+(1 row)
+
+SELECT daitch_mokotoff('Przemysl');
+ daitch_mokotoff
+-----------------
+ 794648 746480
+(1 row)
+
+SELECT daitch_mokotoff('Pshemeshil');
+ daitch_mokotoff
+-----------------
+ 746480
+(1 row)
+
+SELECT daitch_mokotoff('Rosochowaciec');
+                     daitch_mokotoff
+---------------------------------------------------------
+ 945755 945754 945745 945744 944755 944754 944745 944744
+(1 row)
+
+SELECT daitch_mokotoff('Rosokhovatsets');
+ daitch_mokotoff
+-----------------
+ 945744
+(1 row)
+
+-- Ignored characters
+SELECT daitch_mokotoff('''OBrien');
+ daitch_mokotoff
+-----------------
+ 079600
+(1 row)
+
+SELECT daitch_mokotoff('O''Brien');
+ daitch_mokotoff
+-----------------
+ 079600
+(1 row)
+
+-- "Difficult" cases, likely to cause trouble for other implementations.
+SELECT daitch_mokotoff('CJC');
+              daitch_mokotoff
+-------------------------------------------
+ 550000 540000 545000 450000 400000 440000
+(1 row)
+
+SELECT daitch_mokotoff('BESST');
+ daitch_mokotoff
+-----------------
+ 743000
+(1 row)
+
+SELECT daitch_mokotoff('BOUEY');
+ daitch_mokotoff
+-----------------
+ 710000
+(1 row)
+
+SELECT daitch_mokotoff('HANNMANN');
+ daitch_mokotoff
+-----------------
+ 566600
+(1 row)
+
+SELECT daitch_mokotoff('MCCOYJR');
+                     daitch_mokotoff
+---------------------------------------------------------
+ 651900 654900 654190 654490 645190 645490 641900 644900
+(1 row)
+
+SELECT daitch_mokotoff('ACCURSO');
+                     daitch_mokotoff
+---------------------------------------------------------
+ 059400 054000 054940 054400 045940 045400 049400 044000
+(1 row)
+
+SELECT daitch_mokotoff('BIERSCHBACH');
+                     daitch_mokotoff
+---------------------------------------------------------
+ 794575 794574 794750 794740 745750 745740 747500 747400
+(1 row)
+
diff --git a/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8.out
b/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8.out
new file mode 100644
index 0000000000..32d8260383
--- /dev/null
+++ b/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8.out
@@ -0,0 +1,61 @@
+/*
+ * This test must be run in a database with UTF-8 encoding,
+ * because other encodings don't support all the characters used.
+ */
+SELECT getdatabaseencoding() <> 'UTF8'
+       AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+set client_encoding = utf8;
+-- CREATE EXTENSION IF NOT EXISTS fuzzystrmatch;
+-- Accents
+SELECT daitch_mokotoff('Müller');
+ daitch_mokotoff
+-----------------
+ 689000
+(1 row)
+
+SELECT daitch_mokotoff('Schäfer');
+ daitch_mokotoff
+-----------------
+ 479000
+(1 row)
+
+SELECT daitch_mokotoff('Straßburg');
+ daitch_mokotoff
+-----------------
+ 294795
+(1 row)
+
+SELECT daitch_mokotoff('Éregon');
+ daitch_mokotoff
+-----------------
+ 095600
+(1 row)
+
+-- Special characters added at https://www.jewishgen.org/InfoFiles/Soundex.html
+SELECT daitch_mokotoff('gąszczu');
+ daitch_mokotoff
+-----------------
+ 564000 540000
+(1 row)
+
+SELECT daitch_mokotoff('brzęczy');
+       daitch_mokotoff
+-----------------------------
+ 794640 794400 746400 744000
+(1 row)
+
+SELECT daitch_mokotoff('ţamas');
+ daitch_mokotoff
+-----------------
+ 364000 464000
+(1 row)
+
+SELECT daitch_mokotoff('țamas');
+ daitch_mokotoff
+-----------------
+ 364000 464000
+(1 row)
+
diff --git a/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8_1.out
b/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8_1.out
new file mode 100644
index 0000000000..37aead89c0
--- /dev/null
+++ b/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8_1.out
@@ -0,0 +1,8 @@
+/*
+ * This test must be run in a database with UTF-8 encoding,
+ * because other encodings don't support all the characters used.
+ */
+SELECT getdatabaseencoding() <> 'UTF8'
+       AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql b/contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql
new file mode 100644
index 0000000000..b9d7b229a3
--- /dev/null
+++ b/contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql
@@ -0,0 +1,8 @@
+/* contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION fuzzystrmatch UPDATE TO '1.2'" to load this file. \quit
+
+CREATE FUNCTION daitch_mokotoff(text) RETURNS text
+AS 'MODULE_PATHNAME', 'daitch_mokotoff'
+LANGUAGE C IMMUTABLE STRICT PARALLEL SAFE;
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch.control b/contrib/fuzzystrmatch/fuzzystrmatch.control
index 3cd6660bf9..8b6e9fd993 100644
--- a/contrib/fuzzystrmatch/fuzzystrmatch.control
+++ b/contrib/fuzzystrmatch/fuzzystrmatch.control
@@ -1,6 +1,6 @@
 # fuzzystrmatch extension
 comment = 'determine similarities and distance between strings'
-default_version = '1.1'
+default_version = '1.2'
 module_pathname = '$libdir/fuzzystrmatch'
 relocatable = true
 trusted = true
diff --git a/contrib/fuzzystrmatch/meson.build b/contrib/fuzzystrmatch/meson.build
index e6d06149ce..6b4a13694f 100644
--- a/contrib/fuzzystrmatch/meson.build
+++ b/contrib/fuzzystrmatch/meson.build
@@ -1,7 +1,16 @@
 fuzzystrmatch_sources = files(
-  'fuzzystrmatch.c',
+  'daitch_mokotoff.c',
   'dmetaphone.c',
+  'fuzzystrmatch.c',
+)
+
+daitch_mokotoff_h = custom_target('daitch_mokotoff',
+  input: 'daitch_mokotoff_header.pl',
+  output: 'daitch_mokotoff.h',
+  command: [perl, '@INPUT@', '@OUTPUT@'],
 )
+generated_sources += daitch_mokotoff_h
+fuzzystrmatch_sources += daitch_mokotoff_h

 if host_system == 'windows'
   fuzzystrmatch_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
@@ -19,6 +28,7 @@ install_data(
   'fuzzystrmatch.control',
   'fuzzystrmatch--1.0--1.1.sql',
   'fuzzystrmatch--1.1.sql',
+  'fuzzystrmatch--1.1--1.2.sql',
   kwargs: contrib_data_args,
 )

@@ -29,6 +39,7 @@ tests += {
   'regress': {
     'sql': [
       'fuzzystrmatch',
+      'fuzzystrmatch_utf8',
     ],
   },
 }
diff --git a/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql b/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql
index f05dc28ffb..db05c7d6b6 100644
--- a/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql
+++ b/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql
@@ -19,3 +19,48 @@ SELECT metaphone('GUMBO', 4);

 SELECT dmetaphone('gumbo');
 SELECT dmetaphone_alt('gumbo');
+
+-- Wovels
+SELECT daitch_mokotoff('Augsburg');
+SELECT daitch_mokotoff('Breuer');
+SELECT daitch_mokotoff('Freud');
+
+-- The letter "H"
+SELECT daitch_mokotoff('Halberstadt');
+SELECT daitch_mokotoff('Mannheim');
+
+-- Adjacent sounds
+SELECT daitch_mokotoff('Chernowitz');
+
+-- Adjacent letters with identical adjacent code digits
+SELECT daitch_mokotoff('Cherkassy');
+SELECT daitch_mokotoff('Kleinman');
+
+-- More than one word
+SELECT daitch_mokotoff('Nowy Targ');
+
+-- Padded with "0"
+SELECT daitch_mokotoff('Berlin');
+
+-- Other examples from https://www.avotaynu.com/soundex.htm
+SELECT daitch_mokotoff('Ceniow');
+SELECT daitch_mokotoff('Tsenyuv');
+SELECT daitch_mokotoff('Holubica');
+SELECT daitch_mokotoff('Golubitsa');
+SELECT daitch_mokotoff('Przemysl');
+SELECT daitch_mokotoff('Pshemeshil');
+SELECT daitch_mokotoff('Rosochowaciec');
+SELECT daitch_mokotoff('Rosokhovatsets');
+
+-- Ignored characters
+SELECT daitch_mokotoff('''OBrien');
+SELECT daitch_mokotoff('O''Brien');
+
+-- "Difficult" cases, likely to cause trouble for other implementations.
+SELECT daitch_mokotoff('CJC');
+SELECT daitch_mokotoff('BESST');
+SELECT daitch_mokotoff('BOUEY');
+SELECT daitch_mokotoff('HANNMANN');
+SELECT daitch_mokotoff('MCCOYJR');
+SELECT daitch_mokotoff('ACCURSO');
+SELECT daitch_mokotoff('BIERSCHBACH');
diff --git a/contrib/fuzzystrmatch/sql/fuzzystrmatch_utf8.sql b/contrib/fuzzystrmatch/sql/fuzzystrmatch_utf8.sql
new file mode 100644
index 0000000000..f42c01a1bb
--- /dev/null
+++ b/contrib/fuzzystrmatch/sql/fuzzystrmatch_utf8.sql
@@ -0,0 +1,26 @@
+/*
+ * This test must be run in a database with UTF-8 encoding,
+ * because other encodings don't support all the characters used.
+ */
+
+SELECT getdatabaseencoding() <> 'UTF8'
+       AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+set client_encoding = utf8;
+
+-- CREATE EXTENSION IF NOT EXISTS fuzzystrmatch;
+
+-- Accents
+SELECT daitch_mokotoff('Müller');
+SELECT daitch_mokotoff('Schäfer');
+SELECT daitch_mokotoff('Straßburg');
+SELECT daitch_mokotoff('Éregon');
+
+-- Special characters added at https://www.jewishgen.org/InfoFiles/Soundex.html
+SELECT daitch_mokotoff('gąszczu');
+SELECT daitch_mokotoff('brzęczy');
+SELECT daitch_mokotoff('ţamas');
+SELECT daitch_mokotoff('țamas');
diff --git a/doc/src/sgml/fuzzystrmatch.sgml b/doc/src/sgml/fuzzystrmatch.sgml
index 382e54be91..08781778f8 100644
--- a/doc/src/sgml/fuzzystrmatch.sgml
+++ b/doc/src/sgml/fuzzystrmatch.sgml
@@ -241,4 +241,101 @@ test=# SELECT dmetaphone('gumbo');
 </screen>
  </sect2>

+ <sect2>
+  <title>Daitch-Mokotoff Soundex</title>
+
+  <para>
+   Compared to the American Soundex System implemented in the
+   <function>soundex</function> function, the major improvements of the
+   Daitch-Mokotoff Soundex System are:
+
+   <itemizedlist spacing="compact" mark="bullet">
+    <listitem>
+     <para>
+      Information is coded to the first six meaningful letters rather than
+      four.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      The initial letter is coded rather than kept as is.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Where two consecutive letters have a single sound, they are coded as a
+      single number.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      When a letter or combination of letters may have two different sounds,
+      it is double coded under the two different codes.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      A letter or combination of letters maps into ten possible codes rather
+      than seven.
+     </para>
+    </listitem>
+   </itemizedlist>
+  </para>
+
+  <indexterm>
+   <primary>daitch_mokotoff</primary>
+  </indexterm>
+
+  <para>
+   The following function generates Daitch-Mokotoff soundex codes for matching
+   of similar-sounding input:
+  </para>
+
+<synopsis>
+daitch_mokotoff(text source) returns text
+</synopsis>
+
+  <para>
+   Since a Daitch-Mokotoff soundex code consists of only 6 digits,
+   <literal>source</literal> should be preferably a single word or name.
+   Any alternative soundex codes are separated by space, which makes the returned
+   text suited for use in Full Text Search, see <xref linkend="textsearch"/> and
+   <xref linkend="functions-textsearch"/>.
+  </para>
+
+  <para>
+   Example:
+  </para>
+
+<programlisting>
+CREATE OR REPLACE FUNCTION soundex_name(v_name text) RETURNS text AS $$
+  SELECT string_agg(daitch_mokotoff(n), ' ')
+  FROM regexp_split_to_table(v_name, '\s+') AS n
+$$ LANGUAGE sql STRICT IMMUTABLE PARALLEL SAFE;
+
+CREATE OR REPLACE FUNCTION soundex_tsvector(v_name text) RETURNS tsvector AS $$
+  SELECT to_tsvector('simple', soundex_name(v_name))
+$$ LANGUAGE sql STRICT IMMUTABLE PARALLEL SAFE;
+
+CREATE OR REPLACE FUNCTION soundex_tsquery(v_name text) RETURNS tsquery AS $$
+  SELECT to_tsquery('simple', quote_literal(soundex_name(v_name)))
+$$ LANGUAGE sql STRICT IMMUTABLE PARALLEL SAFE;
+
+-- Note that searches could be more efficient with the tsvector in a separate column
+-- (no recalculation on table row recheck).
+CREATE TABLE s (nm text);
+CREATE INDEX ix_s_txt ON s USING gin (soundex_tsvector(nm)) WITH (fastupdate = off);
+
+INSERT INTO s VALUES ('John Doe');
+INSERT INTO s VALUES ('Jane Roe');
+INSERT INTO s VALUES ('Public John Q.');
+INSERT INTO s VALUES ('George Best');
+
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('john');
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('jane doe');
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('john public');
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('besst, giorgio');
+</programlisting>
+ </sect2>
+
 </sect1>

Re: daitch_mokotoff module

От
Dag Lem
Дата:
I noticed that the Meson builds failed in Cfbot, the updated patch adds
a missing "include_directories" line to meson.build.

Best regards

Dag Lem

diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile
index 0704894f88..12baf2d884 100644
--- a/contrib/fuzzystrmatch/Makefile
+++ b/contrib/fuzzystrmatch/Makefile
@@ -3,14 +3,15 @@
 MODULE_big = fuzzystrmatch
 OBJS = \
     $(WIN32RES) \
+    daitch_mokotoff.o \
     dmetaphone.o \
     fuzzystrmatch.o

 EXTENSION = fuzzystrmatch
-DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql
+DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql
 PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"

-REGRESS = fuzzystrmatch
+REGRESS = fuzzystrmatch fuzzystrmatch_utf8

 ifdef USE_PGXS
 PG_CONFIG = pg_config
@@ -22,3 +23,14 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
+
+# Force this dependency to be known even without dependency info built:
+daitch_mokotoff.o: daitch_mokotoff.h
+
+daitch_mokotoff.h: daitch_mokotoff_header.pl
+    perl $< $@
+
+distprep: daitch_mokotoff.h
+
+maintainer-clean:
+    rm -f daitch_mokotoff.h
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c
new file mode 100644
index 0000000000..d4ad95c283
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff.c
@@ -0,0 +1,596 @@
+/*
+ * Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem <dag@nimrod.no>
+ *
+ * This implementation of the Daitch-Mokotoff Soundex System aims at high
+ * performance.
+ *
+ * - The processing of each phoneme is initiated by an O(1) table lookup.
+ * - For phonemes containing more than one character, a coding tree is traversed
+ *   to process the complete phoneme.
+ * - The (alternate) soundex codes are produced digit by digit in-place in
+ *   another tree structure.
+ *
+ *  References:
+ *
+ * https://www.avotaynu.com/soundex.htm
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex
+ * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php)
+ * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java)
+ * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm)
+ *
+ * A few notes on other implementations:
+ *
+ * - All other known implementations have the same unofficial rules for "UE",
+ *   these are also adapted by this implementation (0, 1, NC).
+ * - The only other known implementation which is capable of generating all
+ *   correct soundex codes in all cases is the JOS Soundex Calculator at
+ *   https://www.jewishgen.org/jos/jossound.htm
+ * - "J" is considered (only) a vowel in dmlat.php
+ * - The official rules for "RS" are commented out in dmlat.php
+ * - Identical code digits for adjacent letters are not collapsed correctly in
+ *   dmsoundex.php when double digit codes are involved. E.g. "BESST" yields
+ *   744300 instead of 743000 as for "BEST".
+ * - "J" is considered (only) a consonant in DaitchMokotoffSoundex.java
+ * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+*/
+
+#include "postgres.h"
+
+#include "lib/stringinfo.h"
+#include "mb/pg_wchar.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+
+
+/*
+ * The soundex coding chart table is adapted from from
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * See daitch_mokotoff_header.pl for details.
+*/
+
+#define DM_CODE_DIGITS 6
+
+/* Coding chart table: Soundex codes */
+typedef char dm_code[2 + 1];    /* One or two sequential code digits + NUL */
+typedef dm_code dm_codes[3];    /* Start of name, before a vowel, any other */
+
+/* Coding chart table: Letter in input sequence */
+struct dm_letter
+{
+    char        letter;            /* Present letter in sequence */
+    const struct dm_letter *letters;    /* List of possible successive letters */
+    const        dm_codes *codes;    /* Code sequence(s) for complete sequence */
+};
+
+typedef struct dm_letter dm_letter;
+
+/* Generated coding chart table */
+#include "daitch_mokotoff.h"
+
+/* Node in soundex code tree */
+struct dm_node
+{
+    int            soundex_length; /* Length of generated soundex code */
+    char        soundex[DM_CODE_DIGITS + 1];    /* Soundex code */
+    int            is_leaf;        /* Candidate for complete soundex code */
+    int            last_update;    /* Letter number for last update of node */
+    char        code_digit;        /* Last code digit, 0 - 9 */
+
+    /*
+     * One or two alternate code digits leading to this node. If there are two
+     * digits, one of them is always an 'X'. Repeated code digits and 'X' lead
+     * back to the same node.
+     */
+    char        prev_code_digits[2];
+    /* One or two alternate code digits moving forward. */
+    char        next_code_digits[2];
+    /* ORed together code index(es) used to reach current node. */
+    int            prev_code_index;
+    int            next_code_index;
+    /* Nodes branching out from this node. */
+    struct dm_node *children[DM_MAX_ALTERNATE_CODES + 1];
+    /* Next node in linked list. Alternating index for each iteration. */
+    struct dm_node *next[2];
+};
+
+typedef struct dm_node dm_node;
+
+
+/* Internal C implementation */
+static int    daitch_mokotoff_coding(char *word, StringInfo soundex);
+
+
+PG_FUNCTION_INFO_V1(daitch_mokotoff);
+
+Datum
+daitch_mokotoff(PG_FUNCTION_ARGS)
+{
+    text       *arg = PG_GETARG_TEXT_PP(0);
+    char       *string;
+    StringInfoData soundex;
+    text       *retval;
+    MemoryContext old_ctx,
+                tmp_ctx;
+
+    tmp_ctx = AllocSetContextCreate(CurrentMemoryContext,
+                                    "daitch_mokotoff temporary context",
+                                    ALLOCSET_DEFAULT_SIZES);
+    old_ctx = MemoryContextSwitchTo(tmp_ctx);
+
+    string = pg_server_to_any(text_to_cstring(arg), VARSIZE_ANY_EXHDR(arg), PG_UTF8);
+    initStringInfo(&soundex);
+
+    if (!daitch_mokotoff_coding(string, &soundex))
+    {
+        /* No encodable characters in input. */
+        MemoryContextSwitchTo(old_ctx);
+        MemoryContextDelete(tmp_ctx);
+        PG_RETURN_NULL();
+    }
+
+    string = pg_any_to_server(soundex.data, soundex.len, PG_UTF8);
+    MemoryContextSwitchTo(old_ctx);
+    retval = cstring_to_text(string);
+    MemoryContextDelete(tmp_ctx);
+
+    PG_RETURN_TEXT_P(retval);
+}
+
+
+/* Template for new node in soundex code tree. */
+static const dm_node start_node = {
+    .soundex_length = 0,
+    .soundex = "000000 ",        /* Six digits + joining space */
+    .is_leaf = 0,
+    .last_update = 0,
+    .code_digit = '\0',
+    .prev_code_digits = {'\0', '\0'},
+    .next_code_digits = {'\0', '\0'},
+    .prev_code_index = 0,
+    .next_code_index = 0,
+    .children = {NULL},
+    .next = {NULL}
+};
+
+/* Dummy soundex codes at end of input. */
+static const dm_codes end_codes[2] =
+{
+    {
+        "X", "X", "X"
+    }
+};
+
+
+/* Initialize soundex code tree node for next code digit. */
+static void
+initialize_node(dm_node * node, int last_update)
+{
+    if (node->last_update < last_update)
+    {
+        node->prev_code_digits[0] = node->next_code_digits[0];
+        node->prev_code_digits[1] = node->next_code_digits[1];
+        node->next_code_digits[0] = '\0';
+        node->next_code_digits[1] = '\0';
+        node->prev_code_index = node->next_code_index;
+        node->next_code_index = 0;
+        node->is_leaf = 0;
+        node->last_update = last_update;
+    }
+}
+
+
+/* Update soundex code tree node with next code digit. */
+static void
+add_next_code_digit(dm_node * node, int code_index, char code_digit)
+{
+    /* OR in index 1 or 2. */
+    node->next_code_index |= code_index;
+
+    if (!node->next_code_digits[0])
+    {
+        node->next_code_digits[0] = code_digit;
+    }
+    else if (node->next_code_digits[0] != code_digit)
+    {
+        node->next_code_digits[1] = code_digit;
+    }
+}
+
+
+/* Mark soundex code tree node as leaf. */
+static void
+set_leaf(dm_node * first_node[2], dm_node * last_node[2], dm_node * node, int ix_node)
+{
+    if (!node->is_leaf)
+    {
+        node->is_leaf = 1;
+
+        if (first_node[ix_node] == NULL)
+        {
+            first_node[ix_node] = node;
+        }
+        else
+        {
+            last_node[ix_node]->next[ix_node] = node;
+        }
+
+        last_node[ix_node] = node;
+        node->next[ix_node] = NULL;
+    }
+}
+
+
+/* Find next node corresponding to code digit, or create a new node. */
+static dm_node *
+find_or_create_child_node(dm_node * parent, char code_digit, StringInfo soundex)
+{
+    dm_node   **nodes;
+    dm_node    *node;
+    int            i;
+
+    for (nodes = parent->children, i = 0; (node = nodes[i]); i++)
+    {
+        if (node->code_digit == code_digit)
+        {
+            /* Found existing child node. Skip completed nodes. */
+            return node->soundex_length < DM_CODE_DIGITS ? node : NULL;
+        }
+    }
+
+    /* Create new child node. */
+    Assert(i < DM_MAX_ALTERNATE_CODES);
+    node = palloc(sizeof(dm_node));
+    nodes[i] = node;
+
+    *node = start_node;
+    memcpy(node->soundex, parent->soundex, sizeof(parent->soundex));
+    node->soundex_length = parent->soundex_length;
+    node->soundex[node->soundex_length++] = code_digit;
+    node->code_digit = code_digit;
+    node->next_code_index = node->prev_code_index;
+
+    if (node->soundex_length < DM_CODE_DIGITS)
+    {
+        return node;
+    }
+    else
+    {
+        /* Append completed soundex code to soundex string. */
+        appendBinaryStringInfoNT(soundex, node->soundex, DM_CODE_DIGITS + 1);
+        return NULL;
+    }
+}
+
+
+/* Update node for next code digit(s). */
+static void
+update_node(dm_node * first_node[2], dm_node * last_node[2], dm_node * node, int ix_node,
+            int letter_no, int prev_code_index, int next_code_index,
+            const char *next_code_digits, int digit_no,
+            StringInfo soundex)
+{
+    int            i;
+    char        next_code_digit = next_code_digits[digit_no];
+    int            num_dirty_nodes = 0;
+    dm_node    *dirty_nodes[2];
+
+    initialize_node(node, letter_no);
+
+    if (node->prev_code_index && !(node->prev_code_index & prev_code_index))
+    {
+        /*
+         * If the sound (vowel / consonant) of this letter encoding doesn't
+         * correspond to the coding index of the previous letter, we skip this
+         * letter encoding. Note that currently, only "J" can be either a
+         * vowel or a consonant.
+         */
+        return;
+    }
+
+    if (next_code_digit == 'X' ||
+        (digit_no == 0 &&
+         (node->prev_code_digits[0] == next_code_digit ||
+          node->prev_code_digits[1] == next_code_digit)))
+    {
+        /* The code digit is the same as one of the previous (i.e. not added). */
+        dirty_nodes[num_dirty_nodes++] = node;
+    }
+
+    if (next_code_digit != 'X' &&
+        (digit_no > 0 ||
+         node->prev_code_digits[0] != next_code_digit ||
+         node->prev_code_digits[1]))
+    {
+        /* The code digit is different from one of the previous (i.e. added). */
+        node = find_or_create_child_node(node, next_code_digit, soundex);
+        if (node)
+        {
+            initialize_node(node, letter_no);
+            dirty_nodes[num_dirty_nodes++] = node;
+        }
+    }
+
+    for (i = 0; i < num_dirty_nodes; i++)
+    {
+        /* Add code digit leading to the current node. */
+        add_next_code_digit(dirty_nodes[i], next_code_index, next_code_digit);
+
+        if (next_code_digits[++digit_no])
+        {
+            update_node(first_node, last_node, dirty_nodes[i], ix_node,
+                        letter_no, prev_code_index, next_code_index,
+                        next_code_digits, digit_no,
+                        soundex);
+        }
+        else
+        {
+            /* Add incomplete leaf node to linked list. */
+            set_leaf(first_node, last_node, dirty_nodes[i], ix_node);
+        }
+    }
+}
+
+
+/* Update soundex tree leaf nodes. */
+static void
+update_leaves(dm_node * first_node[2], int *ix_node, int letter_no,
+              const dm_codes * codes, const dm_codes * next_codes,
+              StringInfo soundex)
+{
+    int            i,
+                j,
+                code_index;
+    dm_node    *node,
+               *last_node[2];
+    const        dm_code *code,
+               *next_code;
+    int            ix_node_next = (*ix_node + 1) & 1;    /* Alternating index: 0, 1 */
+
+    /* Initialize for new linked list of leaves. */
+    first_node[ix_node_next] = NULL;
+    last_node[ix_node_next] = NULL;
+
+    /* Process all nodes. */
+    for (node = first_node[*ix_node]; node; node = node->next[*ix_node])
+    {
+        /* One or two alternate code sequences. */
+        for (i = 0; i < 2 && (code = codes[i]) && code[0][0]; i++)
+        {
+            /* Coding for previous letter - before vowel: 1, all other: 2 */
+            int            prev_code_index = (code[0][0] > '1') + 1;
+
+            /* One or two alternate next code sequences. */
+            for (j = 0; j < 2 && (next_code = next_codes[j]) && next_code[0][0]; j++)
+            {
+                /* Determine which code to use. */
+                if (letter_no == 0)
+                {
+                    /* This is the first letter. */
+                    code_index = 0;
+                }
+                else if (next_code[0][0] <= '1')
+                {
+                    /* The next letter is a vowel. */
+                    code_index = 1;
+                }
+                else
+                {
+                    /* All other cases. */
+                    code_index = 2;
+                }
+
+                /* One or two sequential code digits. */
+                update_node(first_node, last_node, node, ix_node_next,
+                            letter_no, prev_code_index, code_index,
+                            code[code_index], 0,
+                            soundex);
+            }
+        }
+    }
+
+    *ix_node = ix_node_next;
+}
+
+
+/* Mapping from ISO8859-1 to upper-case ASCII */
+static const char iso8859_1_to_ascii_upper[] =
+/*
+"`abcdefghijklmnopqrstuvwxyz{|}~                                  ¡¢£¤¥¦§¨©ª«¬
®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ"
+*/
+"`ABCDEFGHIJKLMNOPQRSTUVWXYZ{|}~                                  !
?AAAAAAECEEEEIIIIDNOOOOO*OUUUUYDSAAAAAAECEEEEIIIIDNOOOOO/OUUUUYDY";
+
+
+/* Return next character, converted from UTF-8 to uppercase ASCII. */
+static char
+read_char(unsigned char *str, int *ix)
+{
+    /* Substitute character for skipped code points. */
+    const char    na = '\x1a';
+    pg_wchar    c;
+
+    /* Decode UTF-8 character to ISO 10646 code point. */
+    str += *ix;
+    c = utf8_to_unicode(str);
+    *ix += pg_utf_mblen(str);
+
+    if (c >= (unsigned char) '[' && c <= (unsigned char) ']')
+    {
+        /* ASCII characters [, \, and ] are reserved for Ą, Ę, and Ţ/Ț. */
+        return na;
+    }
+    else if (c < 0x60)
+    {
+        /* Non-lowercase ASCII character. */
+        return c;
+    }
+    else if (c < 0x100)
+    {
+        /* ISO-8859-1 code point, converted to upper-case ASCII. */
+        return iso8859_1_to_ascii_upper[c - 0x60];
+    }
+    else
+    {
+        /* Conversion of non-ASCII characters in the coding chart. */
+        switch (c)
+        {
+            case 0x0104:
+            case 0x0105:
+                /* Ą/ą */
+                return '[';
+            case 0x0118:
+            case 0x0119:
+                /* Ę/ę */
+                return '\\';
+            case 0x0162:
+            case 0x0163:
+            case 0x021A:
+            case 0x021B:
+                /* Ţ/ţ or Ț/ț */
+                return ']';
+            default:
+                return na;
+        }
+    }
+}
+
+
+/* Read next ASCII character, skipping any characters not in [A-\]]. */
+static char
+read_valid_char(char *str, int *ix)
+{
+    char        c;
+
+    while ((c = read_char((unsigned char *) str, ix)))
+    {
+        if (c >= 'A' && c <= ']')
+        {
+            break;
+        }
+    }
+
+    return c;
+}
+
+
+/* Return sound coding for "letter" (letter sequence) */
+static const dm_codes *
+read_letter(char *str, int *ix)
+{
+    char        c,
+                cmp;
+    int            i,
+                j;
+    const        dm_letter *letters;
+    const        dm_codes *codes;
+
+    /* First letter in sequence. */
+    if (!(c = read_valid_char(str, ix)))
+    {
+        return NULL;
+    }
+    letters = &letter_[c - 'A'];
+    codes = letters->codes;
+    i = *ix;
+
+    /* Any subsequent letters in sequence. */
+    while ((letters = letters->letters) && (c = read_valid_char(str, &i)))
+    {
+        for (j = 0; (cmp = letters[j].letter); j++)
+        {
+            if (cmp == c)
+            {
+                /* Letter found. */
+                letters = &letters[j];
+                if (letters->codes)
+                {
+                    /* Coding for letter sequence found. */
+                    codes = letters->codes;
+                    *ix = i;
+                }
+                break;
+            }
+        }
+        if (!cmp)
+        {
+            /* The sequence of letters has no coding. */
+            break;
+        }
+    }
+
+    return codes;
+}
+
+
+/* Generate all Daitch-Mokotoff soundex codes for word, separated by space. */
+static int
+daitch_mokotoff_coding(char *word, StringInfo soundex)
+{
+    int            i = 0;
+    int            letter_no = 0;
+    int            ix_node = 0;
+    const        dm_codes *codes,
+               *next_codes;
+    dm_node    *first_node[2],
+               *node;
+
+    /* First letter. */
+    if (!(codes = read_letter(word, &i)))
+    {
+        /* No encodable character in input. */
+        return 0;
+    }
+
+    /* Starting point. */
+    first_node[ix_node] = palloc(sizeof(dm_node));
+    *first_node[ix_node] = start_node;
+
+    /*
+     * Loop until either the word input is exhausted, or all generated soundex
+     * codes are completed to six digits.
+     */
+    while (codes && first_node[ix_node])
+    {
+        next_codes = read_letter(word, &i);
+
+        /* Update leaf nodes. */
+        update_leaves(first_node, &ix_node, letter_no,
+                      codes, next_codes ? next_codes : end_codes,
+                      soundex);
+
+        codes = next_codes;
+        letter_no++;
+    }
+
+    /* Append all remaining (incomplete) soundex codes. */
+    for (node = first_node[ix_node]; node; node = node->next[ix_node])
+    {
+        appendBinaryStringInfoNT(soundex, node->soundex, DM_CODE_DIGITS + 1);
+    }
+
+    /* Terminate string at the final space. */
+    soundex->len--;
+    soundex->data[soundex->len] = '\0';
+
+    return 1;
+}
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff_header.pl b/contrib/fuzzystrmatch/daitch_mokotoff_header.pl
new file mode 100755
index 0000000000..807b5fb8c5
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff_header.pl
@@ -0,0 +1,260 @@
+#!/bin/perl
+#
+# Generation of types and lookup tables for Daitch-Mokotoff soundex.
+#
+# Copyright (c) 2021 Finance Norway
+# Author: Dag Lem <dag@nimrod.no>
+#
+# Permission to use, copy, modify, and distribute this software and its
+# documentation for any purpose, without fee, and without a written agreement
+# is hereby granted, provided that the above copyright notice and this
+# paragraph and the following two paragraphs appear in all copies.
+#
+# IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+# DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+# LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+# DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+# POSSIBILITY OF SUCH DAMAGE.
+#
+# THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+# INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+# AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+# ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+# PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+#
+
+use strict;
+use warnings;
+use utf8;
+use open IO => ':utf8', ':std';
+use Data::Dumper;
+
+die "Usage: $0 OUTPUT_FILE\n" if @ARGV != 1;
+my $output_file = $ARGV[0];
+
+# Open the output file
+open my $OUTPUT, '>', $output_file
+  or die "Could not open output file $output_file: $!\n";
+
+# Parse code table and generate tree for letter transitions.
+my %codes;
+my %alternates;
+my $table = [{}, [["","",""]]];
+while (<DATA>) {
+    chomp;
+    my ($letters, $codes) = split(/\s+/);
+    my @codes = map { [ split(/,/) ] } split(/\|/, $codes);
+
+    # Find alternate code transitions for calculation of storage.
+    # The first character ("start of a name") can never yield more than two alternate codes,
+    # and is not considered here.
+    if (@codes > 1) {
+        for my $j (1..2) {  # Codes for "before a vowel" and "any other"
+            for my $i (0..1) {  # Alternate codes
+                # Identical code digits for adjacent letters are collapsed.
+                # For each possible non-transition due to code digit
+                # collapsing, find all alternate transitions.
+                my ($present, $next) = ($codes[$i][$j], $codes[($i + 1)%2][$j]);
+                next if length($present) != 1;
+                $next = $present ne substr($next, 0, 1) ? substr($next, 0, 1) : substr($next, -1, 1);
+                $alternates{$present}{$next} = 1;
+            }
+        }
+    }
+    my $key = "codes_" . join("_or_", map { join("_", @$_) } @codes);
+    my $val = join(",\n", map { "\t{\n\t\t" . join(", ", map { "\"$_\"" } @$_) . "\n\t}" } @codes);
+    $codes{$key} = $val;
+
+    for my $letter (split(/,/, $letters)) {
+        my $ref = $table->[0];
+        # Link each character to the next in the letter combination.
+        my @c = split(//, $letter);
+        my $last_c = pop(@c);
+        for my $c (@c) {
+            $ref->{$c} //= [ {}, undef ];
+            $ref->{$c}[0] //= {};
+            $ref = $ref->{$c}[0];
+        }
+        # The sound code for the letter combination is stored at the last character.
+        $ref->{$last_c}[1] = $key;
+    }
+}
+close(DATA);
+
+# Add alternates by following transitions to 'X' (not coded).
+my $alt_x = $alternates{"X"};
+delete $alt_x->{"X"};
+while (my ($k, $v) = each %alternates) {
+    if (delete $v->{"X"}) {
+        for my $x (keys %$alt_x) {
+            $v->{$x} = 1;
+        }
+    }
+}
+
+# Find the maximum number of alternate codes in one position.
+# Add two for any additional final code digit transitions.
+my $max_alt = (sort { $b <=> $a } (map { scalar keys %$_ } values %alternates))[0] + 2;
+
+print $OUTPUT <<EOF;
+/*
+ * Constants and lookup tables for Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem <dag\@nimrod.no>
+ *
+ * This file is generated by daitch_mokotoff_header.pl
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+ */
+
+#define DM_MAX_ALTERNATE_CODES $max_alt
+
+/* Codes for letter sequence at start of name, before a vowel, and any other. */
+EOF
+
+for my $key (sort keys %codes) {
+    print $OUTPUT "static const dm_codes $key\[2\] =\n{\n" . $codes{$key} . "\n};\n";
+}
+
+print $OUTPUT <<EOF;
+
+/* Coding for alternative following letters in sequence. */
+EOF
+
+sub hash2code {
+    my ($ref, $letter) = @_;
+
+    my @letters = ();
+
+    my $h = $ref->[0];
+    for my $key (sort keys %$h) {
+        $ref = $h->{$key};
+        my $children = "NULL";
+        if (defined $ref->[0]) {
+            $children = "letter_$letter$key";
+            hash2code($ref, "$letter$key");
+        }
+        my $codes = $ref->[1] // "NULL";
+        push(@letters, "\t{\n\t\t'$key', $children, $codes\n\t}");
+    }
+
+    print $OUTPUT "static const dm_letter letter_$letter\[\] =\n{\n";
+    for (@letters) {
+        print $OUTPUT "$_,\n";
+    }
+    print $OUTPUT "\t{\n\t\t'\\0'\n\t}\n";
+    print $OUTPUT "};\n";
+}
+
+hash2code($table, '');
+
+close $OUTPUT;
+
+# Table adapted from https://www.jewishgen.org/InfoFiles/Soundex.html
+#
+# The conversion from the coding chart to the table should be self
+# explanatory, but note the differences stated below.
+#
+# X = NC (not coded)
+#
+# The non-ASCII letters in the coding chart are coded with substitute
+# lowercase ASCII letters, which sort after the uppercase ASCII letters:
+#
+# Ą => a (use '[' for table lookup)
+# Ę => e (use '\\' for table lookup)
+# Ţ => t (use ']' for table lookup)
+#
+# The rule for "UE" does not correspond to the coding chart, however
+# it is used by all other known implementations, including the one at
+# https://www.jewishgen.org/jos/jossound.htm (try e.g. "bouey").
+#
+# Note that the implementation assumes that vowels are assigned code
+# 0 or 1. "J" can be either a vowel or a consonant.
+#
+
+__DATA__
+AI,AJ,AY                0,1,X
+AU                        0,7,X
+a                        X,X,6|X,X,X
+A                        0,X,X
+B                        7,7,7
+CHS                        5,54,54
+CH                        5,5,5|4,4,4
+CK                        5,5,5|45,45,45
+CZ,CS,CSZ,CZS            4,4,4
+C                        5,5,5|4,4,4
+DRZ,DRS                    4,4,4
+DS,DSH,DSZ                4,4,4
+DZ,DZH,DZS                4,4,4
+D,DT                    3,3,3
+EI,EJ,EY                0,1,X
+EU                        1,1,X
+e                        X,X,6|X,X,X
+E                        0,X,X
+FB                        7,7,7
+F                        7,7,7
+G                        5,5,5
+H                        5,5,X
+IA,IE,IO,IU                1,X,X
+I                        0,X,X
+J                        1,X,X|4,4,4
+KS                        5,54,54
+KH                        5,5,5
+K                        5,5,5
+L                        8,8,8
+MN                        66,66,66
+M                        6,6,6
+NM                        66,66,66
+N                        6,6,6
+OI,OJ,OY                0,1,X
+O                        0,X,X
+P,PF,PH                    7,7,7
+Q                        5,5,5
+RZ,RS                    94,94,94|4,4,4
+R                        9,9,9
+SCHTSCH,SCHTSH,SCHTCH    2,4,4
+SCH                        4,4,4
+SHTCH,SHCH,SHTSH        2,4,4
+SHT,SCHT,SCHD            2,43,43
+SH                        4,4,4
+STCH,STSCH,SC            2,4,4
+STRZ,STRS,STSH            2,4,4
+ST                        2,43,43
+SZCZ,SZCS                2,4,4
+SZT,SHD,SZD,SD            2,43,43
+SZ                        4,4,4
+S                        4,4,4
+TCH,TTCH,TTSCH            4,4,4
+TH                        3,3,3
+TRZ,TRS                    4,4,4
+TSCH,TSH                4,4,4
+TS,TTS,TTSZ,TC            4,4,4
+TZ,TTZ,TZS,TSZ            4,4,4
+t                        3,3,3|4,4,4
+T                        3,3,3
+UI,UJ,UY,UE                0,1,X
+U                        0,X,X
+V                        7,7,7
+W                        7,7,7
+X                        5,54,54
+Y                        1,X,X
+ZDZ,ZDZH,ZHDZH            2,4,4
+ZD,ZHD                    2,43,43
+ZH,ZS,ZSCH,ZSH            4,4,4
+Z                        4,4,4
diff --git a/contrib/fuzzystrmatch/expected/fuzzystrmatch.out b/contrib/fuzzystrmatch/expected/fuzzystrmatch.out
index 493c95cdfa..f62ddad4ee 100644
--- a/contrib/fuzzystrmatch/expected/fuzzystrmatch.out
+++ b/contrib/fuzzystrmatch/expected/fuzzystrmatch.out
@@ -65,3 +65,174 @@ SELECT dmetaphone_alt('gumbo');
  KMP
 (1 row)

+-- Wovels
+SELECT daitch_mokotoff('Augsburg');
+ daitch_mokotoff
+-----------------
+ 054795
+(1 row)
+
+SELECT daitch_mokotoff('Breuer');
+ daitch_mokotoff
+-----------------
+ 791900
+(1 row)
+
+SELECT daitch_mokotoff('Freud');
+ daitch_mokotoff
+-----------------
+ 793000
+(1 row)
+
+-- The letter "H"
+SELECT daitch_mokotoff('Halberstadt');
+ daitch_mokotoff
+-----------------
+ 587943 587433
+(1 row)
+
+SELECT daitch_mokotoff('Mannheim');
+ daitch_mokotoff
+-----------------
+ 665600
+(1 row)
+
+-- Adjacent sounds
+SELECT daitch_mokotoff('Chernowitz');
+ daitch_mokotoff
+-----------------
+ 596740 496740
+(1 row)
+
+-- Adjacent letters with identical adjacent code digits
+SELECT daitch_mokotoff('Cherkassy');
+ daitch_mokotoff
+-----------------
+ 595400 495400
+(1 row)
+
+SELECT daitch_mokotoff('Kleinman');
+ daitch_mokotoff
+-----------------
+ 586660
+(1 row)
+
+-- More than one word
+SELECT daitch_mokotoff('Nowy Targ');
+ daitch_mokotoff
+-----------------
+ 673950
+(1 row)
+
+-- Padded with "0"
+SELECT daitch_mokotoff('Berlin');
+ daitch_mokotoff
+-----------------
+ 798600
+(1 row)
+
+-- Other examples from https://www.avotaynu.com/soundex.htm
+SELECT daitch_mokotoff('Ceniow');
+ daitch_mokotoff
+-----------------
+ 567000 467000
+(1 row)
+
+SELECT daitch_mokotoff('Tsenyuv');
+ daitch_mokotoff
+-----------------
+ 467000
+(1 row)
+
+SELECT daitch_mokotoff('Holubica');
+ daitch_mokotoff
+-----------------
+ 587500 587400
+(1 row)
+
+SELECT daitch_mokotoff('Golubitsa');
+ daitch_mokotoff
+-----------------
+ 587400
+(1 row)
+
+SELECT daitch_mokotoff('Przemysl');
+ daitch_mokotoff
+-----------------
+ 794648 746480
+(1 row)
+
+SELECT daitch_mokotoff('Pshemeshil');
+ daitch_mokotoff
+-----------------
+ 746480
+(1 row)
+
+SELECT daitch_mokotoff('Rosochowaciec');
+                     daitch_mokotoff
+---------------------------------------------------------
+ 945755 945754 945745 945744 944755 944754 944745 944744
+(1 row)
+
+SELECT daitch_mokotoff('Rosokhovatsets');
+ daitch_mokotoff
+-----------------
+ 945744
+(1 row)
+
+-- Ignored characters
+SELECT daitch_mokotoff('''OBrien');
+ daitch_mokotoff
+-----------------
+ 079600
+(1 row)
+
+SELECT daitch_mokotoff('O''Brien');
+ daitch_mokotoff
+-----------------
+ 079600
+(1 row)
+
+-- "Difficult" cases, likely to cause trouble for other implementations.
+SELECT daitch_mokotoff('CJC');
+              daitch_mokotoff
+-------------------------------------------
+ 550000 540000 545000 450000 400000 440000
+(1 row)
+
+SELECT daitch_mokotoff('BESST');
+ daitch_mokotoff
+-----------------
+ 743000
+(1 row)
+
+SELECT daitch_mokotoff('BOUEY');
+ daitch_mokotoff
+-----------------
+ 710000
+(1 row)
+
+SELECT daitch_mokotoff('HANNMANN');
+ daitch_mokotoff
+-----------------
+ 566600
+(1 row)
+
+SELECT daitch_mokotoff('MCCOYJR');
+                     daitch_mokotoff
+---------------------------------------------------------
+ 651900 654900 654190 654490 645190 645490 641900 644900
+(1 row)
+
+SELECT daitch_mokotoff('ACCURSO');
+                     daitch_mokotoff
+---------------------------------------------------------
+ 059400 054000 054940 054400 045940 045400 049400 044000
+(1 row)
+
+SELECT daitch_mokotoff('BIERSCHBACH');
+                     daitch_mokotoff
+---------------------------------------------------------
+ 794575 794574 794750 794740 745750 745740 747500 747400
+(1 row)
+
diff --git a/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8.out
b/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8.out
new file mode 100644
index 0000000000..32d8260383
--- /dev/null
+++ b/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8.out
@@ -0,0 +1,61 @@
+/*
+ * This test must be run in a database with UTF-8 encoding,
+ * because other encodings don't support all the characters used.
+ */
+SELECT getdatabaseencoding() <> 'UTF8'
+       AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+set client_encoding = utf8;
+-- CREATE EXTENSION IF NOT EXISTS fuzzystrmatch;
+-- Accents
+SELECT daitch_mokotoff('Müller');
+ daitch_mokotoff
+-----------------
+ 689000
+(1 row)
+
+SELECT daitch_mokotoff('Schäfer');
+ daitch_mokotoff
+-----------------
+ 479000
+(1 row)
+
+SELECT daitch_mokotoff('Straßburg');
+ daitch_mokotoff
+-----------------
+ 294795
+(1 row)
+
+SELECT daitch_mokotoff('Éregon');
+ daitch_mokotoff
+-----------------
+ 095600
+(1 row)
+
+-- Special characters added at https://www.jewishgen.org/InfoFiles/Soundex.html
+SELECT daitch_mokotoff('gąszczu');
+ daitch_mokotoff
+-----------------
+ 564000 540000
+(1 row)
+
+SELECT daitch_mokotoff('brzęczy');
+       daitch_mokotoff
+-----------------------------
+ 794640 794400 746400 744000
+(1 row)
+
+SELECT daitch_mokotoff('ţamas');
+ daitch_mokotoff
+-----------------
+ 364000 464000
+(1 row)
+
+SELECT daitch_mokotoff('țamas');
+ daitch_mokotoff
+-----------------
+ 364000 464000
+(1 row)
+
diff --git a/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8_1.out
b/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8_1.out
new file mode 100644
index 0000000000..37aead89c0
--- /dev/null
+++ b/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8_1.out
@@ -0,0 +1,8 @@
+/*
+ * This test must be run in a database with UTF-8 encoding,
+ * because other encodings don't support all the characters used.
+ */
+SELECT getdatabaseencoding() <> 'UTF8'
+       AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql b/contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql
new file mode 100644
index 0000000000..b9d7b229a3
--- /dev/null
+++ b/contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql
@@ -0,0 +1,8 @@
+/* contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION fuzzystrmatch UPDATE TO '1.2'" to load this file. \quit
+
+CREATE FUNCTION daitch_mokotoff(text) RETURNS text
+AS 'MODULE_PATHNAME', 'daitch_mokotoff'
+LANGUAGE C IMMUTABLE STRICT PARALLEL SAFE;
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch.control b/contrib/fuzzystrmatch/fuzzystrmatch.control
index 3cd6660bf9..8b6e9fd993 100644
--- a/contrib/fuzzystrmatch/fuzzystrmatch.control
+++ b/contrib/fuzzystrmatch/fuzzystrmatch.control
@@ -1,6 +1,6 @@
 # fuzzystrmatch extension
 comment = 'determine similarities and distance between strings'
-default_version = '1.1'
+default_version = '1.2'
 module_pathname = '$libdir/fuzzystrmatch'
 relocatable = true
 trusted = true
diff --git a/contrib/fuzzystrmatch/meson.build b/contrib/fuzzystrmatch/meson.build
index e6d06149ce..73178794c2 100644
--- a/contrib/fuzzystrmatch/meson.build
+++ b/contrib/fuzzystrmatch/meson.build
@@ -1,7 +1,16 @@
 fuzzystrmatch_sources = files(
-  'fuzzystrmatch.c',
+  'daitch_mokotoff.c',
   'dmetaphone.c',
+  'fuzzystrmatch.c',
+)
+
+daitch_mokotoff_h = custom_target('daitch_mokotoff',
+  input: 'daitch_mokotoff_header.pl',
+  output: 'daitch_mokotoff.h',
+  command: [perl, '@INPUT@', '@OUTPUT@'],
 )
+generated_sources += daitch_mokotoff_h
+fuzzystrmatch_sources += daitch_mokotoff_h

 if host_system == 'windows'
   fuzzystrmatch_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
@@ -11,6 +20,7 @@ endif

 fuzzystrmatch = shared_module('fuzzystrmatch',
   fuzzystrmatch_sources,
+  include_directories: include_directories('.'),
   kwargs: contrib_mod_args,
 )
 contrib_targets += fuzzystrmatch
@@ -19,6 +29,7 @@ install_data(
   'fuzzystrmatch.control',
   'fuzzystrmatch--1.0--1.1.sql',
   'fuzzystrmatch--1.1.sql',
+  'fuzzystrmatch--1.1--1.2.sql',
   kwargs: contrib_data_args,
 )

@@ -29,6 +40,7 @@ tests += {
   'regress': {
     'sql': [
       'fuzzystrmatch',
+      'fuzzystrmatch_utf8',
     ],
   },
 }
diff --git a/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql b/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql
index f05dc28ffb..db05c7d6b6 100644
--- a/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql
+++ b/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql
@@ -19,3 +19,48 @@ SELECT metaphone('GUMBO', 4);

 SELECT dmetaphone('gumbo');
 SELECT dmetaphone_alt('gumbo');
+
+-- Wovels
+SELECT daitch_mokotoff('Augsburg');
+SELECT daitch_mokotoff('Breuer');
+SELECT daitch_mokotoff('Freud');
+
+-- The letter "H"
+SELECT daitch_mokotoff('Halberstadt');
+SELECT daitch_mokotoff('Mannheim');
+
+-- Adjacent sounds
+SELECT daitch_mokotoff('Chernowitz');
+
+-- Adjacent letters with identical adjacent code digits
+SELECT daitch_mokotoff('Cherkassy');
+SELECT daitch_mokotoff('Kleinman');
+
+-- More than one word
+SELECT daitch_mokotoff('Nowy Targ');
+
+-- Padded with "0"
+SELECT daitch_mokotoff('Berlin');
+
+-- Other examples from https://www.avotaynu.com/soundex.htm
+SELECT daitch_mokotoff('Ceniow');
+SELECT daitch_mokotoff('Tsenyuv');
+SELECT daitch_mokotoff('Holubica');
+SELECT daitch_mokotoff('Golubitsa');
+SELECT daitch_mokotoff('Przemysl');
+SELECT daitch_mokotoff('Pshemeshil');
+SELECT daitch_mokotoff('Rosochowaciec');
+SELECT daitch_mokotoff('Rosokhovatsets');
+
+-- Ignored characters
+SELECT daitch_mokotoff('''OBrien');
+SELECT daitch_mokotoff('O''Brien');
+
+-- "Difficult" cases, likely to cause trouble for other implementations.
+SELECT daitch_mokotoff('CJC');
+SELECT daitch_mokotoff('BESST');
+SELECT daitch_mokotoff('BOUEY');
+SELECT daitch_mokotoff('HANNMANN');
+SELECT daitch_mokotoff('MCCOYJR');
+SELECT daitch_mokotoff('ACCURSO');
+SELECT daitch_mokotoff('BIERSCHBACH');
diff --git a/contrib/fuzzystrmatch/sql/fuzzystrmatch_utf8.sql b/contrib/fuzzystrmatch/sql/fuzzystrmatch_utf8.sql
new file mode 100644
index 0000000000..f42c01a1bb
--- /dev/null
+++ b/contrib/fuzzystrmatch/sql/fuzzystrmatch_utf8.sql
@@ -0,0 +1,26 @@
+/*
+ * This test must be run in a database with UTF-8 encoding,
+ * because other encodings don't support all the characters used.
+ */
+
+SELECT getdatabaseencoding() <> 'UTF8'
+       AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+set client_encoding = utf8;
+
+-- CREATE EXTENSION IF NOT EXISTS fuzzystrmatch;
+
+-- Accents
+SELECT daitch_mokotoff('Müller');
+SELECT daitch_mokotoff('Schäfer');
+SELECT daitch_mokotoff('Straßburg');
+SELECT daitch_mokotoff('Éregon');
+
+-- Special characters added at https://www.jewishgen.org/InfoFiles/Soundex.html
+SELECT daitch_mokotoff('gąszczu');
+SELECT daitch_mokotoff('brzęczy');
+SELECT daitch_mokotoff('ţamas');
+SELECT daitch_mokotoff('țamas');
diff --git a/doc/src/sgml/fuzzystrmatch.sgml b/doc/src/sgml/fuzzystrmatch.sgml
index 382e54be91..08781778f8 100644
--- a/doc/src/sgml/fuzzystrmatch.sgml
+++ b/doc/src/sgml/fuzzystrmatch.sgml
@@ -241,4 +241,101 @@ test=# SELECT dmetaphone('gumbo');
 </screen>
  </sect2>

+ <sect2>
+  <title>Daitch-Mokotoff Soundex</title>
+
+  <para>
+   Compared to the American Soundex System implemented in the
+   <function>soundex</function> function, the major improvements of the
+   Daitch-Mokotoff Soundex System are:
+
+   <itemizedlist spacing="compact" mark="bullet">
+    <listitem>
+     <para>
+      Information is coded to the first six meaningful letters rather than
+      four.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      The initial letter is coded rather than kept as is.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Where two consecutive letters have a single sound, they are coded as a
+      single number.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      When a letter or combination of letters may have two different sounds,
+      it is double coded under the two different codes.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      A letter or combination of letters maps into ten possible codes rather
+      than seven.
+     </para>
+    </listitem>
+   </itemizedlist>
+  </para>
+
+  <indexterm>
+   <primary>daitch_mokotoff</primary>
+  </indexterm>
+
+  <para>
+   The following function generates Daitch-Mokotoff soundex codes for matching
+   of similar-sounding input:
+  </para>
+
+<synopsis>
+daitch_mokotoff(text source) returns text
+</synopsis>
+
+  <para>
+   Since a Daitch-Mokotoff soundex code consists of only 6 digits,
+   <literal>source</literal> should be preferably a single word or name.
+   Any alternative soundex codes are separated by space, which makes the returned
+   text suited for use in Full Text Search, see <xref linkend="textsearch"/> and
+   <xref linkend="functions-textsearch"/>.
+  </para>
+
+  <para>
+   Example:
+  </para>
+
+<programlisting>
+CREATE OR REPLACE FUNCTION soundex_name(v_name text) RETURNS text AS $$
+  SELECT string_agg(daitch_mokotoff(n), ' ')
+  FROM regexp_split_to_table(v_name, '\s+') AS n
+$$ LANGUAGE sql STRICT IMMUTABLE PARALLEL SAFE;
+
+CREATE OR REPLACE FUNCTION soundex_tsvector(v_name text) RETURNS tsvector AS $$
+  SELECT to_tsvector('simple', soundex_name(v_name))
+$$ LANGUAGE sql STRICT IMMUTABLE PARALLEL SAFE;
+
+CREATE OR REPLACE FUNCTION soundex_tsquery(v_name text) RETURNS tsquery AS $$
+  SELECT to_tsquery('simple', quote_literal(soundex_name(v_name)))
+$$ LANGUAGE sql STRICT IMMUTABLE PARALLEL SAFE;
+
+-- Note that searches could be more efficient with the tsvector in a separate column
+-- (no recalculation on table row recheck).
+CREATE TABLE s (nm text);
+CREATE INDEX ix_s_txt ON s USING gin (soundex_tsvector(nm)) WITH (fastupdate = off);
+
+INSERT INTO s VALUES ('John Doe');
+INSERT INTO s VALUES ('Jane Roe');
+INSERT INTO s VALUES ('Public John Q.');
+INSERT INTO s VALUES ('George Best');
+
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('john');
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('jane doe');
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('john public');
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('besst, giorgio');
+</programlisting>
+ </sect2>
+
 </sect1>

Re: daitch_mokotoff module

От
Dag Lem
Дата:
Dag Lem <dag@nimrod.no> writes:

> I noticed that the Meson builds failed in Cfbot, the updated patch adds
> a missing "include_directories" line to meson.build.
>

This should hopefully fix the last Cfbot failures, by exclusion of
daitch_mokotoff.h from headerscheck and cpluspluscheck.

Best regards

Dag Lem

diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile
index 0704894f88..12baf2d884 100644
--- a/contrib/fuzzystrmatch/Makefile
+++ b/contrib/fuzzystrmatch/Makefile
@@ -3,14 +3,15 @@
 MODULE_big = fuzzystrmatch
 OBJS = \
     $(WIN32RES) \
+    daitch_mokotoff.o \
     dmetaphone.o \
     fuzzystrmatch.o

 EXTENSION = fuzzystrmatch
-DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql
+DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql
 PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"

-REGRESS = fuzzystrmatch
+REGRESS = fuzzystrmatch fuzzystrmatch_utf8

 ifdef USE_PGXS
 PG_CONFIG = pg_config
@@ -22,3 +23,14 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
+
+# Force this dependency to be known even without dependency info built:
+daitch_mokotoff.o: daitch_mokotoff.h
+
+daitch_mokotoff.h: daitch_mokotoff_header.pl
+    perl $< $@
+
+distprep: daitch_mokotoff.h
+
+maintainer-clean:
+    rm -f daitch_mokotoff.h
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c
new file mode 100644
index 0000000000..d4ad95c283
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff.c
@@ -0,0 +1,596 @@
+/*
+ * Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem <dag@nimrod.no>
+ *
+ * This implementation of the Daitch-Mokotoff Soundex System aims at high
+ * performance.
+ *
+ * - The processing of each phoneme is initiated by an O(1) table lookup.
+ * - For phonemes containing more than one character, a coding tree is traversed
+ *   to process the complete phoneme.
+ * - The (alternate) soundex codes are produced digit by digit in-place in
+ *   another tree structure.
+ *
+ *  References:
+ *
+ * https://www.avotaynu.com/soundex.htm
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex
+ * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php)
+ * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java)
+ * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm)
+ *
+ * A few notes on other implementations:
+ *
+ * - All other known implementations have the same unofficial rules for "UE",
+ *   these are also adapted by this implementation (0, 1, NC).
+ * - The only other known implementation which is capable of generating all
+ *   correct soundex codes in all cases is the JOS Soundex Calculator at
+ *   https://www.jewishgen.org/jos/jossound.htm
+ * - "J" is considered (only) a vowel in dmlat.php
+ * - The official rules for "RS" are commented out in dmlat.php
+ * - Identical code digits for adjacent letters are not collapsed correctly in
+ *   dmsoundex.php when double digit codes are involved. E.g. "BESST" yields
+ *   744300 instead of 743000 as for "BEST".
+ * - "J" is considered (only) a consonant in DaitchMokotoffSoundex.java
+ * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+*/
+
+#include "postgres.h"
+
+#include "lib/stringinfo.h"
+#include "mb/pg_wchar.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+
+
+/*
+ * The soundex coding chart table is adapted from from
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * See daitch_mokotoff_header.pl for details.
+*/
+
+#define DM_CODE_DIGITS 6
+
+/* Coding chart table: Soundex codes */
+typedef char dm_code[2 + 1];    /* One or two sequential code digits + NUL */
+typedef dm_code dm_codes[3];    /* Start of name, before a vowel, any other */
+
+/* Coding chart table: Letter in input sequence */
+struct dm_letter
+{
+    char        letter;            /* Present letter in sequence */
+    const struct dm_letter *letters;    /* List of possible successive letters */
+    const        dm_codes *codes;    /* Code sequence(s) for complete sequence */
+};
+
+typedef struct dm_letter dm_letter;
+
+/* Generated coding chart table */
+#include "daitch_mokotoff.h"
+
+/* Node in soundex code tree */
+struct dm_node
+{
+    int            soundex_length; /* Length of generated soundex code */
+    char        soundex[DM_CODE_DIGITS + 1];    /* Soundex code */
+    int            is_leaf;        /* Candidate for complete soundex code */
+    int            last_update;    /* Letter number for last update of node */
+    char        code_digit;        /* Last code digit, 0 - 9 */
+
+    /*
+     * One or two alternate code digits leading to this node. If there are two
+     * digits, one of them is always an 'X'. Repeated code digits and 'X' lead
+     * back to the same node.
+     */
+    char        prev_code_digits[2];
+    /* One or two alternate code digits moving forward. */
+    char        next_code_digits[2];
+    /* ORed together code index(es) used to reach current node. */
+    int            prev_code_index;
+    int            next_code_index;
+    /* Nodes branching out from this node. */
+    struct dm_node *children[DM_MAX_ALTERNATE_CODES + 1];
+    /* Next node in linked list. Alternating index for each iteration. */
+    struct dm_node *next[2];
+};
+
+typedef struct dm_node dm_node;
+
+
+/* Internal C implementation */
+static int    daitch_mokotoff_coding(char *word, StringInfo soundex);
+
+
+PG_FUNCTION_INFO_V1(daitch_mokotoff);
+
+Datum
+daitch_mokotoff(PG_FUNCTION_ARGS)
+{
+    text       *arg = PG_GETARG_TEXT_PP(0);
+    char       *string;
+    StringInfoData soundex;
+    text       *retval;
+    MemoryContext old_ctx,
+                tmp_ctx;
+
+    tmp_ctx = AllocSetContextCreate(CurrentMemoryContext,
+                                    "daitch_mokotoff temporary context",
+                                    ALLOCSET_DEFAULT_SIZES);
+    old_ctx = MemoryContextSwitchTo(tmp_ctx);
+
+    string = pg_server_to_any(text_to_cstring(arg), VARSIZE_ANY_EXHDR(arg), PG_UTF8);
+    initStringInfo(&soundex);
+
+    if (!daitch_mokotoff_coding(string, &soundex))
+    {
+        /* No encodable characters in input. */
+        MemoryContextSwitchTo(old_ctx);
+        MemoryContextDelete(tmp_ctx);
+        PG_RETURN_NULL();
+    }
+
+    string = pg_any_to_server(soundex.data, soundex.len, PG_UTF8);
+    MemoryContextSwitchTo(old_ctx);
+    retval = cstring_to_text(string);
+    MemoryContextDelete(tmp_ctx);
+
+    PG_RETURN_TEXT_P(retval);
+}
+
+
+/* Template for new node in soundex code tree. */
+static const dm_node start_node = {
+    .soundex_length = 0,
+    .soundex = "000000 ",        /* Six digits + joining space */
+    .is_leaf = 0,
+    .last_update = 0,
+    .code_digit = '\0',
+    .prev_code_digits = {'\0', '\0'},
+    .next_code_digits = {'\0', '\0'},
+    .prev_code_index = 0,
+    .next_code_index = 0,
+    .children = {NULL},
+    .next = {NULL}
+};
+
+/* Dummy soundex codes at end of input. */
+static const dm_codes end_codes[2] =
+{
+    {
+        "X", "X", "X"
+    }
+};
+
+
+/* Initialize soundex code tree node for next code digit. */
+static void
+initialize_node(dm_node * node, int last_update)
+{
+    if (node->last_update < last_update)
+    {
+        node->prev_code_digits[0] = node->next_code_digits[0];
+        node->prev_code_digits[1] = node->next_code_digits[1];
+        node->next_code_digits[0] = '\0';
+        node->next_code_digits[1] = '\0';
+        node->prev_code_index = node->next_code_index;
+        node->next_code_index = 0;
+        node->is_leaf = 0;
+        node->last_update = last_update;
+    }
+}
+
+
+/* Update soundex code tree node with next code digit. */
+static void
+add_next_code_digit(dm_node * node, int code_index, char code_digit)
+{
+    /* OR in index 1 or 2. */
+    node->next_code_index |= code_index;
+
+    if (!node->next_code_digits[0])
+    {
+        node->next_code_digits[0] = code_digit;
+    }
+    else if (node->next_code_digits[0] != code_digit)
+    {
+        node->next_code_digits[1] = code_digit;
+    }
+}
+
+
+/* Mark soundex code tree node as leaf. */
+static void
+set_leaf(dm_node * first_node[2], dm_node * last_node[2], dm_node * node, int ix_node)
+{
+    if (!node->is_leaf)
+    {
+        node->is_leaf = 1;
+
+        if (first_node[ix_node] == NULL)
+        {
+            first_node[ix_node] = node;
+        }
+        else
+        {
+            last_node[ix_node]->next[ix_node] = node;
+        }
+
+        last_node[ix_node] = node;
+        node->next[ix_node] = NULL;
+    }
+}
+
+
+/* Find next node corresponding to code digit, or create a new node. */
+static dm_node *
+find_or_create_child_node(dm_node * parent, char code_digit, StringInfo soundex)
+{
+    dm_node   **nodes;
+    dm_node    *node;
+    int            i;
+
+    for (nodes = parent->children, i = 0; (node = nodes[i]); i++)
+    {
+        if (node->code_digit == code_digit)
+        {
+            /* Found existing child node. Skip completed nodes. */
+            return node->soundex_length < DM_CODE_DIGITS ? node : NULL;
+        }
+    }
+
+    /* Create new child node. */
+    Assert(i < DM_MAX_ALTERNATE_CODES);
+    node = palloc(sizeof(dm_node));
+    nodes[i] = node;
+
+    *node = start_node;
+    memcpy(node->soundex, parent->soundex, sizeof(parent->soundex));
+    node->soundex_length = parent->soundex_length;
+    node->soundex[node->soundex_length++] = code_digit;
+    node->code_digit = code_digit;
+    node->next_code_index = node->prev_code_index;
+
+    if (node->soundex_length < DM_CODE_DIGITS)
+    {
+        return node;
+    }
+    else
+    {
+        /* Append completed soundex code to soundex string. */
+        appendBinaryStringInfoNT(soundex, node->soundex, DM_CODE_DIGITS + 1);
+        return NULL;
+    }
+}
+
+
+/* Update node for next code digit(s). */
+static void
+update_node(dm_node * first_node[2], dm_node * last_node[2], dm_node * node, int ix_node,
+            int letter_no, int prev_code_index, int next_code_index,
+            const char *next_code_digits, int digit_no,
+            StringInfo soundex)
+{
+    int            i;
+    char        next_code_digit = next_code_digits[digit_no];
+    int            num_dirty_nodes = 0;
+    dm_node    *dirty_nodes[2];
+
+    initialize_node(node, letter_no);
+
+    if (node->prev_code_index && !(node->prev_code_index & prev_code_index))
+    {
+        /*
+         * If the sound (vowel / consonant) of this letter encoding doesn't
+         * correspond to the coding index of the previous letter, we skip this
+         * letter encoding. Note that currently, only "J" can be either a
+         * vowel or a consonant.
+         */
+        return;
+    }
+
+    if (next_code_digit == 'X' ||
+        (digit_no == 0 &&
+         (node->prev_code_digits[0] == next_code_digit ||
+          node->prev_code_digits[1] == next_code_digit)))
+    {
+        /* The code digit is the same as one of the previous (i.e. not added). */
+        dirty_nodes[num_dirty_nodes++] = node;
+    }
+
+    if (next_code_digit != 'X' &&
+        (digit_no > 0 ||
+         node->prev_code_digits[0] != next_code_digit ||
+         node->prev_code_digits[1]))
+    {
+        /* The code digit is different from one of the previous (i.e. added). */
+        node = find_or_create_child_node(node, next_code_digit, soundex);
+        if (node)
+        {
+            initialize_node(node, letter_no);
+            dirty_nodes[num_dirty_nodes++] = node;
+        }
+    }
+
+    for (i = 0; i < num_dirty_nodes; i++)
+    {
+        /* Add code digit leading to the current node. */
+        add_next_code_digit(dirty_nodes[i], next_code_index, next_code_digit);
+
+        if (next_code_digits[++digit_no])
+        {
+            update_node(first_node, last_node, dirty_nodes[i], ix_node,
+                        letter_no, prev_code_index, next_code_index,
+                        next_code_digits, digit_no,
+                        soundex);
+        }
+        else
+        {
+            /* Add incomplete leaf node to linked list. */
+            set_leaf(first_node, last_node, dirty_nodes[i], ix_node);
+        }
+    }
+}
+
+
+/* Update soundex tree leaf nodes. */
+static void
+update_leaves(dm_node * first_node[2], int *ix_node, int letter_no,
+              const dm_codes * codes, const dm_codes * next_codes,
+              StringInfo soundex)
+{
+    int            i,
+                j,
+                code_index;
+    dm_node    *node,
+               *last_node[2];
+    const        dm_code *code,
+               *next_code;
+    int            ix_node_next = (*ix_node + 1) & 1;    /* Alternating index: 0, 1 */
+
+    /* Initialize for new linked list of leaves. */
+    first_node[ix_node_next] = NULL;
+    last_node[ix_node_next] = NULL;
+
+    /* Process all nodes. */
+    for (node = first_node[*ix_node]; node; node = node->next[*ix_node])
+    {
+        /* One or two alternate code sequences. */
+        for (i = 0; i < 2 && (code = codes[i]) && code[0][0]; i++)
+        {
+            /* Coding for previous letter - before vowel: 1, all other: 2 */
+            int            prev_code_index = (code[0][0] > '1') + 1;
+
+            /* One or two alternate next code sequences. */
+            for (j = 0; j < 2 && (next_code = next_codes[j]) && next_code[0][0]; j++)
+            {
+                /* Determine which code to use. */
+                if (letter_no == 0)
+                {
+                    /* This is the first letter. */
+                    code_index = 0;
+                }
+                else if (next_code[0][0] <= '1')
+                {
+                    /* The next letter is a vowel. */
+                    code_index = 1;
+                }
+                else
+                {
+                    /* All other cases. */
+                    code_index = 2;
+                }
+
+                /* One or two sequential code digits. */
+                update_node(first_node, last_node, node, ix_node_next,
+                            letter_no, prev_code_index, code_index,
+                            code[code_index], 0,
+                            soundex);
+            }
+        }
+    }
+
+    *ix_node = ix_node_next;
+}
+
+
+/* Mapping from ISO8859-1 to upper-case ASCII */
+static const char iso8859_1_to_ascii_upper[] =
+/*
+"`abcdefghijklmnopqrstuvwxyz{|}~                                  ¡¢£¤¥¦§¨©ª«¬
®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ"
+*/
+"`ABCDEFGHIJKLMNOPQRSTUVWXYZ{|}~                                  !
?AAAAAAECEEEEIIIIDNOOOOO*OUUUUYDSAAAAAAECEEEEIIIIDNOOOOO/OUUUUYDY";
+
+
+/* Return next character, converted from UTF-8 to uppercase ASCII. */
+static char
+read_char(unsigned char *str, int *ix)
+{
+    /* Substitute character for skipped code points. */
+    const char    na = '\x1a';
+    pg_wchar    c;
+
+    /* Decode UTF-8 character to ISO 10646 code point. */
+    str += *ix;
+    c = utf8_to_unicode(str);
+    *ix += pg_utf_mblen(str);
+
+    if (c >= (unsigned char) '[' && c <= (unsigned char) ']')
+    {
+        /* ASCII characters [, \, and ] are reserved for Ą, Ę, and Ţ/Ț. */
+        return na;
+    }
+    else if (c < 0x60)
+    {
+        /* Non-lowercase ASCII character. */
+        return c;
+    }
+    else if (c < 0x100)
+    {
+        /* ISO-8859-1 code point, converted to upper-case ASCII. */
+        return iso8859_1_to_ascii_upper[c - 0x60];
+    }
+    else
+    {
+        /* Conversion of non-ASCII characters in the coding chart. */
+        switch (c)
+        {
+            case 0x0104:
+            case 0x0105:
+                /* Ą/ą */
+                return '[';
+            case 0x0118:
+            case 0x0119:
+                /* Ę/ę */
+                return '\\';
+            case 0x0162:
+            case 0x0163:
+            case 0x021A:
+            case 0x021B:
+                /* Ţ/ţ or Ț/ț */
+                return ']';
+            default:
+                return na;
+        }
+    }
+}
+
+
+/* Read next ASCII character, skipping any characters not in [A-\]]. */
+static char
+read_valid_char(char *str, int *ix)
+{
+    char        c;
+
+    while ((c = read_char((unsigned char *) str, ix)))
+    {
+        if (c >= 'A' && c <= ']')
+        {
+            break;
+        }
+    }
+
+    return c;
+}
+
+
+/* Return sound coding for "letter" (letter sequence) */
+static const dm_codes *
+read_letter(char *str, int *ix)
+{
+    char        c,
+                cmp;
+    int            i,
+                j;
+    const        dm_letter *letters;
+    const        dm_codes *codes;
+
+    /* First letter in sequence. */
+    if (!(c = read_valid_char(str, ix)))
+    {
+        return NULL;
+    }
+    letters = &letter_[c - 'A'];
+    codes = letters->codes;
+    i = *ix;
+
+    /* Any subsequent letters in sequence. */
+    while ((letters = letters->letters) && (c = read_valid_char(str, &i)))
+    {
+        for (j = 0; (cmp = letters[j].letter); j++)
+        {
+            if (cmp == c)
+            {
+                /* Letter found. */
+                letters = &letters[j];
+                if (letters->codes)
+                {
+                    /* Coding for letter sequence found. */
+                    codes = letters->codes;
+                    *ix = i;
+                }
+                break;
+            }
+        }
+        if (!cmp)
+        {
+            /* The sequence of letters has no coding. */
+            break;
+        }
+    }
+
+    return codes;
+}
+
+
+/* Generate all Daitch-Mokotoff soundex codes for word, separated by space. */
+static int
+daitch_mokotoff_coding(char *word, StringInfo soundex)
+{
+    int            i = 0;
+    int            letter_no = 0;
+    int            ix_node = 0;
+    const        dm_codes *codes,
+               *next_codes;
+    dm_node    *first_node[2],
+               *node;
+
+    /* First letter. */
+    if (!(codes = read_letter(word, &i)))
+    {
+        /* No encodable character in input. */
+        return 0;
+    }
+
+    /* Starting point. */
+    first_node[ix_node] = palloc(sizeof(dm_node));
+    *first_node[ix_node] = start_node;
+
+    /*
+     * Loop until either the word input is exhausted, or all generated soundex
+     * codes are completed to six digits.
+     */
+    while (codes && first_node[ix_node])
+    {
+        next_codes = read_letter(word, &i);
+
+        /* Update leaf nodes. */
+        update_leaves(first_node, &ix_node, letter_no,
+                      codes, next_codes ? next_codes : end_codes,
+                      soundex);
+
+        codes = next_codes;
+        letter_no++;
+    }
+
+    /* Append all remaining (incomplete) soundex codes. */
+    for (node = first_node[ix_node]; node; node = node->next[ix_node])
+    {
+        appendBinaryStringInfoNT(soundex, node->soundex, DM_CODE_DIGITS + 1);
+    }
+
+    /* Terminate string at the final space. */
+    soundex->len--;
+    soundex->data[soundex->len] = '\0';
+
+    return 1;
+}
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff_header.pl b/contrib/fuzzystrmatch/daitch_mokotoff_header.pl
new file mode 100755
index 0000000000..807b5fb8c5
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff_header.pl
@@ -0,0 +1,260 @@
+#!/bin/perl
+#
+# Generation of types and lookup tables for Daitch-Mokotoff soundex.
+#
+# Copyright (c) 2021 Finance Norway
+# Author: Dag Lem <dag@nimrod.no>
+#
+# Permission to use, copy, modify, and distribute this software and its
+# documentation for any purpose, without fee, and without a written agreement
+# is hereby granted, provided that the above copyright notice and this
+# paragraph and the following two paragraphs appear in all copies.
+#
+# IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+# DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+# LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+# DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+# POSSIBILITY OF SUCH DAMAGE.
+#
+# THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+# INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+# AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+# ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+# PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+#
+
+use strict;
+use warnings;
+use utf8;
+use open IO => ':utf8', ':std';
+use Data::Dumper;
+
+die "Usage: $0 OUTPUT_FILE\n" if @ARGV != 1;
+my $output_file = $ARGV[0];
+
+# Open the output file
+open my $OUTPUT, '>', $output_file
+  or die "Could not open output file $output_file: $!\n";
+
+# Parse code table and generate tree for letter transitions.
+my %codes;
+my %alternates;
+my $table = [{}, [["","",""]]];
+while (<DATA>) {
+    chomp;
+    my ($letters, $codes) = split(/\s+/);
+    my @codes = map { [ split(/,/) ] } split(/\|/, $codes);
+
+    # Find alternate code transitions for calculation of storage.
+    # The first character ("start of a name") can never yield more than two alternate codes,
+    # and is not considered here.
+    if (@codes > 1) {
+        for my $j (1..2) {  # Codes for "before a vowel" and "any other"
+            for my $i (0..1) {  # Alternate codes
+                # Identical code digits for adjacent letters are collapsed.
+                # For each possible non-transition due to code digit
+                # collapsing, find all alternate transitions.
+                my ($present, $next) = ($codes[$i][$j], $codes[($i + 1)%2][$j]);
+                next if length($present) != 1;
+                $next = $present ne substr($next, 0, 1) ? substr($next, 0, 1) : substr($next, -1, 1);
+                $alternates{$present}{$next} = 1;
+            }
+        }
+    }
+    my $key = "codes_" . join("_or_", map { join("_", @$_) } @codes);
+    my $val = join(",\n", map { "\t{\n\t\t" . join(", ", map { "\"$_\"" } @$_) . "\n\t}" } @codes);
+    $codes{$key} = $val;
+
+    for my $letter (split(/,/, $letters)) {
+        my $ref = $table->[0];
+        # Link each character to the next in the letter combination.
+        my @c = split(//, $letter);
+        my $last_c = pop(@c);
+        for my $c (@c) {
+            $ref->{$c} //= [ {}, undef ];
+            $ref->{$c}[0] //= {};
+            $ref = $ref->{$c}[0];
+        }
+        # The sound code for the letter combination is stored at the last character.
+        $ref->{$last_c}[1] = $key;
+    }
+}
+close(DATA);
+
+# Add alternates by following transitions to 'X' (not coded).
+my $alt_x = $alternates{"X"};
+delete $alt_x->{"X"};
+while (my ($k, $v) = each %alternates) {
+    if (delete $v->{"X"}) {
+        for my $x (keys %$alt_x) {
+            $v->{$x} = 1;
+        }
+    }
+}
+
+# Find the maximum number of alternate codes in one position.
+# Add two for any additional final code digit transitions.
+my $max_alt = (sort { $b <=> $a } (map { scalar keys %$_ } values %alternates))[0] + 2;
+
+print $OUTPUT <<EOF;
+/*
+ * Constants and lookup tables for Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem <dag\@nimrod.no>
+ *
+ * This file is generated by daitch_mokotoff_header.pl
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+ */
+
+#define DM_MAX_ALTERNATE_CODES $max_alt
+
+/* Codes for letter sequence at start of name, before a vowel, and any other. */
+EOF
+
+for my $key (sort keys %codes) {
+    print $OUTPUT "static const dm_codes $key\[2\] =\n{\n" . $codes{$key} . "\n};\n";
+}
+
+print $OUTPUT <<EOF;
+
+/* Coding for alternative following letters in sequence. */
+EOF
+
+sub hash2code {
+    my ($ref, $letter) = @_;
+
+    my @letters = ();
+
+    my $h = $ref->[0];
+    for my $key (sort keys %$h) {
+        $ref = $h->{$key};
+        my $children = "NULL";
+        if (defined $ref->[0]) {
+            $children = "letter_$letter$key";
+            hash2code($ref, "$letter$key");
+        }
+        my $codes = $ref->[1] // "NULL";
+        push(@letters, "\t{\n\t\t'$key', $children, $codes\n\t}");
+    }
+
+    print $OUTPUT "static const dm_letter letter_$letter\[\] =\n{\n";
+    for (@letters) {
+        print $OUTPUT "$_,\n";
+    }
+    print $OUTPUT "\t{\n\t\t'\\0'\n\t}\n";
+    print $OUTPUT "};\n";
+}
+
+hash2code($table, '');
+
+close $OUTPUT;
+
+# Table adapted from https://www.jewishgen.org/InfoFiles/Soundex.html
+#
+# The conversion from the coding chart to the table should be self
+# explanatory, but note the differences stated below.
+#
+# X = NC (not coded)
+#
+# The non-ASCII letters in the coding chart are coded with substitute
+# lowercase ASCII letters, which sort after the uppercase ASCII letters:
+#
+# Ą => a (use '[' for table lookup)
+# Ę => e (use '\\' for table lookup)
+# Ţ => t (use ']' for table lookup)
+#
+# The rule for "UE" does not correspond to the coding chart, however
+# it is used by all other known implementations, including the one at
+# https://www.jewishgen.org/jos/jossound.htm (try e.g. "bouey").
+#
+# Note that the implementation assumes that vowels are assigned code
+# 0 or 1. "J" can be either a vowel or a consonant.
+#
+
+__DATA__
+AI,AJ,AY                0,1,X
+AU                        0,7,X
+a                        X,X,6|X,X,X
+A                        0,X,X
+B                        7,7,7
+CHS                        5,54,54
+CH                        5,5,5|4,4,4
+CK                        5,5,5|45,45,45
+CZ,CS,CSZ,CZS            4,4,4
+C                        5,5,5|4,4,4
+DRZ,DRS                    4,4,4
+DS,DSH,DSZ                4,4,4
+DZ,DZH,DZS                4,4,4
+D,DT                    3,3,3
+EI,EJ,EY                0,1,X
+EU                        1,1,X
+e                        X,X,6|X,X,X
+E                        0,X,X
+FB                        7,7,7
+F                        7,7,7
+G                        5,5,5
+H                        5,5,X
+IA,IE,IO,IU                1,X,X
+I                        0,X,X
+J                        1,X,X|4,4,4
+KS                        5,54,54
+KH                        5,5,5
+K                        5,5,5
+L                        8,8,8
+MN                        66,66,66
+M                        6,6,6
+NM                        66,66,66
+N                        6,6,6
+OI,OJ,OY                0,1,X
+O                        0,X,X
+P,PF,PH                    7,7,7
+Q                        5,5,5
+RZ,RS                    94,94,94|4,4,4
+R                        9,9,9
+SCHTSCH,SCHTSH,SCHTCH    2,4,4
+SCH                        4,4,4
+SHTCH,SHCH,SHTSH        2,4,4
+SHT,SCHT,SCHD            2,43,43
+SH                        4,4,4
+STCH,STSCH,SC            2,4,4
+STRZ,STRS,STSH            2,4,4
+ST                        2,43,43
+SZCZ,SZCS                2,4,4
+SZT,SHD,SZD,SD            2,43,43
+SZ                        4,4,4
+S                        4,4,4
+TCH,TTCH,TTSCH            4,4,4
+TH                        3,3,3
+TRZ,TRS                    4,4,4
+TSCH,TSH                4,4,4
+TS,TTS,TTSZ,TC            4,4,4
+TZ,TTZ,TZS,TSZ            4,4,4
+t                        3,3,3|4,4,4
+T                        3,3,3
+UI,UJ,UY,UE                0,1,X
+U                        0,X,X
+V                        7,7,7
+W                        7,7,7
+X                        5,54,54
+Y                        1,X,X
+ZDZ,ZDZH,ZHDZH            2,4,4
+ZD,ZHD                    2,43,43
+ZH,ZS,ZSCH,ZSH            4,4,4
+Z                        4,4,4
diff --git a/contrib/fuzzystrmatch/expected/fuzzystrmatch.out b/contrib/fuzzystrmatch/expected/fuzzystrmatch.out
index 493c95cdfa..f62ddad4ee 100644
--- a/contrib/fuzzystrmatch/expected/fuzzystrmatch.out
+++ b/contrib/fuzzystrmatch/expected/fuzzystrmatch.out
@@ -65,3 +65,174 @@ SELECT dmetaphone_alt('gumbo');
  KMP
 (1 row)

+-- Wovels
+SELECT daitch_mokotoff('Augsburg');
+ daitch_mokotoff
+-----------------
+ 054795
+(1 row)
+
+SELECT daitch_mokotoff('Breuer');
+ daitch_mokotoff
+-----------------
+ 791900
+(1 row)
+
+SELECT daitch_mokotoff('Freud');
+ daitch_mokotoff
+-----------------
+ 793000
+(1 row)
+
+-- The letter "H"
+SELECT daitch_mokotoff('Halberstadt');
+ daitch_mokotoff
+-----------------
+ 587943 587433
+(1 row)
+
+SELECT daitch_mokotoff('Mannheim');
+ daitch_mokotoff
+-----------------
+ 665600
+(1 row)
+
+-- Adjacent sounds
+SELECT daitch_mokotoff('Chernowitz');
+ daitch_mokotoff
+-----------------
+ 596740 496740
+(1 row)
+
+-- Adjacent letters with identical adjacent code digits
+SELECT daitch_mokotoff('Cherkassy');
+ daitch_mokotoff
+-----------------
+ 595400 495400
+(1 row)
+
+SELECT daitch_mokotoff('Kleinman');
+ daitch_mokotoff
+-----------------
+ 586660
+(1 row)
+
+-- More than one word
+SELECT daitch_mokotoff('Nowy Targ');
+ daitch_mokotoff
+-----------------
+ 673950
+(1 row)
+
+-- Padded with "0"
+SELECT daitch_mokotoff('Berlin');
+ daitch_mokotoff
+-----------------
+ 798600
+(1 row)
+
+-- Other examples from https://www.avotaynu.com/soundex.htm
+SELECT daitch_mokotoff('Ceniow');
+ daitch_mokotoff
+-----------------
+ 567000 467000
+(1 row)
+
+SELECT daitch_mokotoff('Tsenyuv');
+ daitch_mokotoff
+-----------------
+ 467000
+(1 row)
+
+SELECT daitch_mokotoff('Holubica');
+ daitch_mokotoff
+-----------------
+ 587500 587400
+(1 row)
+
+SELECT daitch_mokotoff('Golubitsa');
+ daitch_mokotoff
+-----------------
+ 587400
+(1 row)
+
+SELECT daitch_mokotoff('Przemysl');
+ daitch_mokotoff
+-----------------
+ 794648 746480
+(1 row)
+
+SELECT daitch_mokotoff('Pshemeshil');
+ daitch_mokotoff
+-----------------
+ 746480
+(1 row)
+
+SELECT daitch_mokotoff('Rosochowaciec');
+                     daitch_mokotoff
+---------------------------------------------------------
+ 945755 945754 945745 945744 944755 944754 944745 944744
+(1 row)
+
+SELECT daitch_mokotoff('Rosokhovatsets');
+ daitch_mokotoff
+-----------------
+ 945744
+(1 row)
+
+-- Ignored characters
+SELECT daitch_mokotoff('''OBrien');
+ daitch_mokotoff
+-----------------
+ 079600
+(1 row)
+
+SELECT daitch_mokotoff('O''Brien');
+ daitch_mokotoff
+-----------------
+ 079600
+(1 row)
+
+-- "Difficult" cases, likely to cause trouble for other implementations.
+SELECT daitch_mokotoff('CJC');
+              daitch_mokotoff
+-------------------------------------------
+ 550000 540000 545000 450000 400000 440000
+(1 row)
+
+SELECT daitch_mokotoff('BESST');
+ daitch_mokotoff
+-----------------
+ 743000
+(1 row)
+
+SELECT daitch_mokotoff('BOUEY');
+ daitch_mokotoff
+-----------------
+ 710000
+(1 row)
+
+SELECT daitch_mokotoff('HANNMANN');
+ daitch_mokotoff
+-----------------
+ 566600
+(1 row)
+
+SELECT daitch_mokotoff('MCCOYJR');
+                     daitch_mokotoff
+---------------------------------------------------------
+ 651900 654900 654190 654490 645190 645490 641900 644900
+(1 row)
+
+SELECT daitch_mokotoff('ACCURSO');
+                     daitch_mokotoff
+---------------------------------------------------------
+ 059400 054000 054940 054400 045940 045400 049400 044000
+(1 row)
+
+SELECT daitch_mokotoff('BIERSCHBACH');
+                     daitch_mokotoff
+---------------------------------------------------------
+ 794575 794574 794750 794740 745750 745740 747500 747400
+(1 row)
+
diff --git a/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8.out
b/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8.out
new file mode 100644
index 0000000000..32d8260383
--- /dev/null
+++ b/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8.out
@@ -0,0 +1,61 @@
+/*
+ * This test must be run in a database with UTF-8 encoding,
+ * because other encodings don't support all the characters used.
+ */
+SELECT getdatabaseencoding() <> 'UTF8'
+       AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+set client_encoding = utf8;
+-- CREATE EXTENSION IF NOT EXISTS fuzzystrmatch;
+-- Accents
+SELECT daitch_mokotoff('Müller');
+ daitch_mokotoff
+-----------------
+ 689000
+(1 row)
+
+SELECT daitch_mokotoff('Schäfer');
+ daitch_mokotoff
+-----------------
+ 479000
+(1 row)
+
+SELECT daitch_mokotoff('Straßburg');
+ daitch_mokotoff
+-----------------
+ 294795
+(1 row)
+
+SELECT daitch_mokotoff('Éregon');
+ daitch_mokotoff
+-----------------
+ 095600
+(1 row)
+
+-- Special characters added at https://www.jewishgen.org/InfoFiles/Soundex.html
+SELECT daitch_mokotoff('gąszczu');
+ daitch_mokotoff
+-----------------
+ 564000 540000
+(1 row)
+
+SELECT daitch_mokotoff('brzęczy');
+       daitch_mokotoff
+-----------------------------
+ 794640 794400 746400 744000
+(1 row)
+
+SELECT daitch_mokotoff('ţamas');
+ daitch_mokotoff
+-----------------
+ 364000 464000
+(1 row)
+
+SELECT daitch_mokotoff('țamas');
+ daitch_mokotoff
+-----------------
+ 364000 464000
+(1 row)
+
diff --git a/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8_1.out
b/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8_1.out
new file mode 100644
index 0000000000..37aead89c0
--- /dev/null
+++ b/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8_1.out
@@ -0,0 +1,8 @@
+/*
+ * This test must be run in a database with UTF-8 encoding,
+ * because other encodings don't support all the characters used.
+ */
+SELECT getdatabaseencoding() <> 'UTF8'
+       AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql b/contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql
new file mode 100644
index 0000000000..b9d7b229a3
--- /dev/null
+++ b/contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql
@@ -0,0 +1,8 @@
+/* contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION fuzzystrmatch UPDATE TO '1.2'" to load this file. \quit
+
+CREATE FUNCTION daitch_mokotoff(text) RETURNS text
+AS 'MODULE_PATHNAME', 'daitch_mokotoff'
+LANGUAGE C IMMUTABLE STRICT PARALLEL SAFE;
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch.control b/contrib/fuzzystrmatch/fuzzystrmatch.control
index 3cd6660bf9..8b6e9fd993 100644
--- a/contrib/fuzzystrmatch/fuzzystrmatch.control
+++ b/contrib/fuzzystrmatch/fuzzystrmatch.control
@@ -1,6 +1,6 @@
 # fuzzystrmatch extension
 comment = 'determine similarities and distance between strings'
-default_version = '1.1'
+default_version = '1.2'
 module_pathname = '$libdir/fuzzystrmatch'
 relocatable = true
 trusted = true
diff --git a/contrib/fuzzystrmatch/meson.build b/contrib/fuzzystrmatch/meson.build
index e6d06149ce..73178794c2 100644
--- a/contrib/fuzzystrmatch/meson.build
+++ b/contrib/fuzzystrmatch/meson.build
@@ -1,7 +1,16 @@
 fuzzystrmatch_sources = files(
-  'fuzzystrmatch.c',
+  'daitch_mokotoff.c',
   'dmetaphone.c',
+  'fuzzystrmatch.c',
+)
+
+daitch_mokotoff_h = custom_target('daitch_mokotoff',
+  input: 'daitch_mokotoff_header.pl',
+  output: 'daitch_mokotoff.h',
+  command: [perl, '@INPUT@', '@OUTPUT@'],
 )
+generated_sources += daitch_mokotoff_h
+fuzzystrmatch_sources += daitch_mokotoff_h

 if host_system == 'windows'
   fuzzystrmatch_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
@@ -11,6 +20,7 @@ endif

 fuzzystrmatch = shared_module('fuzzystrmatch',
   fuzzystrmatch_sources,
+  include_directories: include_directories('.'),
   kwargs: contrib_mod_args,
 )
 contrib_targets += fuzzystrmatch
@@ -19,6 +29,7 @@ install_data(
   'fuzzystrmatch.control',
   'fuzzystrmatch--1.0--1.1.sql',
   'fuzzystrmatch--1.1.sql',
+  'fuzzystrmatch--1.1--1.2.sql',
   kwargs: contrib_data_args,
 )

@@ -29,6 +40,7 @@ tests += {
   'regress': {
     'sql': [
       'fuzzystrmatch',
+      'fuzzystrmatch_utf8',
     ],
   },
 }
diff --git a/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql b/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql
index f05dc28ffb..db05c7d6b6 100644
--- a/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql
+++ b/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql
@@ -19,3 +19,48 @@ SELECT metaphone('GUMBO', 4);

 SELECT dmetaphone('gumbo');
 SELECT dmetaphone_alt('gumbo');
+
+-- Wovels
+SELECT daitch_mokotoff('Augsburg');
+SELECT daitch_mokotoff('Breuer');
+SELECT daitch_mokotoff('Freud');
+
+-- The letter "H"
+SELECT daitch_mokotoff('Halberstadt');
+SELECT daitch_mokotoff('Mannheim');
+
+-- Adjacent sounds
+SELECT daitch_mokotoff('Chernowitz');
+
+-- Adjacent letters with identical adjacent code digits
+SELECT daitch_mokotoff('Cherkassy');
+SELECT daitch_mokotoff('Kleinman');
+
+-- More than one word
+SELECT daitch_mokotoff('Nowy Targ');
+
+-- Padded with "0"
+SELECT daitch_mokotoff('Berlin');
+
+-- Other examples from https://www.avotaynu.com/soundex.htm
+SELECT daitch_mokotoff('Ceniow');
+SELECT daitch_mokotoff('Tsenyuv');
+SELECT daitch_mokotoff('Holubica');
+SELECT daitch_mokotoff('Golubitsa');
+SELECT daitch_mokotoff('Przemysl');
+SELECT daitch_mokotoff('Pshemeshil');
+SELECT daitch_mokotoff('Rosochowaciec');
+SELECT daitch_mokotoff('Rosokhovatsets');
+
+-- Ignored characters
+SELECT daitch_mokotoff('''OBrien');
+SELECT daitch_mokotoff('O''Brien');
+
+-- "Difficult" cases, likely to cause trouble for other implementations.
+SELECT daitch_mokotoff('CJC');
+SELECT daitch_mokotoff('BESST');
+SELECT daitch_mokotoff('BOUEY');
+SELECT daitch_mokotoff('HANNMANN');
+SELECT daitch_mokotoff('MCCOYJR');
+SELECT daitch_mokotoff('ACCURSO');
+SELECT daitch_mokotoff('BIERSCHBACH');
diff --git a/contrib/fuzzystrmatch/sql/fuzzystrmatch_utf8.sql b/contrib/fuzzystrmatch/sql/fuzzystrmatch_utf8.sql
new file mode 100644
index 0000000000..f42c01a1bb
--- /dev/null
+++ b/contrib/fuzzystrmatch/sql/fuzzystrmatch_utf8.sql
@@ -0,0 +1,26 @@
+/*
+ * This test must be run in a database with UTF-8 encoding,
+ * because other encodings don't support all the characters used.
+ */
+
+SELECT getdatabaseencoding() <> 'UTF8'
+       AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+set client_encoding = utf8;
+
+-- CREATE EXTENSION IF NOT EXISTS fuzzystrmatch;
+
+-- Accents
+SELECT daitch_mokotoff('Müller');
+SELECT daitch_mokotoff('Schäfer');
+SELECT daitch_mokotoff('Straßburg');
+SELECT daitch_mokotoff('Éregon');
+
+-- Special characters added at https://www.jewishgen.org/InfoFiles/Soundex.html
+SELECT daitch_mokotoff('gąszczu');
+SELECT daitch_mokotoff('brzęczy');
+SELECT daitch_mokotoff('ţamas');
+SELECT daitch_mokotoff('țamas');
diff --git a/doc/src/sgml/fuzzystrmatch.sgml b/doc/src/sgml/fuzzystrmatch.sgml
index 382e54be91..08781778f8 100644
--- a/doc/src/sgml/fuzzystrmatch.sgml
+++ b/doc/src/sgml/fuzzystrmatch.sgml
@@ -241,4 +241,101 @@ test=# SELECT dmetaphone('gumbo');
 </screen>
  </sect2>

+ <sect2>
+  <title>Daitch-Mokotoff Soundex</title>
+
+  <para>
+   Compared to the American Soundex System implemented in the
+   <function>soundex</function> function, the major improvements of the
+   Daitch-Mokotoff Soundex System are:
+
+   <itemizedlist spacing="compact" mark="bullet">
+    <listitem>
+     <para>
+      Information is coded to the first six meaningful letters rather than
+      four.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      The initial letter is coded rather than kept as is.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Where two consecutive letters have a single sound, they are coded as a
+      single number.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      When a letter or combination of letters may have two different sounds,
+      it is double coded under the two different codes.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      A letter or combination of letters maps into ten possible codes rather
+      than seven.
+     </para>
+    </listitem>
+   </itemizedlist>
+  </para>
+
+  <indexterm>
+   <primary>daitch_mokotoff</primary>
+  </indexterm>
+
+  <para>
+   The following function generates Daitch-Mokotoff soundex codes for matching
+   of similar-sounding input:
+  </para>
+
+<synopsis>
+daitch_mokotoff(text source) returns text
+</synopsis>
+
+  <para>
+   Since a Daitch-Mokotoff soundex code consists of only 6 digits,
+   <literal>source</literal> should be preferably a single word or name.
+   Any alternative soundex codes are separated by space, which makes the returned
+   text suited for use in Full Text Search, see <xref linkend="textsearch"/> and
+   <xref linkend="functions-textsearch"/>.
+  </para>
+
+  <para>
+   Example:
+  </para>
+
+<programlisting>
+CREATE OR REPLACE FUNCTION soundex_name(v_name text) RETURNS text AS $$
+  SELECT string_agg(daitch_mokotoff(n), ' ')
+  FROM regexp_split_to_table(v_name, '\s+') AS n
+$$ LANGUAGE sql STRICT IMMUTABLE PARALLEL SAFE;
+
+CREATE OR REPLACE FUNCTION soundex_tsvector(v_name text) RETURNS tsvector AS $$
+  SELECT to_tsvector('simple', soundex_name(v_name))
+$$ LANGUAGE sql STRICT IMMUTABLE PARALLEL SAFE;
+
+CREATE OR REPLACE FUNCTION soundex_tsquery(v_name text) RETURNS tsquery AS $$
+  SELECT to_tsquery('simple', quote_literal(soundex_name(v_name)))
+$$ LANGUAGE sql STRICT IMMUTABLE PARALLEL SAFE;
+
+-- Note that searches could be more efficient with the tsvector in a separate column
+-- (no recalculation on table row recheck).
+CREATE TABLE s (nm text);
+CREATE INDEX ix_s_txt ON s USING gin (soundex_tsvector(nm)) WITH (fastupdate = off);
+
+INSERT INTO s VALUES ('John Doe');
+INSERT INTO s VALUES ('Jane Roe');
+INSERT INTO s VALUES ('Public John Q.');
+INSERT INTO s VALUES ('George Best');
+
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('john');
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('jane doe');
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('john public');
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('besst, giorgio');
+</programlisting>
+ </sect2>
+
 </sect1>
diff --git a/src/tools/pginclude/cpluspluscheck b/src/tools/pginclude/cpluspluscheck
index b393f2a2ea..17a37aee85 100755
--- a/src/tools/pginclude/cpluspluscheck
+++ b/src/tools/pginclude/cpluspluscheck
@@ -119,6 +119,9 @@ do
     test "$f" = src/include/common/unicode_nonspacing_table.h && continue
     test "$f" = src/include/common/unicode_east_asian_fw_table.h && continue

+    # Also not meant to be included standalone.
+    test "$f" = contrib/fuzzystrmatch/daitch_mokotoff.h && continue
+
     # We can't make these Bison output files compilable standalone
     # without using "%code require", which old Bison versions lack.
     # parser/gram.h will be included by parser/gramparse.h anyway.
diff --git a/src/tools/pginclude/headerscheck b/src/tools/pginclude/headerscheck
index 2a39856f88..24aacd1239 100755
--- a/src/tools/pginclude/headerscheck
+++ b/src/tools/pginclude/headerscheck
@@ -114,6 +114,9 @@ do
     test "$f" = src/include/common/unicode_nonspacing_table.h && continue
     test "$f" = src/include/common/unicode_east_asian_fw_table.h && continue

+    # Also not meant to be included standalone.
+    test "$f" = contrib/fuzzystrmatch/daitch_mokotoff.h && continue
+
     # We can't make these Bison output files compilable standalone
     # without using "%code require", which old Bison versions lack.
     # parser/gram.h will be included by parser/gramparse.h anyway.

Re: daitch_mokotoff module

От
Dag Lem
Дата:
Dag Lem <dag@nimrod.no> writes:

> Hi Ian,
>
> Ian Lawrence Barwick <barwick@gmail.com> writes:
>

[...]

>> I see you provided some feedback on
>> https://commitfest.postgresql.org/36/3468/,
>> though the patch seems to have not been accepted (but not
>> conclusively rejected
>> either). If you still have the chance to review another patch (or
>> more) it would
>> be much appreciated, as there's quite a few piling up. Things like
>> documentation
>> or small improvements to client applications are always a good place to start.
>> Reviews can be provided at any time, there's no need to wait for the next
>> CommitFest.
>>
>
> OK, I'll try to find another patch to review.
>

I have scanned through all the patches in Commitfest 2023-01 with status
"Needs review", and it is difficult to find something which I can
meaningfully review.

The only thing I felt qualified to comment (or nit-pick?) on was
https://commitfest.postgresql.org/41/4071/

If something else should turn up which could be reviewed by someone
without intimate knowledge of PostgreSQL internals, then don't hesitate
to ask.

As for the Daitch-Mokotoff patch, the review by Andres Freund was very
helpful in order to improve the extension and to make it more idiomatic
- hopefully it is now a bit closer to being included.


Best regards

Dag Lem



Re: daitch_mokotoff module

От
Andres Freund
Дата:
On 2022-12-22 14:27:54 +0100, Dag Lem wrote:
> This should hopefully fix the last Cfbot failures, by exclusion of
> daitch_mokotoff.h from headerscheck and cpluspluscheck.

Btw, you can do the same tests as cfbot in your own repo by enabling CI
in a github repo. See src/tools/ci/README



Re: daitch_mokotoff module

От
Alvaro Herrera
Дата:
On 2022-Dec-22, Dag Lem wrote:

> This should hopefully fix the last Cfbot failures, by exclusion of
> daitch_mokotoff.h from headerscheck and cpluspluscheck.

Hmm, maybe it'd be better to move the typedefs to the .h file instead.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Pensar que el espectro que vemos es ilusorio no lo despoja de espanto,
sólo le suma el nuevo terror de la locura" (Perelandra, C.S. Lewis)



Re: daitch_mokotoff module

От
Alvaro Herrera
Дата:
I wonder why do you have it return the multiple alternative codes as a
space-separated string.  Maybe an array would be more appropriate.  Even
on your documented example use, the first thing you do is split it on
spaces.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/



Re: daitch_mokotoff module

От
Alvaro Herrera
Дата:
On 2022-Dec-23, Alvaro Herrera wrote:

> I wonder why do you have it return the multiple alternative codes as a
> space-separated string.  Maybe an array would be more appropriate.  Even
> on your documented example use, the first thing you do is split it on
> spaces.

I tried downloading a list of surnames from here
https://www.bibliotecadenombres.com/apellidos/apellidos-espanoles/
pasted that in a text file and \copy'ed it into a table.  Then I ran
this query

select string_agg(a, ' ' order by a), daitch_mokotoff(a), count(*)
from apellidos
group by daitch_mokotoff(a)
order by count(*) desc;

so I have a first entry like this

string_agg      │ Balasco Balles Belasco Belles Blas Blasco Fallas Feliz Palos Pelaez Plaza Valles Vallez Velasco Velez
VelizVeloz Villas
 
daitch_mokotoff │ 784000
count           │ 18

but then I have a bunch of other entries with the same code 784000 as
alternative codes,

string_agg      │ Velazco
daitch_mokotoff │ 784500 784000
count           │ 1

string_agg      │ Palacio
daitch_mokotoff │ 785000 784000
count           │ 1

I suppose I need to group these together somehow, and it would make more
sense to do that if the values were arrays.


If I scroll a bit further down and choose, say, 794000 (a relatively
popular one), then I have this

string_agg      │ Barraza Barrios Barros Bras Ferraz Frias Frisco Parras Peraza Peres Perez Porras Varas Veras
daitch_mokotoff │ 794000
count           │ 14

and looking for that code in the result I also get these three

string_agg      │ Barca Barco Parco
daitch_mokotoff │ 795000 794000
count           │ 3

string_agg      │ Borja
daitch_mokotoff │ 790000 794000
count           │ 1

string_agg      │ Borjas
daitch_mokotoff │ 794000 794400
count           │ 1

and then I see that I should also search for possible matches in codes
795000, 790000 and 794400, so that gives me

string_agg      │ Baria Baro Barrio Barro Berra Borra Feria Para Parra Perea Vera
daitch_mokotoff │ 790000
count           │ 11

string_agg      │ Barriga Borge Borrego Burgo Fraga
daitch_mokotoff │ 795000
count           │ 5

string_agg      │ Borjas
daitch_mokotoff │ 794000 794400
count           │ 1

which look closely related (compare "Veras" in the first to "Vera" in
the later set.  If you ignore that pseudo-match, you're likely to miss
possible family relationships.)


I suppose if I were a genealogy researcher, I would be helped by having
each of these codes behave as a separate unit, rather than me having to
split the string into the several possible contained values.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Industry suffers from the managerial dogma that for the sake of stability
and continuity, the company should be independent of the competence of
individual employees."                                      (E. Dijkstra)



Re: daitch_mokotoff module

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> On 2022-Dec-22, Dag Lem wrote:
>> This should hopefully fix the last Cfbot failures, by exclusion of
>> daitch_mokotoff.h from headerscheck and cpluspluscheck.

> Hmm, maybe it'd be better to move the typedefs to the .h file instead.

Indeed, that sounds like exactly the wrong way to fix such a problem.
The bar for excluding stuff from headerscheck needs to be very high.

            regards, tom lane



Re: daitch_mokotoff module

От
Dag Lem
Дата:
Andres Freund <andres@anarazel.de> writes:

> On 2022-12-22 14:27:54 +0100, Dag Lem wrote:
>> This should hopefully fix the last Cfbot failures, by exclusion of
>> daitch_mokotoff.h from headerscheck and cpluspluscheck.
>
> Btw, you can do the same tests as cfbot in your own repo by enabling CI
> in a github repo. See src/tools/ci/README
>

OK, thanks, I've set it up now.

Best regards,

Dag Lem



Re: daitch_mokotoff module

От
Dag Lem
Дата:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
>> On 2022-Dec-22, Dag Lem wrote:
>>> This should hopefully fix the last Cfbot failures, by exclusion of
>>> daitch_mokotoff.h from headerscheck and cpluspluscheck.
>
>> Hmm, maybe it'd be better to move the typedefs to the .h file instead.
>
> Indeed, that sounds like exactly the wrong way to fix such a problem.
> The bar for excluding stuff from headerscheck needs to be very high.
>

OK, I've moved enough declarations back to the generated header file
again so as to avoid excluding it from headerscheck and cpluspluscheck.

Best regards,

Dag Lem

diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile
index 0704894f88..12baf2d884 100644
--- a/contrib/fuzzystrmatch/Makefile
+++ b/contrib/fuzzystrmatch/Makefile
@@ -3,14 +3,15 @@
 MODULE_big = fuzzystrmatch
 OBJS = \
     $(WIN32RES) \
+    daitch_mokotoff.o \
     dmetaphone.o \
     fuzzystrmatch.o

 EXTENSION = fuzzystrmatch
-DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql
+DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql
 PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"

-REGRESS = fuzzystrmatch
+REGRESS = fuzzystrmatch fuzzystrmatch_utf8

 ifdef USE_PGXS
 PG_CONFIG = pg_config
@@ -22,3 +23,14 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
+
+# Force this dependency to be known even without dependency info built:
+daitch_mokotoff.o: daitch_mokotoff.h
+
+daitch_mokotoff.h: daitch_mokotoff_header.pl
+    perl $< $@
+
+distprep: daitch_mokotoff.h
+
+maintainer-clean:
+    rm -f daitch_mokotoff.h
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c
new file mode 100644
index 0000000000..e809d4a39e
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff.c
@@ -0,0 +1,582 @@
+/*
+ * Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem <dag@nimrod.no>
+ *
+ * This implementation of the Daitch-Mokotoff Soundex System aims at high
+ * performance.
+ *
+ * - The processing of each phoneme is initiated by an O(1) table lookup.
+ * - For phonemes containing more than one character, a coding tree is traversed
+ *   to process the complete phoneme.
+ * - The (alternate) soundex codes are produced digit by digit in-place in
+ *   another tree structure.
+ *
+ *  References:
+ *
+ * https://www.avotaynu.com/soundex.htm
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex
+ * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php)
+ * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java)
+ * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm)
+ *
+ * A few notes on other implementations:
+ *
+ * - All other known implementations have the same unofficial rules for "UE",
+ *   these are also adapted by this implementation (0, 1, NC).
+ * - The only other known implementation which is capable of generating all
+ *   correct soundex codes in all cases is the JOS Soundex Calculator at
+ *   https://www.jewishgen.org/jos/jossound.htm
+ * - "J" is considered (only) a vowel in dmlat.php
+ * - The official rules for "RS" are commented out in dmlat.php
+ * - Identical code digits for adjacent letters are not collapsed correctly in
+ *   dmsoundex.php when double digit codes are involved. E.g. "BESST" yields
+ *   744300 instead of 743000 as for "BEST".
+ * - "J" is considered (only) a consonant in DaitchMokotoffSoundex.java
+ * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+*/
+
+#include "postgres.h"
+
+#include "lib/stringinfo.h"
+#include "mb/pg_wchar.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+
+
+/*
+ * The soundex coding chart table is adapted from from
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * See daitch_mokotoff_header.pl for details.
+*/
+
+/* Generated coding chart table */
+#include "daitch_mokotoff.h"
+
+#define DM_CODE_DIGITS 6
+
+/* Node in soundex code tree */
+struct dm_node
+{
+    int            soundex_length; /* Length of generated soundex code */
+    char        soundex[DM_CODE_DIGITS + 1];    /* Soundex code */
+    int            is_leaf;        /* Candidate for complete soundex code */
+    int            last_update;    /* Letter number for last update of node */
+    char        code_digit;        /* Last code digit, 0 - 9 */
+
+    /*
+     * One or two alternate code digits leading to this node. If there are two
+     * digits, one of them is always an 'X'. Repeated code digits and 'X' lead
+     * back to the same node.
+     */
+    char        prev_code_digits[2];
+    /* One or two alternate code digits moving forward. */
+    char        next_code_digits[2];
+    /* ORed together code index(es) used to reach current node. */
+    int            prev_code_index;
+    int            next_code_index;
+    /* Nodes branching out from this node. */
+    struct dm_node *children[DM_MAX_ALTERNATE_CODES + 1];
+    /* Next node in linked list. Alternating index for each iteration. */
+    struct dm_node *next[2];
+};
+
+typedef struct dm_node dm_node;
+
+
+/* Internal C implementation */
+static int    daitch_mokotoff_coding(char *word, StringInfo soundex);
+
+
+PG_FUNCTION_INFO_V1(daitch_mokotoff);
+
+Datum
+daitch_mokotoff(PG_FUNCTION_ARGS)
+{
+    text       *arg = PG_GETARG_TEXT_PP(0);
+    char       *string;
+    StringInfoData soundex;
+    text       *retval;
+    MemoryContext old_ctx,
+                tmp_ctx;
+
+    tmp_ctx = AllocSetContextCreate(CurrentMemoryContext,
+                                    "daitch_mokotoff temporary context",
+                                    ALLOCSET_DEFAULT_SIZES);
+    old_ctx = MemoryContextSwitchTo(tmp_ctx);
+
+    string = pg_server_to_any(text_to_cstring(arg), VARSIZE_ANY_EXHDR(arg), PG_UTF8);
+    initStringInfo(&soundex);
+
+    if (!daitch_mokotoff_coding(string, &soundex))
+    {
+        /* No encodable characters in input. */
+        MemoryContextSwitchTo(old_ctx);
+        MemoryContextDelete(tmp_ctx);
+        PG_RETURN_NULL();
+    }
+
+    string = pg_any_to_server(soundex.data, soundex.len, PG_UTF8);
+    MemoryContextSwitchTo(old_ctx);
+    retval = cstring_to_text(string);
+    MemoryContextDelete(tmp_ctx);
+
+    PG_RETURN_TEXT_P(retval);
+}
+
+
+/* Template for new node in soundex code tree. */
+static const dm_node start_node = {
+    .soundex_length = 0,
+    .soundex = "000000 ",        /* Six digits + joining space */
+    .is_leaf = 0,
+    .last_update = 0,
+    .code_digit = '\0',
+    .prev_code_digits = {'\0', '\0'},
+    .next_code_digits = {'\0', '\0'},
+    .prev_code_index = 0,
+    .next_code_index = 0,
+    .children = {NULL},
+    .next = {NULL}
+};
+
+/* Dummy soundex codes at end of input. */
+static const dm_codes end_codes[2] =
+{
+    {
+        "X", "X", "X"
+    }
+};
+
+
+/* Initialize soundex code tree node for next code digit. */
+static void
+initialize_node(dm_node * node, int last_update)
+{
+    if (node->last_update < last_update)
+    {
+        node->prev_code_digits[0] = node->next_code_digits[0];
+        node->prev_code_digits[1] = node->next_code_digits[1];
+        node->next_code_digits[0] = '\0';
+        node->next_code_digits[1] = '\0';
+        node->prev_code_index = node->next_code_index;
+        node->next_code_index = 0;
+        node->is_leaf = 0;
+        node->last_update = last_update;
+    }
+}
+
+
+/* Update soundex code tree node with next code digit. */
+static void
+add_next_code_digit(dm_node * node, int code_index, char code_digit)
+{
+    /* OR in index 1 or 2. */
+    node->next_code_index |= code_index;
+
+    if (!node->next_code_digits[0])
+    {
+        node->next_code_digits[0] = code_digit;
+    }
+    else if (node->next_code_digits[0] != code_digit)
+    {
+        node->next_code_digits[1] = code_digit;
+    }
+}
+
+
+/* Mark soundex code tree node as leaf. */
+static void
+set_leaf(dm_node * first_node[2], dm_node * last_node[2], dm_node * node, int ix_node)
+{
+    if (!node->is_leaf)
+    {
+        node->is_leaf = 1;
+
+        if (first_node[ix_node] == NULL)
+        {
+            first_node[ix_node] = node;
+        }
+        else
+        {
+            last_node[ix_node]->next[ix_node] = node;
+        }
+
+        last_node[ix_node] = node;
+        node->next[ix_node] = NULL;
+    }
+}
+
+
+/* Find next node corresponding to code digit, or create a new node. */
+static dm_node *
+find_or_create_child_node(dm_node * parent, char code_digit, StringInfo soundex)
+{
+    dm_node   **nodes;
+    dm_node    *node;
+    int            i;
+
+    for (nodes = parent->children, i = 0; (node = nodes[i]); i++)
+    {
+        if (node->code_digit == code_digit)
+        {
+            /* Found existing child node. Skip completed nodes. */
+            return node->soundex_length < DM_CODE_DIGITS ? node : NULL;
+        }
+    }
+
+    /* Create new child node. */
+    Assert(i < DM_MAX_ALTERNATE_CODES);
+    node = palloc(sizeof(dm_node));
+    nodes[i] = node;
+
+    *node = start_node;
+    memcpy(node->soundex, parent->soundex, sizeof(parent->soundex));
+    node->soundex_length = parent->soundex_length;
+    node->soundex[node->soundex_length++] = code_digit;
+    node->code_digit = code_digit;
+    node->next_code_index = node->prev_code_index;
+
+    if (node->soundex_length < DM_CODE_DIGITS)
+    {
+        return node;
+    }
+    else
+    {
+        /* Append completed soundex code to soundex string. */
+        appendBinaryStringInfoNT(soundex, node->soundex, DM_CODE_DIGITS + 1);
+        return NULL;
+    }
+}
+
+
+/* Update node for next code digit(s). */
+static void
+update_node(dm_node * first_node[2], dm_node * last_node[2], dm_node * node, int ix_node,
+            int letter_no, int prev_code_index, int next_code_index,
+            const char *next_code_digits, int digit_no,
+            StringInfo soundex)
+{
+    int            i;
+    char        next_code_digit = next_code_digits[digit_no];
+    int            num_dirty_nodes = 0;
+    dm_node    *dirty_nodes[2];
+
+    initialize_node(node, letter_no);
+
+    if (node->prev_code_index && !(node->prev_code_index & prev_code_index))
+    {
+        /*
+         * If the sound (vowel / consonant) of this letter encoding doesn't
+         * correspond to the coding index of the previous letter, we skip this
+         * letter encoding. Note that currently, only "J" can be either a
+         * vowel or a consonant.
+         */
+        return;
+    }
+
+    if (next_code_digit == 'X' ||
+        (digit_no == 0 &&
+         (node->prev_code_digits[0] == next_code_digit ||
+          node->prev_code_digits[1] == next_code_digit)))
+    {
+        /* The code digit is the same as one of the previous (i.e. not added). */
+        dirty_nodes[num_dirty_nodes++] = node;
+    }
+
+    if (next_code_digit != 'X' &&
+        (digit_no > 0 ||
+         node->prev_code_digits[0] != next_code_digit ||
+         node->prev_code_digits[1]))
+    {
+        /* The code digit is different from one of the previous (i.e. added). */
+        node = find_or_create_child_node(node, next_code_digit, soundex);
+        if (node)
+        {
+            initialize_node(node, letter_no);
+            dirty_nodes[num_dirty_nodes++] = node;
+        }
+    }
+
+    for (i = 0; i < num_dirty_nodes; i++)
+    {
+        /* Add code digit leading to the current node. */
+        add_next_code_digit(dirty_nodes[i], next_code_index, next_code_digit);
+
+        if (next_code_digits[++digit_no])
+        {
+            update_node(first_node, last_node, dirty_nodes[i], ix_node,
+                        letter_no, prev_code_index, next_code_index,
+                        next_code_digits, digit_no,
+                        soundex);
+        }
+        else
+        {
+            /* Add incomplete leaf node to linked list. */
+            set_leaf(first_node, last_node, dirty_nodes[i], ix_node);
+        }
+    }
+}
+
+
+/* Update soundex tree leaf nodes. */
+static void
+update_leaves(dm_node * first_node[2], int *ix_node, int letter_no,
+              const dm_codes * codes, const dm_codes * next_codes,
+              StringInfo soundex)
+{
+    int            i,
+                j,
+                code_index;
+    dm_node    *node,
+               *last_node[2];
+    const        dm_code *code,
+               *next_code;
+    int            ix_node_next = (*ix_node + 1) & 1;    /* Alternating index: 0, 1 */
+
+    /* Initialize for new linked list of leaves. */
+    first_node[ix_node_next] = NULL;
+    last_node[ix_node_next] = NULL;
+
+    /* Process all nodes. */
+    for (node = first_node[*ix_node]; node; node = node->next[*ix_node])
+    {
+        /* One or two alternate code sequences. */
+        for (i = 0; i < 2 && (code = codes[i]) && code[0][0]; i++)
+        {
+            /* Coding for previous letter - before vowel: 1, all other: 2 */
+            int            prev_code_index = (code[0][0] > '1') + 1;
+
+            /* One or two alternate next code sequences. */
+            for (j = 0; j < 2 && (next_code = next_codes[j]) && next_code[0][0]; j++)
+            {
+                /* Determine which code to use. */
+                if (letter_no == 0)
+                {
+                    /* This is the first letter. */
+                    code_index = 0;
+                }
+                else if (next_code[0][0] <= '1')
+                {
+                    /* The next letter is a vowel. */
+                    code_index = 1;
+                }
+                else
+                {
+                    /* All other cases. */
+                    code_index = 2;
+                }
+
+                /* One or two sequential code digits. */
+                update_node(first_node, last_node, node, ix_node_next,
+                            letter_no, prev_code_index, code_index,
+                            code[code_index], 0,
+                            soundex);
+            }
+        }
+    }
+
+    *ix_node = ix_node_next;
+}
+
+
+/* Mapping from ISO8859-1 to upper-case ASCII */
+static const char iso8859_1_to_ascii_upper[] =
+/*
+"`abcdefghijklmnopqrstuvwxyz{|}~                                  ¡¢£¤¥¦§¨©ª«¬
®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ"
+*/
+"`ABCDEFGHIJKLMNOPQRSTUVWXYZ{|}~                                  !
?AAAAAAECEEEEIIIIDNOOOOO*OUUUUYDSAAAAAAECEEEEIIIIDNOOOOO/OUUUUYDY";
+
+
+/* Return next character, converted from UTF-8 to uppercase ASCII. */
+static char
+read_char(unsigned char *str, int *ix)
+{
+    /* Substitute character for skipped code points. */
+    const char    na = '\x1a';
+    pg_wchar    c;
+
+    /* Decode UTF-8 character to ISO 10646 code point. */
+    str += *ix;
+    c = utf8_to_unicode(str);
+    *ix += pg_utf_mblen(str);
+
+    if (c >= (unsigned char) '[' && c <= (unsigned char) ']')
+    {
+        /* ASCII characters [, \, and ] are reserved for Ą, Ę, and Ţ/Ț. */
+        return na;
+    }
+    else if (c < 0x60)
+    {
+        /* Non-lowercase ASCII character. */
+        return c;
+    }
+    else if (c < 0x100)
+    {
+        /* ISO-8859-1 code point, converted to upper-case ASCII. */
+        return iso8859_1_to_ascii_upper[c - 0x60];
+    }
+    else
+    {
+        /* Conversion of non-ASCII characters in the coding chart. */
+        switch (c)
+        {
+            case 0x0104:
+            case 0x0105:
+                /* Ą/ą */
+                return '[';
+            case 0x0118:
+            case 0x0119:
+                /* Ę/ę */
+                return '\\';
+            case 0x0162:
+            case 0x0163:
+            case 0x021A:
+            case 0x021B:
+                /* Ţ/ţ or Ț/ț */
+                return ']';
+            default:
+                return na;
+        }
+    }
+}
+
+
+/* Read next ASCII character, skipping any characters not in [A-\]]. */
+static char
+read_valid_char(char *str, int *ix)
+{
+    char        c;
+
+    while ((c = read_char((unsigned char *) str, ix)))
+    {
+        if (c >= 'A' && c <= ']')
+        {
+            break;
+        }
+    }
+
+    return c;
+}
+
+
+/* Return sound coding for "letter" (letter sequence) */
+static const dm_codes *
+read_letter(char *str, int *ix)
+{
+    char        c,
+                cmp;
+    int            i,
+                j;
+    const        dm_letter *letters;
+    const        dm_codes *codes;
+
+    /* First letter in sequence. */
+    if (!(c = read_valid_char(str, ix)))
+    {
+        return NULL;
+    }
+    letters = &letter_[c - 'A'];
+    codes = letters->codes;
+    i = *ix;
+
+    /* Any subsequent letters in sequence. */
+    while ((letters = letters->letters) && (c = read_valid_char(str, &i)))
+    {
+        for (j = 0; (cmp = letters[j].letter); j++)
+        {
+            if (cmp == c)
+            {
+                /* Letter found. */
+                letters = &letters[j];
+                if (letters->codes)
+                {
+                    /* Coding for letter sequence found. */
+                    codes = letters->codes;
+                    *ix = i;
+                }
+                break;
+            }
+        }
+        if (!cmp)
+        {
+            /* The sequence of letters has no coding. */
+            break;
+        }
+    }
+
+    return codes;
+}
+
+
+/* Generate all Daitch-Mokotoff soundex codes for word, separated by space. */
+static int
+daitch_mokotoff_coding(char *word, StringInfo soundex)
+{
+    int            i = 0;
+    int            letter_no = 0;
+    int            ix_node = 0;
+    const        dm_codes *codes,
+               *next_codes;
+    dm_node    *first_node[2],
+               *node;
+
+    /* First letter. */
+    if (!(codes = read_letter(word, &i)))
+    {
+        /* No encodable character in input. */
+        return 0;
+    }
+
+    /* Starting point. */
+    first_node[ix_node] = palloc(sizeof(dm_node));
+    *first_node[ix_node] = start_node;
+
+    /*
+     * Loop until either the word input is exhausted, or all generated soundex
+     * codes are completed to six digits.
+     */
+    while (codes && first_node[ix_node])
+    {
+        next_codes = read_letter(word, &i);
+
+        /* Update leaf nodes. */
+        update_leaves(first_node, &ix_node, letter_no,
+                      codes, next_codes ? next_codes : end_codes,
+                      soundex);
+
+        codes = next_codes;
+        letter_no++;
+    }
+
+    /* Append all remaining (incomplete) soundex codes. */
+    for (node = first_node[ix_node]; node; node = node->next[ix_node])
+    {
+        appendBinaryStringInfoNT(soundex, node->soundex, DM_CODE_DIGITS + 1);
+    }
+
+    /* Terminate string at the final space. */
+    soundex->len--;
+    soundex->data[soundex->len] = '\0';
+
+    return 1;
+}
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff_header.pl b/contrib/fuzzystrmatch/daitch_mokotoff_header.pl
new file mode 100755
index 0000000000..c5eb49a6ff
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff_header.pl
@@ -0,0 +1,274 @@
+#!/bin/perl
+#
+# Generation of types and lookup tables for Daitch-Mokotoff soundex.
+#
+# Copyright (c) 2021 Finance Norway
+# Author: Dag Lem <dag@nimrod.no>
+#
+# Permission to use, copy, modify, and distribute this software and its
+# documentation for any purpose, without fee, and without a written agreement
+# is hereby granted, provided that the above copyright notice and this
+# paragraph and the following two paragraphs appear in all copies.
+#
+# IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+# DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+# LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+# DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+# POSSIBILITY OF SUCH DAMAGE.
+#
+# THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+# INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+# AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+# ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+# PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+#
+
+use strict;
+use warnings;
+use utf8;
+use open IO => ':utf8', ':std';
+use Data::Dumper;
+
+die "Usage: $0 OUTPUT_FILE\n" if @ARGV != 1;
+my $output_file = $ARGV[0];
+
+# Open the output file
+open my $OUTPUT, '>', $output_file
+  or die "Could not open output file $output_file: $!\n";
+
+# Parse code table and generate tree for letter transitions.
+my %codes;
+my %alternates;
+my $table = [{}, [["","",""]]];
+while (<DATA>) {
+    chomp;
+    my ($letters, $codes) = split(/\s+/);
+    my @codes = map { [ split(/,/) ] } split(/\|/, $codes);
+
+    # Find alternate code transitions for calculation of storage.
+    # The first character ("start of a name") can never yield more than two alternate codes,
+    # and is not considered here.
+    if (@codes > 1) {
+        for my $j (1..2) {  # Codes for "before a vowel" and "any other"
+            for my $i (0..1) {  # Alternate codes
+                # Identical code digits for adjacent letters are collapsed.
+                # For each possible non-transition due to code digit
+                # collapsing, find all alternate transitions.
+                my ($present, $next) = ($codes[$i][$j], $codes[($i + 1)%2][$j]);
+                next if length($present) != 1;
+                $next = $present ne substr($next, 0, 1) ? substr($next, 0, 1) : substr($next, -1, 1);
+                $alternates{$present}{$next} = 1;
+            }
+        }
+    }
+    my $key = "codes_" . join("_or_", map { join("_", @$_) } @codes);
+    my $val = join(",\n", map { "\t{\n\t\t" . join(", ", map { "\"$_\"" } @$_) . "\n\t}" } @codes);
+    $codes{$key} = $val;
+
+    for my $letter (split(/,/, $letters)) {
+        my $ref = $table->[0];
+        # Link each character to the next in the letter combination.
+        my @c = split(//, $letter);
+        my $last_c = pop(@c);
+        for my $c (@c) {
+            $ref->{$c} //= [ {}, undef ];
+            $ref->{$c}[0] //= {};
+            $ref = $ref->{$c}[0];
+        }
+        # The sound code for the letter combination is stored at the last character.
+        $ref->{$last_c}[1] = $key;
+    }
+}
+close(DATA);
+
+# Add alternates by following transitions to 'X' (not coded).
+my $alt_x = $alternates{"X"};
+delete $alt_x->{"X"};
+while (my ($k, $v) = each %alternates) {
+    if (delete $v->{"X"}) {
+        for my $x (keys %$alt_x) {
+            $v->{$x} = 1;
+        }
+    }
+}
+
+# Find the maximum number of alternate codes in one position.
+# Add two for any additional final code digit transitions.
+my $max_alt = (sort { $b <=> $a } (map { scalar keys %$_ } values %alternates))[0] + 2;
+
+print $OUTPUT <<EOF;
+/*
+ * Constants and lookup tables for Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem <dag\@nimrod.no>
+ *
+ * This file is generated by daitch_mokotoff_header.pl
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+ */
+
+#define DM_MAX_ALTERNATE_CODES $max_alt
+
+/* Coding chart table: Soundex codes */
+typedef char dm_code[2 + 1];    /* One or two sequential code digits + NUL */
+typedef dm_code dm_codes[3];    /* Start of name, before a vowel, any other */
+
+/* Coding chart table: Letter in input sequence */
+struct dm_letter
+{
+    char        letter;            /* Present letter in sequence */
+    const struct dm_letter *letters;    /* List of possible successive letters */
+    const        dm_codes *codes;    /* Code sequence(s) for complete sequence */
+};
+
+typedef struct dm_letter dm_letter;
+
+/* Codes for letter sequence at start of name, before a vowel, and any other. */
+EOF
+
+for my $key (sort keys %codes) {
+    print $OUTPUT "static const dm_codes $key\[2\] =\n{\n" . $codes{$key} . "\n};\n";
+}
+
+print $OUTPUT <<EOF;
+
+/* Coding for alternative following letters in sequence. */
+EOF
+
+sub hash2code {
+    my ($ref, $letter) = @_;
+
+    my @letters = ();
+
+    my $h = $ref->[0];
+    for my $key (sort keys %$h) {
+        $ref = $h->{$key};
+        my $children = "NULL";
+        if (defined $ref->[0]) {
+            $children = "letter_$letter$key";
+            hash2code($ref, "$letter$key");
+        }
+        my $codes = $ref->[1] // "NULL";
+        push(@letters, "\t{\n\t\t'$key', $children, $codes\n\t}");
+    }
+
+    print $OUTPUT "static const dm_letter letter_$letter\[\] =\n{\n";
+    for (@letters) {
+        print $OUTPUT "$_,\n";
+    }
+    print $OUTPUT "\t{\n\t\t'\\0'\n\t}\n";
+    print $OUTPUT "};\n";
+}
+
+hash2code($table, '');
+
+close $OUTPUT;
+
+# Table adapted from https://www.jewishgen.org/InfoFiles/Soundex.html
+#
+# The conversion from the coding chart to the table should be self
+# explanatory, but note the differences stated below.
+#
+# X = NC (not coded)
+#
+# The non-ASCII letters in the coding chart are coded with substitute
+# lowercase ASCII letters, which sort after the uppercase ASCII letters:
+#
+# Ą => a (use '[' for table lookup)
+# Ę => e (use '\\' for table lookup)
+# Ţ => t (use ']' for table lookup)
+#
+# The rule for "UE" does not correspond to the coding chart, however
+# it is used by all other known implementations, including the one at
+# https://www.jewishgen.org/jos/jossound.htm (try e.g. "bouey").
+#
+# Note that the implementation assumes that vowels are assigned code
+# 0 or 1. "J" can be either a vowel or a consonant.
+#
+
+__DATA__
+AI,AJ,AY                0,1,X
+AU                        0,7,X
+a                        X,X,6|X,X,X
+A                        0,X,X
+B                        7,7,7
+CHS                        5,54,54
+CH                        5,5,5|4,4,4
+CK                        5,5,5|45,45,45
+CZ,CS,CSZ,CZS            4,4,4
+C                        5,5,5|4,4,4
+DRZ,DRS                    4,4,4
+DS,DSH,DSZ                4,4,4
+DZ,DZH,DZS                4,4,4
+D,DT                    3,3,3
+EI,EJ,EY                0,1,X
+EU                        1,1,X
+e                        X,X,6|X,X,X
+E                        0,X,X
+FB                        7,7,7
+F                        7,7,7
+G                        5,5,5
+H                        5,5,X
+IA,IE,IO,IU                1,X,X
+I                        0,X,X
+J                        1,X,X|4,4,4
+KS                        5,54,54
+KH                        5,5,5
+K                        5,5,5
+L                        8,8,8
+MN                        66,66,66
+M                        6,6,6
+NM                        66,66,66
+N                        6,6,6
+OI,OJ,OY                0,1,X
+O                        0,X,X
+P,PF,PH                    7,7,7
+Q                        5,5,5
+RZ,RS                    94,94,94|4,4,4
+R                        9,9,9
+SCHTSCH,SCHTSH,SCHTCH    2,4,4
+SCH                        4,4,4
+SHTCH,SHCH,SHTSH        2,4,4
+SHT,SCHT,SCHD            2,43,43
+SH                        4,4,4
+STCH,STSCH,SC            2,4,4
+STRZ,STRS,STSH            2,4,4
+ST                        2,43,43
+SZCZ,SZCS                2,4,4
+SZT,SHD,SZD,SD            2,43,43
+SZ                        4,4,4
+S                        4,4,4
+TCH,TTCH,TTSCH            4,4,4
+TH                        3,3,3
+TRZ,TRS                    4,4,4
+TSCH,TSH                4,4,4
+TS,TTS,TTSZ,TC            4,4,4
+TZ,TTZ,TZS,TSZ            4,4,4
+t                        3,3,3|4,4,4
+T                        3,3,3
+UI,UJ,UY,UE                0,1,X
+U                        0,X,X
+V                        7,7,7
+W                        7,7,7
+X                        5,54,54
+Y                        1,X,X
+ZDZ,ZDZH,ZHDZH            2,4,4
+ZD,ZHD                    2,43,43
+ZH,ZS,ZSCH,ZSH            4,4,4
+Z                        4,4,4
diff --git a/contrib/fuzzystrmatch/expected/fuzzystrmatch.out b/contrib/fuzzystrmatch/expected/fuzzystrmatch.out
index 493c95cdfa..f62ddad4ee 100644
--- a/contrib/fuzzystrmatch/expected/fuzzystrmatch.out
+++ b/contrib/fuzzystrmatch/expected/fuzzystrmatch.out
@@ -65,3 +65,174 @@ SELECT dmetaphone_alt('gumbo');
  KMP
 (1 row)

+-- Wovels
+SELECT daitch_mokotoff('Augsburg');
+ daitch_mokotoff
+-----------------
+ 054795
+(1 row)
+
+SELECT daitch_mokotoff('Breuer');
+ daitch_mokotoff
+-----------------
+ 791900
+(1 row)
+
+SELECT daitch_mokotoff('Freud');
+ daitch_mokotoff
+-----------------
+ 793000
+(1 row)
+
+-- The letter "H"
+SELECT daitch_mokotoff('Halberstadt');
+ daitch_mokotoff
+-----------------
+ 587943 587433
+(1 row)
+
+SELECT daitch_mokotoff('Mannheim');
+ daitch_mokotoff
+-----------------
+ 665600
+(1 row)
+
+-- Adjacent sounds
+SELECT daitch_mokotoff('Chernowitz');
+ daitch_mokotoff
+-----------------
+ 596740 496740
+(1 row)
+
+-- Adjacent letters with identical adjacent code digits
+SELECT daitch_mokotoff('Cherkassy');
+ daitch_mokotoff
+-----------------
+ 595400 495400
+(1 row)
+
+SELECT daitch_mokotoff('Kleinman');
+ daitch_mokotoff
+-----------------
+ 586660
+(1 row)
+
+-- More than one word
+SELECT daitch_mokotoff('Nowy Targ');
+ daitch_mokotoff
+-----------------
+ 673950
+(1 row)
+
+-- Padded with "0"
+SELECT daitch_mokotoff('Berlin');
+ daitch_mokotoff
+-----------------
+ 798600
+(1 row)
+
+-- Other examples from https://www.avotaynu.com/soundex.htm
+SELECT daitch_mokotoff('Ceniow');
+ daitch_mokotoff
+-----------------
+ 567000 467000
+(1 row)
+
+SELECT daitch_mokotoff('Tsenyuv');
+ daitch_mokotoff
+-----------------
+ 467000
+(1 row)
+
+SELECT daitch_mokotoff('Holubica');
+ daitch_mokotoff
+-----------------
+ 587500 587400
+(1 row)
+
+SELECT daitch_mokotoff('Golubitsa');
+ daitch_mokotoff
+-----------------
+ 587400
+(1 row)
+
+SELECT daitch_mokotoff('Przemysl');
+ daitch_mokotoff
+-----------------
+ 794648 746480
+(1 row)
+
+SELECT daitch_mokotoff('Pshemeshil');
+ daitch_mokotoff
+-----------------
+ 746480
+(1 row)
+
+SELECT daitch_mokotoff('Rosochowaciec');
+                     daitch_mokotoff
+---------------------------------------------------------
+ 945755 945754 945745 945744 944755 944754 944745 944744
+(1 row)
+
+SELECT daitch_mokotoff('Rosokhovatsets');
+ daitch_mokotoff
+-----------------
+ 945744
+(1 row)
+
+-- Ignored characters
+SELECT daitch_mokotoff('''OBrien');
+ daitch_mokotoff
+-----------------
+ 079600
+(1 row)
+
+SELECT daitch_mokotoff('O''Brien');
+ daitch_mokotoff
+-----------------
+ 079600
+(1 row)
+
+-- "Difficult" cases, likely to cause trouble for other implementations.
+SELECT daitch_mokotoff('CJC');
+              daitch_mokotoff
+-------------------------------------------
+ 550000 540000 545000 450000 400000 440000
+(1 row)
+
+SELECT daitch_mokotoff('BESST');
+ daitch_mokotoff
+-----------------
+ 743000
+(1 row)
+
+SELECT daitch_mokotoff('BOUEY');
+ daitch_mokotoff
+-----------------
+ 710000
+(1 row)
+
+SELECT daitch_mokotoff('HANNMANN');
+ daitch_mokotoff
+-----------------
+ 566600
+(1 row)
+
+SELECT daitch_mokotoff('MCCOYJR');
+                     daitch_mokotoff
+---------------------------------------------------------
+ 651900 654900 654190 654490 645190 645490 641900 644900
+(1 row)
+
+SELECT daitch_mokotoff('ACCURSO');
+                     daitch_mokotoff
+---------------------------------------------------------
+ 059400 054000 054940 054400 045940 045400 049400 044000
+(1 row)
+
+SELECT daitch_mokotoff('BIERSCHBACH');
+                     daitch_mokotoff
+---------------------------------------------------------
+ 794575 794574 794750 794740 745750 745740 747500 747400
+(1 row)
+
diff --git a/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8.out
b/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8.out
new file mode 100644
index 0000000000..32d8260383
--- /dev/null
+++ b/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8.out
@@ -0,0 +1,61 @@
+/*
+ * This test must be run in a database with UTF-8 encoding,
+ * because other encodings don't support all the characters used.
+ */
+SELECT getdatabaseencoding() <> 'UTF8'
+       AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+set client_encoding = utf8;
+-- CREATE EXTENSION IF NOT EXISTS fuzzystrmatch;
+-- Accents
+SELECT daitch_mokotoff('Müller');
+ daitch_mokotoff
+-----------------
+ 689000
+(1 row)
+
+SELECT daitch_mokotoff('Schäfer');
+ daitch_mokotoff
+-----------------
+ 479000
+(1 row)
+
+SELECT daitch_mokotoff('Straßburg');
+ daitch_mokotoff
+-----------------
+ 294795
+(1 row)
+
+SELECT daitch_mokotoff('Éregon');
+ daitch_mokotoff
+-----------------
+ 095600
+(1 row)
+
+-- Special characters added at https://www.jewishgen.org/InfoFiles/Soundex.html
+SELECT daitch_mokotoff('gąszczu');
+ daitch_mokotoff
+-----------------
+ 564000 540000
+(1 row)
+
+SELECT daitch_mokotoff('brzęczy');
+       daitch_mokotoff
+-----------------------------
+ 794640 794400 746400 744000
+(1 row)
+
+SELECT daitch_mokotoff('ţamas');
+ daitch_mokotoff
+-----------------
+ 364000 464000
+(1 row)
+
+SELECT daitch_mokotoff('țamas');
+ daitch_mokotoff
+-----------------
+ 364000 464000
+(1 row)
+
diff --git a/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8_1.out
b/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8_1.out
new file mode 100644
index 0000000000..37aead89c0
--- /dev/null
+++ b/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8_1.out
@@ -0,0 +1,8 @@
+/*
+ * This test must be run in a database with UTF-8 encoding,
+ * because other encodings don't support all the characters used.
+ */
+SELECT getdatabaseencoding() <> 'UTF8'
+       AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql b/contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql
new file mode 100644
index 0000000000..b9d7b229a3
--- /dev/null
+++ b/contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql
@@ -0,0 +1,8 @@
+/* contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION fuzzystrmatch UPDATE TO '1.2'" to load this file. \quit
+
+CREATE FUNCTION daitch_mokotoff(text) RETURNS text
+AS 'MODULE_PATHNAME', 'daitch_mokotoff'
+LANGUAGE C IMMUTABLE STRICT PARALLEL SAFE;
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch.control b/contrib/fuzzystrmatch/fuzzystrmatch.control
index 3cd6660bf9..8b6e9fd993 100644
--- a/contrib/fuzzystrmatch/fuzzystrmatch.control
+++ b/contrib/fuzzystrmatch/fuzzystrmatch.control
@@ -1,6 +1,6 @@
 # fuzzystrmatch extension
 comment = 'determine similarities and distance between strings'
-default_version = '1.1'
+default_version = '1.2'
 module_pathname = '$libdir/fuzzystrmatch'
 relocatable = true
 trusted = true
diff --git a/contrib/fuzzystrmatch/meson.build b/contrib/fuzzystrmatch/meson.build
index 11aec733cb..abb944faaf 100644
--- a/contrib/fuzzystrmatch/meson.build
+++ b/contrib/fuzzystrmatch/meson.build
@@ -1,9 +1,18 @@
 # Copyright (c) 2022, PostgreSQL Global Development Group

 fuzzystrmatch_sources = files(
-  'fuzzystrmatch.c',
+  'daitch_mokotoff.c',
   'dmetaphone.c',
+  'fuzzystrmatch.c',
+)
+
+daitch_mokotoff_h = custom_target('daitch_mokotoff',
+  input: 'daitch_mokotoff_header.pl',
+  output: 'daitch_mokotoff.h',
+  command: [perl, '@INPUT@', '@OUTPUT@'],
 )
+generated_sources += daitch_mokotoff_h
+fuzzystrmatch_sources += daitch_mokotoff_h

 if host_system == 'windows'
   fuzzystrmatch_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
@@ -13,6 +22,7 @@ endif

 fuzzystrmatch = shared_module('fuzzystrmatch',
   fuzzystrmatch_sources,
+  include_directories: include_directories('.'),
   kwargs: contrib_mod_args,
 )
 contrib_targets += fuzzystrmatch
@@ -21,6 +31,7 @@ install_data(
   'fuzzystrmatch.control',
   'fuzzystrmatch--1.0--1.1.sql',
   'fuzzystrmatch--1.1.sql',
+  'fuzzystrmatch--1.1--1.2.sql',
   kwargs: contrib_data_args,
 )

@@ -31,6 +42,7 @@ tests += {
   'regress': {
     'sql': [
       'fuzzystrmatch',
+      'fuzzystrmatch_utf8',
     ],
   },
 }
diff --git a/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql b/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql
index f05dc28ffb..db05c7d6b6 100644
--- a/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql
+++ b/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql
@@ -19,3 +19,48 @@ SELECT metaphone('GUMBO', 4);

 SELECT dmetaphone('gumbo');
 SELECT dmetaphone_alt('gumbo');
+
+-- Wovels
+SELECT daitch_mokotoff('Augsburg');
+SELECT daitch_mokotoff('Breuer');
+SELECT daitch_mokotoff('Freud');
+
+-- The letter "H"
+SELECT daitch_mokotoff('Halberstadt');
+SELECT daitch_mokotoff('Mannheim');
+
+-- Adjacent sounds
+SELECT daitch_mokotoff('Chernowitz');
+
+-- Adjacent letters with identical adjacent code digits
+SELECT daitch_mokotoff('Cherkassy');
+SELECT daitch_mokotoff('Kleinman');
+
+-- More than one word
+SELECT daitch_mokotoff('Nowy Targ');
+
+-- Padded with "0"
+SELECT daitch_mokotoff('Berlin');
+
+-- Other examples from https://www.avotaynu.com/soundex.htm
+SELECT daitch_mokotoff('Ceniow');
+SELECT daitch_mokotoff('Tsenyuv');
+SELECT daitch_mokotoff('Holubica');
+SELECT daitch_mokotoff('Golubitsa');
+SELECT daitch_mokotoff('Przemysl');
+SELECT daitch_mokotoff('Pshemeshil');
+SELECT daitch_mokotoff('Rosochowaciec');
+SELECT daitch_mokotoff('Rosokhovatsets');
+
+-- Ignored characters
+SELECT daitch_mokotoff('''OBrien');
+SELECT daitch_mokotoff('O''Brien');
+
+-- "Difficult" cases, likely to cause trouble for other implementations.
+SELECT daitch_mokotoff('CJC');
+SELECT daitch_mokotoff('BESST');
+SELECT daitch_mokotoff('BOUEY');
+SELECT daitch_mokotoff('HANNMANN');
+SELECT daitch_mokotoff('MCCOYJR');
+SELECT daitch_mokotoff('ACCURSO');
+SELECT daitch_mokotoff('BIERSCHBACH');
diff --git a/contrib/fuzzystrmatch/sql/fuzzystrmatch_utf8.sql b/contrib/fuzzystrmatch/sql/fuzzystrmatch_utf8.sql
new file mode 100644
index 0000000000..f42c01a1bb
--- /dev/null
+++ b/contrib/fuzzystrmatch/sql/fuzzystrmatch_utf8.sql
@@ -0,0 +1,26 @@
+/*
+ * This test must be run in a database with UTF-8 encoding,
+ * because other encodings don't support all the characters used.
+ */
+
+SELECT getdatabaseencoding() <> 'UTF8'
+       AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+set client_encoding = utf8;
+
+-- CREATE EXTENSION IF NOT EXISTS fuzzystrmatch;
+
+-- Accents
+SELECT daitch_mokotoff('Müller');
+SELECT daitch_mokotoff('Schäfer');
+SELECT daitch_mokotoff('Straßburg');
+SELECT daitch_mokotoff('Éregon');
+
+-- Special characters added at https://www.jewishgen.org/InfoFiles/Soundex.html
+SELECT daitch_mokotoff('gąszczu');
+SELECT daitch_mokotoff('brzęczy');
+SELECT daitch_mokotoff('ţamas');
+SELECT daitch_mokotoff('țamas');
diff --git a/doc/src/sgml/fuzzystrmatch.sgml b/doc/src/sgml/fuzzystrmatch.sgml
index 382e54be91..08781778f8 100644
--- a/doc/src/sgml/fuzzystrmatch.sgml
+++ b/doc/src/sgml/fuzzystrmatch.sgml
@@ -241,4 +241,101 @@ test=# SELECT dmetaphone('gumbo');
 </screen>
  </sect2>

+ <sect2>
+  <title>Daitch-Mokotoff Soundex</title>
+
+  <para>
+   Compared to the American Soundex System implemented in the
+   <function>soundex</function> function, the major improvements of the
+   Daitch-Mokotoff Soundex System are:
+
+   <itemizedlist spacing="compact" mark="bullet">
+    <listitem>
+     <para>
+      Information is coded to the first six meaningful letters rather than
+      four.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      The initial letter is coded rather than kept as is.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Where two consecutive letters have a single sound, they are coded as a
+      single number.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      When a letter or combination of letters may have two different sounds,
+      it is double coded under the two different codes.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      A letter or combination of letters maps into ten possible codes rather
+      than seven.
+     </para>
+    </listitem>
+   </itemizedlist>
+  </para>
+
+  <indexterm>
+   <primary>daitch_mokotoff</primary>
+  </indexterm>
+
+  <para>
+   The following function generates Daitch-Mokotoff soundex codes for matching
+   of similar-sounding input:
+  </para>
+
+<synopsis>
+daitch_mokotoff(text source) returns text
+</synopsis>
+
+  <para>
+   Since a Daitch-Mokotoff soundex code consists of only 6 digits,
+   <literal>source</literal> should be preferably a single word or name.
+   Any alternative soundex codes are separated by space, which makes the returned
+   text suited for use in Full Text Search, see <xref linkend="textsearch"/> and
+   <xref linkend="functions-textsearch"/>.
+  </para>
+
+  <para>
+   Example:
+  </para>
+
+<programlisting>
+CREATE OR REPLACE FUNCTION soundex_name(v_name text) RETURNS text AS $$
+  SELECT string_agg(daitch_mokotoff(n), ' ')
+  FROM regexp_split_to_table(v_name, '\s+') AS n
+$$ LANGUAGE sql STRICT IMMUTABLE PARALLEL SAFE;
+
+CREATE OR REPLACE FUNCTION soundex_tsvector(v_name text) RETURNS tsvector AS $$
+  SELECT to_tsvector('simple', soundex_name(v_name))
+$$ LANGUAGE sql STRICT IMMUTABLE PARALLEL SAFE;
+
+CREATE OR REPLACE FUNCTION soundex_tsquery(v_name text) RETURNS tsquery AS $$
+  SELECT to_tsquery('simple', quote_literal(soundex_name(v_name)))
+$$ LANGUAGE sql STRICT IMMUTABLE PARALLEL SAFE;
+
+-- Note that searches could be more efficient with the tsvector in a separate column
+-- (no recalculation on table row recheck).
+CREATE TABLE s (nm text);
+CREATE INDEX ix_s_txt ON s USING gin (soundex_tsvector(nm)) WITH (fastupdate = off);
+
+INSERT INTO s VALUES ('John Doe');
+INSERT INTO s VALUES ('Jane Roe');
+INSERT INTO s VALUES ('Public John Q.');
+INSERT INTO s VALUES ('George Best');
+
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('john');
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('jane doe');
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('john public');
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('besst, giorgio');
+</programlisting>
+ </sect2>
+
 </sect1>

Re: daitch_mokotoff module

От
Dag Lem
Дата:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

> I wonder why do you have it return the multiple alternative codes as a
> space-separated string.  Maybe an array would be more appropriate.  Even
> on your documented example use, the first thing you do is split it on
> spaces.

In the example, the *input* is split on whitespace, the returned soundex
codes are not. The splitting of the input is done in order to code each
word separately. One of the stated rules of the Daitch-Mokotoff Soundex
Coding is that "When a name consists of more than one word, it is coded
as if one word", and this may not always be desired. See
https://www.avotaynu.com/soundex.htm or
https://www.jewishgen.org/InfoFiles/soundex.html for the rules.

The intended use for the Daitch-Mokotoff soundex, as for any other
soundex algorithm, is to index names (or words) on some representation
of sound, so that alike sounding names with different spellings will
match.

In PostgreSQL, the Daitch-Mokotoff Soundex and Full Text Search makes
for a powerful combination to match alike sounding names. Full Text
Search (as any other free text search engine) works with documents, and
thus the Daitch-Mokotoff Soundex implementation produces documents
(words separated by space). As stated in the documentation: "Any
alternative soundex codes are separated by space, which makes the
returned text suited for use in Full Text Search".

Best regards,

Dag Lem



Re: daitch_mokotoff module

От
Dag Lem
Дата:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

> On 2022-Dec-23, Alvaro Herrera wrote:
>

[...]

> I tried downloading a list of surnames from here
> https://www.bibliotecadenombres.com/apellidos/apellidos-espanoles/
> pasted that in a text file and \copy'ed it into a table.  Then I ran
> this query
>
> select string_agg(a, ' ' order by a), daitch_mokotoff(a), count(*)
> from apellidos
> group by daitch_mokotoff(a)
> order by count(*) desc;
>
> so I have a first entry like this
>
> string_agg │ Balasco Balles Belasco Belles Blas Blasco Fallas Feliz
> Palos Pelaez Plaza Valles Vallez Velasco Velez Veliz Veloz Villas
> daitch_mokotoff │ 784000
> count           │ 18
>
> but then I have a bunch of other entries with the same code 784000 as
> alternative codes,
>
> string_agg      │ Velazco
> daitch_mokotoff │ 784500 784000
> count           │ 1
>
> string_agg      │ Palacio
> daitch_mokotoff │ 785000 784000
> count           │ 1
>
> I suppose I need to group these together somehow, and it would make more
> sense to do that if the values were arrays.
>
>
> If I scroll a bit further down and choose, say, 794000 (a relatively
> popular one), then I have this
>
> string_agg │ Barraza Barrios Barros Bras Ferraz Frias Frisco Parras
> Peraza Peres Perez Porras Varas Veras
> daitch_mokotoff │ 794000
> count           │ 14
>
> and looking for that code in the result I also get these three
>
> string_agg      │ Barca Barco Parco
> daitch_mokotoff │ 795000 794000
> count           │ 3
>
> string_agg      │ Borja
> daitch_mokotoff │ 790000 794000
> count           │ 1
>
> string_agg      │ Borjas
> daitch_mokotoff │ 794000 794400
> count           │ 1
>
> and then I see that I should also search for possible matches in codes
> 795000, 790000 and 794400, so that gives me
>
> string_agg │ Baria Baro Barrio Barro Berra Borra Feria Para Parra
> Perea Vera
> daitch_mokotoff │ 790000
> count           │ 11
>
> string_agg      │ Barriga Borge Borrego Burgo Fraga
> daitch_mokotoff │ 795000
> count           │ 5
>
> string_agg      │ Borjas
> daitch_mokotoff │ 794000 794400
> count           │ 1
>
> which look closely related (compare "Veras" in the first to "Vera" in
> the later set.  If you ignore that pseudo-match, you're likely to miss
> possible family relationships.)
>
>
> I suppose if I were a genealogy researcher, I would be helped by having
> each of these codes behave as a separate unit, rather than me having to
> split the string into the several possible contained values.

It seems to me like you're trying to use soundex coding for something it
was never designed for.

As stated in my previous mail, soundex algorithms are designed to index
names on some representation of sound, so that alike sounding names with
different spellings will match, and as shown in the documentation
example, that is exactly what the implementation facilitates.

Daitch-Mokotoff Soundex indexes alternative sounds for the same name,
however if I understand correctly, you want to index names by single
sounds, linking all alike sounding names to the same soundex code. I
fail to see how that is useful - if you want to find matches for a name,
you simply match against all indexed names. If you only consider one
sound, you won't find all names that match.

In any case, as explained in the documentation, the implementation is
intended to be a companion to Full Text Search, thus text is the natural
representation for the soundex codes.

BTW Vera 790000 does not match Veras 794000, because they don't sound
the same (up to the maximum soundex code length).

Best regards

Dag Lem



Re: daitch_mokotoff module

От
Dag Lem
Дата:
Dag Lem <dag@nimrod.no> writes:

> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
>
>> On 2022-Dec-23, Alvaro Herrera wrote:
>>
>
> [...]
>
>> I tried downloading a list of surnames from here
>> https://www.bibliotecadenombres.com/apellidos/apellidos-espanoles/
>> pasted that in a text file and \copy'ed it into a table.  Then I ran
>> this query
>>
>> select string_agg(a, ' ' order by a), daitch_mokotoff(a), count(*)
>> from apellidos
>> group by daitch_mokotoff(a)
>> order by count(*) desc;
>>
>> so I have a first entry like this
>>
>> string_agg │ Balasco Balles Belasco Belles Blas Blasco Fallas Feliz
>> Palos Pelaez Plaza Valles Vallez Velasco Velez Veliz Veloz Villas
>> daitch_mokotoff │ 784000
>> count           │ 18
>>
>> but then I have a bunch of other entries with the same code 784000 as
>> alternative codes,
>>
>> string_agg      │ Velazco
>> daitch_mokotoff │ 784500 784000
>> count           │ 1
>>
>> string_agg      │ Palacio
>> daitch_mokotoff │ 785000 784000
>> count           │ 1
>>
>> I suppose I need to group these together somehow, and it would make more
>> sense to do that if the values were arrays.
>>
>>
>> If I scroll a bit further down and choose, say, 794000 (a relatively
>> popular one), then I have this
>>
>> string_agg │ Barraza Barrios Barros Bras Ferraz Frias Frisco Parras
>> Peraza Peres Perez Porras Varas Veras
>> daitch_mokotoff │ 794000
>> count           │ 14
>>
>> and looking for that code in the result I also get these three
>>
>> string_agg      │ Barca Barco Parco
>> daitch_mokotoff │ 795000 794000
>> count           │ 3
>>
>> string_agg      │ Borja
>> daitch_mokotoff │ 790000 794000
>> count           │ 1
>>
>> string_agg      │ Borjas
>> daitch_mokotoff │ 794000 794400
>> count           │ 1
>>
>> and then I see that I should also search for possible matches in codes
>> 795000, 790000 and 794400, so that gives me
>>
>> string_agg │ Baria Baro Barrio Barro Berra Borra Feria Para Parra
>> Perea Vera
>> daitch_mokotoff │ 790000
>> count           │ 11
>>
>> string_agg      │ Barriga Borge Borrego Burgo Fraga
>> daitch_mokotoff │ 795000
>> count           │ 5
>>
>> string_agg      │ Borjas
>> daitch_mokotoff │ 794000 794400
>> count           │ 1
>>
>> which look closely related (compare "Veras" in the first to "Vera" in
>> the later set.  If you ignore that pseudo-match, you're likely to miss
>> possible family relationships.)
>>
>>
>> I suppose if I were a genealogy researcher, I would be helped by having
>> each of these codes behave as a separate unit, rather than me having to
>> split the string into the several possible contained values.
>
> It seems to me like you're trying to use soundex coding for something it
> was never designed for.
>
> As stated in my previous mail, soundex algorithms are designed to index
> names on some representation of sound, so that alike sounding names with
> different spellings will match, and as shown in the documentation
> example, that is exactly what the implementation facilitates.
>
> Daitch-Mokotoff Soundex indexes alternative sounds for the same name,
> however if I understand correctly, you want to index names by single
> sounds, linking all alike sounding names to the same soundex code. I
> fail to see how that is useful - if you want to find matches for a name,
> you simply match against all indexed names. If you only consider one
> sound, you won't find all names that match.
>
> In any case, as explained in the documentation, the implementation is
> intended to be a companion to Full Text Search, thus text is the natural
> representation for the soundex codes.
>
> BTW Vera 790000 does not match Veras 794000, because they don't sound
> the same (up to the maximum soundex code length).
>

I've been sleeping on this, and perhaps the normal use case can just as
well (or better) be covered by the "@>" array operator? I originally
implemented similar functionality using another soundex algorithm more
than a decade ago, and either arrays couldn't be GIN indexed back then,
or I simply missed it. I'll have to get back to this - now it's
Christmas!

Merry Christmas!

Best regards,

Dag Lem



Re: daitch_mokotoff module

От
Alvaro Herrera
Дата:
Hello

On 2022-Dec-23, Dag Lem wrote:

> It seems to me like you're trying to use soundex coding for something it
> was never designed for.

I'm not trying to use it for anything, actually.  I'm just reading the
pages your patch links to, to try and understand how this algorithm can
be best implemented in Postgres.

So I got to this page
https://www.avotaynu.com/soundex.htm
which explains that Daitch figured that it would be best if a letter
that can have two possible encodings would be encoded in both ways:

> 5. If a combination of letters could have two possible sounds, then it
> is coded in both manners. For example, the letters ch can have a soft
> sound such as in Chicago or a hard sound as in Christmas.

which I understand as meaning that a single name returns two possible
encodings, which is why these three names
 Barca Barco Parco
have two possible encodings
 795000 and 794000
which is what your algorithm returns.

In fact, using the word Christmas we do get alternative codes for the first
letter (either 4 or 5), precisely as in Daitch's example:

=# select daitch_mokotoff('christmas');
 daitch_mokotoff 
─────────────────
 594364 494364
(1 fila)

and if we take out the ambiguous 'ch', we get a single one:

=# select daitch_mokotoff('ristmas');
 daitch_mokotoff 
─────────────────
 943640
(1 fila)

and if we add another 'ch', we get the codes for each possibility at each
position of the ambiguous 'ch':

=# select daitch_mokotoff('christmach');
       daitch_mokotoff       
─────────────────────────────
 594365 594364 494365 494364
(1 fila)


So, yes, I'm proposing that we returns those as array elements and that
@> is used to match them.

> Daitch-Mokotoff Soundex indexes alternative sounds for the same name,
> however if I understand correctly, you want to index names by single
> sounds, linking all alike sounding names to the same soundex code. I
> fail to see how that is useful - if you want to find matches for a name,
> you simply match against all indexed names. If you only consider one
> sound, you won't find all names that match.

Hmm, I think we're saying the same thing, but from opposite points of
view.  No, I want each name to return multiple codes, but that those
multiple codes can be treated as a multiple-value array of codes, rather
than as a single string of space-separated codes.

> In any case, as explained in the documentation, the implementation is
> intended to be a companion to Full Text Search, thus text is the natural
> representation for the soundex codes.

Hmm, I don't agree with this point.  The numbers are representations of
the strings, but they don't necessarily have to be strings themselves.


> BTW Vera 790000 does not match Veras 794000, because they don't sound
> the same (up to the maximum soundex code length).

No, and maybe that's okay because they have different codes.  But they
are both similar, in Daitch-Mokotoff, to Borja, which has two codes,
790000 and 794000.  (Any Spanish speaker will readily tell you that
neither Vera nor Veras are similar in any way to Borja, but D-M has
chosen to say that each of them matches one of Borjas' codes.  So they
*are* related, even though indirectly, and as a genealogist you *may* be
interested in getting a match for a person called Vera when looking for
relatives to a person called Veras.  And, as a Spanish speaker, that
would make a lot of sense to me.)


Now, it's true that I've chosen to use Spanish names for my silly little
experiment.  Maybe this isn't terribly useful as a practical example,
because this algorithm seems to have been designed for Jew surnames and
perhaps not many (or not any) Jews had Spanish surnames.  I don't know;
I'm not a Jew myself (though Noah Gordon tells the tale of a Spanish Jew
called Josep Álvarez in his book "The Winemaker", so I guess it's not
impossible).  Anyway, I suspect if you repeat the experiment with names
of other origins, you'll find pretty much the same results apply there,
and that is the whole reason D-M returns multiple codes and not just
one.


Merry Christmas :-)

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/



Re: daitch_mokotoff module

От
Dag Lem
Дата:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

> Hello
>
> On 2022-Dec-23, Dag Lem wrote:
>

[...]

> So, yes, I'm proposing that we returns those as array elements and that
> @> is used to match them.

Looking into the array operators I guess that to match such arrays
directly one would actually use && (overlaps) rather than @> (contains),
but I digress.

The function is changed to return an array of soundex codes - I hope it
is now to your liking :-)

I also improved on the documentation example (using Full Text Search).
AFAIK you can't make general queries like that using arrays, however in
any case I must admit that text arrays seem like more natural building
blocks than space delimited text here.

Search to perform

is the best match for Daitch-Mokotoff, however

, but
in any case I've changed it into return arrays now. I hope it is to your
liking.

>
>> Daitch-Mokotoff Soundex indexes alternative sounds for the same name,
>> however if I understand correctly, you want to index names by single
>> sounds, linking all alike sounding names to the same soundex code. I
>> fail to see how that is useful - if you want to find matches for a name,
>> you simply match against all indexed names. If you only consider one
>> sound, you won't find all names that match.
>
> Hmm, I think we're saying the same thing, but from opposite points of
> view.  No, I want each name to return multiple codes, but that those
> multiple codes can be treated as a multiple-value array of codes, rather
> than as a single string of space-separated codes.
>
>> In any case, as explained in the documentation, the implementation is
>> intended to be a companion to Full Text Search, thus text is the natural
>> representation for the soundex codes.
>
> Hmm, I don't agree with this point.  The numbers are representations of
> the strings, but they don't necessarily have to be strings themselves.
>
>
>> BTW Vera 790000 does not match Veras 794000, because they don't sound
>> the same (up to the maximum soundex code length).
>
> No, and maybe that's okay because they have different codes.  But they
> are both similar, in Daitch-Mokotoff, to Borja, which has two codes,
> 790000 and 794000.  (Any Spanish speaker will readily tell you that
> neither Vera nor Veras are similar in any way to Borja, but D-M has
> chosen to say that each of them matches one of Borjas' codes.  So they
> *are* related, even though indirectly, and as a genealogist you *may* be
> interested in getting a match for a person called Vera when looking for
> relatives to a person called Veras.  And, as a Spanish speaker, that
> would make a lot of sense to me.)
>
>
> Now, it's true that I've chosen to use Spanish names for my silly little
> experiment.  Maybe this isn't terribly useful as a practical example,
> because this algorithm seems to have been designed for Jew surnames and
> perhaps not many (or not any) Jews had Spanish surnames.  I don't know;
> I'm not a Jew myself (though Noah Gordon tells the tale of a Spanish Jew
> called Josep Álvarez in his book "The Winemaker", so I guess it's not
> impossible).  Anyway, I suspect if you repeat the experiment with names
> of other origins, you'll find pretty much the same results apply there,
> and that is the whole reason D-M returns multiple codes and not just
> one.
>
>
> Merry Christmas :-)

--
Dag



Re: daitch_mokotoff module

От
Dag Lem
Дата:
Sorry about the latest unfinished email - don't know what key
combination I managed to hit there.

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

> Hello
>
> On 2022-Dec-23, Dag Lem wrote:
>

[...]

>
> So, yes, I'm proposing that we returns those as array elements and that
> @> is used to match them.
>

Looking into the array operators I guess that to match such arrays
directly one would actually use && (overlaps) rather than @> (contains),
but I digress.

The function is changed to return an array of soundex codes - I hope it
is now to your liking :-)

I also improved on the documentation example (using Full Text Search).
AFAIK you can't make general queries like that using arrays, however in
any case I must admit that text arrays seem like more natural building
blocks than space delimited text here.

[...]

>> BTW Vera 790000 does not match Veras 794000, because they don't sound
>> the same (up to the maximum soundex code length).
>
> No, and maybe that's okay because they have different codes.  But they
> are both similar, in Daitch-Mokotoff, to Borja, which has two codes,
> 790000 and 794000.  (Any Spanish speaker will readily tell you that
> neither Vera nor Veras are similar in any way to Borja, but D-M has
> chosen to say that each of them matches one of Borjas' codes.  So they
> *are* related, even though indirectly, and as a genealogist you *may* be
> interested in getting a match for a person called Vera when looking for
> relatives to a person called Veras.  And, as a Spanish speaker, that
> would make a lot of sense to me.)

It is what it is - we can't call it Daitch-Mokotoff Soundex while
implementing something else. Having said that, one can always pre- or
postprocess to tweak the results.

Daitch-Mokotoff Soundex is known to produce false positives, but that is
in many cases not a problem.

Even though it's clearly tuned for Jewish names, the soundex algorithm
seems to work just fine for European names (we use it to match mostly
Norwegian names).


Best regards

Dag Lem

diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile
index 0704894f88..12baf2d884 100644
--- a/contrib/fuzzystrmatch/Makefile
+++ b/contrib/fuzzystrmatch/Makefile
@@ -3,14 +3,15 @@
 MODULE_big = fuzzystrmatch
 OBJS = \
     $(WIN32RES) \
+    daitch_mokotoff.o \
     dmetaphone.o \
     fuzzystrmatch.o

 EXTENSION = fuzzystrmatch
-DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql
+DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql
 PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"

-REGRESS = fuzzystrmatch
+REGRESS = fuzzystrmatch fuzzystrmatch_utf8

 ifdef USE_PGXS
 PG_CONFIG = pg_config
@@ -22,3 +23,14 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
+
+# Force this dependency to be known even without dependency info built:
+daitch_mokotoff.o: daitch_mokotoff.h
+
+daitch_mokotoff.h: daitch_mokotoff_header.pl
+    perl $< $@
+
+distprep: daitch_mokotoff.h
+
+maintainer-clean:
+    rm -f daitch_mokotoff.h
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c
new file mode 100644
index 0000000000..2548903770
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff.c
@@ -0,0 +1,582 @@
+/*
+ * Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem <dag@nimrod.no>
+ *
+ * This implementation of the Daitch-Mokotoff Soundex System aims at high
+ * performance.
+ *
+ * - The processing of each phoneme is initiated by an O(1) table lookup.
+ * - For phonemes containing more than one character, a coding tree is traversed
+ *   to process the complete phoneme.
+ * - The (alternate) soundex codes are produced digit by digit in-place in
+ *   another tree structure.
+ *
+ *  References:
+ *
+ * https://www.avotaynu.com/soundex.htm
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex
+ * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php)
+ * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java)
+ * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm)
+ *
+ * A few notes on other implementations:
+ *
+ * - All other known implementations have the same unofficial rules for "UE",
+ *   these are also adapted by this implementation (0, 1, NC).
+ * - The only other known implementation which is capable of generating all
+ *   correct soundex codes in all cases is the JOS Soundex Calculator at
+ *   https://www.jewishgen.org/jos/jossound.htm
+ * - "J" is considered (only) a vowel in dmlat.php
+ * - The official rules for "RS" are commented out in dmlat.php
+ * - Identical code digits for adjacent letters are not collapsed correctly in
+ *   dmsoundex.php when double digit codes are involved. E.g. "BESST" yields
+ *   744300 instead of 743000 as for "BEST".
+ * - "J" is considered (only) a consonant in DaitchMokotoffSoundex.java
+ * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+*/
+
+#include "postgres.h"
+
+#include "catalog/pg_type.h"
+#include "mb/pg_wchar.h"
+#include "utils/array.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+
+
+/*
+ * The soundex coding chart table is adapted from
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * See daitch_mokotoff_header.pl for details.
+*/
+
+/* Generated coding chart table */
+#include "daitch_mokotoff.h"
+
+#define DM_CODE_DIGITS 6
+
+/* Node in soundex code tree */
+struct dm_node
+{
+    int            soundex_length; /* Length of generated soundex code */
+    char        soundex[DM_CODE_DIGITS];    /* Soundex code */
+    int            is_leaf;        /* Candidate for complete soundex code */
+    int            last_update;    /* Letter number for last update of node */
+    char        code_digit;        /* Last code digit, 0 - 9 */
+
+    /*
+     * One or two alternate code digits leading to this node. If there are two
+     * digits, one of them is always an 'X'. Repeated code digits and 'X' lead
+     * back to the same node.
+     */
+    char        prev_code_digits[2];
+    /* One or two alternate code digits moving forward. */
+    char        next_code_digits[2];
+    /* ORed together code index(es) used to reach current node. */
+    int            prev_code_index;
+    int            next_code_index;
+    /* Possible nodes branching out from this node - digits 0-9. */
+    struct dm_node *children[10];
+    /* Next node in linked list. Alternating index for each iteration. */
+    struct dm_node *next[2];
+};
+
+typedef struct dm_node dm_node;
+
+
+/* Internal C implementation */
+static int    daitch_mokotoff_coding(char *word, ArrayBuildState *soundex, MemoryContext tmp_ctx);
+
+
+PG_FUNCTION_INFO_V1(daitch_mokotoff);
+
+Datum
+daitch_mokotoff(PG_FUNCTION_ARGS)
+{
+    text       *arg = PG_GETARG_TEXT_PP(0);
+    char       *string;
+    ArrayBuildState *soundex;
+    Datum        retval;
+    MemoryContext mem_ctx,
+                tmp_ctx;
+
+    tmp_ctx = AllocSetContextCreate(CurrentMemoryContext,
+                                    "daitch_mokotoff temporary context",
+                                    ALLOCSET_DEFAULT_SIZES);
+    mem_ctx = MemoryContextSwitchTo(tmp_ctx);
+
+    string = pg_server_to_any(text_to_cstring(arg), VARSIZE_ANY_EXHDR(arg), PG_UTF8);
+    soundex = initArrayResult(TEXTOID, tmp_ctx, false);
+
+    if (!daitch_mokotoff_coding(string, soundex, tmp_ctx))
+    {
+        /* No encodable characters in input. */
+        MemoryContextSwitchTo(mem_ctx);
+        MemoryContextDelete(tmp_ctx);
+        PG_RETURN_NULL();
+    }
+
+    retval = makeArrayResult(soundex, mem_ctx);
+    MemoryContextSwitchTo(mem_ctx);
+    MemoryContextDelete(tmp_ctx);
+
+    PG_RETURN_DATUM(retval);
+}
+
+
+/* Template for new node in soundex code tree. */
+static const dm_node start_node = {
+    .soundex_length = 0,
+    .soundex = "000000",        /* Six digits */
+    .is_leaf = 0,
+    .last_update = 0,
+    .code_digit = '\0',
+    .prev_code_digits = {'\0', '\0'},
+    .next_code_digits = {'\0', '\0'},
+    .prev_code_index = 0,
+    .next_code_index = 0,
+    .children = {NULL},
+    .next = {NULL}
+};
+
+/* Dummy soundex codes at end of input. */
+static const dm_codes end_codes[2] =
+{
+    {
+        "X", "X", "X"
+    }
+};
+
+
+/* Initialize soundex code tree node for next code digit. */
+static void
+initialize_node(dm_node * node, int last_update)
+{
+    if (node->last_update < last_update)
+    {
+        node->prev_code_digits[0] = node->next_code_digits[0];
+        node->prev_code_digits[1] = node->next_code_digits[1];
+        node->next_code_digits[0] = '\0';
+        node->next_code_digits[1] = '\0';
+        node->prev_code_index = node->next_code_index;
+        node->next_code_index = 0;
+        node->is_leaf = 0;
+        node->last_update = last_update;
+    }
+}
+
+
+/* Update soundex code tree node with next code digit. */
+static void
+add_next_code_digit(dm_node * node, int code_index, char code_digit)
+{
+    /* OR in index 1 or 2. */
+    node->next_code_index |= code_index;
+
+    if (!node->next_code_digits[0])
+    {
+        node->next_code_digits[0] = code_digit;
+    }
+    else if (node->next_code_digits[0] != code_digit)
+    {
+        node->next_code_digits[1] = code_digit;
+    }
+}
+
+
+/* Mark soundex code tree node as leaf. */
+static void
+set_leaf(dm_node * first_node[2], dm_node * last_node[2], dm_node * node, int ix_node)
+{
+    if (!node->is_leaf)
+    {
+        node->is_leaf = 1;
+
+        if (first_node[ix_node] == NULL)
+        {
+            first_node[ix_node] = node;
+        }
+        else
+        {
+            last_node[ix_node]->next[ix_node] = node;
+        }
+
+        last_node[ix_node] = node;
+        node->next[ix_node] = NULL;
+    }
+}
+
+
+/* Find next node corresponding to code digit, or create a new node. */
+static dm_node *
+find_or_create_child_node(dm_node * parent, char code_digit, ArrayBuildState *soundex, MemoryContext tmp_ctx)
+{
+    int            i = code_digit - '0';
+    dm_node   **nodes = parent->children;
+    dm_node    *node = nodes[i];
+
+    if (node)
+    {
+        /* Found existing child node. Skip completed nodes. */
+        return node->soundex_length < DM_CODE_DIGITS ? node : NULL;
+    }
+
+    /* Create new child node. */
+    node = palloc(sizeof(dm_node));
+    nodes[i] = node;
+
+    *node = start_node;
+    memcpy(node->soundex, parent->soundex, sizeof(parent->soundex));
+    node->soundex_length = parent->soundex_length;
+    node->soundex[node->soundex_length++] = code_digit;
+    node->code_digit = code_digit;
+    node->next_code_index = node->prev_code_index;
+
+    if (node->soundex_length < DM_CODE_DIGITS)
+    {
+        return node;
+    }
+    else
+    {
+        /* Append completed soundex code to soundex array. */
+        accumArrayResult(soundex,
+                         PointerGetDatum(cstring_to_text_with_len(node->soundex, DM_CODE_DIGITS)),
+                         false,
+                         TEXTOID,
+                         tmp_ctx);
+        return NULL;
+    }
+}
+
+
+/* Update node for next code digit(s). */
+static void
+update_node(dm_node * first_node[2], dm_node * last_node[2], dm_node * node, int ix_node,
+            int letter_no, int prev_code_index, int next_code_index,
+            const char *next_code_digits, int digit_no,
+            ArrayBuildState *soundex, MemoryContext tmp_ctx)
+{
+    int            i;
+    char        next_code_digit = next_code_digits[digit_no];
+    int            num_dirty_nodes = 0;
+    dm_node    *dirty_nodes[2];
+
+    initialize_node(node, letter_no);
+
+    if (node->prev_code_index && !(node->prev_code_index & prev_code_index))
+    {
+        /*
+         * If the sound (vowel / consonant) of this letter encoding doesn't
+         * correspond to the coding index of the previous letter, we skip this
+         * letter encoding. Note that currently, only "J" can be either a
+         * vowel or a consonant.
+         */
+        return;
+    }
+
+    if (next_code_digit == 'X' ||
+        (digit_no == 0 &&
+         (node->prev_code_digits[0] == next_code_digit ||
+          node->prev_code_digits[1] == next_code_digit)))
+    {
+        /* The code digit is the same as one of the previous (i.e. not added). */
+        dirty_nodes[num_dirty_nodes++] = node;
+    }
+
+    if (next_code_digit != 'X' &&
+        (digit_no > 0 ||
+         node->prev_code_digits[0] != next_code_digit ||
+         node->prev_code_digits[1]))
+    {
+        /* The code digit is different from one of the previous (i.e. added). */
+        node = find_or_create_child_node(node, next_code_digit, soundex, tmp_ctx);
+        if (node)
+        {
+            initialize_node(node, letter_no);
+            dirty_nodes[num_dirty_nodes++] = node;
+        }
+    }
+
+    for (i = 0; i < num_dirty_nodes; i++)
+    {
+        /* Add code digit leading to the current node. */
+        add_next_code_digit(dirty_nodes[i], next_code_index, next_code_digit);
+
+        if (next_code_digits[++digit_no])
+        {
+            update_node(first_node, last_node, dirty_nodes[i], ix_node,
+                        letter_no, prev_code_index, next_code_index,
+                        next_code_digits, digit_no,
+                        soundex, tmp_ctx);
+        }
+        else
+        {
+            /* Add incomplete leaf node to linked list. */
+            set_leaf(first_node, last_node, dirty_nodes[i], ix_node);
+        }
+    }
+}
+
+
+/* Update soundex tree leaf nodes. */
+static void
+update_leaves(dm_node * first_node[2], int *ix_node, int letter_no,
+              const dm_codes * codes, const dm_codes * next_codes,
+              ArrayBuildState *soundex, MemoryContext tmp_ctx)
+{
+    int            i,
+                j,
+                code_index;
+    dm_node    *node,
+               *last_node[2];
+    const        dm_code *code,
+               *next_code;
+    int            ix_node_next = (*ix_node + 1) & 1;    /* Alternating index: 0, 1 */
+
+    /* Initialize for new linked list of leaves. */
+    first_node[ix_node_next] = NULL;
+    last_node[ix_node_next] = NULL;
+
+    /* Process all nodes. */
+    for (node = first_node[*ix_node]; node; node = node->next[*ix_node])
+    {
+        /* One or two alternate code sequences. */
+        for (i = 0; i < 2 && (code = codes[i]) && code[0][0]; i++)
+        {
+            /* Coding for previous letter - before vowel: 1, all other: 2 */
+            int            prev_code_index = (code[0][0] > '1') + 1;
+
+            /* One or two alternate next code sequences. */
+            for (j = 0; j < 2 && (next_code = next_codes[j]) && next_code[0][0]; j++)
+            {
+                /* Determine which code to use. */
+                if (letter_no == 0)
+                {
+                    /* This is the first letter. */
+                    code_index = 0;
+                }
+                else if (next_code[0][0] <= '1')
+                {
+                    /* The next letter is a vowel. */
+                    code_index = 1;
+                }
+                else
+                {
+                    /* All other cases. */
+                    code_index = 2;
+                }
+
+                /* One or two sequential code digits. */
+                update_node(first_node, last_node, node, ix_node_next,
+                            letter_no, prev_code_index, code_index,
+                            code[code_index], 0,
+                            soundex, tmp_ctx);
+            }
+        }
+    }
+
+    *ix_node = ix_node_next;
+}
+
+
+/* Mapping from ISO8859-1 to upper-case ASCII */
+static const char iso8859_1_to_ascii_upper[] =
+/*
+"`abcdefghijklmnopqrstuvwxyz{|}~                                  ¡¢£¤¥¦§¨©ª«¬
®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ"
+*/
+"`ABCDEFGHIJKLMNOPQRSTUVWXYZ{|}~                                  !
?AAAAAAECEEEEIIIIDNOOOOO*OUUUUYDSAAAAAAECEEEEIIIIDNOOOOO/OUUUUYDY";
+
+
+/* Return next character, converted from UTF-8 to uppercase ASCII. */
+static char
+read_char(unsigned char *str, int *ix)
+{
+    /* Substitute character for skipped code points. */
+    const char    na = '\x1a';
+    pg_wchar    c;
+
+    /* Decode UTF-8 character to ISO 10646 code point. */
+    str += *ix;
+    c = utf8_to_unicode(str);
+    *ix += pg_utf_mblen(str);
+
+    if (c >= (unsigned char) '[' && c <= (unsigned char) ']')
+    {
+        /* ASCII characters [, \, and ] are reserved for Ą, Ę, and Ţ/Ț. */
+        return na;
+    }
+    else if (c < 0x60)
+    {
+        /* Non-lowercase ASCII character. */
+        return c;
+    }
+    else if (c < 0x100)
+    {
+        /* ISO-8859-1 code point, converted to upper-case ASCII. */
+        return iso8859_1_to_ascii_upper[c - 0x60];
+    }
+    else
+    {
+        /* Conversion of non-ASCII characters in the coding chart. */
+        switch (c)
+        {
+            case 0x0104:
+            case 0x0105:
+                /* Ą/ą */
+                return '[';
+            case 0x0118:
+            case 0x0119:
+                /* Ę/ę */
+                return '\\';
+            case 0x0162:
+            case 0x0163:
+            case 0x021A:
+            case 0x021B:
+                /* Ţ/ţ or Ț/ț */
+                return ']';
+            default:
+                return na;
+        }
+    }
+}
+
+
+/* Read next ASCII character, skipping any characters not in [A-\]]. */
+static char
+read_valid_char(char *str, int *ix)
+{
+    char        c;
+
+    while ((c = read_char((unsigned char *) str, ix)))
+    {
+        if (c >= 'A' && c <= ']')
+        {
+            break;
+        }
+    }
+
+    return c;
+}
+
+
+/* Return sound coding for "letter" (letter sequence) */
+static const dm_codes *
+read_letter(char *str, int *ix)
+{
+    char        c,
+                cmp;
+    int            i,
+                j;
+    const        dm_letter *letters;
+    const        dm_codes *codes;
+
+    /* First letter in sequence. */
+    if (!(c = read_valid_char(str, ix)))
+    {
+        return NULL;
+    }
+    letters = &letter_[c - 'A'];
+    codes = letters->codes;
+    i = *ix;
+
+    /* Any subsequent letters in sequence. */
+    while ((letters = letters->letters) && (c = read_valid_char(str, &i)))
+    {
+        for (j = 0; (cmp = letters[j].letter); j++)
+        {
+            if (cmp == c)
+            {
+                /* Letter found. */
+                letters = &letters[j];
+                if (letters->codes)
+                {
+                    /* Coding for letter sequence found. */
+                    codes = letters->codes;
+                    *ix = i;
+                }
+                break;
+            }
+        }
+        if (!cmp)
+        {
+            /* The sequence of letters has no coding. */
+            break;
+        }
+    }
+
+    return codes;
+}
+
+
+/* Generate all Daitch-Mokotoff soundex codes for word. */
+static int
+daitch_mokotoff_coding(char *word, ArrayBuildState *soundex, MemoryContext tmp_ctx)
+{
+    int            i = 0;
+    int            letter_no = 0;
+    int            ix_node = 0;
+    const        dm_codes *codes,
+               *next_codes;
+    dm_node    *first_node[2],
+               *node;
+
+    /* First letter. */
+    if (!(codes = read_letter(word, &i)))
+    {
+        /* No encodable character in input. */
+        return 0;
+    }
+
+    /* Starting point. */
+    first_node[ix_node] = palloc(sizeof(dm_node));
+    *first_node[ix_node] = start_node;
+
+    /*
+     * Loop until either the word input is exhausted, or all generated soundex
+     * codes are completed to six digits.
+     */
+    while (codes && first_node[ix_node])
+    {
+        next_codes = read_letter(word, &i);
+
+        /* Update leaf nodes. */
+        update_leaves(first_node, &ix_node, letter_no,
+                      codes, next_codes ? next_codes : end_codes,
+                      soundex, tmp_ctx);
+
+        codes = next_codes;
+        letter_no++;
+    }
+
+    /* Append all remaining (incomplete) soundex codes. */
+    for (node = first_node[ix_node]; node; node = node->next[ix_node])
+    {
+        accumArrayResult(soundex,
+                         PointerGetDatum(cstring_to_text_with_len(node->soundex, DM_CODE_DIGITS)),
+                         false,
+                         TEXTOID,
+                         tmp_ctx);
+    }
+
+    return 1;
+}
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff_header.pl b/contrib/fuzzystrmatch/daitch_mokotoff_header.pl
new file mode 100755
index 0000000000..c888847ad0
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff_header.pl
@@ -0,0 +1,240 @@
+#!/bin/perl
+#
+# Generation of types and lookup tables for Daitch-Mokotoff soundex.
+#
+# Copyright (c) 2021 Finance Norway
+# Author: Dag Lem <dag@nimrod.no>
+#
+# Permission to use, copy, modify, and distribute this software and its
+# documentation for any purpose, without fee, and without a written agreement
+# is hereby granted, provided that the above copyright notice and this
+# paragraph and the following two paragraphs appear in all copies.
+#
+# IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+# DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+# LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+# DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+# POSSIBILITY OF SUCH DAMAGE.
+#
+# THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+# INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+# AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+# ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+# PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+#
+
+use strict;
+use warnings;
+use utf8;
+use open IO => ':utf8', ':std';
+use Data::Dumper;
+
+die "Usage: $0 OUTPUT_FILE\n" if @ARGV != 1;
+my $output_file = $ARGV[0];
+
+# Open the output file
+open my $OUTPUT, '>', $output_file
+  or die "Could not open output file $output_file: $!\n";
+
+# Parse code table and generate tree for letter transitions.
+my %codes;
+my $table = [{}, [["","",""]]];
+while (<DATA>) {
+    chomp;
+    my ($letters, $codes) = split(/\s+/);
+    my @codes = map { [ split(/,/) ] } split(/\|/, $codes);
+
+    my $key = "codes_" . join("_or_", map { join("_", @$_) } @codes);
+    my $val = join(",\n", map { "\t{\n\t\t" . join(", ", map { "\"$_\"" } @$_) . "\n\t}" } @codes);
+    $codes{$key} = $val;
+
+    for my $letter (split(/,/, $letters)) {
+        my $ref = $table->[0];
+        # Link each character to the next in the letter combination.
+        my @c = split(//, $letter);
+        my $last_c = pop(@c);
+        for my $c (@c) {
+            $ref->{$c} //= [ {}, undef ];
+            $ref->{$c}[0] //= {};
+            $ref = $ref->{$c}[0];
+        }
+        # The sound code for the letter combination is stored at the last character.
+        $ref->{$last_c}[1] = $key;
+    }
+}
+close(DATA);
+
+print $OUTPUT <<EOF;
+/*
+ * Constants and lookup tables for Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem <dag\@nimrod.no>
+ *
+ * This file is generated by daitch_mokotoff_header.pl
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+ */
+
+/* Coding chart table: Soundex codes */
+typedef char dm_code[2 + 1];    /* One or two sequential code digits + NUL */
+typedef dm_code dm_codes[3];    /* Start of name, before a vowel, any other */
+
+/* Coding chart table: Letter in input sequence */
+struct dm_letter
+{
+    char        letter;            /* Present letter in sequence */
+    const struct dm_letter *letters;    /* List of possible successive letters */
+    const        dm_codes *codes;    /* Code sequence(s) for complete sequence */
+};
+
+typedef struct dm_letter dm_letter;
+
+/* Codes for letter sequence at start of name, before a vowel, and any other. */
+EOF
+
+for my $key (sort keys %codes) {
+    print $OUTPUT "static const dm_codes $key\[2\] =\n{\n" . $codes{$key} . "\n};\n";
+}
+
+print $OUTPUT <<EOF;
+
+/* Coding for alternative following letters in sequence. */
+EOF
+
+sub hash2code {
+    my ($ref, $letter) = @_;
+
+    my @letters = ();
+
+    my $h = $ref->[0];
+    for my $key (sort keys %$h) {
+        $ref = $h->{$key};
+        my $children = "NULL";
+        if (defined $ref->[0]) {
+            $children = "letter_$letter$key";
+            hash2code($ref, "$letter$key");
+        }
+        my $codes = $ref->[1] // "NULL";
+        push(@letters, "\t{\n\t\t'$key', $children, $codes\n\t}");
+    }
+
+    print $OUTPUT "static const dm_letter letter_$letter\[\] =\n{\n";
+    for (@letters) {
+        print $OUTPUT "$_,\n";
+    }
+    print $OUTPUT "\t{\n\t\t'\\0'\n\t}\n";
+    print $OUTPUT "};\n";
+}
+
+hash2code($table, '');
+
+close $OUTPUT;
+
+# Table adapted from https://www.jewishgen.org/InfoFiles/Soundex.html
+#
+# The conversion from the coding chart to the table should be self
+# explanatory, but note the differences stated below.
+#
+# X = NC (not coded)
+#
+# The non-ASCII letters in the coding chart are coded with substitute
+# lowercase ASCII letters, which sort after the uppercase ASCII letters:
+#
+# Ą => a (use '[' for table lookup)
+# Ę => e (use '\\' for table lookup)
+# Ţ => t (use ']' for table lookup)
+#
+# The rule for "UE" does not correspond to the coding chart, however
+# it is used by all other known implementations, including the one at
+# https://www.jewishgen.org/jos/jossound.htm (try e.g. "bouey").
+#
+# Note that the implementation assumes that vowels are assigned code
+# 0 or 1. "J" can be either a vowel or a consonant.
+#
+
+__DATA__
+AI,AJ,AY                0,1,X
+AU                        0,7,X
+a                        X,X,6|X,X,X
+A                        0,X,X
+B                        7,7,7
+CHS                        5,54,54
+CH                        5,5,5|4,4,4
+CK                        5,5,5|45,45,45
+CZ,CS,CSZ,CZS            4,4,4
+C                        5,5,5|4,4,4
+DRZ,DRS                    4,4,4
+DS,DSH,DSZ                4,4,4
+DZ,DZH,DZS                4,4,4
+D,DT                    3,3,3
+EI,EJ,EY                0,1,X
+EU                        1,1,X
+e                        X,X,6|X,X,X
+E                        0,X,X
+FB                        7,7,7
+F                        7,7,7
+G                        5,5,5
+H                        5,5,X
+IA,IE,IO,IU                1,X,X
+I                        0,X,X
+J                        1,X,X|4,4,4
+KS                        5,54,54
+KH                        5,5,5
+K                        5,5,5
+L                        8,8,8
+MN                        66,66,66
+M                        6,6,6
+NM                        66,66,66
+N                        6,6,6
+OI,OJ,OY                0,1,X
+O                        0,X,X
+P,PF,PH                    7,7,7
+Q                        5,5,5
+RZ,RS                    94,94,94|4,4,4
+R                        9,9,9
+SCHTSCH,SCHTSH,SCHTCH    2,4,4
+SCH                        4,4,4
+SHTCH,SHCH,SHTSH        2,4,4
+SHT,SCHT,SCHD            2,43,43
+SH                        4,4,4
+STCH,STSCH,SC            2,4,4
+STRZ,STRS,STSH            2,4,4
+ST                        2,43,43
+SZCZ,SZCS                2,4,4
+SZT,SHD,SZD,SD            2,43,43
+SZ                        4,4,4
+S                        4,4,4
+TCH,TTCH,TTSCH            4,4,4
+TH                        3,3,3
+TRZ,TRS                    4,4,4
+TSCH,TSH                4,4,4
+TS,TTS,TTSZ,TC            4,4,4
+TZ,TTZ,TZS,TSZ            4,4,4
+t                        3,3,3|4,4,4
+T                        3,3,3
+UI,UJ,UY,UE                0,1,X
+U                        0,X,X
+V                        7,7,7
+W                        7,7,7
+X                        5,54,54
+Y                        1,X,X
+ZDZ,ZDZH,ZHDZH            2,4,4
+ZD,ZHD                    2,43,43
+ZH,ZS,ZSCH,ZSH            4,4,4
+Z                        4,4,4
diff --git a/contrib/fuzzystrmatch/expected/fuzzystrmatch.out b/contrib/fuzzystrmatch/expected/fuzzystrmatch.out
index 493c95cdfa..bcb837fd6b 100644
--- a/contrib/fuzzystrmatch/expected/fuzzystrmatch.out
+++ b/contrib/fuzzystrmatch/expected/fuzzystrmatch.out
@@ -65,3 +65,174 @@ SELECT dmetaphone_alt('gumbo');
  KMP
 (1 row)

+-- Wovels
+SELECT daitch_mokotoff('Augsburg');
+ daitch_mokotoff
+-----------------
+ {054795}
+(1 row)
+
+SELECT daitch_mokotoff('Breuer');
+ daitch_mokotoff
+-----------------
+ {791900}
+(1 row)
+
+SELECT daitch_mokotoff('Freud');
+ daitch_mokotoff
+-----------------
+ {793000}
+(1 row)
+
+-- The letter "H"
+SELECT daitch_mokotoff('Halberstadt');
+ daitch_mokotoff
+-----------------
+ {587943,587433}
+(1 row)
+
+SELECT daitch_mokotoff('Mannheim');
+ daitch_mokotoff
+-----------------
+ {665600}
+(1 row)
+
+-- Adjacent sounds
+SELECT daitch_mokotoff('Chernowitz');
+ daitch_mokotoff
+-----------------
+ {596740,496740}
+(1 row)
+
+-- Adjacent letters with identical adjacent code digits
+SELECT daitch_mokotoff('Cherkassy');
+ daitch_mokotoff
+-----------------
+ {595400,495400}
+(1 row)
+
+SELECT daitch_mokotoff('Kleinman');
+ daitch_mokotoff
+-----------------
+ {586660}
+(1 row)
+
+-- More than one word
+SELECT daitch_mokotoff('Nowy Targ');
+ daitch_mokotoff
+-----------------
+ {673950}
+(1 row)
+
+-- Padded with "0"
+SELECT daitch_mokotoff('Berlin');
+ daitch_mokotoff
+-----------------
+ {798600}
+(1 row)
+
+-- Other examples from https://www.avotaynu.com/soundex.htm
+SELECT daitch_mokotoff('Ceniow');
+ daitch_mokotoff
+-----------------
+ {567000,467000}
+(1 row)
+
+SELECT daitch_mokotoff('Tsenyuv');
+ daitch_mokotoff
+-----------------
+ {467000}
+(1 row)
+
+SELECT daitch_mokotoff('Holubica');
+ daitch_mokotoff
+-----------------
+ {587500,587400}
+(1 row)
+
+SELECT daitch_mokotoff('Golubitsa');
+ daitch_mokotoff
+-----------------
+ {587400}
+(1 row)
+
+SELECT daitch_mokotoff('Przemysl');
+ daitch_mokotoff
+-----------------
+ {794648,746480}
+(1 row)
+
+SELECT daitch_mokotoff('Pshemeshil');
+ daitch_mokotoff
+-----------------
+ {746480}
+(1 row)
+
+SELECT daitch_mokotoff('Rosochowaciec');
+                      daitch_mokotoff
+-----------------------------------------------------------
+ {945755,945754,945745,945744,944755,944754,944745,944744}
+(1 row)
+
+SELECT daitch_mokotoff('Rosokhovatsets');
+ daitch_mokotoff
+-----------------
+ {945744}
+(1 row)
+
+-- Ignored characters
+SELECT daitch_mokotoff('''OBrien');
+ daitch_mokotoff
+-----------------
+ {079600}
+(1 row)
+
+SELECT daitch_mokotoff('O''Brien');
+ daitch_mokotoff
+-----------------
+ {079600}
+(1 row)
+
+-- "Difficult" cases, likely to cause trouble for other implementations.
+SELECT daitch_mokotoff('CJC');
+               daitch_mokotoff
+---------------------------------------------
+ {550000,540000,545000,450000,400000,440000}
+(1 row)
+
+SELECT daitch_mokotoff('BESST');
+ daitch_mokotoff
+-----------------
+ {743000}
+(1 row)
+
+SELECT daitch_mokotoff('BOUEY');
+ daitch_mokotoff
+-----------------
+ {710000}
+(1 row)
+
+SELECT daitch_mokotoff('HANNMANN');
+ daitch_mokotoff
+-----------------
+ {566600}
+(1 row)
+
+SELECT daitch_mokotoff('MCCOYJR');
+                      daitch_mokotoff
+-----------------------------------------------------------
+ {651900,654900,654190,654490,645190,645490,641900,644900}
+(1 row)
+
+SELECT daitch_mokotoff('ACCURSO');
+                      daitch_mokotoff
+-----------------------------------------------------------
+ {059400,054000,054940,054400,045940,045400,049400,044000}
+(1 row)
+
+SELECT daitch_mokotoff('BIERSCHBACH');
+                      daitch_mokotoff
+-----------------------------------------------------------
+ {794575,794574,794750,794740,745750,745740,747500,747400}
+(1 row)
+
diff --git a/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8.out
b/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8.out
new file mode 100644
index 0000000000..b0dd4880ba
--- /dev/null
+++ b/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8.out
@@ -0,0 +1,61 @@
+/*
+ * This test must be run in a database with UTF-8 encoding,
+ * because other encodings don't support all the characters used.
+ */
+SELECT getdatabaseencoding() <> 'UTF8'
+       AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+set client_encoding = utf8;
+-- CREATE EXTENSION IF NOT EXISTS fuzzystrmatch;
+-- Accents
+SELECT daitch_mokotoff('Müller');
+ daitch_mokotoff
+-----------------
+ {689000}
+(1 row)
+
+SELECT daitch_mokotoff('Schäfer');
+ daitch_mokotoff
+-----------------
+ {479000}
+(1 row)
+
+SELECT daitch_mokotoff('Straßburg');
+ daitch_mokotoff
+-----------------
+ {294795}
+(1 row)
+
+SELECT daitch_mokotoff('Éregon');
+ daitch_mokotoff
+-----------------
+ {095600}
+(1 row)
+
+-- Special characters added at https://www.jewishgen.org/InfoFiles/Soundex.html
+SELECT daitch_mokotoff('gąszczu');
+ daitch_mokotoff
+-----------------
+ {564000,540000}
+(1 row)
+
+SELECT daitch_mokotoff('brzęczy');
+        daitch_mokotoff
+-------------------------------
+ {794640,794400,746400,744000}
+(1 row)
+
+SELECT daitch_mokotoff('ţamas');
+ daitch_mokotoff
+-----------------
+ {364000,464000}
+(1 row)
+
+SELECT daitch_mokotoff('țamas');
+ daitch_mokotoff
+-----------------
+ {364000,464000}
+(1 row)
+
diff --git a/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8_1.out
b/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8_1.out
new file mode 100644
index 0000000000..37aead89c0
--- /dev/null
+++ b/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8_1.out
@@ -0,0 +1,8 @@
+/*
+ * This test must be run in a database with UTF-8 encoding,
+ * because other encodings don't support all the characters used.
+ */
+SELECT getdatabaseencoding() <> 'UTF8'
+       AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql b/contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql
new file mode 100644
index 0000000000..d8542a781c
--- /dev/null
+++ b/contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql
@@ -0,0 +1,8 @@
+/* contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION fuzzystrmatch UPDATE TO '1.2'" to load this file. \quit
+
+CREATE FUNCTION daitch_mokotoff(text) RETURNS text[]
+AS 'MODULE_PATHNAME', 'daitch_mokotoff'
+LANGUAGE C IMMUTABLE STRICT PARALLEL SAFE;
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch.control b/contrib/fuzzystrmatch/fuzzystrmatch.control
index 3cd6660bf9..8b6e9fd993 100644
--- a/contrib/fuzzystrmatch/fuzzystrmatch.control
+++ b/contrib/fuzzystrmatch/fuzzystrmatch.control
@@ -1,6 +1,6 @@
 # fuzzystrmatch extension
 comment = 'determine similarities and distance between strings'
-default_version = '1.1'
+default_version = '1.2'
 module_pathname = '$libdir/fuzzystrmatch'
 relocatable = true
 trusted = true
diff --git a/contrib/fuzzystrmatch/meson.build b/contrib/fuzzystrmatch/meson.build
index 11aec733cb..abb944faaf 100644
--- a/contrib/fuzzystrmatch/meson.build
+++ b/contrib/fuzzystrmatch/meson.build
@@ -1,9 +1,18 @@
 # Copyright (c) 2022, PostgreSQL Global Development Group

 fuzzystrmatch_sources = files(
-  'fuzzystrmatch.c',
+  'daitch_mokotoff.c',
   'dmetaphone.c',
+  'fuzzystrmatch.c',
+)
+
+daitch_mokotoff_h = custom_target('daitch_mokotoff',
+  input: 'daitch_mokotoff_header.pl',
+  output: 'daitch_mokotoff.h',
+  command: [perl, '@INPUT@', '@OUTPUT@'],
 )
+generated_sources += daitch_mokotoff_h
+fuzzystrmatch_sources += daitch_mokotoff_h

 if host_system == 'windows'
   fuzzystrmatch_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
@@ -13,6 +22,7 @@ endif

 fuzzystrmatch = shared_module('fuzzystrmatch',
   fuzzystrmatch_sources,
+  include_directories: include_directories('.'),
   kwargs: contrib_mod_args,
 )
 contrib_targets += fuzzystrmatch
@@ -21,6 +31,7 @@ install_data(
   'fuzzystrmatch.control',
   'fuzzystrmatch--1.0--1.1.sql',
   'fuzzystrmatch--1.1.sql',
+  'fuzzystrmatch--1.1--1.2.sql',
   kwargs: contrib_data_args,
 )

@@ -31,6 +42,7 @@ tests += {
   'regress': {
     'sql': [
       'fuzzystrmatch',
+      'fuzzystrmatch_utf8',
     ],
   },
 }
diff --git a/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql b/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql
index f05dc28ffb..db05c7d6b6 100644
--- a/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql
+++ b/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql
@@ -19,3 +19,48 @@ SELECT metaphone('GUMBO', 4);

 SELECT dmetaphone('gumbo');
 SELECT dmetaphone_alt('gumbo');
+
+-- Wovels
+SELECT daitch_mokotoff('Augsburg');
+SELECT daitch_mokotoff('Breuer');
+SELECT daitch_mokotoff('Freud');
+
+-- The letter "H"
+SELECT daitch_mokotoff('Halberstadt');
+SELECT daitch_mokotoff('Mannheim');
+
+-- Adjacent sounds
+SELECT daitch_mokotoff('Chernowitz');
+
+-- Adjacent letters with identical adjacent code digits
+SELECT daitch_mokotoff('Cherkassy');
+SELECT daitch_mokotoff('Kleinman');
+
+-- More than one word
+SELECT daitch_mokotoff('Nowy Targ');
+
+-- Padded with "0"
+SELECT daitch_mokotoff('Berlin');
+
+-- Other examples from https://www.avotaynu.com/soundex.htm
+SELECT daitch_mokotoff('Ceniow');
+SELECT daitch_mokotoff('Tsenyuv');
+SELECT daitch_mokotoff('Holubica');
+SELECT daitch_mokotoff('Golubitsa');
+SELECT daitch_mokotoff('Przemysl');
+SELECT daitch_mokotoff('Pshemeshil');
+SELECT daitch_mokotoff('Rosochowaciec');
+SELECT daitch_mokotoff('Rosokhovatsets');
+
+-- Ignored characters
+SELECT daitch_mokotoff('''OBrien');
+SELECT daitch_mokotoff('O''Brien');
+
+-- "Difficult" cases, likely to cause trouble for other implementations.
+SELECT daitch_mokotoff('CJC');
+SELECT daitch_mokotoff('BESST');
+SELECT daitch_mokotoff('BOUEY');
+SELECT daitch_mokotoff('HANNMANN');
+SELECT daitch_mokotoff('MCCOYJR');
+SELECT daitch_mokotoff('ACCURSO');
+SELECT daitch_mokotoff('BIERSCHBACH');
diff --git a/contrib/fuzzystrmatch/sql/fuzzystrmatch_utf8.sql b/contrib/fuzzystrmatch/sql/fuzzystrmatch_utf8.sql
new file mode 100644
index 0000000000..f42c01a1bb
--- /dev/null
+++ b/contrib/fuzzystrmatch/sql/fuzzystrmatch_utf8.sql
@@ -0,0 +1,26 @@
+/*
+ * This test must be run in a database with UTF-8 encoding,
+ * because other encodings don't support all the characters used.
+ */
+
+SELECT getdatabaseencoding() <> 'UTF8'
+       AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+set client_encoding = utf8;
+
+-- CREATE EXTENSION IF NOT EXISTS fuzzystrmatch;
+
+-- Accents
+SELECT daitch_mokotoff('Müller');
+SELECT daitch_mokotoff('Schäfer');
+SELECT daitch_mokotoff('Straßburg');
+SELECT daitch_mokotoff('Éregon');
+
+-- Special characters added at https://www.jewishgen.org/InfoFiles/Soundex.html
+SELECT daitch_mokotoff('gąszczu');
+SELECT daitch_mokotoff('brzęczy');
+SELECT daitch_mokotoff('ţamas');
+SELECT daitch_mokotoff('țamas');
diff --git a/doc/src/sgml/fuzzystrmatch.sgml b/doc/src/sgml/fuzzystrmatch.sgml
index 382e54be91..95814b9901 100644
--- a/doc/src/sgml/fuzzystrmatch.sgml
+++ b/doc/src/sgml/fuzzystrmatch.sgml
@@ -241,4 +241,99 @@ test=# SELECT dmetaphone('gumbo');
 </screen>
  </sect2>

+ <sect2>
+  <title>Daitch-Mokotoff Soundex</title>
+
+  <para>
+   Compared to the American Soundex System implemented in the
+   <function>soundex</function> function, the major improvements of the
+   Daitch-Mokotoff Soundex System are:
+
+   <itemizedlist spacing="compact" mark="bullet">
+    <listitem>
+     <para>
+      Information is coded to the first six meaningful letters rather than
+      four.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      The initial letter is coded rather than kept as is.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Where two consecutive letters have a single sound, they are coded as a
+      single number.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      When a letter or combination of letters may have two different sounds,
+      it is double coded under the two different codes.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      A letter or combination of letters maps into ten possible codes rather
+      than seven.
+     </para>
+    </listitem>
+   </itemizedlist>
+  </para>
+
+  <indexterm>
+   <primary>daitch_mokotoff</primary>
+  </indexterm>
+
+  <para>
+   The following function generates Daitch-Mokotoff soundex codes for matching
+   of similar-sounding input:
+  </para>
+
+<synopsis>
+daitch_mokotoff(text source) returns text[]
+</synopsis>
+
+  <para>
+   Since a Daitch-Mokotoff soundex code consists of only 6 digits,
+   <literal>source</literal> should be preferably a single word or name.
+   The returned text array may be used directly in a GIN index, using the
+   <literal>&&</literal> operator for matching, see <xref linkend="gin"/>.
+   For more advanced queries, Full Text Search may be used, see
+   <xref linkend="textsearch"/>, and the example below.
+  </para>
+
+  <para>
+   Example:
+  </para>
+
+<programlisting>
+CREATE OR REPLACE FUNCTION soundex_tsvector(v_name text) RETURNS tsvector AS $$
+  SELECT to_tsvector('simple', string_agg(array_to_string(daitch_mokotoff(n), ' '), ' '))
+  FROM regexp_split_to_table(v_name, '\s+') AS n
+$$ LANGUAGE sql STRICT IMMUTABLE PARALLEL SAFE;
+
+CREATE OR REPLACE FUNCTION soundex_tsquery(v_name text) RETURNS tsquery AS $$
+  SELECT string_agg('(' || array_to_string(daitch_mokotoff(n), '|') || ')', '&')::tsquery
+  FROM regexp_split_to_table(v_name, '\s+') AS n
+$$ LANGUAGE sql STRICT IMMUTABLE PARALLEL SAFE;
+
+-- Note that searches could be more efficient with the tsvector in a separate column
+-- (no recalculation on table row recheck).
+CREATE TABLE s (nm text);
+CREATE INDEX ix_s_txt ON s USING gin (soundex_tsvector(nm)) WITH (fastupdate = off);
+
+INSERT INTO s VALUES ('John Doe');
+INSERT INTO s VALUES ('Jane Roe');
+INSERT INTO s VALUES ('Public John Q.');
+INSERT INTO s VALUES ('George Best');
+
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('john');
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('jane doe');
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('john public');
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('besst, giorgio');
+</programlisting>
+ </sect2>
+
 </sect1>

Re: daitch_mokotoff module

От
Dag Lem
Дата:
Is there anything else I should do here, to avoid the status being
incorrectly stuck at "Waiting for Author" again.

Best regards

Dag Lem



Re: daitch_mokotoff module

От
Alvaro Herrera
Дата:
On 2023-Jan-05, Dag Lem wrote:

> Is there anything else I should do here, to avoid the status being
> incorrectly stuck at "Waiting for Author" again.

Just mark it Needs Review for now.  I'll be back from vacation on Jan
11th and can have a look then (or somebody else can, perhaps.)

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Puedes vivir sólo una vez, pero si lo haces bien, una vez es suficiente"



Re: daitch_mokotoff module

От
Dag Lem
Дата:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

> On 2023-Jan-05, Dag Lem wrote:
>
>> Is there anything else I should do here, to avoid the status being
>> incorrectly stuck at "Waiting for Author" again.
>
> Just mark it Needs Review for now.  I'll be back from vacation on Jan
> 11th and can have a look then (or somebody else can, perhaps.)

OK, done. Have a nice vacation!

Best regards

Dag Lem



Re: daitch_mokotoff module

От
Paul Ramsey
Дата:
On Mon, Jan 2, 2023 at 2:03 PM Dag Lem <dag@nimrod.no> wrote:

> I also improved on the documentation example (using Full Text Search).
> AFAIK you can't make general queries like that using arrays, however in
> any case I must admit that text arrays seem like more natural building
> blocks than space delimited text here.

This is a fun addition to fuzzystrmatch.

While it's a little late in the game, I'll just put it out there:
daitch_mokotoff() is way harder to type than soundex_dm(). Not sure
how you feel about that.

On the documentation, I found the leap directly into the tsquery
example a bit too big. Maybe start with a very simple example,

--
dm=# SELECT daitch_mokotoff('Schwartzenegger'),
            daitch_mokotoff('Swartzenegger');

 daitch_mokotoff | daitch_mokotoff
-----------------+-----------------
 {479465}        | {479465}
--

Then transition into a more complex example that illustrates the GIN
index technique you mention in the text, but do not show:

--
CREATE TABLE dm_gin (source text, dm text[]);

INSERT INTO dm_gin (source) VALUES
    ('Swartzenegger'),
    ('John'),
    ('James'),
    ('Steinman'),
    ('Steinmetz');

UPDATE dm_gin SET dm = daitch_mokotoff(source);

CREATE INDEX dm_gin_x ON dm_gin USING GIN (dm);

SELECT * FROM dm_gin WHERE dm && daitch_mokotoff('Schwartzenegger');
--

And only then go into the tsearch example. Incidentally, what does the
tsearch approach provide that the simple GIN approach does not?
Ideally explain that briefly before launching into the example. With
all the custom functions and so on it's a little involved, so maybe if
there's not a huge win in using that approach drop it entirely?

ATB,
P



Re: daitch_mokotoff module

От
Dag Lem
Дата:
Paul Ramsey <pramsey@cleverelephant.ca> writes:

> On Mon, Jan 2, 2023 at 2:03 PM Dag Lem <dag@nimrod.no> wrote:
>
>> I also improved on the documentation example (using Full Text Search).
>> AFAIK you can't make general queries like that using arrays, however in
>> any case I must admit that text arrays seem like more natural building
>> blocks than space delimited text here.
>
> This is a fun addition to fuzzystrmatch.

I'm glad to hear it! :-)

>
> While it's a little late in the game, I'll just put it out there:
> daitch_mokotoff() is way harder to type than soundex_dm(). Not sure
> how you feel about that.

I chose the name in order to follow the naming of the other functions in
fuzzystrmatch, which as far as I can tell are given the name which each
algorithm is known by.

Personally I don't think it's worth it to deviate from the naming of the
other functions just to avoid typing a few characters, and I certainly
don't think daitch_mokotoff is any harder to get right than
levenshtein_less_equal ;-)

So, if I were to decide, I wouldn't change the name of the function.
However I'm obviously not calling the shots on what goes into PostgreSQL
- perhaps someone else would like to weigh in on this?

>
> On the documentation, I found the leap directly into the tsquery
> example a bit too big. Maybe start with a very simple example,
>
> --
> dm=# SELECT daitch_mokotoff('Schwartzenegger'),
>             daitch_mokotoff('Swartzenegger');
>
>  daitch_mokotoff | daitch_mokotoff
> -----------------+-----------------
>  {479465}        | {479465}
> --
>
> Then transition into a more complex example that illustrates the GIN
> index technique you mention in the text, but do not show:
>
> --
> CREATE TABLE dm_gin (source text, dm text[]);
>
> INSERT INTO dm_gin (source) VALUES
>     ('Swartzenegger'),
>     ('John'),
>     ('James'),
>     ('Steinman'),
>     ('Steinmetz');
>
> UPDATE dm_gin SET dm = daitch_mokotoff(source);
>
> CREATE INDEX dm_gin_x ON dm_gin USING GIN (dm);
>
> SELECT * FROM dm_gin WHERE dm && daitch_mokotoff('Schwartzenegger');
> --

Sure, I can do that. You don't think this much example text will be
TL;DR?

>
> And only then go into the tsearch example. Incidentally, what does the
> tsearch approach provide that the simple GIN approach does not?

The example shows how to do a simultaneous match on first AND last
names, where the first and last names (any number of names) are stored
in the same indexed column, and the order of the names in the index and
the search term does not matter.

If you were to use the GIN "&&" operator, you would get a match if
either the first OR the last name matches. If you were to use the GIN
"@>" operator, you would *not* get a match if the search term contains
more soundex codes than the indexed name.

E.g. this yields a correct match:
SELECT soundex_tsvector('John Yamson') @@ soundex_tsquery('John Jameson');

While this yields a false positive:
SELECT (daitch_mokotoff('John') || daitch_mokotoff('Yamson')) && (daitch_mokotoff('John') || daitch_mokotoff('Doe'));

And this yields a false negative:
SELECT (daitch_mokotoff('John') || daitch_mokotoff('Yamson')) @> (daitch_mokotoff('John') ||
daitch_mokotoff('Jameson'));

This may explained better by simply showing the output of
soundex_tsvector and soundex_tsquery:

SELECT soundex_tsvector('John Yamson');
         soundex_tsvector         
----------------------------------
 '160000':1 '164600':3 '460000':2

SELECT soundex_tsquery('John Jameson');
                  soundex_tsquery                  
---------------------------------------------------
 ( '160000' | '460000' ) & ( '164600' | '464600' )

> Ideally explain that briefly before launching into the example. With
> all the custom functions and so on it's a little involved, so maybe if
> there's not a huge win in using that approach drop it entirely?

I believe this functionality is quite useful, and that it's actually
what's called for in many situations. So, I'd rather not drop this
example.

>
> ATB,
> P
>

Best regards,

Dag Lem



Re: daitch_mokotoff module

От
Paul Ramsey
Дата:

> On Jan 12, 2023, at 7:30 AM, Dag Lem <dag@nimrod.no> wrote:
>
> Paul Ramsey <pramsey@cleverelephant.ca> writes:
>
>> On Mon, Jan 2, 2023 at 2:03 PM Dag Lem <dag@nimrod.no> wrote:
>>
>>> I also improved on the documentation example (using Full Text Search).
>>> AFAIK you can't make general queries like that using arrays, however in
>>> any case I must admit that text arrays seem like more natural building
>>> blocks than space delimited text here.
>>
>> This is a fun addition to fuzzystrmatch.
>
> I'm glad to hear it! :-)
>
>>
>> While it's a little late in the game, I'll just put it out there:
>> daitch_mokotoff() is way harder to type than soundex_dm(). Not sure
>> how you feel about that.
>
> I chose the name in order to follow the naming of the other functions in
> fuzzystrmatch, which as far as I can tell are given the name which each
> algorithm is known by.
>
> Personally I don't think it's worth it to deviate from the naming of the
> other functions just to avoid typing a few characters, and I certainly
> don't think daitch_mokotoff is any harder to get right than
> levenshtein_less_equal ;-)

Good points :)

>
>>
>> On the documentation, I found the leap directly into the tsquery
>> example a bit too big. Maybe start with a very simple example,
>>
>> --
>> dm=# SELECT daitch_mokotoff('Schwartzenegger'),
>>            daitch_mokotoff('Swartzenegger');
>>
>> daitch_mokotoff | daitch_mokotoff
>> -----------------+-----------------
>> {479465}        | {479465}
>> --
>>
>> Then transition into a more complex example that illustrates the GIN
>> index technique you mention in the text, but do not show:
>>
>> --
>> CREATE TABLE dm_gin (source text, dm text[]);
>>
>> INSERT INTO dm_gin (source) VALUES
>>    ('Swartzenegger'),
>>    ('John'),
>>    ('James'),
>>    ('Steinman'),
>>    ('Steinmetz');
>>
>> UPDATE dm_gin SET dm = daitch_mokotoff(source);
>>
>> CREATE INDEX dm_gin_x ON dm_gin USING GIN (dm);
>>
>> SELECT * FROM dm_gin WHERE dm && daitch_mokotoff('Schwartzenegger');
>> --
>
> Sure, I can do that. You don't think this much example text will be
> TL;DR?

I can only speak for myself, but examples are the meat of documentation learning, so as long as they come with enough
explanatorycontext to be legible it's worth having them, IMO. 

>
>>
>> And only then go into the tsearch example. Incidentally, what does the
>> tsearch approach provide that the simple GIN approach does not?
>
> The example shows how to do a simultaneous match on first AND last
> names, where the first and last names (any number of names) are stored
> in the same indexed column, and the order of the names in the index and
> the search term does not matter.
>
> If you were to use the GIN "&&" operator, you would get a match if
> either the first OR the last name matches. If you were to use the GIN
> "@>" operator, you would *not* get a match if the search term contains
> more soundex codes than the indexed name.
>
> E.g. this yields a correct match:
> SELECT soundex_tsvector('John Yamson') @@ soundex_tsquery('John Jameson');
>
> While this yields a false positive:
> SELECT (daitch_mokotoff('John') || daitch_mokotoff('Yamson')) && (daitch_mokotoff('John') || daitch_mokotoff('Doe'));
>
> And this yields a false negative:
> SELECT (daitch_mokotoff('John') || daitch_mokotoff('Yamson')) @> (daitch_mokotoff('John') ||
daitch_mokotoff('Jameson'));
>
> This may explained better by simply showing the output of
> soundex_tsvector and soundex_tsquery:
>
> SELECT soundex_tsvector('John Yamson');
>         soundex_tsvector
> ----------------------------------
> '160000':1 '164600':3 '460000':2
>
> SELECT soundex_tsquery('John Jameson');
>                  soundex_tsquery
> ---------------------------------------------------
> ( '160000' | '460000' ) & ( '164600' | '464600' )
>
>> Ideally explain that briefly before launching into the example. With
>> all the custom functions and so on it's a little involved, so maybe if
>> there's not a huge win in using that approach drop it entirely?
>
> I believe this functionality is quite useful, and that it's actually
> what's called for in many situations. So, I'd rather not drop this
> example.

Sounds good

P

>
>>
>> ATB,
>> P
>>
>
> Best regards,
>
> Dag Lem




Re: daitch_mokotoff module

От
Dag Lem
Дата:
Paul Ramsey <pramsey@cleverelephant.ca> writes:

>> On Jan 12, 2023, at 7:30 AM, Dag Lem <dag@nimrod.no> wrote:
>> 

[...]

>> 
>> Sure, I can do that. You don't think this much example text will be
>> TL;DR?
>
> I can only speak for myself, but examples are the meat of
> documentation learning, so as long as they come with enough
> explanatory context to be legible it's worth having them, IMO.
>

I have updated the documentation, hopefully it is more accessible now.

I also corrected documentation for the other functions in fuzzystrmatch
(function name and argtype in the wrong order).

Crossing fingers that someone will eventually change the status to
"Ready for Committer" :-)

Best regards,

Dag Lem

diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile
index 0704894f88..12baf2d884 100644
--- a/contrib/fuzzystrmatch/Makefile
+++ b/contrib/fuzzystrmatch/Makefile
@@ -3,14 +3,15 @@
 MODULE_big = fuzzystrmatch
 OBJS = \
     $(WIN32RES) \
+    daitch_mokotoff.o \
     dmetaphone.o \
     fuzzystrmatch.o

 EXTENSION = fuzzystrmatch
-DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql
+DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql
 PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"

-REGRESS = fuzzystrmatch
+REGRESS = fuzzystrmatch fuzzystrmatch_utf8

 ifdef USE_PGXS
 PG_CONFIG = pg_config
@@ -22,3 +23,14 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
+
+# Force this dependency to be known even without dependency info built:
+daitch_mokotoff.o: daitch_mokotoff.h
+
+daitch_mokotoff.h: daitch_mokotoff_header.pl
+    perl $< $@
+
+distprep: daitch_mokotoff.h
+
+maintainer-clean:
+    rm -f daitch_mokotoff.h
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c
new file mode 100644
index 0000000000..2548903770
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff.c
@@ -0,0 +1,582 @@
+/*
+ * Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem <dag@nimrod.no>
+ *
+ * This implementation of the Daitch-Mokotoff Soundex System aims at high
+ * performance.
+ *
+ * - The processing of each phoneme is initiated by an O(1) table lookup.
+ * - For phonemes containing more than one character, a coding tree is traversed
+ *   to process the complete phoneme.
+ * - The (alternate) soundex codes are produced digit by digit in-place in
+ *   another tree structure.
+ *
+ *  References:
+ *
+ * https://www.avotaynu.com/soundex.htm
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex
+ * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php)
+ * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java)
+ * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm)
+ *
+ * A few notes on other implementations:
+ *
+ * - All other known implementations have the same unofficial rules for "UE",
+ *   these are also adapted by this implementation (0, 1, NC).
+ * - The only other known implementation which is capable of generating all
+ *   correct soundex codes in all cases is the JOS Soundex Calculator at
+ *   https://www.jewishgen.org/jos/jossound.htm
+ * - "J" is considered (only) a vowel in dmlat.php
+ * - The official rules for "RS" are commented out in dmlat.php
+ * - Identical code digits for adjacent letters are not collapsed correctly in
+ *   dmsoundex.php when double digit codes are involved. E.g. "BESST" yields
+ *   744300 instead of 743000 as for "BEST".
+ * - "J" is considered (only) a consonant in DaitchMokotoffSoundex.java
+ * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+*/
+
+#include "postgres.h"
+
+#include "catalog/pg_type.h"
+#include "mb/pg_wchar.h"
+#include "utils/array.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+
+
+/*
+ * The soundex coding chart table is adapted from
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * See daitch_mokotoff_header.pl for details.
+*/
+
+/* Generated coding chart table */
+#include "daitch_mokotoff.h"
+
+#define DM_CODE_DIGITS 6
+
+/* Node in soundex code tree */
+struct dm_node
+{
+    int            soundex_length; /* Length of generated soundex code */
+    char        soundex[DM_CODE_DIGITS];    /* Soundex code */
+    int            is_leaf;        /* Candidate for complete soundex code */
+    int            last_update;    /* Letter number for last update of node */
+    char        code_digit;        /* Last code digit, 0 - 9 */
+
+    /*
+     * One or two alternate code digits leading to this node. If there are two
+     * digits, one of them is always an 'X'. Repeated code digits and 'X' lead
+     * back to the same node.
+     */
+    char        prev_code_digits[2];
+    /* One or two alternate code digits moving forward. */
+    char        next_code_digits[2];
+    /* ORed together code index(es) used to reach current node. */
+    int            prev_code_index;
+    int            next_code_index;
+    /* Possible nodes branching out from this node - digits 0-9. */
+    struct dm_node *children[10];
+    /* Next node in linked list. Alternating index for each iteration. */
+    struct dm_node *next[2];
+};
+
+typedef struct dm_node dm_node;
+
+
+/* Internal C implementation */
+static int    daitch_mokotoff_coding(char *word, ArrayBuildState *soundex, MemoryContext tmp_ctx);
+
+
+PG_FUNCTION_INFO_V1(daitch_mokotoff);
+
+Datum
+daitch_mokotoff(PG_FUNCTION_ARGS)
+{
+    text       *arg = PG_GETARG_TEXT_PP(0);
+    char       *string;
+    ArrayBuildState *soundex;
+    Datum        retval;
+    MemoryContext mem_ctx,
+                tmp_ctx;
+
+    tmp_ctx = AllocSetContextCreate(CurrentMemoryContext,
+                                    "daitch_mokotoff temporary context",
+                                    ALLOCSET_DEFAULT_SIZES);
+    mem_ctx = MemoryContextSwitchTo(tmp_ctx);
+
+    string = pg_server_to_any(text_to_cstring(arg), VARSIZE_ANY_EXHDR(arg), PG_UTF8);
+    soundex = initArrayResult(TEXTOID, tmp_ctx, false);
+
+    if (!daitch_mokotoff_coding(string, soundex, tmp_ctx))
+    {
+        /* No encodable characters in input. */
+        MemoryContextSwitchTo(mem_ctx);
+        MemoryContextDelete(tmp_ctx);
+        PG_RETURN_NULL();
+    }
+
+    retval = makeArrayResult(soundex, mem_ctx);
+    MemoryContextSwitchTo(mem_ctx);
+    MemoryContextDelete(tmp_ctx);
+
+    PG_RETURN_DATUM(retval);
+}
+
+
+/* Template for new node in soundex code tree. */
+static const dm_node start_node = {
+    .soundex_length = 0,
+    .soundex = "000000",        /* Six digits */
+    .is_leaf = 0,
+    .last_update = 0,
+    .code_digit = '\0',
+    .prev_code_digits = {'\0', '\0'},
+    .next_code_digits = {'\0', '\0'},
+    .prev_code_index = 0,
+    .next_code_index = 0,
+    .children = {NULL},
+    .next = {NULL}
+};
+
+/* Dummy soundex codes at end of input. */
+static const dm_codes end_codes[2] =
+{
+    {
+        "X", "X", "X"
+    }
+};
+
+
+/* Initialize soundex code tree node for next code digit. */
+static void
+initialize_node(dm_node * node, int last_update)
+{
+    if (node->last_update < last_update)
+    {
+        node->prev_code_digits[0] = node->next_code_digits[0];
+        node->prev_code_digits[1] = node->next_code_digits[1];
+        node->next_code_digits[0] = '\0';
+        node->next_code_digits[1] = '\0';
+        node->prev_code_index = node->next_code_index;
+        node->next_code_index = 0;
+        node->is_leaf = 0;
+        node->last_update = last_update;
+    }
+}
+
+
+/* Update soundex code tree node with next code digit. */
+static void
+add_next_code_digit(dm_node * node, int code_index, char code_digit)
+{
+    /* OR in index 1 or 2. */
+    node->next_code_index |= code_index;
+
+    if (!node->next_code_digits[0])
+    {
+        node->next_code_digits[0] = code_digit;
+    }
+    else if (node->next_code_digits[0] != code_digit)
+    {
+        node->next_code_digits[1] = code_digit;
+    }
+}
+
+
+/* Mark soundex code tree node as leaf. */
+static void
+set_leaf(dm_node * first_node[2], dm_node * last_node[2], dm_node * node, int ix_node)
+{
+    if (!node->is_leaf)
+    {
+        node->is_leaf = 1;
+
+        if (first_node[ix_node] == NULL)
+        {
+            first_node[ix_node] = node;
+        }
+        else
+        {
+            last_node[ix_node]->next[ix_node] = node;
+        }
+
+        last_node[ix_node] = node;
+        node->next[ix_node] = NULL;
+    }
+}
+
+
+/* Find next node corresponding to code digit, or create a new node. */
+static dm_node *
+find_or_create_child_node(dm_node * parent, char code_digit, ArrayBuildState *soundex, MemoryContext tmp_ctx)
+{
+    int            i = code_digit - '0';
+    dm_node   **nodes = parent->children;
+    dm_node    *node = nodes[i];
+
+    if (node)
+    {
+        /* Found existing child node. Skip completed nodes. */
+        return node->soundex_length < DM_CODE_DIGITS ? node : NULL;
+    }
+
+    /* Create new child node. */
+    node = palloc(sizeof(dm_node));
+    nodes[i] = node;
+
+    *node = start_node;
+    memcpy(node->soundex, parent->soundex, sizeof(parent->soundex));
+    node->soundex_length = parent->soundex_length;
+    node->soundex[node->soundex_length++] = code_digit;
+    node->code_digit = code_digit;
+    node->next_code_index = node->prev_code_index;
+
+    if (node->soundex_length < DM_CODE_DIGITS)
+    {
+        return node;
+    }
+    else
+    {
+        /* Append completed soundex code to soundex array. */
+        accumArrayResult(soundex,
+                         PointerGetDatum(cstring_to_text_with_len(node->soundex, DM_CODE_DIGITS)),
+                         false,
+                         TEXTOID,
+                         tmp_ctx);
+        return NULL;
+    }
+}
+
+
+/* Update node for next code digit(s). */
+static void
+update_node(dm_node * first_node[2], dm_node * last_node[2], dm_node * node, int ix_node,
+            int letter_no, int prev_code_index, int next_code_index,
+            const char *next_code_digits, int digit_no,
+            ArrayBuildState *soundex, MemoryContext tmp_ctx)
+{
+    int            i;
+    char        next_code_digit = next_code_digits[digit_no];
+    int            num_dirty_nodes = 0;
+    dm_node    *dirty_nodes[2];
+
+    initialize_node(node, letter_no);
+
+    if (node->prev_code_index && !(node->prev_code_index & prev_code_index))
+    {
+        /*
+         * If the sound (vowel / consonant) of this letter encoding doesn't
+         * correspond to the coding index of the previous letter, we skip this
+         * letter encoding. Note that currently, only "J" can be either a
+         * vowel or a consonant.
+         */
+        return;
+    }
+
+    if (next_code_digit == 'X' ||
+        (digit_no == 0 &&
+         (node->prev_code_digits[0] == next_code_digit ||
+          node->prev_code_digits[1] == next_code_digit)))
+    {
+        /* The code digit is the same as one of the previous (i.e. not added). */
+        dirty_nodes[num_dirty_nodes++] = node;
+    }
+
+    if (next_code_digit != 'X' &&
+        (digit_no > 0 ||
+         node->prev_code_digits[0] != next_code_digit ||
+         node->prev_code_digits[1]))
+    {
+        /* The code digit is different from one of the previous (i.e. added). */
+        node = find_or_create_child_node(node, next_code_digit, soundex, tmp_ctx);
+        if (node)
+        {
+            initialize_node(node, letter_no);
+            dirty_nodes[num_dirty_nodes++] = node;
+        }
+    }
+
+    for (i = 0; i < num_dirty_nodes; i++)
+    {
+        /* Add code digit leading to the current node. */
+        add_next_code_digit(dirty_nodes[i], next_code_index, next_code_digit);
+
+        if (next_code_digits[++digit_no])
+        {
+            update_node(first_node, last_node, dirty_nodes[i], ix_node,
+                        letter_no, prev_code_index, next_code_index,
+                        next_code_digits, digit_no,
+                        soundex, tmp_ctx);
+        }
+        else
+        {
+            /* Add incomplete leaf node to linked list. */
+            set_leaf(first_node, last_node, dirty_nodes[i], ix_node);
+        }
+    }
+}
+
+
+/* Update soundex tree leaf nodes. */
+static void
+update_leaves(dm_node * first_node[2], int *ix_node, int letter_no,
+              const dm_codes * codes, const dm_codes * next_codes,
+              ArrayBuildState *soundex, MemoryContext tmp_ctx)
+{
+    int            i,
+                j,
+                code_index;
+    dm_node    *node,
+               *last_node[2];
+    const        dm_code *code,
+               *next_code;
+    int            ix_node_next = (*ix_node + 1) & 1;    /* Alternating index: 0, 1 */
+
+    /* Initialize for new linked list of leaves. */
+    first_node[ix_node_next] = NULL;
+    last_node[ix_node_next] = NULL;
+
+    /* Process all nodes. */
+    for (node = first_node[*ix_node]; node; node = node->next[*ix_node])
+    {
+        /* One or two alternate code sequences. */
+        for (i = 0; i < 2 && (code = codes[i]) && code[0][0]; i++)
+        {
+            /* Coding for previous letter - before vowel: 1, all other: 2 */
+            int            prev_code_index = (code[0][0] > '1') + 1;
+
+            /* One or two alternate next code sequences. */
+            for (j = 0; j < 2 && (next_code = next_codes[j]) && next_code[0][0]; j++)
+            {
+                /* Determine which code to use. */
+                if (letter_no == 0)
+                {
+                    /* This is the first letter. */
+                    code_index = 0;
+                }
+                else if (next_code[0][0] <= '1')
+                {
+                    /* The next letter is a vowel. */
+                    code_index = 1;
+                }
+                else
+                {
+                    /* All other cases. */
+                    code_index = 2;
+                }
+
+                /* One or two sequential code digits. */
+                update_node(first_node, last_node, node, ix_node_next,
+                            letter_no, prev_code_index, code_index,
+                            code[code_index], 0,
+                            soundex, tmp_ctx);
+            }
+        }
+    }
+
+    *ix_node = ix_node_next;
+}
+
+
+/* Mapping from ISO8859-1 to upper-case ASCII */
+static const char iso8859_1_to_ascii_upper[] =
+/*
+"`abcdefghijklmnopqrstuvwxyz{|}~                                  ¡¢£¤¥¦§¨©ª«¬
®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ"
+*/
+"`ABCDEFGHIJKLMNOPQRSTUVWXYZ{|}~                                  !
?AAAAAAECEEEEIIIIDNOOOOO*OUUUUYDSAAAAAAECEEEEIIIIDNOOOOO/OUUUUYDY";
+
+
+/* Return next character, converted from UTF-8 to uppercase ASCII. */
+static char
+read_char(unsigned char *str, int *ix)
+{
+    /* Substitute character for skipped code points. */
+    const char    na = '\x1a';
+    pg_wchar    c;
+
+    /* Decode UTF-8 character to ISO 10646 code point. */
+    str += *ix;
+    c = utf8_to_unicode(str);
+    *ix += pg_utf_mblen(str);
+
+    if (c >= (unsigned char) '[' && c <= (unsigned char) ']')
+    {
+        /* ASCII characters [, \, and ] are reserved for Ą, Ę, and Ţ/Ț. */
+        return na;
+    }
+    else if (c < 0x60)
+    {
+        /* Non-lowercase ASCII character. */
+        return c;
+    }
+    else if (c < 0x100)
+    {
+        /* ISO-8859-1 code point, converted to upper-case ASCII. */
+        return iso8859_1_to_ascii_upper[c - 0x60];
+    }
+    else
+    {
+        /* Conversion of non-ASCII characters in the coding chart. */
+        switch (c)
+        {
+            case 0x0104:
+            case 0x0105:
+                /* Ą/ą */
+                return '[';
+            case 0x0118:
+            case 0x0119:
+                /* Ę/ę */
+                return '\\';
+            case 0x0162:
+            case 0x0163:
+            case 0x021A:
+            case 0x021B:
+                /* Ţ/ţ or Ț/ț */
+                return ']';
+            default:
+                return na;
+        }
+    }
+}
+
+
+/* Read next ASCII character, skipping any characters not in [A-\]]. */
+static char
+read_valid_char(char *str, int *ix)
+{
+    char        c;
+
+    while ((c = read_char((unsigned char *) str, ix)))
+    {
+        if (c >= 'A' && c <= ']')
+        {
+            break;
+        }
+    }
+
+    return c;
+}
+
+
+/* Return sound coding for "letter" (letter sequence) */
+static const dm_codes *
+read_letter(char *str, int *ix)
+{
+    char        c,
+                cmp;
+    int            i,
+                j;
+    const        dm_letter *letters;
+    const        dm_codes *codes;
+
+    /* First letter in sequence. */
+    if (!(c = read_valid_char(str, ix)))
+    {
+        return NULL;
+    }
+    letters = &letter_[c - 'A'];
+    codes = letters->codes;
+    i = *ix;
+
+    /* Any subsequent letters in sequence. */
+    while ((letters = letters->letters) && (c = read_valid_char(str, &i)))
+    {
+        for (j = 0; (cmp = letters[j].letter); j++)
+        {
+            if (cmp == c)
+            {
+                /* Letter found. */
+                letters = &letters[j];
+                if (letters->codes)
+                {
+                    /* Coding for letter sequence found. */
+                    codes = letters->codes;
+                    *ix = i;
+                }
+                break;
+            }
+        }
+        if (!cmp)
+        {
+            /* The sequence of letters has no coding. */
+            break;
+        }
+    }
+
+    return codes;
+}
+
+
+/* Generate all Daitch-Mokotoff soundex codes for word. */
+static int
+daitch_mokotoff_coding(char *word, ArrayBuildState *soundex, MemoryContext tmp_ctx)
+{
+    int            i = 0;
+    int            letter_no = 0;
+    int            ix_node = 0;
+    const        dm_codes *codes,
+               *next_codes;
+    dm_node    *first_node[2],
+               *node;
+
+    /* First letter. */
+    if (!(codes = read_letter(word, &i)))
+    {
+        /* No encodable character in input. */
+        return 0;
+    }
+
+    /* Starting point. */
+    first_node[ix_node] = palloc(sizeof(dm_node));
+    *first_node[ix_node] = start_node;
+
+    /*
+     * Loop until either the word input is exhausted, or all generated soundex
+     * codes are completed to six digits.
+     */
+    while (codes && first_node[ix_node])
+    {
+        next_codes = read_letter(word, &i);
+
+        /* Update leaf nodes. */
+        update_leaves(first_node, &ix_node, letter_no,
+                      codes, next_codes ? next_codes : end_codes,
+                      soundex, tmp_ctx);
+
+        codes = next_codes;
+        letter_no++;
+    }
+
+    /* Append all remaining (incomplete) soundex codes. */
+    for (node = first_node[ix_node]; node; node = node->next[ix_node])
+    {
+        accumArrayResult(soundex,
+                         PointerGetDatum(cstring_to_text_with_len(node->soundex, DM_CODE_DIGITS)),
+                         false,
+                         TEXTOID,
+                         tmp_ctx);
+    }
+
+    return 1;
+}
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff_header.pl b/contrib/fuzzystrmatch/daitch_mokotoff_header.pl
new file mode 100755
index 0000000000..c888847ad0
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff_header.pl
@@ -0,0 +1,240 @@
+#!/bin/perl
+#
+# Generation of types and lookup tables for Daitch-Mokotoff soundex.
+#
+# Copyright (c) 2021 Finance Norway
+# Author: Dag Lem <dag@nimrod.no>
+#
+# Permission to use, copy, modify, and distribute this software and its
+# documentation for any purpose, without fee, and without a written agreement
+# is hereby granted, provided that the above copyright notice and this
+# paragraph and the following two paragraphs appear in all copies.
+#
+# IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+# DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+# LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+# DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+# POSSIBILITY OF SUCH DAMAGE.
+#
+# THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+# INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+# AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+# ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+# PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+#
+
+use strict;
+use warnings;
+use utf8;
+use open IO => ':utf8', ':std';
+use Data::Dumper;
+
+die "Usage: $0 OUTPUT_FILE\n" if @ARGV != 1;
+my $output_file = $ARGV[0];
+
+# Open the output file
+open my $OUTPUT, '>', $output_file
+  or die "Could not open output file $output_file: $!\n";
+
+# Parse code table and generate tree for letter transitions.
+my %codes;
+my $table = [{}, [["","",""]]];
+while (<DATA>) {
+    chomp;
+    my ($letters, $codes) = split(/\s+/);
+    my @codes = map { [ split(/,/) ] } split(/\|/, $codes);
+
+    my $key = "codes_" . join("_or_", map { join("_", @$_) } @codes);
+    my $val = join(",\n", map { "\t{\n\t\t" . join(", ", map { "\"$_\"" } @$_) . "\n\t}" } @codes);
+    $codes{$key} = $val;
+
+    for my $letter (split(/,/, $letters)) {
+        my $ref = $table->[0];
+        # Link each character to the next in the letter combination.
+        my @c = split(//, $letter);
+        my $last_c = pop(@c);
+        for my $c (@c) {
+            $ref->{$c} //= [ {}, undef ];
+            $ref->{$c}[0] //= {};
+            $ref = $ref->{$c}[0];
+        }
+        # The sound code for the letter combination is stored at the last character.
+        $ref->{$last_c}[1] = $key;
+    }
+}
+close(DATA);
+
+print $OUTPUT <<EOF;
+/*
+ * Constants and lookup tables for Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem <dag\@nimrod.no>
+ *
+ * This file is generated by daitch_mokotoff_header.pl
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+ */
+
+/* Coding chart table: Soundex codes */
+typedef char dm_code[2 + 1];    /* One or two sequential code digits + NUL */
+typedef dm_code dm_codes[3];    /* Start of name, before a vowel, any other */
+
+/* Coding chart table: Letter in input sequence */
+struct dm_letter
+{
+    char        letter;            /* Present letter in sequence */
+    const struct dm_letter *letters;    /* List of possible successive letters */
+    const        dm_codes *codes;    /* Code sequence(s) for complete sequence */
+};
+
+typedef struct dm_letter dm_letter;
+
+/* Codes for letter sequence at start of name, before a vowel, and any other. */
+EOF
+
+for my $key (sort keys %codes) {
+    print $OUTPUT "static const dm_codes $key\[2\] =\n{\n" . $codes{$key} . "\n};\n";
+}
+
+print $OUTPUT <<EOF;
+
+/* Coding for alternative following letters in sequence. */
+EOF
+
+sub hash2code {
+    my ($ref, $letter) = @_;
+
+    my @letters = ();
+
+    my $h = $ref->[0];
+    for my $key (sort keys %$h) {
+        $ref = $h->{$key};
+        my $children = "NULL";
+        if (defined $ref->[0]) {
+            $children = "letter_$letter$key";
+            hash2code($ref, "$letter$key");
+        }
+        my $codes = $ref->[1] // "NULL";
+        push(@letters, "\t{\n\t\t'$key', $children, $codes\n\t}");
+    }
+
+    print $OUTPUT "static const dm_letter letter_$letter\[\] =\n{\n";
+    for (@letters) {
+        print $OUTPUT "$_,\n";
+    }
+    print $OUTPUT "\t{\n\t\t'\\0'\n\t}\n";
+    print $OUTPUT "};\n";
+}
+
+hash2code($table, '');
+
+close $OUTPUT;
+
+# Table adapted from https://www.jewishgen.org/InfoFiles/Soundex.html
+#
+# The conversion from the coding chart to the table should be self
+# explanatory, but note the differences stated below.
+#
+# X = NC (not coded)
+#
+# The non-ASCII letters in the coding chart are coded with substitute
+# lowercase ASCII letters, which sort after the uppercase ASCII letters:
+#
+# Ą => a (use '[' for table lookup)
+# Ę => e (use '\\' for table lookup)
+# Ţ => t (use ']' for table lookup)
+#
+# The rule for "UE" does not correspond to the coding chart, however
+# it is used by all other known implementations, including the one at
+# https://www.jewishgen.org/jos/jossound.htm (try e.g. "bouey").
+#
+# Note that the implementation assumes that vowels are assigned code
+# 0 or 1. "J" can be either a vowel or a consonant.
+#
+
+__DATA__
+AI,AJ,AY                0,1,X
+AU                        0,7,X
+a                        X,X,6|X,X,X
+A                        0,X,X
+B                        7,7,7
+CHS                        5,54,54
+CH                        5,5,5|4,4,4
+CK                        5,5,5|45,45,45
+CZ,CS,CSZ,CZS            4,4,4
+C                        5,5,5|4,4,4
+DRZ,DRS                    4,4,4
+DS,DSH,DSZ                4,4,4
+DZ,DZH,DZS                4,4,4
+D,DT                    3,3,3
+EI,EJ,EY                0,1,X
+EU                        1,1,X
+e                        X,X,6|X,X,X
+E                        0,X,X
+FB                        7,7,7
+F                        7,7,7
+G                        5,5,5
+H                        5,5,X
+IA,IE,IO,IU                1,X,X
+I                        0,X,X
+J                        1,X,X|4,4,4
+KS                        5,54,54
+KH                        5,5,5
+K                        5,5,5
+L                        8,8,8
+MN                        66,66,66
+M                        6,6,6
+NM                        66,66,66
+N                        6,6,6
+OI,OJ,OY                0,1,X
+O                        0,X,X
+P,PF,PH                    7,7,7
+Q                        5,5,5
+RZ,RS                    94,94,94|4,4,4
+R                        9,9,9
+SCHTSCH,SCHTSH,SCHTCH    2,4,4
+SCH                        4,4,4
+SHTCH,SHCH,SHTSH        2,4,4
+SHT,SCHT,SCHD            2,43,43
+SH                        4,4,4
+STCH,STSCH,SC            2,4,4
+STRZ,STRS,STSH            2,4,4
+ST                        2,43,43
+SZCZ,SZCS                2,4,4
+SZT,SHD,SZD,SD            2,43,43
+SZ                        4,4,4
+S                        4,4,4
+TCH,TTCH,TTSCH            4,4,4
+TH                        3,3,3
+TRZ,TRS                    4,4,4
+TSCH,TSH                4,4,4
+TS,TTS,TTSZ,TC            4,4,4
+TZ,TTZ,TZS,TSZ            4,4,4
+t                        3,3,3|4,4,4
+T                        3,3,3
+UI,UJ,UY,UE                0,1,X
+U                        0,X,X
+V                        7,7,7
+W                        7,7,7
+X                        5,54,54
+Y                        1,X,X
+ZDZ,ZDZH,ZHDZH            2,4,4
+ZD,ZHD                    2,43,43
+ZH,ZS,ZSCH,ZSH            4,4,4
+Z                        4,4,4
diff --git a/contrib/fuzzystrmatch/expected/fuzzystrmatch.out b/contrib/fuzzystrmatch/expected/fuzzystrmatch.out
index 493c95cdfa..bcb837fd6b 100644
--- a/contrib/fuzzystrmatch/expected/fuzzystrmatch.out
+++ b/contrib/fuzzystrmatch/expected/fuzzystrmatch.out
@@ -65,3 +65,174 @@ SELECT dmetaphone_alt('gumbo');
  KMP
 (1 row)

+-- Wovels
+SELECT daitch_mokotoff('Augsburg');
+ daitch_mokotoff
+-----------------
+ {054795}
+(1 row)
+
+SELECT daitch_mokotoff('Breuer');
+ daitch_mokotoff
+-----------------
+ {791900}
+(1 row)
+
+SELECT daitch_mokotoff('Freud');
+ daitch_mokotoff
+-----------------
+ {793000}
+(1 row)
+
+-- The letter "H"
+SELECT daitch_mokotoff('Halberstadt');
+ daitch_mokotoff
+-----------------
+ {587943,587433}
+(1 row)
+
+SELECT daitch_mokotoff('Mannheim');
+ daitch_mokotoff
+-----------------
+ {665600}
+(1 row)
+
+-- Adjacent sounds
+SELECT daitch_mokotoff('Chernowitz');
+ daitch_mokotoff
+-----------------
+ {596740,496740}
+(1 row)
+
+-- Adjacent letters with identical adjacent code digits
+SELECT daitch_mokotoff('Cherkassy');
+ daitch_mokotoff
+-----------------
+ {595400,495400}
+(1 row)
+
+SELECT daitch_mokotoff('Kleinman');
+ daitch_mokotoff
+-----------------
+ {586660}
+(1 row)
+
+-- More than one word
+SELECT daitch_mokotoff('Nowy Targ');
+ daitch_mokotoff
+-----------------
+ {673950}
+(1 row)
+
+-- Padded with "0"
+SELECT daitch_mokotoff('Berlin');
+ daitch_mokotoff
+-----------------
+ {798600}
+(1 row)
+
+-- Other examples from https://www.avotaynu.com/soundex.htm
+SELECT daitch_mokotoff('Ceniow');
+ daitch_mokotoff
+-----------------
+ {567000,467000}
+(1 row)
+
+SELECT daitch_mokotoff('Tsenyuv');
+ daitch_mokotoff
+-----------------
+ {467000}
+(1 row)
+
+SELECT daitch_mokotoff('Holubica');
+ daitch_mokotoff
+-----------------
+ {587500,587400}
+(1 row)
+
+SELECT daitch_mokotoff('Golubitsa');
+ daitch_mokotoff
+-----------------
+ {587400}
+(1 row)
+
+SELECT daitch_mokotoff('Przemysl');
+ daitch_mokotoff
+-----------------
+ {794648,746480}
+(1 row)
+
+SELECT daitch_mokotoff('Pshemeshil');
+ daitch_mokotoff
+-----------------
+ {746480}
+(1 row)
+
+SELECT daitch_mokotoff('Rosochowaciec');
+                      daitch_mokotoff
+-----------------------------------------------------------
+ {945755,945754,945745,945744,944755,944754,944745,944744}
+(1 row)
+
+SELECT daitch_mokotoff('Rosokhovatsets');
+ daitch_mokotoff
+-----------------
+ {945744}
+(1 row)
+
+-- Ignored characters
+SELECT daitch_mokotoff('''OBrien');
+ daitch_mokotoff
+-----------------
+ {079600}
+(1 row)
+
+SELECT daitch_mokotoff('O''Brien');
+ daitch_mokotoff
+-----------------
+ {079600}
+(1 row)
+
+-- "Difficult" cases, likely to cause trouble for other implementations.
+SELECT daitch_mokotoff('CJC');
+               daitch_mokotoff
+---------------------------------------------
+ {550000,540000,545000,450000,400000,440000}
+(1 row)
+
+SELECT daitch_mokotoff('BESST');
+ daitch_mokotoff
+-----------------
+ {743000}
+(1 row)
+
+SELECT daitch_mokotoff('BOUEY');
+ daitch_mokotoff
+-----------------
+ {710000}
+(1 row)
+
+SELECT daitch_mokotoff('HANNMANN');
+ daitch_mokotoff
+-----------------
+ {566600}
+(1 row)
+
+SELECT daitch_mokotoff('MCCOYJR');
+                      daitch_mokotoff
+-----------------------------------------------------------
+ {651900,654900,654190,654490,645190,645490,641900,644900}
+(1 row)
+
+SELECT daitch_mokotoff('ACCURSO');
+                      daitch_mokotoff
+-----------------------------------------------------------
+ {059400,054000,054940,054400,045940,045400,049400,044000}
+(1 row)
+
+SELECT daitch_mokotoff('BIERSCHBACH');
+                      daitch_mokotoff
+-----------------------------------------------------------
+ {794575,794574,794750,794740,745750,745740,747500,747400}
+(1 row)
+
diff --git a/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8.out
b/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8.out
new file mode 100644
index 0000000000..b0dd4880ba
--- /dev/null
+++ b/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8.out
@@ -0,0 +1,61 @@
+/*
+ * This test must be run in a database with UTF-8 encoding,
+ * because other encodings don't support all the characters used.
+ */
+SELECT getdatabaseencoding() <> 'UTF8'
+       AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+set client_encoding = utf8;
+-- CREATE EXTENSION IF NOT EXISTS fuzzystrmatch;
+-- Accents
+SELECT daitch_mokotoff('Müller');
+ daitch_mokotoff
+-----------------
+ {689000}
+(1 row)
+
+SELECT daitch_mokotoff('Schäfer');
+ daitch_mokotoff
+-----------------
+ {479000}
+(1 row)
+
+SELECT daitch_mokotoff('Straßburg');
+ daitch_mokotoff
+-----------------
+ {294795}
+(1 row)
+
+SELECT daitch_mokotoff('Éregon');
+ daitch_mokotoff
+-----------------
+ {095600}
+(1 row)
+
+-- Special characters added at https://www.jewishgen.org/InfoFiles/Soundex.html
+SELECT daitch_mokotoff('gąszczu');
+ daitch_mokotoff
+-----------------
+ {564000,540000}
+(1 row)
+
+SELECT daitch_mokotoff('brzęczy');
+        daitch_mokotoff
+-------------------------------
+ {794640,794400,746400,744000}
+(1 row)
+
+SELECT daitch_mokotoff('ţamas');
+ daitch_mokotoff
+-----------------
+ {364000,464000}
+(1 row)
+
+SELECT daitch_mokotoff('țamas');
+ daitch_mokotoff
+-----------------
+ {364000,464000}
+(1 row)
+
diff --git a/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8_1.out
b/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8_1.out
new file mode 100644
index 0000000000..37aead89c0
--- /dev/null
+++ b/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8_1.out
@@ -0,0 +1,8 @@
+/*
+ * This test must be run in a database with UTF-8 encoding,
+ * because other encodings don't support all the characters used.
+ */
+SELECT getdatabaseencoding() <> 'UTF8'
+       AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql b/contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql
new file mode 100644
index 0000000000..d8542a781c
--- /dev/null
+++ b/contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql
@@ -0,0 +1,8 @@
+/* contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION fuzzystrmatch UPDATE TO '1.2'" to load this file. \quit
+
+CREATE FUNCTION daitch_mokotoff(text) RETURNS text[]
+AS 'MODULE_PATHNAME', 'daitch_mokotoff'
+LANGUAGE C IMMUTABLE STRICT PARALLEL SAFE;
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch.control b/contrib/fuzzystrmatch/fuzzystrmatch.control
index 3cd6660bf9..8b6e9fd993 100644
--- a/contrib/fuzzystrmatch/fuzzystrmatch.control
+++ b/contrib/fuzzystrmatch/fuzzystrmatch.control
@@ -1,6 +1,6 @@
 # fuzzystrmatch extension
 comment = 'determine similarities and distance between strings'
-default_version = '1.1'
+default_version = '1.2'
 module_pathname = '$libdir/fuzzystrmatch'
 relocatable = true
 trusted = true
diff --git a/contrib/fuzzystrmatch/meson.build b/contrib/fuzzystrmatch/meson.build
index 6f424be6d4..3ff84eb531 100644
--- a/contrib/fuzzystrmatch/meson.build
+++ b/contrib/fuzzystrmatch/meson.build
@@ -1,9 +1,18 @@
 # Copyright (c) 2022-2023, PostgreSQL Global Development Group

 fuzzystrmatch_sources = files(
-  'fuzzystrmatch.c',
+  'daitch_mokotoff.c',
   'dmetaphone.c',
+  'fuzzystrmatch.c',
+)
+
+daitch_mokotoff_h = custom_target('daitch_mokotoff',
+  input: 'daitch_mokotoff_header.pl',
+  output: 'daitch_mokotoff.h',
+  command: [perl, '@INPUT@', '@OUTPUT@'],
 )
+generated_sources += daitch_mokotoff_h
+fuzzystrmatch_sources += daitch_mokotoff_h

 if host_system == 'windows'
   fuzzystrmatch_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
@@ -13,6 +22,7 @@ endif

 fuzzystrmatch = shared_module('fuzzystrmatch',
   fuzzystrmatch_sources,
+  include_directories: include_directories('.'),
   kwargs: contrib_mod_args,
 )
 contrib_targets += fuzzystrmatch
@@ -21,6 +31,7 @@ install_data(
   'fuzzystrmatch.control',
   'fuzzystrmatch--1.0--1.1.sql',
   'fuzzystrmatch--1.1.sql',
+  'fuzzystrmatch--1.1--1.2.sql',
   kwargs: contrib_data_args,
 )

@@ -31,6 +42,7 @@ tests += {
   'regress': {
     'sql': [
       'fuzzystrmatch',
+      'fuzzystrmatch_utf8',
     ],
   },
 }
diff --git a/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql b/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql
index f05dc28ffb..db05c7d6b6 100644
--- a/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql
+++ b/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql
@@ -19,3 +19,48 @@ SELECT metaphone('GUMBO', 4);

 SELECT dmetaphone('gumbo');
 SELECT dmetaphone_alt('gumbo');
+
+-- Wovels
+SELECT daitch_mokotoff('Augsburg');
+SELECT daitch_mokotoff('Breuer');
+SELECT daitch_mokotoff('Freud');
+
+-- The letter "H"
+SELECT daitch_mokotoff('Halberstadt');
+SELECT daitch_mokotoff('Mannheim');
+
+-- Adjacent sounds
+SELECT daitch_mokotoff('Chernowitz');
+
+-- Adjacent letters with identical adjacent code digits
+SELECT daitch_mokotoff('Cherkassy');
+SELECT daitch_mokotoff('Kleinman');
+
+-- More than one word
+SELECT daitch_mokotoff('Nowy Targ');
+
+-- Padded with "0"
+SELECT daitch_mokotoff('Berlin');
+
+-- Other examples from https://www.avotaynu.com/soundex.htm
+SELECT daitch_mokotoff('Ceniow');
+SELECT daitch_mokotoff('Tsenyuv');
+SELECT daitch_mokotoff('Holubica');
+SELECT daitch_mokotoff('Golubitsa');
+SELECT daitch_mokotoff('Przemysl');
+SELECT daitch_mokotoff('Pshemeshil');
+SELECT daitch_mokotoff('Rosochowaciec');
+SELECT daitch_mokotoff('Rosokhovatsets');
+
+-- Ignored characters
+SELECT daitch_mokotoff('''OBrien');
+SELECT daitch_mokotoff('O''Brien');
+
+-- "Difficult" cases, likely to cause trouble for other implementations.
+SELECT daitch_mokotoff('CJC');
+SELECT daitch_mokotoff('BESST');
+SELECT daitch_mokotoff('BOUEY');
+SELECT daitch_mokotoff('HANNMANN');
+SELECT daitch_mokotoff('MCCOYJR');
+SELECT daitch_mokotoff('ACCURSO');
+SELECT daitch_mokotoff('BIERSCHBACH');
diff --git a/contrib/fuzzystrmatch/sql/fuzzystrmatch_utf8.sql b/contrib/fuzzystrmatch/sql/fuzzystrmatch_utf8.sql
new file mode 100644
index 0000000000..f42c01a1bb
--- /dev/null
+++ b/contrib/fuzzystrmatch/sql/fuzzystrmatch_utf8.sql
@@ -0,0 +1,26 @@
+/*
+ * This test must be run in a database with UTF-8 encoding,
+ * because other encodings don't support all the characters used.
+ */
+
+SELECT getdatabaseencoding() <> 'UTF8'
+       AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+set client_encoding = utf8;
+
+-- CREATE EXTENSION IF NOT EXISTS fuzzystrmatch;
+
+-- Accents
+SELECT daitch_mokotoff('Müller');
+SELECT daitch_mokotoff('Schäfer');
+SELECT daitch_mokotoff('Straßburg');
+SELECT daitch_mokotoff('Éregon');
+
+-- Special characters added at https://www.jewishgen.org/InfoFiles/Soundex.html
+SELECT daitch_mokotoff('gąszczu');
+SELECT daitch_mokotoff('brzęczy');
+SELECT daitch_mokotoff('ţamas');
+SELECT daitch_mokotoff('țamas');
diff --git a/doc/src/sgml/fuzzystrmatch.sgml b/doc/src/sgml/fuzzystrmatch.sgml
index 5dedbd8f7a..9e7faf6f47 100644
--- a/doc/src/sgml/fuzzystrmatch.sgml
+++ b/doc/src/sgml/fuzzystrmatch.sgml
@@ -104,10 +104,10 @@ SELECT * FROM s WHERE difference(s.nm, 'john') > 2;
   </indexterm>

 <synopsis>
-levenshtein(text source, text target, int ins_cost, int del_cost, int sub_cost) returns int
-levenshtein(text source, text target) returns int
-levenshtein_less_equal(text source, text target, int ins_cost, int del_cost, int sub_cost, int max_d) returns int
-levenshtein_less_equal(text source, text target, int max_d) returns int
+levenshtein(source text, target text, ins_cost int, del_cost int, sub_cost int) returns int
+levenshtein(source text, target text) returns int
+levenshtein_less_equal(source text, target text, ins_cost int, del_cost int, sub_cost int, max_d int) returns int
+levenshtein_less_equal(source text, target text, max_d int) returns int
 </synopsis>

   <para>
@@ -177,7 +177,7 @@ test=# SELECT levenshtein_less_equal('extensive', 'exhaustive', 4);
   </indexterm>

 <synopsis>
-metaphone(text source, int max_output_length) returns text
+metaphone(source text, max_output_length int) returns text
 </synopsis>

   <para>
@@ -220,8 +220,8 @@ test=# SELECT metaphone('GUMBO', 4);
   </indexterm>

 <synopsis>
-dmetaphone(text source) returns text
-dmetaphone_alt(text source) returns text
+dmetaphone(source text) returns text
+dmetaphone_alt(source text) returns text
 </synopsis>

   <para>
@@ -241,4 +241,155 @@ test=# SELECT dmetaphone('gumbo');
 </screen>
  </sect2>

+ <sect2>
+  <title>Daitch-Mokotoff Soundex</title>
+
+  <para>
+   Compared to the American Soundex System implemented in the
+   <function>soundex</function> function, the major improvements of the
+   Daitch-Mokotoff Soundex System are:
+
+   <itemizedlist spacing="compact" mark="bullet">
+    <listitem>
+     <para>
+      Information is coded to the first six meaningful letters rather than
+      four.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      The initial letter is coded rather than kept as is.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Where two consecutive letters have a single sound, they are coded as a
+      single number.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      When a letter or combination of letters may have two different sounds,
+      it is double coded under the two different codes.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      A letter or combination of letters maps into ten possible codes rather
+      than seven.
+     </para>
+    </listitem>
+   </itemizedlist>
+  </para>
+
+  <indexterm>
+   <primary>daitch_mokotoff</primary>
+  </indexterm>
+
+  <para>
+   The following function generates Daitch-Mokotoff soundex codes for matching
+   of similar-sounding input:
+  </para>
+
+<synopsis>
+daitch_mokotoff(source text) returns text[]
+</synopsis>
+
+  <para>
+   Since a Daitch-Mokotoff soundex code consists of only 6 digits,
+   <literal>source</literal> should be preferably a single word or name.
+  </para>
+
+  <para>
+   Examples:
+  </para>
+
+<programlisting>
+SELECT daitch_mokotoff('George');
+ daitch_mokotoff
+-----------------
+ {595000}
+
+SELECT daitch_mokotoff('John');
+ daitch_mokotoff
+-----------------
+ {160000,460000}
+
+SELECT daitch_mokotoff('Bierschbach');
+                      daitch_mokotoff
+-----------------------------------------------------------
+ {794575,794574,794750,794740,745750,745740,747500,747400}
+
+SELECT daitch_mokotoff('Schwartzenegger');
+ daitch_mokotoff
+-----------------
+ {479465}
+</programlisting>
+
+  <para>
+   For matching of single names, the returned text array can be matched
+   directly using the <literal>&&</literal> operator. A GIN index may
+   be used for efficiency, see <xref linkend="gin"/> and the example
+   below:
+  </para>
+
+<programlisting>
+CREATE TABLE s (nm text);
+
+INSERT INTO s (nm) VALUES
+  ('Schwartzenegger'),
+  ('John'),
+  ('James'),
+  ('Steinman'),
+  ('Steinmetz');
+
+CREATE INDEX ix_s_dm ON s USING gin (daitch_mokotoff(nm)) WITH (fastupdate = off);
+
+SELECT * FROM s WHERE daitch_mokotoff(nm) && daitch_mokotoff('Swartzenegger');
+SELECT * FROM s WHERE daitch_mokotoff(nm) && daitch_mokotoff('Jane');
+SELECT * FROM s WHERE daitch_mokotoff(nm) && daitch_mokotoff('Jens');
+</programlisting>
+
+  <para>
+   For indexing and matching of any number of names in any order, Full Text
+   Search may be used. See <xref linkend="textsearch"/> and the example below:
+  </para>
+
+<programlisting>
+CREATE OR REPLACE FUNCTION soundex_tsvector(v_name text) RETURNS tsvector AS $$
+  SELECT to_tsvector('simple', string_agg(array_to_string(daitch_mokotoff(n), ' '), ' '))
+  FROM regexp_split_to_table(v_name, '\s+') AS n
+$$ LANGUAGE sql STRICT IMMUTABLE PARALLEL SAFE;
+
+CREATE OR REPLACE FUNCTION soundex_tsquery(v_name text) RETURNS tsquery AS $$
+  SELECT string_agg('(' || array_to_string(daitch_mokotoff(n), '|') || ')', '&')::tsquery
+  FROM regexp_split_to_table(v_name, '\s+') AS n
+$$ LANGUAGE sql STRICT IMMUTABLE PARALLEL SAFE;
+
+CREATE TABLE s (nm text);
+CREATE INDEX ix_s_txt ON s USING gin (soundex_tsvector(nm)) WITH (fastupdate = off);
+
+INSERT INTO s (nm) VALUES
+  ('John Doe'),
+  ('Jane Roe'),
+  ('Public John Q.'),
+  ('George Best'),
+  ('John Yamson');
+
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('john');
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('jane doe');
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('john public');
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('besst, giorgio');
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('Jameson John');
+</programlisting>
+
+  <para>
+   Note that if it is desired to avoid recalculation of soundex codes on GIN
+   table row recheck, an index on a separate column may be used instead of an
+   index on an expression. A stored generated column may be used for this, see
+   <xref linkend="ddl-generated-columns"/>.
+  </para>
+
+ </sect2>
+
 </sect1>

Re: daitch_mokotoff module

От
Dag Lem
Дата:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

> On 2023-Jan-05, Dag Lem wrote:
>
>> Is there anything else I should do here, to avoid the status being
>> incorrectly stuck at "Waiting for Author" again.
>
> Just mark it Needs Review for now.  I'll be back from vacation on Jan
> 11th and can have a look then (or somebody else can, perhaps.)

Paul Ramsey had a few comments in the mean time, and based on this I
have produced (yet another) patch, with improved documentation.

However it's still not marked as "Ready for Committer" - can you please
take a look again?

Best regards

Dag Lem



Re: daitch_mokotoff module

От
Dag Lem
Дата:
Hi Paul,

I just went by to check the status of the patch, and I noticed that
you've added yourself as reviewer earlier - great!

Please tell me if there is anything I can do to help bring this across
the finish line.

Best regards,

Dag Lem



Re: daitch_mokotoff module

От
Paul Ramsey
Дата:

> On Feb 7, 2023, at 6:47 AM, Dag Lem <dag@nimrod.no> wrote:
>
> I just went by to check the status of the patch, and I noticed that
> you've added yourself as reviewer earlier - great!
>
> Please tell me if there is anything I can do to help bring this across
> the finish line.

Honestly, I had set it to Ready for Committer, but then I went to run regression one more time and my regression blew
up.I found I couldn't enable the UTF tests without things failing. And I don't blame you! I think my installation is
probablyout-of-alignment in some way, but I didn't want to flip the Ready flag without having run everything through to
completion,so I flipped it back. Also, are the UTF tests enabled by default? It wasn't clear to me that they were? 

P


Re: daitch_mokotoff module

От
Tomas Vondra
Дата:
On 2/7/23 18:08, Paul Ramsey wrote:
> 
> 
>> On Feb 7, 2023, at 6:47 AM, Dag Lem <dag@nimrod.no> wrote:
>>
>> I just went by to check the status of the patch, and I noticed that
>> you've added yourself as reviewer earlier - great!
>>
>> Please tell me if there is anything I can do to help bring this across
>> the finish line.
> 
> Honestly, I had set it to Ready for Committer, but then I went to run regression one more time and my regression blew
up.I found I couldn't enable the UTF tests without things failing. And I don't blame you! I think my installation is
probablyout-of-alignment in some way, but I didn't want to flip the Ready flag without having run everything through to
completion,so I flipped it back. Also, are the UTF tests enabled by default? It wasn't clear to me that they were?
 
> 
The utf8 tests are enabled depending on the encoding returned by
getdatabaseencoding(). Systems with other encodings will simply use the
alternate .out file. And it works perfectly fine for me.

IMHO it's ready for committer.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: daitch_mokotoff module

От
Alvaro Herrera
Дата:
On 2023-Jan-17, Dag Lem wrote:

> + * Daitch-Mokotoff Soundex
> + *
> + * Copyright (c) 2021 Finance Norway
> + * Author: Dag Lem <dag@nimrod.no>

Hmm, I don't think we accept copyright lines that aren't "PostgreSQL
Global Development Group".  Is it okay to use that, and update the year
to 2023?  (Note that answering "no" very likely means your patch is not
candidate for inclusion.)  Also, we tend not to have "Author:" lines.

> + * Permission to use, copy, modify, and distribute this software and its
> + * documentation for any purpose, without fee, and without a written agreement
> + * is hereby granted, provided that the above copyright notice and this
> + * paragraph and the following two paragraphs appear in all copies.
> + *
> + * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
> + * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
> + * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
> + * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
> + * POSSIBILITY OF SUCH DAMAGE.
> + *
> + * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
> + * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
> + * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
> + * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
> + * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.

We don't keep a separate copyright statement in the file; rather we
assume that all files are under the PostgreSQL license, which is in the
COPYRIGHT file at the top of the tree.  Changing it thus has the side
effect that these disclaim notes refer to the University of California
rather than "the Author".  IANAL.


I think we should add SPDX markers to all the files we distribute:
/* SPDX-License-Identifier: PostgreSQL */

https://spdx.dev/ids/
https://spdx.org/licenses/PostgreSQL.html

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Tiene valor aquel que admite que es un cobarde" (Fernandel)



Re: daitch_mokotoff module

От
Dag Lem
Дата:
Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

> On 2/7/23 18:08, Paul Ramsey wrote:
>> 
>> 
>>> On Feb 7, 2023, at 6:47 AM, Dag Lem <dag@nimrod.no> wrote:
>>>
>>> I just went by to check the status of the patch, and I noticed that
>>> you've added yourself as reviewer earlier - great!
>>>
>>> Please tell me if there is anything I can do to help bring this across
>>> the finish line.
>> 
>> Honestly, I had set it to Ready for Committer, but then I went to
>> run regression one more time and my regression blew up. I found I
>> couldn't enable the UTF tests without things failing. And I don't
>> blame you! I think my installation is probably out-of-alignment in
>> some way, but I didn't want to flip the Ready flag without having
>> run everything through to completion, so I flipped it back. Also,
>> are the UTF tests enabled by default? It wasn't clear to me that
>> they were?
>> 
> The utf8 tests are enabled depending on the encoding returned by
> getdatabaseencoding(). Systems with other encodings will simply use the
> alternate .out file. And it works perfectly fine for me.
>
> IMHO it's ready for committer.
>
>
> regards

Yes, the UTF-8 tests follow the current best practice as has been
explained to me earlier. The following patch exemplifies this:

https://github.com/postgres/postgres/commit/c2e8bd27519f47ff56987b30eb34a01969b9a9e8


Best regards,

Dag Lem



Re: daitch_mokotoff module

От
Dag Lem
Дата:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

> On 2023-Jan-17, Dag Lem wrote:
>
>> + * Daitch-Mokotoff Soundex
>> + *
>> + * Copyright (c) 2021 Finance Norway
>> + * Author: Dag Lem <dag@nimrod.no>
>
> Hmm, I don't think we accept copyright lines that aren't "PostgreSQL
> Global Development Group".  Is it okay to use that, and update the year
> to 2023?  (Note that answering "no" very likely means your patch is not
> candidate for inclusion.)  Also, we tend not to have "Author:" lines.
>

You'll have to forgive me for not knowing about this rule:

  grep -ER "Copyright.*[0-9]{4}" contrib/ | grep -v PostgreSQL

In any case, I have checked with the copyright owner, and it would be OK
to assign the copyright to "PostgreSQL Global Development Group".

To avoid going back and forth with patches, how do you propose that the
sponsor and the author of the contributed module should be credited?
Woule something like this be acceptable?

/*
 * Daitch-Mokotoff Soundex
 *
 * Copyright (c) 2023, PostgreSQL Global Development Group
 *
 * This module was sponsored by Finance Norway / Trafikkforsikringsforeningen
 * and implemented by Dag Lem <dag@nimrod.no>
 *
 ...

[...]

>
> We don't keep a separate copyright statement in the file; rather we
> assume that all files are under the PostgreSQL license, which is in the
> COPYRIGHT file at the top of the tree.  Changing it thus has the side
> effect that these disclaim notes refer to the University of California
> rather than "the Author".  IANAL.

OK, no problem. Note that you will again find counterexamples under
contrib/ (and in some other places):

  grep -R "Permission to use" .

> I think we should add SPDX markers to all the files we distribute:
> /* SPDX-License-Identifier: PostgreSQL */
>
> https://spdx.dev/ids/
> https://spdx.org/licenses/PostgreSQL.html

As far as I can tell, this is not included in any file so far, and is
thus better left to decide and implement by someone else.

Best regards

Dag Lem



Re: daitch_mokotoff module

От
Tomas Vondra
Дата:

On 2/8/23 15:31, Dag Lem wrote:
> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> 
>> On 2023-Jan-17, Dag Lem wrote:
>>
>>> + * Daitch-Mokotoff Soundex
>>> + *
>>> + * Copyright (c) 2021 Finance Norway
>>> + * Author: Dag Lem <dag@nimrod.no>
>>
>> Hmm, I don't think we accept copyright lines that aren't "PostgreSQL
>> Global Development Group".  Is it okay to use that, and update the year
>> to 2023?  (Note that answering "no" very likely means your patch is not
>> candidate for inclusion.)  Also, we tend not to have "Author:" lines.
>>
> 
> You'll have to forgive me for not knowing about this rule:
> 
>   grep -ER "Copyright.*[0-9]{4}" contrib/ | grep -v PostgreSQL
> 
> In any case, I have checked with the copyright owner, and it would be OK
> to assign the copyright to "PostgreSQL Global Development Group".
> 

I'm not entirely sure what's the rule either, and I'm a committer. My
guess is these cases are either old and/or adding a code that already
existed elsewhere (like some of the double metaphone, for example), or
maybe both. But I'd bet we'd prefer not adding more ...

> To avoid going back and forth with patches, how do you propose that the
> sponsor and the author of the contributed module should be credited?
> Woule something like this be acceptable?
> 

We generally credit contributors in two ways - by mentioning them in the
commit message, and by listing them in the release notes (for individual
features).

> /*
>  * Daitch-Mokotoff Soundex
>  *
>  * Copyright (c) 2023, PostgreSQL Global Development Group
>  *
>  * This module was sponsored by Finance Norway / Trafikkforsikringsforeningen
>  * and implemented by Dag Lem <dag@nimrod.no>
>  *
>  ...
> 
> [...]
> 
>>
>> We don't keep a separate copyright statement in the file; rather we
>> assume that all files are under the PostgreSQL license, which is in the
>> COPYRIGHT file at the top of the tree.  Changing it thus has the side
>> effect that these disclaim notes refer to the University of California
>> rather than "the Author".  IANAL.
> 
> OK, no problem. Note that you will again find counterexamples under
> contrib/ (and in some other places):
> 
>   grep -R "Permission to use" .
> 
>> I think we should add SPDX markers to all the files we distribute:
>> /* SPDX-License-Identifier: PostgreSQL */
>>
>> https://spdx.dev/ids/
>> https://spdx.org/licenses/PostgreSQL.html
> 
> As far as I can tell, this is not included in any file so far, and is
> thus better left to decide and implement by someone else.
> 

I don't think Alvaro was suggesting this patch should do that. It was
more a generic comment about what the project as a whole might do.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: daitch_mokotoff module

От
Dag Lem
Дата:
Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

> On 2/8/23 15:31, Dag Lem wrote:
>> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
>> 
>>> On 2023-Jan-17, Dag Lem wrote:
>>>
>>>> + * Daitch-Mokotoff Soundex
>>>> + *
>>>> + * Copyright (c) 2021 Finance Norway
>>>> + * Author: Dag Lem <dag@nimrod.no>
>>>
>>> Hmm, I don't think we accept copyright lines that aren't "PostgreSQL
>>> Global Development Group".  Is it okay to use that, and update the year
>>> to 2023?  (Note that answering "no" very likely means your patch is not
>>> candidate for inclusion.)  Also, we tend not to have "Author:" lines.
>>>
>> 
>> You'll have to forgive me for not knowing about this rule:
>> 
>>   grep -ER "Copyright.*[0-9]{4}" contrib/ | grep -v PostgreSQL
>> 
>> In any case, I have checked with the copyright owner, and it would be OK
>> to assign the copyright to "PostgreSQL Global Development Group".
>> 
>
> I'm not entirely sure what's the rule either, and I'm a committer. My
> guess is these cases are either old and/or adding a code that already
> existed elsewhere (like some of the double metaphone, for example), or
> maybe both. But I'd bet we'd prefer not adding more ...
>
>> To avoid going back and forth with patches, how do you propose that the
>> sponsor and the author of the contributed module should be credited?
>> Woule something like this be acceptable?
>> 
>
> We generally credit contributors in two ways - by mentioning them in the
> commit message, and by listing them in the release notes (for individual
> features).
>

I'll ask again, would the proposed credits be acceptable? In this case,
the code already existed elsewhere (as in your example for double
metaphone) as a separate extension. The copyright owner is OK with
copyright assignment, however I find it quite unreasonable that proper
credits should not be given. Neither commit messages nor release notes
follow the contributed module, which is in its entirety contributed by
an external entity.

I'll also point out that in addition to credits in code all over the
place, PostgreSQL has much more prominent credits in the documentation:

  grep -ER "Author" doc/ | grep -v PostgreSQL

"Author" is even documented as a top level section in the Reference
Pages as "Author (only used in the contrib section)", see

  https://www.postgresql.org/docs/15/docguide-style.html#id-1.11.11.8.2

If there really exists some new rule which says that for new
contributions under contrib/, credits should not be allowed in any way
in either code or documentation (IANAL, but AFAIU this would be in
conflict with laws on author's moral rights in several countries), then
one would reasonably expect that you'd be upfront about this, both in
documentation, and also as the very first thing when a contribution is
first proposed for inclusion.

Best regards

Dag Lem



Re: daitch_mokotoff module

От
Andres Freund
Дата:
Hi,

On 2023-02-09 10:28:36 +0100, Dag Lem wrote:
> I'll ask again, would the proposed credits be acceptable? In this case,
> the code already existed elsewhere (as in your example for double
> metaphone) as a separate extension. The copyright owner is OK with
> copyright assignment, however I find it quite unreasonable that proper
> credits should not be given.

You don't need to assign copyright, it needs however be licensed under the
terms of the PostgreSQL License.


> Neither commit messages nor release notes
> follow the contributed module, which is in its entirety contributed by
> an external entity.

The problem with adding credits to source files is that it's hard to maintain
them reasonably over time. At what point has a C file been extended
sufficiently to warrant an additional author?


> I'll also point out that in addition to credits in code all over the
> place, PostgreSQL has much more prominent credits in the documentation:
>
>   grep -ER "Author" doc/ | grep -v PostgreSQL

FWIW, I'd rather remove them. In several of those the credited author has, by
now, only done a small fraction of the overall work.

They don't make much sense to me - you don't get a permanent mention in other
parts of the documentation either. Many of the binaries outside of contrib/
involved a lot more work by one individual than cases in contrib/. Lots of
backend code has a *lot* of work done by one individual, yet we don't add
authorship notes in relevant sections of the documentation.

Greetings,

Andres Freund



Re: daitch_mokotoff module

От
Dag Lem
Дата:
I sincerely hope this resolves any blocking issues with copyright /
legalese / credits.

Best regards

Dag Lem

diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile
index 0704894f88..12baf2d884 100644
--- a/contrib/fuzzystrmatch/Makefile
+++ b/contrib/fuzzystrmatch/Makefile
@@ -3,14 +3,15 @@
 MODULE_big = fuzzystrmatch
 OBJS = \
     $(WIN32RES) \
+    daitch_mokotoff.o \
     dmetaphone.o \
     fuzzystrmatch.o

 EXTENSION = fuzzystrmatch
-DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql
+DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql
 PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"

-REGRESS = fuzzystrmatch
+REGRESS = fuzzystrmatch fuzzystrmatch_utf8

 ifdef USE_PGXS
 PG_CONFIG = pg_config
@@ -22,3 +23,14 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
+
+# Force this dependency to be known even without dependency info built:
+daitch_mokotoff.o: daitch_mokotoff.h
+
+daitch_mokotoff.h: daitch_mokotoff_header.pl
+    perl $< $@
+
+distprep: daitch_mokotoff.h
+
+maintainer-clean:
+    rm -f daitch_mokotoff.h
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c
new file mode 100644
index 0000000000..9215ab6530
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff.c
@@ -0,0 +1,567 @@
+/*
+ * Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * This module was originally sponsored by Finance Norway /
+ * Trafikkforsikringsforeningen, and implemented by Dag Lem <dag@nimrod.no>
+ *
+ * The implementation of the Daitch-Mokotoff Soundex System aims at correctness
+ * and high performance, and can be summarized as follows:
+ *
+ * - The processing of each phoneme is initiated by an O(1) table lookup.
+ * - For phonemes containing more than one character, a coding tree is traversed
+ *   to process the complete phoneme.
+ * - The (alternate) soundex codes are produced digit by digit in-place in
+ *   another tree structure.
+ *
+ * References:
+ *
+ * https://www.avotaynu.com/soundex.htm
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex
+ * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php)
+ * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java)
+ * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm)
+ *
+ * A few notes on other implementations:
+ *
+ * - All other known implementations have the same unofficial rules for "UE",
+ *   these are also adapted by this implementation (0, 1, NC).
+ * - The only other known implementation which is capable of generating all
+ *   correct soundex codes in all cases is the JOS Soundex Calculator at
+ *   https://www.jewishgen.org/jos/jossound.htm
+ * - "J" is considered (only) a vowel in dmlat.php
+ * - The official rules for "RS" are commented out in dmlat.php
+ * - Identical code digits for adjacent letters are not collapsed correctly in
+ *   dmsoundex.php when double digit codes are involved. E.g. "BESST" yields
+ *   744300 instead of 743000 as for "BEST".
+ * - "J" is considered (only) a consonant in DaitchMokotoffSoundex.java
+ * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java
+*/
+
+#include "postgres.h"
+
+#include "catalog/pg_type.h"
+#include "mb/pg_wchar.h"
+#include "utils/array.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+
+
+/*
+ * The soundex coding chart table is adapted from
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * See daitch_mokotoff_header.pl for details.
+*/
+
+/* Generated coding chart table */
+#include "daitch_mokotoff.h"
+
+#define DM_CODE_DIGITS 6
+
+/* Node in soundex code tree */
+struct dm_node
+{
+    int            soundex_length; /* Length of generated soundex code */
+    char        soundex[DM_CODE_DIGITS];    /* Soundex code */
+    int            is_leaf;        /* Candidate for complete soundex code */
+    int            last_update;    /* Letter number for last update of node */
+    char        code_digit;        /* Last code digit, 0 - 9 */
+
+    /*
+     * One or two alternate code digits leading to this node. If there are two
+     * digits, one of them is always an 'X'. Repeated code digits and 'X' lead
+     * back to the same node.
+     */
+    char        prev_code_digits[2];
+    /* One or two alternate code digits moving forward. */
+    char        next_code_digits[2];
+    /* ORed together code index(es) used to reach current node. */
+    int            prev_code_index;
+    int            next_code_index;
+    /* Possible nodes branching out from this node - digits 0-9. */
+    struct dm_node *children[10];
+    /* Next node in linked list. Alternating index for each iteration. */
+    struct dm_node *next[2];
+};
+
+typedef struct dm_node dm_node;
+
+
+/* Internal C implementation */
+static int    daitch_mokotoff_coding(char *word, ArrayBuildState *soundex, MemoryContext tmp_ctx);
+
+
+PG_FUNCTION_INFO_V1(daitch_mokotoff);
+
+Datum
+daitch_mokotoff(PG_FUNCTION_ARGS)
+{
+    text       *arg = PG_GETARG_TEXT_PP(0);
+    char       *string;
+    ArrayBuildState *soundex;
+    Datum        retval;
+    MemoryContext mem_ctx,
+                tmp_ctx;
+
+    tmp_ctx = AllocSetContextCreate(CurrentMemoryContext,
+                                    "daitch_mokotoff temporary context",
+                                    ALLOCSET_DEFAULT_SIZES);
+    mem_ctx = MemoryContextSwitchTo(tmp_ctx);
+
+    string = pg_server_to_any(text_to_cstring(arg), VARSIZE_ANY_EXHDR(arg), PG_UTF8);
+    soundex = initArrayResult(TEXTOID, tmp_ctx, false);
+
+    if (!daitch_mokotoff_coding(string, soundex, tmp_ctx))
+    {
+        /* No encodable characters in input. */
+        MemoryContextSwitchTo(mem_ctx);
+        MemoryContextDelete(tmp_ctx);
+        PG_RETURN_NULL();
+    }
+
+    retval = makeArrayResult(soundex, mem_ctx);
+    MemoryContextSwitchTo(mem_ctx);
+    MemoryContextDelete(tmp_ctx);
+
+    PG_RETURN_DATUM(retval);
+}
+
+
+/* Template for new node in soundex code tree. */
+static const dm_node start_node = {
+    .soundex_length = 0,
+    .soundex = "000000",        /* Six digits */
+    .is_leaf = 0,
+    .last_update = 0,
+    .code_digit = '\0',
+    .prev_code_digits = {'\0', '\0'},
+    .next_code_digits = {'\0', '\0'},
+    .prev_code_index = 0,
+    .next_code_index = 0,
+    .children = {NULL},
+    .next = {NULL}
+};
+
+/* Dummy soundex codes at end of input. */
+static const dm_codes end_codes[2] =
+{
+    {
+        "X", "X", "X"
+    }
+};
+
+
+/* Initialize soundex code tree node for next code digit. */
+static void
+initialize_node(dm_node * node, int last_update)
+{
+    if (node->last_update < last_update)
+    {
+        node->prev_code_digits[0] = node->next_code_digits[0];
+        node->prev_code_digits[1] = node->next_code_digits[1];
+        node->next_code_digits[0] = '\0';
+        node->next_code_digits[1] = '\0';
+        node->prev_code_index = node->next_code_index;
+        node->next_code_index = 0;
+        node->is_leaf = 0;
+        node->last_update = last_update;
+    }
+}
+
+
+/* Update soundex code tree node with next code digit. */
+static void
+add_next_code_digit(dm_node * node, int code_index, char code_digit)
+{
+    /* OR in index 1 or 2. */
+    node->next_code_index |= code_index;
+
+    if (!node->next_code_digits[0])
+    {
+        node->next_code_digits[0] = code_digit;
+    }
+    else if (node->next_code_digits[0] != code_digit)
+    {
+        node->next_code_digits[1] = code_digit;
+    }
+}
+
+
+/* Mark soundex code tree node as leaf. */
+static void
+set_leaf(dm_node * first_node[2], dm_node * last_node[2], dm_node * node, int ix_node)
+{
+    if (!node->is_leaf)
+    {
+        node->is_leaf = 1;
+
+        if (first_node[ix_node] == NULL)
+        {
+            first_node[ix_node] = node;
+        }
+        else
+        {
+            last_node[ix_node]->next[ix_node] = node;
+        }
+
+        last_node[ix_node] = node;
+        node->next[ix_node] = NULL;
+    }
+}
+
+
+/* Find next node corresponding to code digit, or create a new node. */
+static dm_node *
+find_or_create_child_node(dm_node * parent, char code_digit, ArrayBuildState *soundex, MemoryContext tmp_ctx)
+{
+    int            i = code_digit - '0';
+    dm_node   **nodes = parent->children;
+    dm_node    *node = nodes[i];
+
+    if (node)
+    {
+        /* Found existing child node. Skip completed nodes. */
+        return node->soundex_length < DM_CODE_DIGITS ? node : NULL;
+    }
+
+    /* Create new child node. */
+    node = palloc(sizeof(dm_node));
+    nodes[i] = node;
+
+    *node = start_node;
+    memcpy(node->soundex, parent->soundex, sizeof(parent->soundex));
+    node->soundex_length = parent->soundex_length;
+    node->soundex[node->soundex_length++] = code_digit;
+    node->code_digit = code_digit;
+    node->next_code_index = node->prev_code_index;
+
+    if (node->soundex_length < DM_CODE_DIGITS)
+    {
+        return node;
+    }
+    else
+    {
+        /* Append completed soundex code to soundex array. */
+        accumArrayResult(soundex,
+                         PointerGetDatum(cstring_to_text_with_len(node->soundex, DM_CODE_DIGITS)),
+                         false,
+                         TEXTOID,
+                         tmp_ctx);
+        return NULL;
+    }
+}
+
+
+/* Update node for next code digit(s). */
+static void
+update_node(dm_node * first_node[2], dm_node * last_node[2], dm_node * node, int ix_node,
+            int letter_no, int prev_code_index, int next_code_index,
+            const char *next_code_digits, int digit_no,
+            ArrayBuildState *soundex, MemoryContext tmp_ctx)
+{
+    int            i;
+    char        next_code_digit = next_code_digits[digit_no];
+    int            num_dirty_nodes = 0;
+    dm_node    *dirty_nodes[2];
+
+    initialize_node(node, letter_no);
+
+    if (node->prev_code_index && !(node->prev_code_index & prev_code_index))
+    {
+        /*
+         * If the sound (vowel / consonant) of this letter encoding doesn't
+         * correspond to the coding index of the previous letter, we skip this
+         * letter encoding. Note that currently, only "J" can be either a
+         * vowel or a consonant.
+         */
+        return;
+    }
+
+    if (next_code_digit == 'X' ||
+        (digit_no == 0 &&
+         (node->prev_code_digits[0] == next_code_digit ||
+          node->prev_code_digits[1] == next_code_digit)))
+    {
+        /* The code digit is the same as one of the previous (i.e. not added). */
+        dirty_nodes[num_dirty_nodes++] = node;
+    }
+
+    if (next_code_digit != 'X' &&
+        (digit_no > 0 ||
+         node->prev_code_digits[0] != next_code_digit ||
+         node->prev_code_digits[1]))
+    {
+        /* The code digit is different from one of the previous (i.e. added). */
+        node = find_or_create_child_node(node, next_code_digit, soundex, tmp_ctx);
+        if (node)
+        {
+            initialize_node(node, letter_no);
+            dirty_nodes[num_dirty_nodes++] = node;
+        }
+    }
+
+    for (i = 0; i < num_dirty_nodes; i++)
+    {
+        /* Add code digit leading to the current node. */
+        add_next_code_digit(dirty_nodes[i], next_code_index, next_code_digit);
+
+        if (next_code_digits[++digit_no])
+        {
+            update_node(first_node, last_node, dirty_nodes[i], ix_node,
+                        letter_no, prev_code_index, next_code_index,
+                        next_code_digits, digit_no,
+                        soundex, tmp_ctx);
+        }
+        else
+        {
+            /* Add incomplete leaf node to linked list. */
+            set_leaf(first_node, last_node, dirty_nodes[i], ix_node);
+        }
+    }
+}
+
+
+/* Update soundex tree leaf nodes. */
+static void
+update_leaves(dm_node * first_node[2], int *ix_node, int letter_no,
+              const dm_codes * codes, const dm_codes * next_codes,
+              ArrayBuildState *soundex, MemoryContext tmp_ctx)
+{
+    int            i,
+                j,
+                code_index;
+    dm_node    *node,
+               *last_node[2];
+    const        dm_code *code,
+               *next_code;
+    int            ix_node_next = (*ix_node + 1) & 1;    /* Alternating index: 0, 1 */
+
+    /* Initialize for new linked list of leaves. */
+    first_node[ix_node_next] = NULL;
+    last_node[ix_node_next] = NULL;
+
+    /* Process all nodes. */
+    for (node = first_node[*ix_node]; node; node = node->next[*ix_node])
+    {
+        /* One or two alternate code sequences. */
+        for (i = 0; i < 2 && (code = codes[i]) && code[0][0]; i++)
+        {
+            /* Coding for previous letter - before vowel: 1, all other: 2 */
+            int            prev_code_index = (code[0][0] > '1') + 1;
+
+            /* One or two alternate next code sequences. */
+            for (j = 0; j < 2 && (next_code = next_codes[j]) && next_code[0][0]; j++)
+            {
+                /* Determine which code to use. */
+                if (letter_no == 0)
+                {
+                    /* This is the first letter. */
+                    code_index = 0;
+                }
+                else if (next_code[0][0] <= '1')
+                {
+                    /* The next letter is a vowel. */
+                    code_index = 1;
+                }
+                else
+                {
+                    /* All other cases. */
+                    code_index = 2;
+                }
+
+                /* One or two sequential code digits. */
+                update_node(first_node, last_node, node, ix_node_next,
+                            letter_no, prev_code_index, code_index,
+                            code[code_index], 0,
+                            soundex, tmp_ctx);
+            }
+        }
+    }
+
+    *ix_node = ix_node_next;
+}
+
+
+/* Mapping from ISO8859-1 to upper-case ASCII */
+static const char iso8859_1_to_ascii_upper[] =
+/*
+"`abcdefghijklmnopqrstuvwxyz{|}~                                  ¡¢£¤¥¦§¨©ª«¬
®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ"
+*/
+"`ABCDEFGHIJKLMNOPQRSTUVWXYZ{|}~                                  !
?AAAAAAECEEEEIIIIDNOOOOO*OUUUUYDSAAAAAAECEEEEIIIIDNOOOOO/OUUUUYDY";
+
+
+/* Return next character, converted from UTF-8 to uppercase ASCII. */
+static char
+read_char(unsigned char *str, int *ix)
+{
+    /* Substitute character for skipped code points. */
+    const char    na = '\x1a';
+    pg_wchar    c;
+
+    /* Decode UTF-8 character to ISO 10646 code point. */
+    str += *ix;
+    c = utf8_to_unicode(str);
+    *ix += pg_utf_mblen(str);
+
+    if (c >= (unsigned char) '[' && c <= (unsigned char) ']')
+    {
+        /* ASCII characters [, \, and ] are reserved for Ą, Ę, and Ţ/Ț. */
+        return na;
+    }
+    else if (c < 0x60)
+    {
+        /* Non-lowercase ASCII character. */
+        return c;
+    }
+    else if (c < 0x100)
+    {
+        /* ISO-8859-1 code point, converted to upper-case ASCII. */
+        return iso8859_1_to_ascii_upper[c - 0x60];
+    }
+    else
+    {
+        /* Conversion of non-ASCII characters in the coding chart. */
+        switch (c)
+        {
+            case 0x0104:
+            case 0x0105:
+                /* Ą/ą */
+                return '[';
+            case 0x0118:
+            case 0x0119:
+                /* Ę/ę */
+                return '\\';
+            case 0x0162:
+            case 0x0163:
+            case 0x021A:
+            case 0x021B:
+                /* Ţ/ţ or Ț/ț */
+                return ']';
+            default:
+                return na;
+        }
+    }
+}
+
+
+/* Read next ASCII character, skipping any characters not in [A-\]]. */
+static char
+read_valid_char(char *str, int *ix)
+{
+    char        c;
+
+    while ((c = read_char((unsigned char *) str, ix)))
+    {
+        if (c >= 'A' && c <= ']')
+        {
+            break;
+        }
+    }
+
+    return c;
+}
+
+
+/* Return sound coding for "letter" (letter sequence) */
+static const dm_codes *
+read_letter(char *str, int *ix)
+{
+    char        c,
+                cmp;
+    int            i,
+                j;
+    const        dm_letter *letters;
+    const        dm_codes *codes;
+
+    /* First letter in sequence. */
+    if (!(c = read_valid_char(str, ix)))
+    {
+        return NULL;
+    }
+    letters = &letter_[c - 'A'];
+    codes = letters->codes;
+    i = *ix;
+
+    /* Any subsequent letters in sequence. */
+    while ((letters = letters->letters) && (c = read_valid_char(str, &i)))
+    {
+        for (j = 0; (cmp = letters[j].letter); j++)
+        {
+            if (cmp == c)
+            {
+                /* Letter found. */
+                letters = &letters[j];
+                if (letters->codes)
+                {
+                    /* Coding for letter sequence found. */
+                    codes = letters->codes;
+                    *ix = i;
+                }
+                break;
+            }
+        }
+        if (!cmp)
+        {
+            /* The sequence of letters has no coding. */
+            break;
+        }
+    }
+
+    return codes;
+}
+
+
+/* Generate all Daitch-Mokotoff soundex codes for word. */
+static int
+daitch_mokotoff_coding(char *word, ArrayBuildState *soundex, MemoryContext tmp_ctx)
+{
+    int            i = 0;
+    int            letter_no = 0;
+    int            ix_node = 0;
+    const        dm_codes *codes,
+               *next_codes;
+    dm_node    *first_node[2],
+               *node;
+
+    /* First letter. */
+    if (!(codes = read_letter(word, &i)))
+    {
+        /* No encodable character in input. */
+        return 0;
+    }
+
+    /* Starting point. */
+    first_node[ix_node] = palloc(sizeof(dm_node));
+    *first_node[ix_node] = start_node;
+
+    /*
+     * Loop until either the word input is exhausted, or all generated soundex
+     * codes are completed to six digits.
+     */
+    while (codes && first_node[ix_node])
+    {
+        next_codes = read_letter(word, &i);
+
+        /* Update leaf nodes. */
+        update_leaves(first_node, &ix_node, letter_no,
+                      codes, next_codes ? next_codes : end_codes,
+                      soundex, tmp_ctx);
+
+        codes = next_codes;
+        letter_no++;
+    }
+
+    /* Append all remaining (incomplete) soundex codes. */
+    for (node = first_node[ix_node]; node; node = node->next[ix_node])
+    {
+        accumArrayResult(soundex,
+                         PointerGetDatum(cstring_to_text_with_len(node->soundex, DM_CODE_DIGITS)),
+                         false,
+                         TEXTOID,
+                         tmp_ctx);
+    }
+
+    return 1;
+}
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff_header.pl b/contrib/fuzzystrmatch/daitch_mokotoff_header.pl
new file mode 100755
index 0000000000..cc4b3a4040
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff_header.pl
@@ -0,0 +1,207 @@
+#!/bin/perl
+#
+# Generation of types and lookup tables for Daitch-Mokotoff soundex.
+#
+# Copyright (c) 2023, PostgreSQL Global Development Group
+#
+# This module was originally sponsored by Finance Norway /
+# Trafikkforsikringsforeningen, and implemented by Dag Lem <dag@nimrod.no>
+#
+
+use strict;
+use warnings;
+use utf8;
+use open IO => ':utf8', ':std';
+use Data::Dumper;
+
+die "Usage: $0 OUTPUT_FILE\n" if @ARGV != 1;
+my $output_file = $ARGV[0];
+
+# Open the output file
+open my $OUTPUT, '>', $output_file
+  or die "Could not open output file $output_file: $!\n";
+
+# Parse code table and generate tree for letter transitions.
+my %codes;
+my $table = [{}, [["","",""]]];
+while (<DATA>) {
+    chomp;
+    my ($letters, $codes) = split(/\s+/);
+    my @codes = map { [ split(/,/) ] } split(/\|/, $codes);
+
+    my $key = "codes_" . join("_or_", map { join("_", @$_) } @codes);
+    my $val = join(",\n", map { "\t{\n\t\t" . join(", ", map { "\"$_\"" } @$_) . "\n\t}" } @codes);
+    $codes{$key} = $val;
+
+    for my $letter (split(/,/, $letters)) {
+        my $ref = $table->[0];
+        # Link each character to the next in the letter combination.
+        my @c = split(//, $letter);
+        my $last_c = pop(@c);
+        for my $c (@c) {
+            $ref->{$c} //= [ {}, undef ];
+            $ref->{$c}[0] //= {};
+            $ref = $ref->{$c}[0];
+        }
+        # The sound code for the letter combination is stored at the last character.
+        $ref->{$last_c}[1] = $key;
+    }
+}
+close(DATA);
+
+print $OUTPUT <<EOF;
+/*
+ * Constants and lookup tables for Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * This file is generated by daitch_mokotoff_header.pl
+ */
+
+/* Coding chart table: Soundex codes */
+typedef char dm_code[2 + 1];    /* One or two sequential code digits + NUL */
+typedef dm_code dm_codes[3];    /* Start of name, before a vowel, any other */
+
+/* Coding chart table: Letter in input sequence */
+struct dm_letter
+{
+    char        letter;            /* Present letter in sequence */
+    const struct dm_letter *letters;    /* List of possible successive letters */
+    const        dm_codes *codes;    /* Code sequence(s) for complete sequence */
+};
+
+typedef struct dm_letter dm_letter;
+
+/* Codes for letter sequence at start of name, before a vowel, and any other. */
+EOF
+
+for my $key (sort keys %codes) {
+    print $OUTPUT "static const dm_codes $key\[2\] =\n{\n" . $codes{$key} . "\n};\n";
+}
+
+print $OUTPUT <<EOF;
+
+/* Coding for alternative following letters in sequence. */
+EOF
+
+sub hash2code {
+    my ($ref, $letter) = @_;
+
+    my @letters = ();
+
+    my $h = $ref->[0];
+    for my $key (sort keys %$h) {
+        $ref = $h->{$key};
+        my $children = "NULL";
+        if (defined $ref->[0]) {
+            $children = "letter_$letter$key";
+            hash2code($ref, "$letter$key");
+        }
+        my $codes = $ref->[1] // "NULL";
+        push(@letters, "\t{\n\t\t'$key', $children, $codes\n\t}");
+    }
+
+    print $OUTPUT "static const dm_letter letter_$letter\[\] =\n{\n";
+    for (@letters) {
+        print $OUTPUT "$_,\n";
+    }
+    print $OUTPUT "\t{\n\t\t'\\0'\n\t}\n";
+    print $OUTPUT "};\n";
+}
+
+hash2code($table, '');
+
+close $OUTPUT;
+
+# Table adapted from https://www.jewishgen.org/InfoFiles/Soundex.html
+#
+# The conversion from the coding chart to the table should be self
+# explanatory, but note the differences stated below.
+#
+# X = NC (not coded)
+#
+# The non-ASCII letters in the coding chart are coded with substitute
+# lowercase ASCII letters, which sort after the uppercase ASCII letters:
+#
+# Ą => a (use '[' for table lookup)
+# Ę => e (use '\\' for table lookup)
+# Ţ => t (use ']' for table lookup)
+#
+# The rule for "UE" does not correspond to the coding chart, however
+# it is used by all other known implementations, including the one at
+# https://www.jewishgen.org/jos/jossound.htm (try e.g. "bouey").
+#
+# Note that the implementation assumes that vowels are assigned code
+# 0 or 1. "J" can be either a vowel or a consonant.
+#
+
+__DATA__
+AI,AJ,AY                0,1,X
+AU                        0,7,X
+a                        X,X,6|X,X,X
+A                        0,X,X
+B                        7,7,7
+CHS                        5,54,54
+CH                        5,5,5|4,4,4
+CK                        5,5,5|45,45,45
+CZ,CS,CSZ,CZS            4,4,4
+C                        5,5,5|4,4,4
+DRZ,DRS                    4,4,4
+DS,DSH,DSZ                4,4,4
+DZ,DZH,DZS                4,4,4
+D,DT                    3,3,3
+EI,EJ,EY                0,1,X
+EU                        1,1,X
+e                        X,X,6|X,X,X
+E                        0,X,X
+FB                        7,7,7
+F                        7,7,7
+G                        5,5,5
+H                        5,5,X
+IA,IE,IO,IU                1,X,X
+I                        0,X,X
+J                        1,X,X|4,4,4
+KS                        5,54,54
+KH                        5,5,5
+K                        5,5,5
+L                        8,8,8
+MN                        66,66,66
+M                        6,6,6
+NM                        66,66,66
+N                        6,6,6
+OI,OJ,OY                0,1,X
+O                        0,X,X
+P,PF,PH                    7,7,7
+Q                        5,5,5
+RZ,RS                    94,94,94|4,4,4
+R                        9,9,9
+SCHTSCH,SCHTSH,SCHTCH    2,4,4
+SCH                        4,4,4
+SHTCH,SHCH,SHTSH        2,4,4
+SHT,SCHT,SCHD            2,43,43
+SH                        4,4,4
+STCH,STSCH,SC            2,4,4
+STRZ,STRS,STSH            2,4,4
+ST                        2,43,43
+SZCZ,SZCS                2,4,4
+SZT,SHD,SZD,SD            2,43,43
+SZ                        4,4,4
+S                        4,4,4
+TCH,TTCH,TTSCH            4,4,4
+TH                        3,3,3
+TRZ,TRS                    4,4,4
+TSCH,TSH                4,4,4
+TS,TTS,TTSZ,TC            4,4,4
+TZ,TTZ,TZS,TSZ            4,4,4
+t                        3,3,3|4,4,4
+T                        3,3,3
+UI,UJ,UY,UE                0,1,X
+U                        0,X,X
+V                        7,7,7
+W                        7,7,7
+X                        5,54,54
+Y                        1,X,X
+ZDZ,ZDZH,ZHDZH            2,4,4
+ZD,ZHD                    2,43,43
+ZH,ZS,ZSCH,ZSH            4,4,4
+Z                        4,4,4
diff --git a/contrib/fuzzystrmatch/expected/fuzzystrmatch.out b/contrib/fuzzystrmatch/expected/fuzzystrmatch.out
index 493c95cdfa..bcb837fd6b 100644
--- a/contrib/fuzzystrmatch/expected/fuzzystrmatch.out
+++ b/contrib/fuzzystrmatch/expected/fuzzystrmatch.out
@@ -65,3 +65,174 @@ SELECT dmetaphone_alt('gumbo');
  KMP
 (1 row)

+-- Wovels
+SELECT daitch_mokotoff('Augsburg');
+ daitch_mokotoff
+-----------------
+ {054795}
+(1 row)
+
+SELECT daitch_mokotoff('Breuer');
+ daitch_mokotoff
+-----------------
+ {791900}
+(1 row)
+
+SELECT daitch_mokotoff('Freud');
+ daitch_mokotoff
+-----------------
+ {793000}
+(1 row)
+
+-- The letter "H"
+SELECT daitch_mokotoff('Halberstadt');
+ daitch_mokotoff
+-----------------
+ {587943,587433}
+(1 row)
+
+SELECT daitch_mokotoff('Mannheim');
+ daitch_mokotoff
+-----------------
+ {665600}
+(1 row)
+
+-- Adjacent sounds
+SELECT daitch_mokotoff('Chernowitz');
+ daitch_mokotoff
+-----------------
+ {596740,496740}
+(1 row)
+
+-- Adjacent letters with identical adjacent code digits
+SELECT daitch_mokotoff('Cherkassy');
+ daitch_mokotoff
+-----------------
+ {595400,495400}
+(1 row)
+
+SELECT daitch_mokotoff('Kleinman');
+ daitch_mokotoff
+-----------------
+ {586660}
+(1 row)
+
+-- More than one word
+SELECT daitch_mokotoff('Nowy Targ');
+ daitch_mokotoff
+-----------------
+ {673950}
+(1 row)
+
+-- Padded with "0"
+SELECT daitch_mokotoff('Berlin');
+ daitch_mokotoff
+-----------------
+ {798600}
+(1 row)
+
+-- Other examples from https://www.avotaynu.com/soundex.htm
+SELECT daitch_mokotoff('Ceniow');
+ daitch_mokotoff
+-----------------
+ {567000,467000}
+(1 row)
+
+SELECT daitch_mokotoff('Tsenyuv');
+ daitch_mokotoff
+-----------------
+ {467000}
+(1 row)
+
+SELECT daitch_mokotoff('Holubica');
+ daitch_mokotoff
+-----------------
+ {587500,587400}
+(1 row)
+
+SELECT daitch_mokotoff('Golubitsa');
+ daitch_mokotoff
+-----------------
+ {587400}
+(1 row)
+
+SELECT daitch_mokotoff('Przemysl');
+ daitch_mokotoff
+-----------------
+ {794648,746480}
+(1 row)
+
+SELECT daitch_mokotoff('Pshemeshil');
+ daitch_mokotoff
+-----------------
+ {746480}
+(1 row)
+
+SELECT daitch_mokotoff('Rosochowaciec');
+                      daitch_mokotoff
+-----------------------------------------------------------
+ {945755,945754,945745,945744,944755,944754,944745,944744}
+(1 row)
+
+SELECT daitch_mokotoff('Rosokhovatsets');
+ daitch_mokotoff
+-----------------
+ {945744}
+(1 row)
+
+-- Ignored characters
+SELECT daitch_mokotoff('''OBrien');
+ daitch_mokotoff
+-----------------
+ {079600}
+(1 row)
+
+SELECT daitch_mokotoff('O''Brien');
+ daitch_mokotoff
+-----------------
+ {079600}
+(1 row)
+
+-- "Difficult" cases, likely to cause trouble for other implementations.
+SELECT daitch_mokotoff('CJC');
+               daitch_mokotoff
+---------------------------------------------
+ {550000,540000,545000,450000,400000,440000}
+(1 row)
+
+SELECT daitch_mokotoff('BESST');
+ daitch_mokotoff
+-----------------
+ {743000}
+(1 row)
+
+SELECT daitch_mokotoff('BOUEY');
+ daitch_mokotoff
+-----------------
+ {710000}
+(1 row)
+
+SELECT daitch_mokotoff('HANNMANN');
+ daitch_mokotoff
+-----------------
+ {566600}
+(1 row)
+
+SELECT daitch_mokotoff('MCCOYJR');
+                      daitch_mokotoff
+-----------------------------------------------------------
+ {651900,654900,654190,654490,645190,645490,641900,644900}
+(1 row)
+
+SELECT daitch_mokotoff('ACCURSO');
+                      daitch_mokotoff
+-----------------------------------------------------------
+ {059400,054000,054940,054400,045940,045400,049400,044000}
+(1 row)
+
+SELECT daitch_mokotoff('BIERSCHBACH');
+                      daitch_mokotoff
+-----------------------------------------------------------
+ {794575,794574,794750,794740,745750,745740,747500,747400}
+(1 row)
+
diff --git a/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8.out
b/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8.out
new file mode 100644
index 0000000000..b0dd4880ba
--- /dev/null
+++ b/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8.out
@@ -0,0 +1,61 @@
+/*
+ * This test must be run in a database with UTF-8 encoding,
+ * because other encodings don't support all the characters used.
+ */
+SELECT getdatabaseencoding() <> 'UTF8'
+       AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+set client_encoding = utf8;
+-- CREATE EXTENSION IF NOT EXISTS fuzzystrmatch;
+-- Accents
+SELECT daitch_mokotoff('Müller');
+ daitch_mokotoff
+-----------------
+ {689000}
+(1 row)
+
+SELECT daitch_mokotoff('Schäfer');
+ daitch_mokotoff
+-----------------
+ {479000}
+(1 row)
+
+SELECT daitch_mokotoff('Straßburg');
+ daitch_mokotoff
+-----------------
+ {294795}
+(1 row)
+
+SELECT daitch_mokotoff('Éregon');
+ daitch_mokotoff
+-----------------
+ {095600}
+(1 row)
+
+-- Special characters added at https://www.jewishgen.org/InfoFiles/Soundex.html
+SELECT daitch_mokotoff('gąszczu');
+ daitch_mokotoff
+-----------------
+ {564000,540000}
+(1 row)
+
+SELECT daitch_mokotoff('brzęczy');
+        daitch_mokotoff
+-------------------------------
+ {794640,794400,746400,744000}
+(1 row)
+
+SELECT daitch_mokotoff('ţamas');
+ daitch_mokotoff
+-----------------
+ {364000,464000}
+(1 row)
+
+SELECT daitch_mokotoff('țamas');
+ daitch_mokotoff
+-----------------
+ {364000,464000}
+(1 row)
+
diff --git a/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8_1.out
b/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8_1.out
new file mode 100644
index 0000000000..37aead89c0
--- /dev/null
+++ b/contrib/fuzzystrmatch/expected/fuzzystrmatch_utf8_1.out
@@ -0,0 +1,8 @@
+/*
+ * This test must be run in a database with UTF-8 encoding,
+ * because other encodings don't support all the characters used.
+ */
+SELECT getdatabaseencoding() <> 'UTF8'
+       AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql b/contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql
new file mode 100644
index 0000000000..d8542a781c
--- /dev/null
+++ b/contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql
@@ -0,0 +1,8 @@
+/* contrib/fuzzystrmatch/fuzzystrmatch--1.1--1.2.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION fuzzystrmatch UPDATE TO '1.2'" to load this file. \quit
+
+CREATE FUNCTION daitch_mokotoff(text) RETURNS text[]
+AS 'MODULE_PATHNAME', 'daitch_mokotoff'
+LANGUAGE C IMMUTABLE STRICT PARALLEL SAFE;
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch.control b/contrib/fuzzystrmatch/fuzzystrmatch.control
index 3cd6660bf9..8b6e9fd993 100644
--- a/contrib/fuzzystrmatch/fuzzystrmatch.control
+++ b/contrib/fuzzystrmatch/fuzzystrmatch.control
@@ -1,6 +1,6 @@
 # fuzzystrmatch extension
 comment = 'determine similarities and distance between strings'
-default_version = '1.1'
+default_version = '1.2'
 module_pathname = '$libdir/fuzzystrmatch'
 relocatable = true
 trusted = true
diff --git a/contrib/fuzzystrmatch/meson.build b/contrib/fuzzystrmatch/meson.build
index 6f424be6d4..3ff84eb531 100644
--- a/contrib/fuzzystrmatch/meson.build
+++ b/contrib/fuzzystrmatch/meson.build
@@ -1,9 +1,18 @@
 # Copyright (c) 2022-2023, PostgreSQL Global Development Group

 fuzzystrmatch_sources = files(
-  'fuzzystrmatch.c',
+  'daitch_mokotoff.c',
   'dmetaphone.c',
+  'fuzzystrmatch.c',
+)
+
+daitch_mokotoff_h = custom_target('daitch_mokotoff',
+  input: 'daitch_mokotoff_header.pl',
+  output: 'daitch_mokotoff.h',
+  command: [perl, '@INPUT@', '@OUTPUT@'],
 )
+generated_sources += daitch_mokotoff_h
+fuzzystrmatch_sources += daitch_mokotoff_h

 if host_system == 'windows'
   fuzzystrmatch_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
@@ -13,6 +22,7 @@ endif

 fuzzystrmatch = shared_module('fuzzystrmatch',
   fuzzystrmatch_sources,
+  include_directories: include_directories('.'),
   kwargs: contrib_mod_args,
 )
 contrib_targets += fuzzystrmatch
@@ -21,6 +31,7 @@ install_data(
   'fuzzystrmatch.control',
   'fuzzystrmatch--1.0--1.1.sql',
   'fuzzystrmatch--1.1.sql',
+  'fuzzystrmatch--1.1--1.2.sql',
   kwargs: contrib_data_args,
 )

@@ -31,6 +42,7 @@ tests += {
   'regress': {
     'sql': [
       'fuzzystrmatch',
+      'fuzzystrmatch_utf8',
     ],
   },
 }
diff --git a/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql b/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql
index f05dc28ffb..db05c7d6b6 100644
--- a/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql
+++ b/contrib/fuzzystrmatch/sql/fuzzystrmatch.sql
@@ -19,3 +19,48 @@ SELECT metaphone('GUMBO', 4);

 SELECT dmetaphone('gumbo');
 SELECT dmetaphone_alt('gumbo');
+
+-- Wovels
+SELECT daitch_mokotoff('Augsburg');
+SELECT daitch_mokotoff('Breuer');
+SELECT daitch_mokotoff('Freud');
+
+-- The letter "H"
+SELECT daitch_mokotoff('Halberstadt');
+SELECT daitch_mokotoff('Mannheim');
+
+-- Adjacent sounds
+SELECT daitch_mokotoff('Chernowitz');
+
+-- Adjacent letters with identical adjacent code digits
+SELECT daitch_mokotoff('Cherkassy');
+SELECT daitch_mokotoff('Kleinman');
+
+-- More than one word
+SELECT daitch_mokotoff('Nowy Targ');
+
+-- Padded with "0"
+SELECT daitch_mokotoff('Berlin');
+
+-- Other examples from https://www.avotaynu.com/soundex.htm
+SELECT daitch_mokotoff('Ceniow');
+SELECT daitch_mokotoff('Tsenyuv');
+SELECT daitch_mokotoff('Holubica');
+SELECT daitch_mokotoff('Golubitsa');
+SELECT daitch_mokotoff('Przemysl');
+SELECT daitch_mokotoff('Pshemeshil');
+SELECT daitch_mokotoff('Rosochowaciec');
+SELECT daitch_mokotoff('Rosokhovatsets');
+
+-- Ignored characters
+SELECT daitch_mokotoff('''OBrien');
+SELECT daitch_mokotoff('O''Brien');
+
+-- "Difficult" cases, likely to cause trouble for other implementations.
+SELECT daitch_mokotoff('CJC');
+SELECT daitch_mokotoff('BESST');
+SELECT daitch_mokotoff('BOUEY');
+SELECT daitch_mokotoff('HANNMANN');
+SELECT daitch_mokotoff('MCCOYJR');
+SELECT daitch_mokotoff('ACCURSO');
+SELECT daitch_mokotoff('BIERSCHBACH');
diff --git a/contrib/fuzzystrmatch/sql/fuzzystrmatch_utf8.sql b/contrib/fuzzystrmatch/sql/fuzzystrmatch_utf8.sql
new file mode 100644
index 0000000000..f42c01a1bb
--- /dev/null
+++ b/contrib/fuzzystrmatch/sql/fuzzystrmatch_utf8.sql
@@ -0,0 +1,26 @@
+/*
+ * This test must be run in a database with UTF-8 encoding,
+ * because other encodings don't support all the characters used.
+ */
+
+SELECT getdatabaseencoding() <> 'UTF8'
+       AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+set client_encoding = utf8;
+
+-- CREATE EXTENSION IF NOT EXISTS fuzzystrmatch;
+
+-- Accents
+SELECT daitch_mokotoff('Müller');
+SELECT daitch_mokotoff('Schäfer');
+SELECT daitch_mokotoff('Straßburg');
+SELECT daitch_mokotoff('Éregon');
+
+-- Special characters added at https://www.jewishgen.org/InfoFiles/Soundex.html
+SELECT daitch_mokotoff('gąszczu');
+SELECT daitch_mokotoff('brzęczy');
+SELECT daitch_mokotoff('ţamas');
+SELECT daitch_mokotoff('țamas');
diff --git a/doc/src/sgml/fuzzystrmatch.sgml b/doc/src/sgml/fuzzystrmatch.sgml
index 5dedbd8f7a..55f0b7a22a 100644
--- a/doc/src/sgml/fuzzystrmatch.sgml
+++ b/doc/src/sgml/fuzzystrmatch.sgml
@@ -104,10 +104,10 @@ SELECT * FROM s WHERE difference(s.nm, 'john') > 2;
   </indexterm>

 <synopsis>
-levenshtein(text source, text target, int ins_cost, int del_cost, int sub_cost) returns int
-levenshtein(text source, text target) returns int
-levenshtein_less_equal(text source, text target, int ins_cost, int del_cost, int sub_cost, int max_d) returns int
-levenshtein_less_equal(text source, text target, int max_d) returns int
+levenshtein(source text, target text, ins_cost int, del_cost int, sub_cost int) returns int
+levenshtein(source text, target text) returns int
+levenshtein_less_equal(source text, target text, ins_cost int, del_cost int, sub_cost int, max_d int) returns int
+levenshtein_less_equal(source text, target text, max_d int) returns int
 </synopsis>

   <para>
@@ -177,7 +177,7 @@ test=# SELECT levenshtein_less_equal('extensive', 'exhaustive', 4);
   </indexterm>

 <synopsis>
-metaphone(text source, int max_output_length) returns text
+metaphone(source text, max_output_length int) returns text
 </synopsis>

   <para>
@@ -220,8 +220,8 @@ test=# SELECT metaphone('GUMBO', 4);
   </indexterm>

 <synopsis>
-dmetaphone(text source) returns text
-dmetaphone_alt(text source) returns text
+dmetaphone(source text) returns text
+dmetaphone_alt(source text) returns text
 </synopsis>

   <para>
@@ -241,4 +241,154 @@ test=# SELECT dmetaphone('gumbo');
 </screen>
  </sect2>

+ <sect2>
+  <title>Daitch-Mokotoff Soundex</title>
+
+  <para>
+   Compared to the American Soundex System implemented in the
+   <function>soundex</function> function, the major improvements of the
+   Daitch-Mokotoff Soundex System are:
+
+   <itemizedlist spacing="compact" mark="bullet">
+    <listitem>
+     <para>
+      Information is coded to the first six meaningful letters rather than
+      four.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      The initial letter is coded rather than kept as is.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Where two consecutive letters have a single sound, they are coded as a
+      single number.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      When a letter or combination of letters may have two different sounds,
+      it is double coded under the two different codes.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      A letter or combination of letters maps into ten possible codes rather
+      than seven.
+     </para>
+    </listitem>
+   </itemizedlist>
+  </para>
+
+  <indexterm>
+   <primary>daitch_mokotoff</primary>
+  </indexterm>
+
+  <para>
+   The following function generates Daitch-Mokotoff soundex codes for matching
+   of similar-sounding input:
+  </para>
+
+<synopsis>
+daitch_mokotoff(source text) returns text[]
+</synopsis>
+
+  <para>
+   Since a Daitch-Mokotoff soundex code consists of only 6 digits,
+   <literal>source</literal> should be preferably a single word or name.
+  </para>
+
+  <para>
+   Examples:
+  </para>
+
+<programlisting>
+SELECT daitch_mokotoff('George');
+ daitch_mokotoff
+-----------------
+ {595000}
+
+SELECT daitch_mokotoff('John');
+ daitch_mokotoff
+-----------------
+ {160000,460000}
+
+SELECT daitch_mokotoff('Bierschbach');
+                      daitch_mokotoff
+-----------------------------------------------------------
+ {794575,794574,794750,794740,745750,745740,747500,747400}
+
+SELECT daitch_mokotoff('Schwartzenegger');
+ daitch_mokotoff
+-----------------
+ {479465}
+</programlisting>
+
+  <para>
+   For matching of single names, the returned text array can be matched
+   directly using the <literal>&&</literal> operator. A GIN index may
+   be used for efficiency, see <xref linkend="gin"/> and the example
+   below:
+  </para>
+
+<programlisting>
+CREATE TABLE s (nm text);
+CREATE INDEX ix_s_dm ON s USING gin (daitch_mokotoff(nm)) WITH (fastupdate = off);
+
+INSERT INTO s (nm) VALUES
+  ('Schwartzenegger'),
+  ('John'),
+  ('James'),
+  ('Steinman'),
+  ('Steinmetz');
+
+SELECT * FROM s WHERE daitch_mokotoff(nm) && daitch_mokotoff('Swartzenegger');
+SELECT * FROM s WHERE daitch_mokotoff(nm) && daitch_mokotoff('Jane');
+SELECT * FROM s WHERE daitch_mokotoff(nm) && daitch_mokotoff('Jens');
+</programlisting>
+
+  <para>
+   For indexing and matching of any number of names in any order, Full Text
+   Search may be used. See <xref linkend="textsearch"/> and the example below:
+  </para>
+
+<programlisting>
+CREATE OR REPLACE FUNCTION soundex_tsvector(v_name text) RETURNS tsvector AS $$
+  SELECT to_tsvector('simple', string_agg(array_to_string(daitch_mokotoff(n), ' '), ' '))
+  FROM regexp_split_to_table(v_name, '\s+') AS n
+$$ LANGUAGE sql STRICT IMMUTABLE PARALLEL SAFE;
+
+CREATE OR REPLACE FUNCTION soundex_tsquery(v_name text) RETURNS tsquery AS $$
+  SELECT string_agg('(' || array_to_string(daitch_mokotoff(n), '|') || ')', '&')::tsquery
+  FROM regexp_split_to_table(v_name, '\s+') AS n
+$$ LANGUAGE sql STRICT IMMUTABLE PARALLEL SAFE;
+
+CREATE TABLE s (nm text);
+CREATE INDEX ix_s_txt ON s USING gin (soundex_tsvector(nm)) WITH (fastupdate = off);
+
+INSERT INTO s (nm) VALUES
+  ('John Doe'),
+  ('Jane Roe'),
+  ('Public John Q.'),
+  ('George Best'),
+  ('John Yamson');
+
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('john');
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('jane doe');
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('john public');
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('besst, giorgio');
+SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('Jameson John');
+</programlisting>
+
+  <para>
+   Note that if it is desired to avoid recalculation of soundex codes on GIN
+   table row recheck, an index on a separate column can be used instead of an
+   index on an expression. A stored generated column may be used for this, see
+   <xref linkend="ddl-generated-columns"/>.
+  </para>
+
+ </sect2>
+
 </sect1>

Re: daitch_mokotoff module

От
Dag Lem
Дата:
Dag Lem <dag@nimrod.no> writes:

> Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
>
>> On 2/7/23 18:08, Paul Ramsey wrote:
>>> 
>>> 
>>>> On Feb 7, 2023, at 6:47 AM, Dag Lem <dag@nimrod.no> wrote:
>>>>
>>>> I just went by to check the status of the patch, and I noticed that
>>>> you've added yourself as reviewer earlier - great!
>>>>
>>>> Please tell me if there is anything I can do to help bring this across
>>>> the finish line.
>>> 
>>> Honestly, I had set it to Ready for Committer, but then I went to
>>> run regression one more time and my regression blew up. I found I
>>> couldn't enable the UTF tests without things failing. And I don't
>>> blame you! I think my installation is probably out-of-alignment in
>>> some way, but I didn't want to flip the Ready flag without having
>>> run everything through to completion, so I flipped it back. Also,
>>> are the UTF tests enabled by default? It wasn't clear to me that
>>> they were?
>>> 
>> The utf8 tests are enabled depending on the encoding returned by
>> getdatabaseencoding(). Systems with other encodings will simply use the
>> alternate .out file. And it works perfectly fine for me.
>>
>> IMHO it's ready for committer.
>>
>>
>> regards
>
> Yes, the UTF-8 tests follow the current best practice as has been
> explained to me earlier. The following patch exemplifies this:
>
> https://github.com/postgres/postgres/commit/c2e8bd27519f47ff56987b30eb34a01969b9a9e8
>
>

Can you please have a look at this again?

Best regards,

Dag Lem



Re: daitch_mokotoff module

От
Dag Lem
Дата:
Dag Lem <dag@nimrod.no> writes:

> I sincerely hope this resolves any blocking issues with copyright /
> legalese / credits.
>

Can this now be considered ready for commiter, so that Paul or someone
else can flip the bit?

Best regards
Dag Lem



Re: daitch_mokotoff module

От
Tomas Vondra
Дата:
On 4/3/23 15:19, Dag Lem wrote:
> Dag Lem <dag@nimrod.no> writes:
> 
>> I sincerely hope this resolves any blocking issues with copyright /
>> legalese / credits.
>>
> 
> Can this now be considered ready for commiter, so that Paul or someone
> else can flip the bit?
> 

Hi, I think from the technical point of view it's sound and ready for
commit. The patch stalled on the copyright/credit stuff, which is
somewhat separate and mostly non-technical aspect of patches. Sorry for
that, I'm sure it's annoying/frustrating :-(

I see the current patch has two simple lines:

 * This module was originally sponsored by Finance Norway /
 * Trafikkforsikringsforeningen, and implemented by Dag Lem

Any objections to this level of attribution in commnents?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: daitch_mokotoff module

От
Tom Lane
Дата:
Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
> Hi, I think from the technical point of view it's sound and ready for
> commit. The patch stalled on the copyright/credit stuff, which is
> somewhat separate and mostly non-technical aspect of patches. Sorry for
> that, I'm sure it's annoying/frustrating :-(

> I see the current patch has two simple lines:

>  * This module was originally sponsored by Finance Norway /
>  * Trafikkforsikringsforeningen, and implemented by Dag Lem

> Any objections to this level of attribution in commnents?

That seems fine to me.  I'll check this over and see if I can get
it pushed today.

            regards, tom lane



Re: daitch_mokotoff module

От
Tom Lane
Дата:
I wrote:
> That seems fine to me.  I'll check this over and see if I can get
> it pushed today.

I pushed this after some mostly-cosmetic fiddling.  Most of the
buildfarm seems okay with it, but crake's perlcritic run is not:

./contrib/fuzzystrmatch/daitch_mokotoff_header.pl: I/O layer ":utf8" used at line 15, column 5.  Use ":encoding(UTF-8)"
toget strict validation.  ([InputOutput::RequireEncodingWithUTF8Layer] Severity: 5) 

Any suggestions on exactly how to pacify that?

            regards, tom lane



Re: daitch_mokotoff module

От
Andres Freund
Дата:
Hi,

On 2023-04-07 21:13:43 -0400, Tom Lane wrote:
> I wrote:
> > That seems fine to me.  I'll check this over and see if I can get
> > it pushed today.
> 
> I pushed this after some mostly-cosmetic fiddling.  Most of the
> buildfarm seems okay with it, but crake's perlcritic run is not:
> 
> ./contrib/fuzzystrmatch/daitch_mokotoff_header.pl: I/O layer ":utf8" used at line 15, column 5.  Use
":encoding(UTF-8)"to get strict validation.  ([InputOutput::RequireEncodingWithUTF8Layer] Severity: 5)
 
> 
> Any suggestions on exactly how to pacify that?

You could follow it's advise and replace the :utf8 with :encoding(UTF-8), that
works here. Or disable it in that piece of code with ## no critic
(RequireEncodingWithUTF8Layer) Or we could disable the warning in
perlcriticrc for all files?

Unless it's not available with old versions, using :encoding(UTF-8) seems
sensible?

Greetings,

Andres Freund



Re: daitch_mokotoff module

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> On 2023-04-07 21:13:43 -0400, Tom Lane wrote:
>> I pushed this after some mostly-cosmetic fiddling.  Most of the
>> buildfarm seems okay with it, but crake's perlcritic run is not:
>>
>> ./contrib/fuzzystrmatch/daitch_mokotoff_header.pl: I/O layer ":utf8" used at line 15, column 5.  Use
":encoding(UTF-8)"to get strict validation.  ([InputOutput::RequireEncodingWithUTF8Layer] Severity: 5) 

> Unless it's not available with old versions, using :encoding(UTF-8) seems
> sensible?

Yeah, that's the obvious fix, I was just wondering if people with
more perl-fu than I have see a problem with it.  But I'll go ahead
and push that for now.

            regards, tom lane



Re: daitch_mokotoff module

От
Tom Lane
Дата:
I wrote:
> I pushed this after some mostly-cosmetic fiddling.  Most of the
> buildfarm seems okay with it,

Spoke too soon [1]:

make[1]: Entering directory '/home/linux1/build-farm-16-pipit/buildroot/HEAD/pgsql.build/contrib/fuzzystrmatch'
'/usr/bin/perl' daitch_mokotoff_header.pl daitch_mokotoff.h
Can't locate open.pm in @INC (you may need to install the open module) (@INC contains: /usr/local/lib64/perl5
/usr/local/share/perl5/usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5) at
daitch_mokotoff_header.plline 15. 
BEGIN failed--compilation aborted at daitch_mokotoff_header.pl line 15.
make[1]: *** [Makefile:33: daitch_mokotoff.h] Error 2

pipit appears to be running a reasonably current system (RHEL8), so
the claim that "open" is a Perl core module appears false.  We need
to rewrite this to not use that.

            regards, tom lane

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=pipit&dt=2023-04-08%2001%3A02%3A39



Re: daitch_mokotoff module

От
Andrew Dunstan
Дата:


On 2023-04-07 Fr 21:52, Tom Lane wrote:
I wrote:
I pushed this after some mostly-cosmetic fiddling.  Most of the
buildfarm seems okay with it,
Spoke too soon [1]:

make[1]: Entering directory '/home/linux1/build-farm-16-pipit/buildroot/HEAD/pgsql.build/contrib/fuzzystrmatch'
'/usr/bin/perl' daitch_mokotoff_header.pl daitch_mokotoff.h
Can't locate open.pm in @INC (you may need to install the open module) (@INC contains: /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5) at daitch_mokotoff_header.pl line 15.
BEGIN failed--compilation aborted at daitch_mokotoff_header.pl line 15.
make[1]: *** [Makefile:33: daitch_mokotoff.h] Error 2

pipit appears to be running a reasonably current system (RHEL8), so
the claim that "open" is a Perl core module appears false.  We need
to rewrite this to not use that.
			regards, tom lane

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=pipit&dt=2023-04-08%2001%3A02%3A39



I think it is a core module (See <https://metacpan.org/pod/open>) but it appears that some packagers have separated it out for reasons that aren't entirely obvious:

andrew@emma:~ $ rpm -q -l -f /usr/share/perl5/open.pm
/usr/share/man/man3/open.3pm.gz
/usr/share/perl5/open.pm

cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: daitch_mokotoff module

От
Tom Lane
Дата:
Andrew Dunstan <andrew@dunslane.net> writes:
> On 2023-04-07 Fr 21:52, Tom Lane wrote:
>> pipit appears to be running a reasonably current system (RHEL8), so
>> the claim that "open" is a Perl core module appears false.  We need
>> to rewrite this to not use that.

> I think it is a core module (See <https://metacpan.org/pod/open>) but it 
> appears that some packagers have separated it out for reasons that 
> aren't entirely obvious:

Hmm, yeah: on my RHEL8 workstation

$ rpm -qf /usr/share/perl5/open.pm
perl-open-1.11-421.el8.noarch

It's not exactly clear how that came to be installed, because

$ rpm -q perl-open --whatrequires
no package requires perl-open

and indeed another nearby RHEL8 machine doesn't have that package
installed at all, even though I've got it loaded up with enough
stuff for most Postgres work.  (Sadly, I'd not tested on that one.)

Anyway, I assume this is just syntactic sugar for something
we can do another way?  If it's at all fundamental, I'll have
to back the patch out.

            regards, tom lane



Re: daitch_mokotoff module

От
Tom Lane
Дата:
I wrote:
> Anyway, I assume this is just syntactic sugar for something
> we can do another way?  If it's at all fundamental, I'll have
> to back the patch out.

On closer inspection, this script is completely devoid of any
need to deal in non-ASCII data at all.  So I just nuked the
"use" lines.

            regards, tom lane



Re: daitch_mokotoff module

От
Andrew Dunstan
Дата:


On 2023-04-07 Fr 23:25, Tom Lane wrote:
I wrote:
Anyway, I assume this is just syntactic sugar for something
we can do another way?  If it's at all fundamental, I'll have
to back the patch out.
On closer inspection, this script is completely devoid of any
need to deal in non-ASCII data at all.  So I just nuked the
"use" lines.
			


Yeah.

I just spent a little while staring at the perl code. I have to say it seems rather opaque, the data structure seems a bit baroque. I'll try to simplify it.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: daitch_mokotoff module

От
Tom Lane
Дата:
Buildfarm member hamerkop has a niggle about this patch:

c:\\build-farm-local\\buildroot\\head\\pgsql.build\\contrib\\fuzzystrmatch\\daitch_mokotoff.c : warning C4819: The file
containsa character that cannot be represented in the current code page (932). Save the file in Unicode format to
preventdata loss 

It's complaining about the comment in

static const char iso8859_1_to_ascii_upper[] =
/*
"`abcdefghijklmnopqrstuvwxyz{|}~                                  ¡¢£¤¥¦§¨©ª«¬
®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ"
*/
"`ABCDEFGHIJKLMNOPQRSTUVWXYZ{|}~                                  !
?AAAAAAECEEEEIIIIDNOOOOO*OUUUUYDSAAAAAAECEEEEIIIIDNOOOOO/OUUUUYDY";

There are some other comments with non-ASCII characters elsewhere in the
file, but I think it's mainly just the weird symbols here that might fail
to translate to encodings that are not based on ISO 8859-1.

I think we need to get rid of this warning: it's far from obvious that
it's a non-issue, and because the compiler is not at all specific about
where the issue is, people could waste a lot of time figuring that out.
In fact, it might *not* be a non-issue, if it prevents the source tree
as a whole from being processed by some tool or other.

So I propose to replace those symbols with "... random symbols ..." or
the like and see if the warning goes away.  If not, we might have to
resort to something more drastic like removing this comment altogether.
We do have non-ASCII text in comments and test cases elsewhere in the
tree, and have not had a lot of trouble with that, so I'm hoping the
letters can stay because they are useful to compare to the constant.

            regards, tom lane