泰国地址识别的一种尝试

前言

由于公司项目主要面向泰国等东南亚地区用户，参考国内各大快递、电商平台有关收货地址的自动识别，希望能实现类似的、基于泰文的泰国收货地址自动识别的功能。用户粘贴一段“姓名 + 收货手机号 + 收货地址 + 邮编”的文字，通过分析、匹配上系统内已经存在的、由物流公司提供的府（จังหวัด）、县（อำเภอ）、区（ตำบล）的邮政编码。作为一个国人程序员，自然是没有泰语功底，因此本文中所提到的有关泰语的相关说明和描述可能存在偏差，但本文仅作为实现自动识别地址功能的一种参考。

了解泰国地址

在说泰国地址之前，我们先来说说中国大陆地区的地址。参考国内顺丰速运提供的国内收货地址识别，顺丰已经实现了用户粘贴一段“中文姓名 + 收货手机号 + 收货地址”即可准确识别出地址中的省、市、区等信息，并能从中提取收货人的中文姓名和手机号。从中华人民共和国国家统计局和中华人民共和国民政部，我们可以获取到国家每年统计的行政区划和行政区划代码。国家官方发布的信息对国内地址识别有一定帮助。

相较于国内权威发布的统计数据，泰国发布的数据就相对较少了。通过 Google 和公司内从事泰语翻译的同事了解到，类似于英语中地址的写法，泰语中地址的书写顺序是倒序，从最小的单位开始写，一直到最大的单位。一般在告知地址时，会用如下字样：ติดต่อเรา / ที่อยู่（联系我们/地址）。以曼谷素万那普机场为例，地址如下：

อาคารผู้โดยสาร ชั้น 6 (แถว F, ประตูทางเข้าที่ 3) 999 หมู่ 10 ถนน บางนา-ตราด ตำบลราชาเทวะ อำเภอบางพลี จังหวัดสมุทรปราการ 10540

出现的结构单位有：ชั้น6（六楼）、แถว F（F道）、ประตูทางเข้าที่ 3（3号门），由于机场较为特殊，如果是普通住户或单位，则这里是住房编号。接下来的单位是：หมู่ 10（十号巷）、 ถนน（路）、ตำบล（区）、 อำเภอ（县）、จังหวัด（府）。以上基本是除曼谷以外的地名的基本表达方法。

曼谷的地址较为特殊，以曼谷一家泰国银行地址为例，地址如下：

333 ถ.สีลม แขวงสีลม เขตบางรัก กรุงเทพมหานคร 10500

这里 ถ. 是 ถนน（路）的简写。在泰文地址书写当中，经常会出现行政单位简写，具体缩略方式是首字母加点。其他单位缩写还有 จ.（จังหวัด，府）、อ.（อำเภอ，县）、ต.（ตำบล，区）、ซ.（ซอย，巷）。以上地址接下来是 แขวง（主干道）、เขต（区域）、กรุงเทพมหานคร（曼谷）。

主要思路

介绍之前，先介绍一下这次泰国地址自动识别的思路。

泰文分词，整理分词结果
从分词结果中找出所有地址结果单位的待定项
含有邮编：对比地址库与各待定项的匹配度，选择相似度最高的一组
邮编未知：找出待定项与地址库所有匹配的地址记录
拾取手机号部分，计算距离，选择距离最近的部分

泰文分词

说到分词，想到的一定是机器学习，考虑到项目无法提供大量泰文进行学习，公司也没有人专门做数据标记，无法自建一套泰文分词库。此路不通后，开始考虑云服务提供商。翻遍国内的阿里云、腾讯云、华为云、京东云、新浪云、百度云，虽然部分云提供地址识别和中文分词服务，但目前仅阿里云提供了中文、英文、泰文三种语言的分词服务，且目前 NLP 服务的免费额度也足够中小企业使用。

由于项目目前还有其他功能使用阿里云服务，因此接入 NLP 服务相对比较简单，验签方式与其他 API 相似，整个接入过程顺畅。（真的没有要给阿里云打广告的意思）

由于泰国目前仍会主要使用邮政编码，因此用户输入的地址有两种情况，一种是地址中含有目的地邮政编码，一种则不含有。

第一种含有目的地邮编，以如下这个泰国地址（部分敏感数据已处理）为例。

น.ส สมหญิง ศรีเรือง 0628888888 333หมู่1 ต.ตรมไพร อ.ศีขรภูมิ จ.สุรินทร์ 32110

调用阿里云 NLP 通用分词接口，可以得到以下分词结果：

{"data":[
{"id":0,"word":"น"},{"id":1,"word":"."},{"id":2,"word":"ส"},
{"id":3,"word":" "},{"id":4,"word":"สมหญิง"},{"id":5,"word":" "},
{"id":6,"word":"ศรีเรือง"},{"id":7,"word":" "},{"id":8,"word":"0628888888"},
{"id":9,"word":" "},{"id":10,"word":"333"},{"id":11,"word":"หมู่"},
{"id":12,"word":"1"},{"id":13,"word":" "},{"id":14,"word":"ต"},
{"id":15,"word":"."},{"id":16,"word":"ตรมไพร"},{"id":17,"word":" "},
{"id":18,"word":"อ"},{"id":19,"word":"."},{"id":20,"word":"ศีขรภูมิ"},
{"id":21,"word":" "},{"id":22,"word":"จ"},{"id":23,"word":"."},
{"id":24,"word":"สุรินทร์"},{"id":25,"word":" "},{"id":26,"word":"32110"}]}

第二种不含有目的地邮编，以如下这个泰国地址（部分敏感数据已处理）为例。

88/2 หมู่8 เขาชะงุ้ม โพธาราม ราชบุรี ปั้นกล่ 098-8888888

调用阿里云 NLP 通用分词接口，可以得到以下分词结果：

{"data":[
{"id":0,"word":"88"},{"id":1,"word":"/"},{"id":2,"word":"2"},
{"id":3,"word":" "},{"id":4,"word":"หมู่"},{"id":5,"word":"8"},
{"id":6,"word":" "},{"id":7,"word":"เขา"},{"id":8,"word":"ชะงุ้ม"},
{"id":9,"word":" "},{"id":10,"word":"โพธาราม"},{"id":11,"word":" "},
{"id":12,"word":"ราชบุรี"},{"id":13,"word":" "},{"id":14,"word":"ปั้น"},
{"id":15,"word":"กล่"},{"id":16,"word":" "},{"id":17,"word":"098-8888888"}]}

从结果中得知，分词结果可能会将手机号码等连续数字断开，同时泰文中输入手机号码可能会存在横杠“-”字符，因此需要将连续数字拼接回来，同时忽略数字与数字之间的横杠字符和空白字符。将 098-8888888 变为 0988888888。

找出待定项

用户通过输入一段地址字符串，我们无法得知用户输入的地址是属于以上哪一种地址，同时为了能够适应泰国地址单位简写，我们需要从分词结果中循环找出府、县、区、邮政编码、手机号的待定项，即将数组中可能成为府、县、区、邮政编码、手机号的值的位置（下标）存起来。

foreach ($words as $i => $word) {
    if (empty(trim($word)) || $word == '.') {
        continue;
    }
    if ($word == self::PROVINCE_SHORT) { // PROVINCE_SHORT 为府（จ）简写
        $usedPos = array_merge($usedPos, $this->findRelatedPart($i, $count, $words, $provinces));
        continue;
    }
    if ($word == self::COUNTY_SHORT) { // COUNTY_SHORT 为县（อ）简写
        $usedPos = array_merge($usedPos, $this->findRelatedPart($i, $count, $words, $counties));
        continue;
    }
    if ($word == self::DISTRICT_SHORT) { // DISTRICT_SHORT 为区（ต）简写
        $usedPos = array_merge($usedPos, $this->findRelatedPart($i, $count, $words, $districts));
        continue;
    }
    if (mb_strstr($word, self::PROVINCE)) {
        $provinces[$i] = mb_substr($word, 0, -(strlen(self::PROVINCE)));
        array_push($usedPos, $i);
        continue;
    }
    if (mb_strstr($word, self::COUNTY)) {
        $counties[$i] = mb_substr($word, 0, -(strlen(self::COUNTY)));
        array_push($usedPos, $i);
        continue;
    }
    if (mb_strstr($word, self::DISTRICT)) {
        $districts[$i] = mb_substr($word, 0, -(strlen(self::DISTRICT)));
        array_push($usedPos, $i);
        continue;
    }
    if (preg_match("/^([1-9]\d{4})$/", trim($word), $matches)) { // 查找邮编
        $postcodes[$i] = $matches[0];
        array_push($usedPos, $i);
        continue;
    }
    if (empty($phone) && preg_match("/^(0[2-3]\d-?\d{6}|0[1,4-9]\d-?\d{7})$/", trim($word), $matches)) { // 查找手机号
        $phone[$i] = $matches[0];
        array_push($usedPos, $i);
        continue;
    }
}

这里在匹配简写府、县、区时，我们需要匹配的是点“.”后面的府、县、区的名称（如 เมือง），而不是带点的名称（如 อ. เมือง）。由于分词结果中点“.”后可能存在空格或特殊空白字符，因此上面代码中 findRelatedPart 函数中对空格和特殊空白字符会跳过处理，截取点“.”后出现的第一个非空元素，并将这个元素放入府、县、区待定项数组中，同时返回所有使用到的元素在分词结果中的下标位置，包括府、县、区的简写、名称和点。以下为 findRelatedPart 函数：

function findRelatedPart($index, $wordsCount, array $words, array &$dataset)
{
    $used = [];
    for ($j = $index + 1; $j < $wordsCount; $j++) {
        if (mb_ord($words[$j]) == 8203 || $words[$j] == '.') {
            array_push($used, $j);
            continue;
        }
 
        if (!empty(trim($words[$j]))) {
            $dataset[$j] = $words[$j];
            array_push($used, $j);
            break;
        }
    }
    array_push($used, $index);
    return $used;
}

经过筛选后，上面提到的两种示例地址得到的结果如下：

// 第一种含有目的地邮编
$provinces = [
  "24": "สุรินทร์"
];
$counties = [
  "20": "ศีขรภูมิ"
];
$districts = [
  "16": "ตรมไพร"
];
$postcodes = [
  "26": "32110"
];
$phone = [
  "8": "0628888888"
];
$usedPos = [8,15,16,14,19,20,18,23,24,22,26];
 
// 第二种不含有目的地邮编
$provinces = [];
$counties = [];
$districts = [];
$postcodes = [];
$phone = [
  "17": "098-8888888"
];
$usedPos = [17];

含有邮编：对比地址库与各待定项的匹配度

通过上述所说取得邮政编码的所有待定项，将这些项在数据库地址库中进行查询，找到对应的地址邮编记录。

foreach ($postcodes as $postcode) {
    $postcodeRows = $addressService->findByPostCode($postcode);
    if ($postcodeRows->isNotEmpty()) {
        $options = $options->concat($postcodeRows);
    }
}

由于泰国邮政编码可能会一个邮编对应多个区，因此可以得到多个存在可能性的区。通过每个区找到区的上一级县和上两级府的名字，并将它们与原始分词结果计算相似度。

function computeSimilar($text, array $possibleTexts, array $usedPos = [])
{
    $max = -1;
    $index = -1;
    foreach ($possibleTexts as $i => $t) {
        if (in_array($i, $usedPos)) {
            continue;
        }
        $percent = 0.00;
        similar_text($text, $t, $percent);
        if ($percent > $max) {
            $max = $percent;
            $index = $i;
        }
    }
 
    return [$max, $index];
}

由区名称相似度、县名称相似度和府名称相似度计算一个平均相似度。假设平均相似度大于设定的阈值时，则将这组区、县和府作为备选放入最终地址可能性数组中。

foreach ($options as $index => $areaOption) {
    $_usedPos = [];
    list ($provincePercent, $provIndex) = $this->computeSimilar($areaOption->province->name, $words);
    array_push($_usedPos, $provIndex);
    list ($countiesPercent, $countyIndex) = $this->computeSimilar($areaOption->county->name, $words, $_usedPos);
    array_push($_usedPos, $countyIndex);
    list ($districtsPercent, $districtIndex) = $this->computeSimilar($areaOption->district->name, $words, $_usedPos);
    array_push($_usedPos, $districtIndex);
 
    if ($provincePercent < 0) $provincePercent = 0;
    if ($countiesPercent < 0) $countiesPercent = 0;
    if ($districtsPercent < 0) $districtsPercent = 0;
    $percent = round(($provincePercent + $countiesPercent + $districtsPercent) / 3, 2);
    if ($percent >= $this->similarity) { // $this->similarity 为设置的阈值
        $percents[$index] = $percent;
        $indexes[$index] = [$provIndex, $countyIndex, $districtIndex];
    }
}

对于备选的地址可能性数组，我们再通过相似度降序排序，并取排序结果的第一条记录，即为地址解析结果。

arsort($percents);
$_index = array_key_first($percents);
$_indexes = $indexes[$_index];        
$address = $options->get($_index);
 
// 维护分词结果中已使用的下标位置
array_push($usedPos, $_indexes[0], $_indexes[1], $_indexes[2]);
$provinces[$_indexes[0]] = $words[$_indexes[0]];
$counties[$_indexes[1]] = $words[$_indexes[1]];
$districts[$_indexes[2]] = $words[$_indexes[2]];
 
// 地址解析结果
$parseResult['level'] = $address;
$parseResult['indexes'] = $_indexes;
$parseResult['similarity'] = $percents[$_index] ?? '';

第一种含有目的地邮编的示例泰国地址分析结果如下：

"level" => App\Fragments\AddressLevel {#1631}
    +province: App\Models\Address {#1713}
    +county: App\Models\Address {#1712}
    +district: App\Models\Address {#1711}
    +postcode: App\Models\Address {#1668}
}
"indexes" => array:3 [
  0 => 24
  1 => 20
  2 => 16
]
"similarity" => 100.0

邮编未知：找出待定项与地址库所有匹配的地址记录

上面已经介绍了邮编确定的情况，相较于邮编未知，邮编确定更好匹配，而对于不确定邮编的时候，这里是将上面找出的府、县、区待定项与地址库中所有的府、县、区进行匹配。由于府的数量要远小于县，同理县的数量也一般小于区，因此这里先进行府的字符串相似度匹配，再对县和区做相似度计算，一级一级向下匹配。

// 匹配府待定项与地址库中府
$allProvinces = $addressService->findByTypeAndParentIds(Address::PROVINCE);
$possibleProvince = $this->calculateSimilarityByCompare($words, $allProvinces, $provinces);
 
// 匹配县待定项与地址库中县
$allCounties = $addressService->findByTypeAndParentIds(
    Address::COUNTY,
    $possibleProvince->pluck('id')->toArray()
);
$possibleCounties = $this->calculateSimilarityByCompare($words, $allCounties, $counties);
 
// 匹配区待定项与地址库中区
$allDistrict = $addressService->findByTypeAndParentIds(
    Address::DISTRICT,
    $possibleCounties->pluck('id')->toArray()
);
$possibleDistricts = $this->calculateSimilarityByCompare($words, $allDistrict, $districts);
 
// 匹配邮政编码
$possiblePostcode = $addressService->findByTypeAndParentIds(
    Address::POSTCODE,
    $possibleDistricts->pluck('id')->toArray()
)->keyBy('parent_id');

这里计算匹配的相似度时需要注意，如果待定项是通过地址结构单位简写找出的，则在计算相似度时需要将简写部分去除后再将两者进行计算。与邮编已知同理，相似度大于设定的阈值时，则将这组区、县和府作为备选放入最终地址可能性数组中。

function calculateSimilarityByCompare($words, $dataset, $filteredOptions)
{
    $possibleOptions = collect();
    foreach($dataset as $data) {
        $percent = 0.00;
        $percentShort = 0.00;
        if (empty($filteredOptions)) {
            $filteredOptions = $words;
        }
        foreach ($words as $k => $v) {
            similar_text($v, $data['name'], $percent);
            similar_text($v, mb_substr($data['name'], 0, strlen($v)), $percentShort);
            if ($percentShort > $percent) {
                $percent = $percentShort;
            }
 
            if ($percent >= $this->similarityWithoutPostcode) { // $this->similarityWithoutPostcode 为设置的邮编未知的相似度阈值
                $possibleOptions->push([
                    'id' => $data['id'],
                    'name' => $data['name'],
                    'parent_id' => $data['parent_id'],
                    'similarity' => $percent,
                    'index' => $k
                ]);
            }
        }
    }
    return $possibleOptions;
}

由于上面匹配的府、县、区和邮编记录（$possibleProvince, $possibleCounties, $possibleDistricts, $possiblePostcode）并非是相互隶属的，因此还需要再整理一下数据。对于备选的地址可能性数组，我们通过相似度降序排序，并取排序结果的第一条记录，即为地址解析结果。参考如下：

$possibleLocations = collect();
foreach($possibleProvince as $pp) {
    $findedCounties = $possibleCounties->where('parent_id', $pp['id']);
    foreach($findedCounties as $fc) {
        $findedDistricts = $possibleDistricts->where('parent_id', $fc['id']);
        foreach($findedDistricts as $fd) {
            $postcode = $possiblePostcode->get($fd['id']);
            $possibleLocations->push([
                'province' => ...,
                'county' => ...,
                'district' => ...,
                'postcode' => ...,
                'similarity' => bcdiv($pp['similarity'] + $fc['similarity'] + $fd['similarity'], 3, 2)
            ]);
        }
    }
}
 
$match = $possibleLocations->sortByDesc('similarity')->first();

第二种目的地邮编未知的示例泰国地址分析结果如下：

"level" => App\Fragments\AddressLevel {#19073}
    +province: App\Models\Address {#19026}
    +county: App\Models\Address {#19033}
    +district: App\Models\Address {#19034}
    +postcode: App\Models\Address {#18904}
}
"indexes" => array:3 [
  0 => 12
  1 => 10
  2 => 10
]
"similarity" => 100.0

拾取手机号，计算距离

由于找出待定项时 usedPos 代表的是所有待定项的下标位置，根据上面的两种方式找出地址后，需要重新计算上面找到的地址所使用了分词结果哪些元素。同时，为了提取姓名，需要计算出没有使用的下标中，连续的下标位置段，如 0, 5, 6, 7, 10, 14, 15, 19 则需要计算成 [[0], 5, 7, 10, 14, 15, 19]。

function collectConsecutive(array $positions)
{
    $consecutivePos = [];
    $tmp = $positions;
    sort($tmp, SORT_NUMERIC);
    for ($i = 0; $i < count($tmp); ) {
        $k = $pos = $tmp[$i];
        for ($j = $i; $j < count($tmp); $j++) {
            if (in_array($k + 1, $tmp)) {
                $k++;
            } else {
                array_push($consecutivePos, [$pos, $k]);
                $i = $i + ($k - $pos) + 1;
                break;
            }
        }
    }
    return $consecutivePos;
}

从上面找出的手机号待定项中，拾取手机号码，并计算手机号与各未使用的连续下标位置段之间的距离。

$consecutiveUnusedPos = $this->collectConsecutive($unusedPos);
$phonePos = array_key_first($phone);
$parseResult['phone'] = str_replace("-", "", $phone[$phonePos]);
$unusedPosDistanceInPhone = $this->computeDistances($consecutiveUnusedPos, $phonePos, true);

计算距离方式如下：

function computeDistances(array $consecutive, $destPos, bool $abs = false, $onlyDirection = '')
{
    $distances = [];
    foreach ($consecutive as $i => $cons) {
        $left = $cons[0];
        $right = $cons[1];
        if ($onlyDirection == 'left' && $destPos < $left) {
            continue;
        }
        if ($onlyDirection == 'right' && $destPos > $right) {
            continue;
        }
        $distances[$i] = $destPos > $right ? $right - $destPos : $left - $destPos;
        if ($abs) {
            $distances[$i] = abs($distances[$i]);
        }
    }
    return $distances;
}

最后对手机号与未使用的连续下标位置段之间的距离做升序排序，取得最近距离的一个连续下标位置段，将连续下标位置段从 $words 中取出则为收货人姓名，同时将这段下标位置段维护到已使用的下标位置 usedPos。

同理，求出剩余部分连续的下标位置段，计算上面匹配出的区名称与各未使用的连续下标位置段之间的距离，取最近距离的一个连续下标位置段，将连续下标位置段从 $words 中取出则为自己填写的街道地址。

第一种含有目的地邮编的示例泰国地址分析结果：

"level" => App\Fragments\AddressLevel {#1631}
    +province: App\Models\Address {#1713}
    +county: App\Models\Address {#1712}
    +district: App\Models\Address {#1711}
    +postcode: App\Models\Address {#1668}
}
"indexes" => array:3 [
  0 => 24
  1 => 20
  2 => 16
]
"similarity" => 100.0
"phone" => "0628888888"
"full_name" => "น.ส สมหญิง ศรีเรือง"
"address" => "333หมู่1"
"origin_text" => "น.ส สมหญิง ศรีเรือง 0628888888 333หมู่1 ต.ตรมไพร อ.ศีขรภูมิ จ.สุรินทร์ 32110"
"level_parse" => array:4 [
  "province" => "สุรินทร์"
  "county" => "ศีขรภูมิ"
  "district" => "ตรมไพร"
  "postcode" => "32110"
]

第二种目的地邮编未知的示例泰国地址分析结果：

"level" => App\Fragments\AddressLevel {#19073}
    +province: App\Models\Address {#19026}
    +county: App\Models\Address {#19033}
    +district: App\Models\Address {#19034}
    +postcode: App\Models\Address {#18904}
}
"indexes" => array:3 [
  0 => 12
  1 => 10
  2 => 10
]
"similarity" => 100.0
"phone" => "0988888888"
"full_name" => "ปั้นกล่"
"address" => "88/2 หมู่8 เขาชะงุ้ม"
"origin_text" => "88/2 หมู่8 เขาชะงุ้ม โพธาราม ราชบุรี ปั้นกล่ 098-8888888"
"level_parse" => array:4 [
  "province" => "ราชบุรี"
  "county" => "โพธาราม"
  "district" => "โพธาราม"
  "postcode" => "70120"
]

总结

使用多个泰国地址，经过多次识别后发现，大部分含有邮编的地址识别成功率较高，识别速度也会较快，而邮编未知的情况则会比较糟糕，由于需要通过数据库遍历数据来匹配，因此识别速度较慢，识别成功率也会因为输入的地址出现很大差异，整体识别成功率比较低。同时，如果邮编未知时，将设定的相似度阈值降低，则会使得数据库遍历数据更多，速度更慢，而将阈值调高，则会使识别成功率降低，因此相似度阈值的设定也很有讲究。

阿里云的 NLP 自然语言处理的多语言分词为这次泰文分词提供了一定的帮助，但分析过程中发现，阿里云对泰文的语言分词还不算特别完善，因为泰文中府、县、区的名称不一定需要使用空格分开，即可能存在输入的地址中府、县、区名称连在一起，此时阿里云的多语言分词就未能较好的分开府、县、区名称。如果分词无法将其分开，那么后面的所有匹配都将很难做到很精确。

这次尝试对我是一种挑战，没有泰文语言基础，也没有分词相关的经验，凭借一些资料和已有的经验做出这个不完善的方法。希望本文对读者有一定帮助，也欢迎与我一同探讨和分享更好的解决方案，共勉。