Erlang Thursday - ETS 数据匹配

今天的Erlang Thursday开始从介绍系列文章改为专门讲ETS,开始用ETS存储一些数据然后在ETS里做一些数据的检索。

首先我们需要在ETS有一些数据,所以我们回到马尔科夫链这个问题上。

马尔科夫链是一个状态机,它是基于概率而不是特定的输入来转换状态的。它的一个普通的例子就是人们熟悉的智能手机里的“日常使用”的预测输入功能,也就是下一个词或字母会被预测并提供给使用者,而被选择的预测单词是遵循预测它跟随当前单词的历史有关的可能性。

首先我们将创建一个模块,该模块有一个函数将接收一个文本字符串,它返回的是一个元组列表,而这个元组的元素是由一个词和该词的后续词组成。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-module(markov_words).
-export([create_word_pairs/1]).
-spec create_word_pairs(string()) -> list({string(), string()}).
create_word_pairs(Text) ->
Words = string:tokens(Text, " \t\n"),
create_word_pairs([], Words).
create_word_pairs(WordPairs, [_Word|[]]) ->
WordPairs;
create_word_pairs(WordPairs, [Word|Words]) ->
[Following|_] = Words,
UpdatedWordPairs = [{Word, Following} | WordPairs],
create_word_pairs(UpdatedWordPairs, Words).

上述代码输入的是一个文本字符串,然后基于空格、tab和新行字符作为词的边界将文本分割成词列表。基于这个词列表,我们创建一个列表,该列表元素是由词和词的后续词组成的元组,而这些元组将被我们插入我们的ETS表中。

是时候打开Erlang shell开始我们的试验了。

首先我们需要编译我们的模块,然后我们将创建一个变量来持有我们的文本,这个文本是为我们的马尔科夫链准备的。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
c(markov_words).
% {ok,markov_words}
ToTC = "It was the best of times, it was the worst of times,
it was the age of wisdom, it was the age of foolishness,
it was the epoch of belief, it was the epoch of incredulity,
it was the season of Light, it was the season of Darkness,
it was the spring of hope, it was the winter of despair,
we had everything before us, we had nothing before us,
we were all going direct to Heaven,
we were all going direct the other way--in short,
the period was so far like the present period,
that some of its noisiest authorities insisted on its
being received, for good or for evil, in the superlative
degree of comparison only.
There were a king with a large jaw and a queen with a
plain face, on the throne of England; there were a king
with a large jaw and a queen with a fair face,
on the throne of France. In both countries it was
clearer than crystal to the lords of the State preserves
of loaves and fishes, that things in general were
settled for ever.".

我们将创建一个新进程并把我们的ETS表转移给它,以防万一我们的Erlang shell崩溃。

1
2
3
4
Fun = fun() -> receive after infinity -> ok end end.
% #Fun<erl_eval.20.54118792>
SomeProcess = spawn(Fun).
% <0.60.0>

然后我们创建一个ETS表,它将用来存储数据,这些数据为我们用来作为我们的马尔科夫链生成器的一部分。

1
2
3
4
WordPairs = ets:new(word_pairs, [public, duplicate_bag]).
% 20498
ets:give_away(WordPairs, SomeProcess, []).
% true

我们设置表为 public,因为我们想让不再是表的所有者的shell进程可以添加数据到表里,同时我们设置表的类型为 duplicate bag。

设为 duplicate_bag 是为了演示的原因。我们希望能有相同键的多个数据,因为我们很可能会看到任何词多次,而且有些词对的词的集合很常见,所以我们希望能捕获(以及保留)那些重复的词。

为了方便从shell里生成数据,我们将用列表解析把我们从文本创建的每一个词对元组通过调用 ets:insert/2 函数来插入到我们的ETS表。

1
2
3
4
[[ ets:insert(WordPairs, WordPair) || WordPair <- markov_words:create_word_pairs(ToTC)]].
% [[true,true,true,true,true,true,true,true,true,true,true,
% true,true,true,true,true,true,true,true,true,true,true,true,
% true,true,true,true,true|...]]

现在我们已经有了一些数据在我们的ETS表里,是时候看看我们怎样才能访问我们的数据。为访问数据,我们开始介绍函数 ets:match/2 ,它的入参是一个要查询的表以及一个模式。

模式是由一个Erlang term组成用来匹配:_ ,匹配任意数据而且不做绑定;或者模式变量,它的格式是$N,N是任意正整数。ets:match/2 的返回结果是一个列表,这个是由模式变量的值组成的,它们按照模式变量名的序列排序。

所以有了这些知识,我们可以尝试查询这些词对来找到所有跟在 of 后的词。如果我们写一个模式匹配,它可能看起来像 {“of”, Following},但是使用ETS,我们需要用一个模式变量按规范来写成:{“of”, ‘$1’}.

让我们在shell里运行。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
ets:match(WordPairs, {"of", '$1'}).
% [["loaves"],
% ["the"],
% ["France."],
% ["England;"],
% ["comparison"],
% ["its"],
% ["despair,"],
% ["hope,"],
% ["Darkness,"],
% ["Light,"],
% ["incredulity,"],
% ["belief,"],
% ["foolishness,"],
% ["wisdom,"],
% ["times,"],
% ["times,"]]

我们看到结果是一个由变量匹配数据的列表组成的列表,在本例子中,就是 ‘$1’ 匹配的。

为了好玩和探索,让我们确认一下在我们的双城记介绍文本里跟在 it 后面的词都有什么。

1
2
3
4
5
6
7
8
9
10
11
ets:match(WordPairs, {"it", '$1'}).
% [["was"],
% ["was"],
% ["was"],
% ["was"],
% ["was"],
% ["was"],
% ["was"],
% ["was"],
% ["was"],
% ["was"]]

就是一堆 was ,这刚好是这本书的头两段里的情况。

然后我们仔细检查看看跟在 Scrooge 后的词是什么。

1
2
ets:match(WordPairs, {"Scrooge", '$1'}).
% []

如果我们想得到整个元组,我们应该用 ets:match_object/2 ,它将返回满足匹配的整个对象。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
ets:match_object(WordPairs, {"of", '$1'}).
% [{"of","loaves"},
% {"of","the"},
% {"of","France."},
% {"of","England;"},
% {"of","comparison"},
% {"of","its"},
% {"of","despair,"},
% {"of","hope,"},
% {"of","Darkness,"},
% {"of","Light,"},
% {"of","incredulity,"},
% {"of","belief,"},
% {"of","foolishness,"},
% {"of","wisdom,"},
% {"of","times,"},
% {"of","times,"}]

或者,在这个例子里,我们可以用 ets:lookup/2 ,它将返回所有键匹配的元素。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
ets:lookup(WordPairs, "of").
% [{"of","loaves"},
% {"of","the"},
% {"of","France."},
% {"of","England;"},
% {"of","comparison"},
% {"of","its"},
% {"of","despair,"},
% {"of","hope,"},
% {"of","Darkness,"},
% {"of","Light,"},
% {"of","incredulity,"},
% {"of","belief,"},
% {"of","foolishness,"},
% {"of","wisdom,"},
% {"of","times,"},
% {"of","times,"}]

所以为了采取一个从马尔科夫链简短绕道的例子,为什么我们想要使用 ets:lookup/2 或 ets:match_object/2 而不是其它函数?为了回答这个问题我们用一个例子,让我们添加另一种数据到我们的马尔科夫链表里,它是一个三元素元组。

To start with, we will insert 100_000 items into our ETS tables and see what the resulting memory size becomes. We will insert a new tuple of {X, X}, for all numbers from 1 to 100_000.(译者注:这一段应该是作者手误,从另外一篇文章错误地拷贝粘贴过来的,这一段和本文无关。)

1
2
ets:insert(WordPairs, {"of", "times,", "it"}).
% true

如果我们调用 ets:lookup/2 ,我们将得到指定键的所有数据。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
ets:lookup(WordPairs, "of").
[{"of","loaves"},
{"of","the"},
{"of","France."},
{"of","England;"},
{"of","comparison"},
{"of","its"},
{"of","despair,"},
{"of","hope,"},
{"of","Darkness,"},
{"of","Light,"},
{"of","incredulity,"},
{"of","belief,"},
{"of","foolishness,"},
{"of","wisdom,"},
{"of","times,"},
{"of","times,"},
{"of","times,","it"}]

但是如果我们用 ets:match_object/2 ,并且用了一个两元组,因为我们只是想得到词对,我们在结果里将不会得到那个三元组。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
ets:match_object(WordPairs, {"of", '_'}).
[{"of","loaves"},
{"of","the"},
{"of","France."},
{"of","England;"},
{"of","comparison"},
{"of","its"},
{"of","despair,"},
{"of","hope,"},
{"of","Darkness,"},
{"of","Light,"},
{"of","incredulity,"},
{"of","belief,"},
{"of","foolishness,"},
{"of","wisdom,"},
{"of","times,"},
{"of","times,"}]

回到马尔科夫链的场景,我们可以开始看看我们是怎样遵循马尔科夫链规则能够获得一些文本。

我们从一个给定的词得到匹配的潜在的词,并且我们从后续词列表里均匀随机抓取一个结果。

1
2
PotentialChoices = ets:match(WordPairs, {"of", '$1'}).
[NextWord] = lists:nth(random:uniform(length(PotentialChoices)), PotentialChoices).

我们可以写一个函数,让它重复上面的这些步骤,直到终结为止。一些终结状态的例子应该是一个词而且没有后续词;我们得到一定数量的词来拼装我们的文本;或者我们得到一定的总长度,使得它符合社交网络和Tweet的要求。

本文里,我们已经开始将一些“真实”的数据加入ETS里,并且为一些给定的模式来匹配数据。下个星期我们将继续看看在这个例子里用其它方式从我们ETS表取出数据放到一些地方,在那里这些数据可能被更好的消费。

原文链接: https://www.proctor-it.com/erlang-thursday-ets-data-matching/