Erlang Thursday – 更多的ETS数据匹配和查询

今天的Erlang Thursday我们继续上周开始的从ETS获取数据研究。

我们已经有一个模块 markov_words ,本周我们将添加一个函数 markov_words:create_word_triples/1 。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
-module(markov_words).
-export([create_word_pairs/1,
create_word_triples/1]).
-spec create_word_pairs(string()) -> list({string(), string()}).
create_word_pairs(Text) ->
Words = string:tokens(Text, " \t\n"),
create_word_pairs([], Words).
-spec create_word_triples(string()) -> list({string(), string(), string()}).
create_word_triples(Text) ->
Words = string:tokens(Text, " \t\n"),
create_word_triples(Words, []).
create_word_pairs(WordPairs, [_Word|[]]) ->
WordPairs;
create_word_pairs(WordPairs, [Word|Words]) ->
[Following|_] = Words,
UpdatedWordPairs = [{Word, Following} | WordPairs],
create_word_pairs(UpdatedWordPairs, Words).
create_word_triples([_Word, _SecondWord | []], WordTriples) ->
WordTriples;
create_word_triples([FirstWord | Words], WordTriples) ->
[SecondWord, Following | _] = Words,
UpdatedWordTriples = [{FirstWord, SecondWord, Following} | WordTriples],
create_word_triples(Words, UpdatedWordTriples).

添加新函数的原因是它将允许我们得到更精确的马尔科夫链,这是通过能看到后续两个词的复合键的情形而提高获取下一个单词的概率。

修改和重新定义我们的模块后,我们回到Erlang shell,编译我们的模块并转载我们的介绍文本给一个变量,测试开始。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
c(markov_words).
% {ok,markov_words}
ToTC = "It was the best of times, it was the worst of times,
it was the age of wisdom, it was the age of foolishness,
it was the epoch of belief, it was the epoch of incredulity,
it was the season of Light, it was the season of Darkness,
it was the spring of hope, it was the winter of despair,
we had everything before us, we had nothing before us,
we were all going direct to Heaven,
we were all going direct the other way--in short,
the period was so far like the present period,
that some of its noisiest authorities insisted on its
being received, for good or for evil, in the superlative
degree of comparison only.
There were a king with a large jaw and a queen with a
plain face, on the throne of England; there were a king
with a large jaw and a queen with a fair face,
on the throne of France. In both countries it was
clearer than crystal to the lords of the State preserves
of loaves and fishes, that things in general were
settled for ever.".

本周我们创建新的ETS表,创建一个新的进程并且把表转移给它(在例子中我们输入一些错误的东西来引起当前的shell进程崩溃)。

1
2
3
4
5
6
7
8
MarkovWords = ets:new(markov_word_tuples, [public, duplicate_bag]).
% 16402
Fun = fun() -> receive after infinity -> ok end end.
% #Fun<erl_eval.20.54118792>
SomeProcess = spawn(Fun).
% <0.58.0>
ets:give_away(MarkovWords, SomeProcess, []).
% true

这周,除了添加我们的词对元组到ETS,我们也将添加新的词三元组到ETS的同样的表里。

1
2
3
[[ ets:insert(MarkovWords, WordPair) || WordPair <- markov_words:create_word_pairs(ToTC)]].
[[ ets:insert(MarkovWords, WordTriple) || WordTriple <- markov_words:create_word_triples(ToTC)]].

既然我们已经有词对和三词组在同一个ETS表中,我们可以用 ets:match_object/2 函数并指定一个匹配模式仅获得二元组数据。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
ets:match_object(MarkovWords, {"of", '$1'}).
% [{"of","loaves"},
% {"of","the"},
% {"of","France."},
% {"of","England;"},
% {"of","comparison"},
% {"of","its"},
% {"of","despair,"},
% {"of","hope,"},
% {"of","Darkness,"},
% {"of","Light,"},
% {"of","incredulity,"},
% {"of","belief,"},
% {"of","foolishness,"},
% {"of","wisdom,"},
% {"of","times,"},
% {"of","times,"}]

或者指定另一个匹配模式仅获得三元组数据。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
ets:match_object(MarkovWords, {"of", '$1', '$2'}).
% [{"of","loaves","and"},
% {"of","the","State"},
% {"of","France.","In"},
% {"of","England;","there"},
% {"of","comparison","only."},
% {"of","its","noisiest"},
% {"of","despair,","we"},
% {"of","hope,","it"},
% {"of","Darkness,","it"},
% {"of","Light,","it"},
% {"of","incredulity,","it"},
% {"of","belief,","it"},
% {"of","foolishness,","it"},
% {"of","wisdom,","it"},
% {"of","times,","it"},
% {"of","times,","it"}]

而如果我们用 ets:lookup/2 函数并传入键,那么我们得到这个键的所有数据而不管它是二元组还是三元组。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
ets:lookup(MarkovWords, "of").
% [{"of","loaves"},
% {"of","the"},
% {"of","France."},
% {"of","England;"},
% {"of","comparison"},
% {"of","its"},
% {"of","despair,"},
% {"of","hope,"},
% {"of","Darkness,"},
% {"of","Light,"},
% {"of","incredulity,"},
% {"of","belief,"},
% {"of","foolishness,"},
% {"of","wisdom,"},
% {"of","times,"},
% {"of","times,"},
% {"of","loaves","and"},
% {"of","the","State"},
% {"of","France.","In"},
% {"of","England;","there"},
% {"of","comparison","only."},
% {"of","its","noisiest"},
% {"of","despair,","we"},
% {"of","hope,","it"},
% {"of","Darkness,","it"},
% {"of","Light,",[...]},
% {"of",[...],...},
% {[...],...},
% {...}|...]

和 ets:lookup/2 不一样,ets:match_object/2 可以匹配元组中的任何元素而不仅是键。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
ets:match_object(MarkovWords, {'$1', "the", '$2'}).
% [{"on","the","throne"},
% {"on","the","throne"},
% {"direct","the","other"},
% {"short,","the","period"},
% {"like","the","present"},
% {"of","the","State"},
% {"to","the","lords"},
% {"in","the","superlative"},
% {"was","the","winter"},
% {"was","the","spring"},
% {"was","the","season"},
% {"was","the","season"},
% {"was","the","epoch"},
% {"was","the","epoch"},
% {"was","the","age"},
% {"was","the","age"},
% {"was","the","worst"},
% {"was","the","best"}]

和 ets:match_object/2 一样,ets:match/2 也可以如此匹配元组。

1
2
3
4
5
6
7
8
9
10
11
ets:match(MarkovWords, {"was", "the", '$1'}).
% [["winter"],
% ["spring"],
% ["season"],
% ["season"],
% ["epoch"],
% ["epoch"],
% ["age"],
% ["age"],
% ["worst"],
% ["best"]]

但有时候我们可能想要对如何返回结果给我们有更细粒度的控制能力,比如用一个元素单列表而不是一个字符串嵌套列表。或者我们甚至有一些标准是我们想要的真正作为我们选择的数据的一部分。

让我们开始研究 ets:select/2

ets:select/2 第一个入参是一个表,第二个入参是一个匹配规则。

这个匹配规则是一个三元组列表,元组的第一个元素是一个匹配模式,第二个元素是一个判断语句元组的列表,最后一个元素是一个表示每个匹配的结果。

如果我们想让调用 ets:select/2 结果和 ets:match/2 相似,就如像下面输出一样。

1
2
3
4
5
6
7
8
9
10
11
ets:select(MarkovWords, [{{"was", "the", '$1'}, [], [['$1']]}]).
% [["winter"],
% ["spring"],
% ["season"],
% ["season"],
% ["epoch"],
% ["epoch"],
% ["age"],
% ["age"],
% ["worst"],
% ["best"]]

第二个入参是一个匹配规则列表,它只有一个元素,这个元素的组成是:
1)一个{“was”, “the”, ‘$1’}样式的匹配模式,它和我们传给 ets:match/2 的一样
2)一个空的条件判断元组列表3)一个返回结果的term: [[‘$1’]],它是我们想要的结果格式列表,在本例子里我们想要每个结果都在它自己的列表里。

如果我们只是想得到一个词组成的列表,我们可以修改匹配规则的返回结果的term为[‘$1’]。

1
2
3
ets:select(MarkovWords, [{{"was", "the", '$1'}, [], ['$1']}]).
% ["winter","spring","season","season","epoch","epoch","age",
% "age","worst","best"]

如果我们想让返回结果看起来更像是一个 ets:matchobject/2 返回的结果集,我们可以使用 ‘$‘作为结果term,它表示整个元素。

1
2
3
4
5
6
7
8
9
10
11
ets:select(MarkovWords, [{{"was", "the", '$1'}, [], ['$_']}]).
% [{"was","the","winter"},
% {"was","the","spring"},
% {"was","the","season"},
% {"was","the","season"},
% {"was","the","epoch"},
% {"was","the","epoch"},
% {"was","the","age"},
% {"was","the","age"},
% {"was","the","worst"},
% {"was","the","best"}]

如果你想要只是匹配某些元素并且提取元组中其它元素,我们可以用 ‘$$’ 作为结果term,它将返回所有匹配变量到一个列表里,返回列表里元素的顺序是按匹配模式里匹配变量的数值排序。

用 ets:select/2 我们也可以有指定多个匹配规则的能力。这样就允许我们查找到所有中间的词是 of 或 the 的三元组数据。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
ets:select(MarkovWords, [{{'$1', "the", '$2'}, [], ['$_']}, {{'$1', "of", '$2'}, [], ['$_']}]).
% [{"some","of","its"},
% {"on","the","throne"},
% {"on","the","throne"},
% {"direct","the","other"},
% {"preserves","of","loaves"},
% {"throne","of","France."},
% {"throne","of","England;"},
% {"worst","of","times,"},
% {"short,","the","period"},
% {"winter","of","despair,"},
% {"degree","of","comparison"},
% {"epoch","of","incredulity,"},
% {"epoch","of","belief,"},
% {"spring","of","hope,"},
% {"like","the","present"},
% {"of","the","State"},
% {"age","of","foolishness,"},
% {"age","of","wisdom,"},
% {"best","of","times,"},
% {"season","of","Darkness,"},
% {"season","of","Light,"},
% {"to","the","lords"},
% {"in","the","superlative"},
% {"was","the","winter"},
% {"was","the","spring"},
% {"was","the",[...]},
% {"was",[...],...},
% {[...],...},
% {...}|...]

使用判断分支,我们可以找到在三元组数据里第一个元素是 was ,而且第二个词在字典排序上小于第三个词这样的第三个词。

1
2
ets:select(MarkovWords, [{{"was", '$1', '$2'}, [{'<', '$1', '$2'}], ['$2']}]).
% ["than","winter","worst"]

本周的文章我们已经学到使用 ets:match/2 和 ets:match_object/2 函数的其它方式,以及他们能克服使用 ets:lookup/2 的时候只能指定一个键的弱点,也可以能够通过使用 ets:select/2 来获得更强的查询能力。

下周,我们将研究 ets:select/2 的更多使用方式,以及我们怎样用ets模块的其它函数来帮助创建查询来更容易地快速解构数据。

原文链接: https://www.proctor-it.com/erlang-thursday-more-ets-data-matching-and-querying/