单词查找树

2022-06-10

#algorithm

单词查找树

这种数据结构由字符串键中的所有字符构造而成，允许使用被查找键中的字符进行查找。跟各种查找树一样，也是由链接的结点组成的数据结构。如果字母表的大小是R，那么每个结点都含有R条链接，虽然一般绘制的时候是如下所示：

但是实际在程序中构造的数据结构并不完全一致，上图只是一个简化版本，因为通常来说这种结构含有大量的空链接，所以只绘制非空的链接。实际如下：

字符和键都隐式地保存在数据结构中，之所以叫隐式，是因为是用数组的索引来记录的。如sea所关联的值是2，那么上图的第一个结点中的第19条链接（s为从a开始的第19个字母）就非空，以此类推ea，最后a指向的结点存储对应的值2。

因为参数R的作用和重要性，所以将基于含有R个字符的字母表的单词查找树称为R向单词查找树。

单词查找树中的查找操作

键的尾字符对应的结点的值为空，则是一次未命中的查找，否则说明命中
查找结束于一条空链接也是一次未命中

单词查找树中的插入操作

在到达键的尾字符之前遇到了空链接，那么需要为到尾字符之间的所有字符创建一个对应的结点
再遇到空字符之前就到达了键的尾字符，那么只需要将尾字符对应的结点的值设置为键的值（无论该结点的值是否为空）

class _Node<T> {
  public val?: T
  // 256=R, _Node本来是个内部类的，这样就能使用到R了
  public next: (_Node<T> | undefined)[] = new Array<_Node<T>>(256)
}

class TrieST<T> {
  private R = 256
  private root?: _Node<T>

  constructor(data: [string, T][]) {
    data.forEach(([k, v]) => this.put(k, v))
  }

  public put(key: string, val: T) {
    this.root = this._put(key, val, 0, this.root)
  }

  private _put(key: string, val: T, d: number, x?: _Node<T>) {
    if (x === undefined) {
      x = new _Node<T>()
    }
    if (d === key.length) {
      x.val = val
      return x
    }
    const c = key.charCodeAt(d)
    x.next[c] = this._put(key, val, d + 1, x.next[c])
    return x
  }

  public get(key: string) {
    const x = this._get(key, 0, this.root)
    return x === undefined ? undefined : x.val
  }

  private _get(key: string, d: number, x?: _Node<T>): _Node<T> | undefined {
    if (x === undefined) return undefined
    if (d === key.length) return x
    const c = key.charCodeAt(d)
    return this._get(key, d + 1, x.next[c])
  }
}

查找所有键

这里用了一个collect方法，用来收集从某个结点出发的路径上的所有字符。

public keys() {
  return this.keyWithPrefix('')
}

public keyWithPrefix(pre: string) {
  const queue: string[] = []
  // 先用_get找到以pre为根节点的结点，如果这样的结点不存在也就不需要继续查找了
  this.collect(pre, queue, this._get(pre, 0, this.root))
  return queue
}

private collect(pre: string, q: string[], x?: _Node<T>) {
  if (x === undefined) return
  if (x.val !== undefined) q.push(pre)
  for (let c = 0; c < this.R; c++) {
    this.collect(pre + String.fromCharCode(c), q, x.next[c])
  }
}

通配符匹配

用一个和上面类似的方法就能实现通配符匹配，只需对collect稍作修改，因为ts不支持这种方法同名不同参的写法，所以新写了个collectWithPat。

public keysThatMatch(pat: string) {
  const q: string[] = []
  this.collectWithPat('', pat, q, this.root)
  return q
}

private collectWithPat(pre: string, pat: string, q: string[], x?: _Node<T>) {
  const d = pre.length
  if (x === undefined) return
  if (d === pat.length && x.val !== undefined) q.push(pre)
  if (d === pat.length) return

  const next = pat.charAt(d)
  for (let c = 0; c < this.R; c++) {
    if (next === '.' || next.charCodeAt(0) === c) {
      this.collectWithPat(pre + String.fromCharCode(c), pat, q, x.next[c])
    }
  }
}

最长前缀

public longestPrefixOf(s: string) {
  const len = this.search(s, 0, 0, this.root)
  return s.substring(0, len)
}

private search(s: string, d: number, length: number, x?: _Node<T>): number {
  if (x === undefined) return length
  if (x.val !== undefined) length = d // 只有找到有值的结点才更新length
  if (d === s.length) return length
  const c = s.charAt(d)
  return this.search(s, d + 1, length, x.next[c.charCodeAt(0)])
}

删除操作

从一棵单词查找树中删去一个键值对的第一步是，找到键所对应的结点并将它的值设为空。如果该结点含有一个非空的链接指向某个子结点，那么就不需要再进行其他操作。如果它所有的链接都为空，那么就这个结点就需要需要被删除掉。如果删掉它后它的父结点的所有链接也都为空，那么继续把该父结点也删除掉，依此类推。

public delete(key: string) {
  this.root = this._delete(key, 0, this.root)
}

private _delete(key: string, d: number, x?: _Node<T>) {
  if (!x) return undefined
  if (d === key.length) {
    x.val = undefined
  } else {
    const c = key.charCodeAt(d)
    x.next[c] = this._delete(key, d + 1, x.next[c])
  }

  // 上面的代码表示已经完成了key的删除工作，下面是开始递归回溯的过程
  // 如果结点有值，那么该结点需要被保留
  if (x.val) return x

  // 如果结点没有值，这个时候检查是否有其他结点，有则说明这个结点也不需要删除，否则返回undefined删除
  for (let c = 0; c < this.R; c++) {
    if (x.next[c]) return x
  }
  return undefined
}

单词查找树的性质

这种链表结构的形状和键插入或者删除顺序无关，对于任意给定的一组键，其单词查找树都是唯一的
查找或者插入一个键时，访问数组的次数最多为键的长度加1

put和get的d参数初始值为0，每次调用都会加1，当长度等于键的长度时递归调用停止。（get应该只用刚好访问键长度次数就可以了，不用加1吧？）

因此理论上来说单词查找树对于命中的查找是最理想的，我们不能奢望查找所需时间比被查找键长度成正比更好。因为无论使用什么算法和数据结构，在检查完要查找的键中所有字符之前都是无法判断是否已找到该键。从实际角度来讲，这个保证也很重要，因为它和符号表中键的数量无关。

空间

如果有N个随机键，那么一个单词查找树中的链接总数在RN到RNw之间，其中w为键的平均长度。

每一个键都有一个对应的结点保存着它关联的值，同时每个结点也含有R个链接，因此链接总数至少有RN条。如果所有键的首字母均不同，那么每个键中的每个字母有一个对应的结点，因为总数为RNw。

w较小时，接近RN
w较大时，接近RNw
因此缩小R能节省大量空间

因此底线是：不要使用上面的TrieST来处理大型字母表的大量长键。但是如果能负担得起这么需要的空间，那么这个性能是无可匹敌的。

测试用例

const data = [
  'she',
  'sells',
  'seashells',
  'by',
  'the',
  'sea',
  'shore',
  'the',
  'shells',
  'she',
  'are',
  'surely',
  'seashells',
]

const main = () => {
  const d = data.map((str, index) => [str, index] as [string, number])
  console.log(`input data:`, d)
  const st = new TrieST<number>(d)

  console.log(`keys: ${st.keys()}`)

  console.log(`get shells: ${st.get('shells')}`)
  console.log(`get shell: ${st.get('shell')}`)

  console.log(`key with prefix 'sea': ${st.keyWithPrefix('sea')}`)
  console.log(`key with prefix 'see': ${st.keyWithPrefix('see')}`)

  console.log(`key that match '.h.': ${st.keysThatMatch('.h.')}`)
  console.log(`key that match 's..': ${st.keysThatMatch('s..')}`)
  console.log(`key that match 's..l': ${st.keysThatMatch('s..l')}`)

  console.log(`longest prefix of 'shellsea': ${st.longestPrefixOf('shellsea')}`)
  console.log(`longest prefix of 'shed': ${st.longestPrefixOf('shed')}`)
  console.log(`longest prefix of 'see': ${st.longestPrefixOf('see')}`)

  console.log(`get seashells: ${st.get('seashells')}`)
  console.log(`delete seashells`)
  st.delete('seashells')
  console.log(`get seashells after delete: ${st.get('seashells')}`)
  console.log(`get sea after delete: ${st.get('sea')}`)
  st.delete('sea')
  console.log(`get sea after delete: ${st.get('sea')}`)
}

/*
input data: [
  [ 'she', 0 ],
  [ 'sells', 1 ],
  [ 'seashells', 2 ],
  [ 'by', 3 ],
  [ 'the', 4 ],
  [ 'sea', 5 ],
  [ 'shore', 6 ],
  [ 'the', 7 ],
  [ 'shells', 8 ],
  [ 'she', 9 ],
  [ 'are', 10 ],
  [ 'surely', 11 ],
  [ 'seashells', 12 ]
]
keys: are,by,sea,seashells,sells,she,shells,shore,surely,the
get shells: 8
get shell: undefined
key with prefix 'sea': sea,seashells
key with prefix 'see': 
key that match '.h.': she,the
key that match 's..': sea,she
key that match 's..l': 
longest prefix of 'shellsea': shells
longest prefix of 'shed': she
longest prefix of 'see': 
get seashells: 12
delete seashells
get seashells after delete: undefined
get sea after delete: 5
get sea after delete: undefined
*/

三向单词查找树

三向单词查找树可以避免R向单词查找树过度的空间浪费，每个结点只含有三个链接：小于、等于、大于当前字符。R向树中是用索引来隐式表示它所对应的字符，三向树中是显示保存的。

空间

N个平均长度为w的字符串构造的三向单词查找树中的链接总数在3N到3Nw之间

查找成本

在一棵由N个随机字符串构造的三向单词查找树中，查找未命中平均需要比较字符~lnN次

实现、测试

class _Node<T> {
  c: number = 0 // 该结点代表的字符
  left?: _Node<T>
  mid?: _Node<T>
  right?: _Node<T>
  val?: T
}

class TST<T> {
  private root?: _Node<T>

  constructor(data: [string, T][]) {
    data.forEach(([k, v]) => this.put(k, v))
  }

  public get(key: string) {
    const x = this._get(key, 0, this.root)
    return x?.val
  }

  private _get(key: string, d: number, x?: _Node<T>): (_Node<T> | undefined) {
    if (!x) return undefined
    const c = key.charCodeAt(d)
    if (c < x.c) return this._get(key, d, x.left)
    else if (c > x.c) return this._get(key, d, x.right)
    else if (d < key.length - 1) return this._get(key, d + 1, x.mid)
    else return x
  }

  public put(key: string, val: T) {
    this.root = this._put(key, val, 0, this.root)
  }

  private _put(key: string, val: T, d: number, x?: _Node<T>): _Node<T> {
    const c = key.charCodeAt(d)
    if (!x) {
      x = new _Node<T>()
      x.c = c
    }
    if (c < x.c) x.left = this._put(key, val, d, x.left)
    else if (c > x.c) x.right = this._put(key, val, d, x.right)
    else if (d < key.length - 1) x.mid = this._put(key, val, d + 1, x.mid)
    else x.val = val
    return x
  }
}

const data = [
  'she',
  'sells',
  'seashells',
  'by',
  'the',
  'sea',
  'shore',
  'the',
  'shells',
  'she',
  'are',
  'surely',
  'seashells',
]

const main = () => {
  const d = data.map((str, index) => [str, index] as [string, number])
  console.log(`input data:`, d)

  const st = new TST<number>(d)

  console.log(`get shells: ${st.get('shells')}`)
  console.log(`get shell: ${st.get('shell')}`)
  console.log(`get seashells: ${st.get('seashells')}`)
}

/*
input data: [
  [ 'she', 0 ],
  [ 'sells', 1 ],
  [ 'seashells', 2 ],
  [ 'by', 3 ],
  [ 'the', 4 ],
  [ 'sea', 5 ],
  [ 'shore', 6 ],
  [ 'the', 7 ],
  [ 'shells', 8 ],
  [ 'she', 9 ],
  [ 'are', 10 ],
  [ 'surely', 11 ],
  [ 'seashells', 12 ]
]
get shells: 8
get shell: undefined
get seashells: 12
*/

应该使用那种实现

如果空间足够，那么R向树速度是最快的，能在常数次字符比较内完成查找。对于大型字母表，R向树的空间可能无法满足，三向树是最佳选择，因为字符比较次数是对数级别的。散列表也很有竞争力，但是不支持像通配符匹配，前缀查找之类的操作。

本节代码：https://github.com/Yuijam/algorithms-note/tree/main/string/search_tree