字符串排序

2022-06-02

#algorithm

热身键索引计数法

有如下数据，学生以及对应的组号，需要将其按组号排序。该算法是接下来几种字符串排序算法的基础。

所谓的键索引计数法，这里的键就是组号。主要分四部进行。

频率统计

遍历数组，通过一个count数组来统计每个键出现的次数。这个一个巧妙的地方是如果键为r，那么count[r+1]++，而不是count[r]++。

频率转为起始索引

数据分类

有了存储了起始索引的count之后，就可以再遍历一遍数组，按照每个键的索引值来将键进行分类。

回写

因为会用到一个临时数组来存放上一步分类后的数据，因此需要将分类好的数据重新填回去，完成排序。

type TStudent = {
  name: string
  group: number
}

const sort = (students: TStudent[], maxGroupNum: number) => {
  const count = new Array(maxGroupNum + 1).fill(0)

  // 频率统计
  students.forEach((s) => {
    count[s.group + 1]++ // group+1是一个巧妙的地方
  })

  // 频率转换为索引
  for (let i = 0; i + 1 < count.length; i++) {
    count[i + 1] += count[i]
  }

  // 数据分类（开始排序）
  const tmp = new Array(students.length)
  students.forEach((s) => {
    tmp[count[s.group]++] = s
  })

  // 回写
  tmp.forEach((t, i) => {
    students[i] = t
  })
}

键索引计数法排序N个键为0到R-1之间的整数的元素需要访问数组11N+4R+1次

上述命题以为着键索引计数法突破了NlogN的排序算法运行时间下限。并且该算法是稳定的。

低位优先的字符串排序（LSD）

如果所有字符串长度均为w，每个字符串从右到左以每个位置的字符作为键，用键索引计数法将字符串排序w遍，这就是LSD，除非键索引计数法是稳定的，否则这种方法是行不通的。

const sort = (a: string[], w: number) => {
  const R = 256
  const N = a.length
  const aux = new Array(a.length)

  for (let i = w-1; i >= 0; i--) {
    const count = new Array(R + 1).fill(0) // 因为在统计频率的时候索引0不会访问到
    for (let j = 0; j < N; j++) {
      count[a[j].charCodeAt(i) + 1]++
    }
    for (let j = 0; j + 1 < count.length; j++) {
      count[j + 1] += count[j]
    }
    for (let j = 0; j < N; j++) {
      aux[count[a[j].charCodeAt(i)]++] = a[j]
    }
    for (let j = 0; j < N; j++) {
      a[j] = aux[j]
    }
  }
}

const data = [
  '1I34ERB',
  '1I34E21',
  '1I34F21',
  '1I34ERB',
  '2IP234Q',
  'NUIE989',
  'NU338AM',
  'UQ89P2M',
  'ZM12W89',
  'O9BVEEW',
]

sort(data, data[0].length)
data.forEach(d => console.log(d))

/*
1I34E21
1I34ERB
1I34ERB
1I34F21
2IP234Q
NU338AM
NUIE989
O9BVEEW
UQ89P2M
ZM12W89
*/

整个过程配合下面的图一起看会容易明白一些。

书上的一种证明方式：如果有两个键，他们中还没有被检查的字符都是完全相同的，那么键的不同之处就仅限于已经被检查过的字符。因为两个键已经被排序过，所以处于稳定性它们将一直保持有序。另外，如果还没被检查过的部分是不同的，那么已经检查过的字符对于两者的最终顺序没有意义，之后的某轮处理会根据高位的字符的不同来修正这对键的顺序。

对于基于R个字符的字母表的N个以长为W的字符串为键的元素，低位LSD需要访问～7WN+3WR次数组，使用额外空间与N+R成正比

虽然上面的例子是等长的字符串，但是说稍作修改也可以适配不等长的字符串。

高位优先字符串排序（MSD）

下面这个算法相对上面的LSD来说更加通用，因为可以允许字符串长度不相同的情况。顾名思义，与LSD相反，是从左到右开始扫描每个字符，或者说键。因为字符长度不一定相同，所以得处理到达字符串末尾的情况，这里是在计数的时候干脆从count数组的charCode+2开始排，而超过末尾的字符串，直接返回charCode=-1，这样-1+2=1，放在索引1的位置。在LSD中创建的是一个R + 1的count数组，这里因为多了一位，所以R+2。

class MSD {
  private R: number = 256
  private M: number = 10
  private aux: string[] = []

  private charAt(s: string, d: number) {
    return d < s.length ? s.charCodeAt(d) : -1
  }

  public sort(a: string[]) {
    const n = a.length
    this.aux = new Array(n)
    this._sort(a, 0, n - 1, 0)
  }

  private _sort(a: string[], lo: number, hi: number, d: number) {
    // 递归到死，不使用插入排序
    if (lo >= hi) return

    const count = new Array(this.R + 2).fill(0)
    for (let i = lo; i <= hi; i++) {
      count[this.charAt(a[i], d) + 2]++ // +2为了对付charAt为-1的情况
    }
    // i + 1 < this.R + 2
    for (let i = 0; i < this.R + 1; i++) {
      count[i + 1] += count[i]
    }
    for (let i = lo; i <= hi; i++) {
      this.aux[count[this.charAt(a[i], d) + 1]++] = a[i]
    }
    for (let i = lo; i <= hi; i++) {
      a[i] = this.aux[i - lo]
    }

    for (let r = 0; r < this.R; r++) {
      this._sort(a, lo + count[r], lo + count[r + 1] - 1, d + 1)
    }
  }
}

我感觉马上到！比LSD就复杂多了。这里有两个地方我好好考虑了下，一个是递归的地方lo + count[r], lo + count[r + 1] - 1，还有一个是回写的时候，a[i] = this.aux[i - lo]，这里只需要填写lo到hi的数据即可，让我没转过弯的是为什么是i - lo，其实虽然aux跟a一样长，但是因为在排序时只排了lo到hi之间的数据，且排序肯定是从索引0开始排啊，所以里面排好了的部分就是0到hi-lo之间的数据。

然后文中提到的小树组带来的开销，我不知道我有没有理解对。

这种方法能够快速地将需要排序的数组切分为较小的数组，但是这种切分也是一把双刃剑：我们肯定需要处理大量微型数组，因此必须快速处理它们，小型子数组对于高位优先字符串排序的性能至关重要。

我一开始一直在纠结切分为较小的数组，递归的时候传递的是数组a的地址和接下来的起始索引，并没有真的切分出一个新数组来不是吗？然后现在想应该不是指的真正切分创造出新数组带来的开销，而是一次递归带来的开销（创建count数组，统计频率，排序，回写），像下图红框中的两个字符串明明已经是相同的了，但是还是会一直递归下去，直到最后一个字符。

所以为了避免这种情况带来的开销，在满足一定阈值的时候切换成插入排序，插入排序的复杂度虽然是O(n2)，但是当数组有序或者近乎有序的情况下，可以达到O(n)，因为内循环只比较了一次退出了循环，近似于只执行了外循环一遍。

使用插入排序：

class MSD {
  private R: number = 256
  private M: number = 10
  private aux: string[] = []

  private charAt(s: string, d: number) {
    return d < s.length ? s.charCodeAt(d) : -1
  }
  
  public sort(a: string[]) {
    const n = a.length
    this.aux = new Array(n)
    this._sort(a, 0, n - 1, 0)
  }

  private _sort(a: string[], lo: number, hi: number, d: number) {
    // 根据阈值使用插入排序结束递归
    if (hi <= lo + this.M) {
      this.insertion(a, lo, hi, d)
      return
    }

    ...
  }

  private insertion(a: string[], lo: number, hi: number, d: number) {
    for (let i = lo; i <= hi; i++) {
      for (let j = i; j > lo && this.less(a[j], a[j - 1], d); j--) {
        this.exch(a, j, j - 1)
      }
    }
  }

  private exch(a: string[], i: number, j: number) {
    const tmp = a[i]
    a[i] = a[j]
    a[j] = tmp
  }

  private less(v: string, w: string, d: number) {
    for (let i = d; i < Math.min(v.length, w.length); i++) {
      if (v.charCodeAt(i) < w.charCodeAt(i)) return true
      if (v.charCodeAt(i) > w.charCodeAt(i)) return false
    }
    return v.length < w.length
  }
}

生成了100k和1000k个字符串，比较使用插入排序和不使用插入排序得到的排序时间（毫秒）。

	不使用	使用
100k	77	26
1000k	525	218

三向字符串快速排序

根据首字母的进行三向切分。MSD会创建大量空数组（这里我依然不理解，递归的时候传递的是数组的指针不是吗？？为啥会是创建空数组），而三向切分总是只有三个。和快速排序一样，也不需要额外的空间（递归所需要的栈除外）。

class Quick3string {

  public sort(a: string[]) {
    this._sort(a, 0, a.length - 1, 0)
  }

  private _sort(a: string[], lo: number, hi: number, d: number) {
    if (hi <= lo) return
    let lt = lo,
      gt = hi,
      i = lo + 1
    
    const v = this.charAt(a[lo], d)
    while (i <= gt) {
      const t = this.charAt(a[i], d)
      if (t < v) {
        this.exch(a, lt++, i++) 
      } else if (t > v) {
        this.exch(a, gt--, i)
      } else {
        i++
      }
    }
    // while之后是这么个感觉 ....lt....gt....
    this._sort(a, lo, lt - 1, d)
    if (v >= 0) { // 这个判断是为了防止v==-1也就是超过长度范围的情况，这个时候，就没有必要进行下一个字符排序了
      this._sort(a, lt, gt, d + 1) // 只对lt...gt这一段进行下一个字符的递归，因为这一区间的首字母都是相同的
    }
    this._sort(a, gt + 1, hi, d)
  }
}

我拿上面测过的那个100w个字符串的数据测试了下，270毫秒，看着略输加入了插入排序的MSD。

本节代码：https://github.com/Yuijam/algorithms-note/tree/main/string/sort

字符串排序

热身 键索引计数法

低位优先的字符串排序（LSD）

高位优先字符串排序（MSD）

三向字符串快速排序

热身键索引计数法