统计可重复列表中的TOP N

文章目录

- - 方案1：HashMap统计 + 全排序
  - - 实现步骤：
    - 代码实现：
    - 优缺点：
  - 方案2：HashMap统计 + 最小堆（优先队列）
  - - 实现步骤：
    - 代码实现：
    - 优缺点：
  - 方案3：Java Stream API
  - - 实现步骤：
    - 代码实现：
    - 优缺点：
  - 完整示例代码
  - 关键点总结
  - 方案4：并行流处理（Parallel Stream）
  - - 实现步骤：
    - 代码实现：
    - 优缺点：
  - 方案5：桶排序（Bucket Sort）
  - - 实现步骤：
    - 代码实现：
    - 优缺点：
  - 方案6：快速选择（Quickselect）算法
  - - 实现步骤：
    - 代码实现（部分）：
    - 优缺点：
  - 方案7：Guava库的MultiSet（第三方依赖）
  - - 实现步骤：
    - 代码实现：
    - 优缺点：
- 二、方案对比总表
- 三、总结建议

这种统计top值的情况场景使用的不少，面试过程中也有聊到过这类问题，在这详细介绍一下思路和方案

在Java中统计列表中出现次数最多的前N个对象，常见的实现方案及其优缺点如下：

方案1：HashMap统计 + 全排序

实现步骤：

使用HashMap统计每个元素的频率。
将统计结果转为列表，按频率降序排序。
取前N个元素。

代码实现：

public static List<Map.Entry<String, Integer>> topNWithSort(List<String> list, int n) {
    // 统计频率
    Map<String, Integer> freqMap = new HashMap<>();
    for (String item : list) {
        freqMap.put(item, freqMap.getOrDefault(item, 0) + 1);
    }
    // 转换为列表并排序
    List<Map.Entry<String, Integer>> entries = new ArrayList<>(freqMap.entrySet());
    entries.sort((a, b) -> b.getValue().compareTo(a.getValue()));
    // 取前N个
    return entries.subList(0, Math.min(n, entries.size()));
}

优缺点：

优点：实现简单，代码直观。
缺点：全排序时间复杂度为 (O(m \log m))（(m) 为不同元素的数量），当 (m) 较大时效率低。

方案2：HashMap统计 + 最小堆（优先队列）

实现步骤：

使用HashMap统计频率。
使用大小为N的最小堆，遍历频率表，维护堆顶为当前最小的频率。
将堆中元素逆序输出。

代码实现：

public static List<Map.Entry<String, Integer>> topNWithHeap(List<String> list, int n) {
    // 统计频率
    Map<String, Integer> freqMap = new HashMap<>();
    for (String item : list) {
        freqMap.put(item, freqMap.getOrDefault(item, 0) + 1);
    }
    // 初始化最小堆（按频率升序）
    PriorityQueue<Map.Entry<String, Integer>> heap = new PriorityQueue<>(
        (a, b) -> a.getValue() - b.getValue()
    );
    // 遍历频率表，维护堆的大小为N
    for (Map.Entry<String, Integer> entry : freqMap.entrySet()) {
        if (heap.size() < n) {
            heap.offer(entry);
        } else if (entry.getValue() > heap.peek().getValue()) {
            heap.poll();
            heap.offer(entry);
        }
    }
    // 将堆转换为列表并逆序
    List<Map.Entry<String, Integer>> result = new ArrayList<>(heap);
    result.sort((a, b) -> b.getValue().compareTo(a.getValue()));
    return result;
}

优缺点：

优点：时间复杂度为 (O(m \log n))，适合大数据量且 (n \ll m) 的场景。
缺点：需要手动维护堆，代码稍复杂。

方案3：Java Stream API

实现步骤：

使用Stream的groupingBy和counting统计频率。
按频率降序排序后取前N个。

代码实现：

public static List<Map.Entry<String, Long>> topNWithStream(List<String> list, int n) {
    return list.stream()
        .collect(Collectors.groupingBy(Function.identity(), Collectors.counting()))
        .entrySet().stream()
        .sorted(Map.Entry.<String, Long>comparingByValue().reversed())
        .limit(n)
        .collect(Collectors.toList());
}

优缺点：

优点：代码简洁，函数式编程风格。
缺点：隐藏实现细节，可能对内存和性能控制不足。

完整示例代码

import java.util.*;
import java.util.function.Function;
import java.util.stream.Collectors;

public class TopNFrequency {

    public static void main(String[] args) {
        List<String> list = Arrays.asList("apple", "banana", "apple", "orange", "banana", "apple");
        int n = 2;

        // 方法1：全排序
        System.out.println("HashMap + Sorting: " + topNWithSort(list, n));
        // 方法2：最小堆
        System.out.println("HashMap + Heap: " + topNWithHeap(list, n));
        // 方法3：Stream API
        System.out.println("Stream API: " + topNWithStream(list, n));
    }

    // 方法1：全排序
    public static List<Map.Entry<String, Integer>> topNWithSort(List<String> list, int n) {
        Map<String, Integer> freqMap = new HashMap<>();
        for (String item : list) {
            freqMap.put(item, freqMap.getOrDefault(item, 0) + 1);
        }
        List<Map.Entry<String, Integer>> entries = new ArrayList<>(freqMap.entrySet());
        entries.sort((a, b) -> b.getValue().compareTo(a.getValue()));
        return entries.subList(0, Math.min(n, entries.size()));
    }

    // 方法2：最小堆
    public static List<Map.Entry<String, Integer>> topNWithHeap(List<String> list, int n) {
        Map<String, Integer> freqMap = new HashMap<>();
        for (String item : list) {
            freqMap.put(item, freqMap.getOrDefault(item, 0) + 1);
        }
        PriorityQueue<Map.Entry<String, Integer>> heap = new PriorityQueue<>(
            (a, b) -> a.getValue() - b.getValue()
        );
        for (Map.Entry<String, Integer> entry : freqMap.entrySet()) {
            if (heap.size() < n) {
                heap.offer(entry);
            } else if (entry.getValue() > heap.peek().getValue()) {
                heap.poll();
                heap.offer(entry);
            }
        }
        List<Map.Entry<String, Integer>> result = new ArrayList<>(heap);
        result.sort((a, b) -> b.getValue().compareTo(a.getValue()));
        return result;
    }

    // 方法3：Stream API
    public static List<Map.Entry<String, Long>> topNWithStream(List<String> list, int n) {
        return list.stream()
            .collect(Collectors.groupingBy(Function.identity(), Collectors.counting()))
            .entrySet().stream()
            .sorted(Map.Entry.<String, Long>comparingByValue().reversed())
            .limit(n)
            .collect(Collectors.toList());
    }
}

关键点总结

全排序适合数据量小的场景，代码简单但效率低。
最小堆适合大数据量，时间复杂度更优。
Stream API以简洁性取胜，但需注意类型转换和性能。

方案4：并行流处理（Parallel Stream）

实现步骤：

使用并行流加速统计和排序。
利用ConcurrentHashMap保证线程安全。

代码实现：

public static List<Map.Entry<String, Long>> topNParallelStream(List<String> list, int n) {
    return list.parallelStream()
        .collect(Collectors.groupingByConcurrent(Function.identity(), Collectors.counting()))
        .entrySet().parallelStream()
        .sorted(Map.Entry.<String, Long>comparingByValue().reversed())
        .limit(n)
        .collect(Collectors.toList());
}

优缺点：

优点：利用多核并行处理，适合超大数据量。
缺点：线程安全控制复杂，可能因数据倾斜导致性能提升有限。

方案5：桶排序（Bucket Sort）

实现步骤：

统计频率，记录最大频率。
创建频率桶，索引为频率，值为元素列表。
从高到低遍历桶，收集前N个元素。

代码实现：

public static List<Map.Entry<String, Integer>> topNBucketSort(List<String> list, int n) {
    Map<String, Integer> freqMap = new HashMap<>();
    int maxFreq = 0;
    for (String item : list) {
        int freq = freqMap.getOrDefault(item, 0) + 1;
        freqMap.put(item, freq);
        maxFreq = Math.max(maxFreq, freq);
    }
    // 创建桶（索引为频率）
    List<List<String>> buckets = new ArrayList<>(maxFreq + 1);
    for (int i = 0; i <= maxFreq; i++) {
        buckets.add(new ArrayList<>());
    }
    freqMap.forEach((k, v) -> buckets.get(v).add(k));
    // 从高到低收集结果
    List<Map.Entry<String, Integer>> result = new ArrayList<>();
    for (int i = maxFreq; i >= 0 && result.size() < n; i--) {
        for (String item : buckets.get(i)) {
            result.add(new AbstractMap.SimpleEntry<>(item, i));
            if (result.size() == n) break;
        }
    }
    return result;
}

优缺点：

优点：时间复杂度 (O(m + k))（(k)为最大频率），适合频率分布集中的场景。
缺点：空间复杂度 (O(k))，若最大频率极高则浪费内存。

方案6：快速选择（Quickselect）算法

实现步骤：

统计频率，将Entry存入列表。
使用快速选择算法找到第N大的频率分界点。
对前N个元素进行排序。

代码实现（部分）：

public static List<Map.Entry<String, Integer>> topNQuickSelect(List<String> list, int n) {
    Map<String, Integer> freqMap = new HashMap<>();
    for (String item : list) {
        freqMap.put(item, freqMap.getOrDefault(item, 0) + 1);
    }
    List<Map.Entry<String, Integer>> entries = new ArrayList<>(freqMap.entrySet());
    quickSelect(entries, n);
    return entries.subList(0, n).stream()
        .sorted((a, b) -> b.getValue().compareTo(a.getValue()))
        .collect(Collectors.toList());
}

private static void quickSelect(List<Map.Entry<String, Integer>> list, int n) {
    int left = 0, right = list.size() - 1;
    while (left <= right) {
        int pivotIndex = partition(list, left, right);
        if (pivotIndex == n) break;
        else if (pivotIndex < n) left = pivotIndex + 1;
        else right = pivotIndex - 1;
    }
}

private static int partition(List<Map.Entry<String, Integer>> list, int low, int high) {
    int pivotValue = list.get(high).getValue();
    int i = low;
    for (int j = low; j < high; j++) {
        if (list.get(j).getValue() > pivotValue) {
            Collections.swap(list, i, j);
            i++;
        }
    }
    Collections.swap(list, i, high);
    return i;
}

优缺点：

优点：平均时间复杂度 (O(m))，适合对性能要求极高的场景。
缺点：实现复杂，需处理大量边界条件。

方案7：Guava库的MultiSet（第三方依赖）

实现步骤：

使用Guava的Multiset统计频率。
按频率排序后取前N个。

代码实现：

public static List<Multiset.Entry<String>> topNGuava(List<String> list, int n) {
    Multiset<String> multiset = HashMultiset.create(list);
    return multiset.entrySet().stream()
        .sorted((a, b) -> b.getCount() - a.getCount())
        .limit(n)
        .collect(Collectors.toList());
}

优缺点：

优点：代码极简，依赖Guava工具类。
缺点：需引入第三方库，不适合纯JDK环境。

二、方案对比总表

方案	时间复杂度	空间复杂度	适用场景
全排序	(O(m \log m))	(O(m))	数据量小，代码简单
最小堆	(O(m \log n))	(O(n))	大数据量且 (n \ll m)
Stream API	(O(m \log m))	(O(m))	快速开发，代码简洁
并行流	(O(m \log m / p))	(O(m))	多核环境，超大数据量
桶排序	(O(m + k))	(O(k))	频率集中且最大值已知
快速选择	(O(m))（平均）	(O(m))	高性能需求，允许复杂实现
Guava MultiSet	(O(m \log m))	(O(m))	允许第三方依赖