【大数据】MapReduce JAVA API编程实践及适用场景介绍

news2026/4/13 4:47:39

1.前言

2.mapreduce编程示例

3.MapReduce适用场景

1.前言

本文是作者大数据系列专栏的其中一篇，前文我们依次聊了大数据的概论、分布式文件系统、分布式数据库、以及计算引擎mapreduce核心概念以及工作原理。

书接上文，本文将会继续聊一下mapreduce的编程实践以及mapreduce的适用场景。基于的Hadoop版本依然是前文的hadoop3.1.3。

2.mapreduce编程示例

本文依然以最经典的单词分词，即统计各个单词数量的业务场景为例。mapreduce其实就是编写map函数和reduce函数。map reduce的Java API中提供了map和reduce的标准接口，实现接口，编写自己的业务逻辑即可。

依赖：

<dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-mapreduce-client-core</artifactId>
   <version>3.1.3</version>
</dependency>

map函数：

map阶段会从分布式文件系统HDFS中去读数据，读入的数据先进行分词，然后进行初步的统计。所以编写map函数要写的就是分词和统计：

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.Text;

public class MyMapper extends Mapper<Object, Text, Text, IntWritable> {
    private Text word = new Text();

    @Override
    protected void map(Object key, Text value, Mapper<Object, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException {
        StringTokenizer itr = new StringTokenizer(value.toString());
        while (itr.hasMoreTokens()) {
            word.set(itr.nextToken());
            context.write(word, new IntWritable(1));
        }
    }
}

key，是每条输入的键，默认情况下处理文本文件时通常是记录的偏移量，类型为Object（实践中常为LongWritable）。

context是输出。

在new StringTokenizer这一步，文本就会进行分词。

IntWritable是int的包装类，主要是为了赋予int类型可序列化的能力，毕竟要在网络中进行传输。

reduce函数：

reduce的shuffle是底层自动执行的，所以我们只需要编写好reduce函数即可：

reduce函数的输入就是shuffle后的<key,Iterable>,context是输出。

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;

public class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
        int sum=0;
        for(IntWritable val:values){
            sum+=val.get();
        }
        context.write(key,new IntWritable(sum));
    }
}

main函数：

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class MapReduceTest {
    public static void main(String[] args)throws Exception {
        Configuration conf = new Configuration();
        conf.set("fs.defaultFS", "hdfs://192.168.31.10:9000");
        conf.set("fs.hdfs.impl","org.apache.hadoop.hdfs.DistributedFileSystem");
        Job job = Job.getInstance(conf, "word count");
        job.setJarByClass(MapReduceTest.class); // 使用当前类的类加载器
        job.setMapperClass(MyMapper.class);
        job.setCombinerClass(MyReducer.class);
        job.setReducerClass(MyReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path("/user/hadoop/input/input1.txt"));
        FileOutputFormat.setOutputPath(job, new Path("/user/hadoop/output"));
        job.waitForCompletion(true);
    }
}