spark安装部署
需要指导私信
所有节点安装scala,安装scala需要安装openjdk-8-jre(当前用户如果没有sudo权限可将其加入sudo组里),以ubuntu2204-LTS为例:
$ sudo apt update
$ sudo apt-get install openjdk-8-jre-headless -y (红帽系列Linux包名可能为java-1.8.0-openjdk-headless)
$ 安装scala,其实也不用安装,现在的spark自带
https://www.scala-lang.org/download/
$ sudo dpkg -i scala-2.13.10.deb
$ wget http://XXX/spark-3.4.0-bin-hadoop3-scala2.13.tgz
$ tar -zxvf spark-3.4.0-bin-hadoop3-scala2.13.tgz
设置环境变量
$ vi .bashrc
export SPARK_HOME=~/spark-3.4.0-bin-hadoop3-scala2.13
export PATH=
P
A
T
H
:
PATH:
PATH:SPARK_HOME/bin
$ source .bashrc
配置spark
$ cd $SPARK_HOME/conf
$ cp workers.template workers
worker中为spark集群中所有节点
$ vi workers
node01
node02
node03
$ cp spark-env.sh.template spark-env.sh
查看当前环境变量
$ echo $JAVA_HOME
$ echo $HADOOP_HOME
配置spark环境变量
$ vi spark-env.sh
export JAVA_HOME=/home/hadoop/jdk1.8.0_361
export HADOOP_HOME=/home/hadoop/hadoop-3.3.5
export HADOOP_CONF_DIR=/home/hadoop/hadoop-3.3.5/etc/hadoop/
export SCALA_HOME=/usr/share/scala
export SPARK_MASTER_HOST=node01
export SPARK_PID_DIR=/home/hadoop/spark-3.4.0-bin-hadoop3-scala2.13/data
export SPARK_LOCAL_DIR=/home/hadoop/spark-3.4.0-bin-hadoop3-scala2.13
export SPARK_EXECUTOR_MEMORY=512M
export SPARK_WORKER_MEMORY=2G
配置spark默认配置文件
$ cp spark-defaults.conf.template spark-defaults.conf
$ vi spark-defaults.conf
spark.master spark://node01:7077
将配置好的spark文件拷贝到其他节点(单节点可忽略)
$ cd
$ scp -r spark-3.4.0-bin-hadoop3-scala2.13 node02:~/
$ scp -r spark-3.4.0-bin-hadoop3-scala2.13 node03:~/
将环境变量文件也scp到其他节点
$ scp .bashrc node02:~/
$ scp .bashrc node03:~/
启动spark服务
$ $SPARK_HOME/sbin/start-all.sh
查看服务
$ jps
Master节点有Master和Worker两个进程,Worker节点只有Worker进程
访问http://node01:8080/
运行例子程序:
$ $SPARK_HOME/bin/run-example SparkPi 10
运行spark-shell(Scala交互界面)
$ spark-shell
scala> val textFile=sc.textFile(“file:///home/hadoop/spark-3.4.0-bin-hadoop3-scala2.13/README.md”)
scala> textFile.count()
scala> :quit
上传一个文件到hdfs
$ hdfs dfs -put /home/hadoop/spark-3.4.0-bin-hadoop3-scala2.13/README.md
$ hdfs dfs -ls
运行pyspark(python交互界面)
$ pyspark
>>> lines=sc.textFile("file:///home/hadoop/spark-3.4.0-bin-hadoop3-scala2.13/README.md")
>>> lines.count()
>>> lines=sc.textFile("README.md")
>>> lines.count()
>>> quit()
jupter安装运行(调试python程序用的一个工具,上课老师要用)
$ sudo apt install python3-pip
$ sudo apt install jupyter-core
$ pip3 install jupyter
$ jupyter notebook --ip=node01
spark on yarn 配置及异常解决
$ hdfs dfs -mkdir /hadoop
$ hdfs dfs -mkdir /hadoop/spark_jars
$ hdfs dfs -put $SPARK_HOME/jars/ /hadoop/spark_jars
$ cd $SPARK_HOME/conf/
$ vi spark-defaults.conf
spark.yarn.jars hdfs://node01:9000/hadoop/spark_jars/
$ cd $HADOOP_HOME/etc/hadoop
$ vi yarn-site.xml
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
修改后重启服务即可
关闭集群服务
$ $SPARK_HOME/sbin/stop-all.sh
$ stop-yarn.sh
$ stop-dfs.sh
启动集群服务
$ start-dfs.sh
$ start-yarn.sh
$ $SPARK_HOME/sbin/start-all.sh
$ pyspark --master=yarn
>>> lines=sc.textFile("file:///home/hadoop/spark-3.4.0-bin-hadoop3-scala2.13/README.md")
>>> lines.count()
125
>>> quit()
启动spark后,spark-shell或pyspark会出现一个警告
WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
环境变量里设置
$ vi ~/.bashrc
export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native/
并在spark配置文件中设置
$ vi $SPARK_HOME/conf/spark-env.sh
export LD_LIBRARY_PATH=$JAVA_LIBRARY_PATH
如果没有生效,则在环境变量里添加如下:
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native/:$LD_LIBRARY_PATH
如果安装了hive还可以使用spark-sql。
对ModuleNotFoundError: No module named 'py4j’错误:
在环境变量中添加
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.9.7-src.zip:$PYTHONPATH