Java使用DOM简单解析XML文件

前言
本文目标
- 目标结果演示
- - 示例文件信息
  - 测试结果
解析XML文件相关的Java类
- DocumentBuilderFactory
- DocumentBuilder
- Document
- NodeList
- Node
简单流程分析
- 练练手
- NodeList及Node的部分结构分析
- - 关于#Text
工具类
测试
- 测试代码及说明
- 输出结果
小结

前言

对于某些需求的部分三方接口，其返回是一个XML文件，对于其中的数据呢，那自然是要拿出来的，所以今天我们就来看看怎么解析XML文件并从中获得想要的数据。

本文目标

认识简单操作XML文件的相关类
了解解析XML文件的流程
编写一个简单的解析XML文件的工具类

目标结果演示

示例文件信息

<?xml version="1.0" encoding="UTF-8"?>
<info>
	<id>189845AUS485_25848</id>
	<sendTime>2022-11-17 14:24:15</sendTime>
	<content>测试信息</content>
	<student>
		<name>张三</name>
		<sex>1</sex>
		<grade>25</grade>
	</student>
	
	<student>
		<name>张二</name>
		<sex>0</sex>
		<grade>28</grade>
	</student>
	
	<other>
		<phone>18168485624</phone>
		<email>3561548659@qq.com</email>
		<director>蔡元培</director>
	</other>
</info>

测试结果

============获取文件下的所有键值对、如果有重复的则只返回最后一个============
phone--18168485624
director--蔡元培
sex--0
grade--28
name--张二
id--189845AUS485_25848
content--测试信息
email--3561548659@qq.com
sendTime--2022-11-17 14:24:15
============获取other标签下的键值对============
phone--18168485624
director--蔡元培
email--3561548659@qq.com
============获取文件里student标签的所有数据并转为对应类的List============
Student{name='张三', sex=1, grade=25.0}
Student{name='张二', sex=0, grade=28.0}

解析XML文件相关的Java类

注意以下内容仅仅是该类注释的一部分，具体详情，请自行查看JavaDoc

DocumentBuilderFactory

Defines a factory API that enables applications 
to obtain a parser 
that produces DOM object trees from XML documents.

定义工厂API，使应用程序能够获得从XML文档生成DOM对象树的解析器。

DocumentBuilder

Defines the API to obtain DOM Document instances from an XML 
document. 
Using this class, an application programmer can obtain 
a Document from XML.

定义从XML文档获取DOM文档实例的API。使用该类，应用程序员可以从XML获取Document。

Document

既然上面提到了，那我们就再来看看，这个类的注释。

The Document interface represents the entire HTML or XML
 document. 
Conceptually, it is the root of the document tree, 
and provides the primary access to the document's data.

Document接口表示整个HTML或XML文档。从概念上讲，它是文档树的根，并提供对文档数据的主要访问。

NodeList

The NodeList interface provides the abstraction of an ordered
 collection of nodes, without defining or constraining how this
 collection is implemented.

NodeList接口提供node的有序集合的抽象，而不定义或约束该集合的实现方式。

Node

The Node interface is the primary datatype for the 
entire Document Object Model. 
It represents a single node in the document tree.

节点接口是整个文档对象模型的主要数据类型。它表示文档树中的单个节点。

简单流程分析

首先看类名就知道DocumentBuilderFactory、DocumentBuilder、Document，这三个的关系：

Factory用于获取DocumentBuilder，DocumentBuilder用于获取Document。
而Document代表的就是一个XML或者HTML文件

NodeList很明显是Node的集合，而Node是Document的主要数据类型，那么我们只需要用Document获取一个NodeList，再获取单个Node的数据即可。

练练手

以下面的XML文件为例

<?xml version="1.0" encoding="UTF-8"?>
<info>
	<id>189845AUS485_25848</id>
	<sendTime>2022-11-17 14:24:15</sendTime>
	<content>测试信息</content>
	<student>
		<name>张三</name>
		<sex>1</sex>
		<grade>25</grade>
	</student>
	
	<student>
		<name>张二</name>
		<sex>0</sex>
		<grade>28</grade>
	</student>
	
	<other>
		<phone>18168485624</phone>
		<email>3561548659@qq.com</email>
		<director>蔡元培</director>
	</other>
</info>

现在让我们试着获取student标签下第一个学生的所有信息。答案不唯一，只要能输出张三的所有信息即可。

本人的方式如下

package com.xml;

import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import java.io.File;
import java.io.IOException;

/**
 * @author 三文鱼先生
 * @title
 * @description
 * @date 2022/11/21
 **/
public class EasyTest {
    public static void main(String[] args) {
        String path = "E:\file\test.xml";
        DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
        try {
            DocumentBuilder db = dbf.newDocumentBuilder();
//            //可以以文件形式处理
            File file = new File(path);
            //将文件处理成Document文档对象
            Document doc = db.parse(file);
            //根据所给标签名称 获取所有该标签的NodeList
            NodeList list = doc.getElementsByTagName("student");
            System.out.println("info子节点有 " + list.getLength());
            //获取第一个学生节点
            Node firstStu = list.item(0);
            //获取该节点的所有子节点 -- 数据与层级关系以节点表示
            NodeList stuNodes = firstStu.getChildNodes();
            System.out.println("第一个学生节点的属性个数为：" + stuNodes.getLength());
            //遍历所有子节点
            System.out.println("学生的所有节点为：");
            for(int i = 0; i < stuNodes.getLength(); i++) {
                Node propertyNode = stuNodes.item(i);
                System.out.println("第" + i + "节点的名称为：" + propertyNode.getNodeName());
                //如果为换行符 则跳过不予处理 可以理解为换行节点
                if(propertyNode.getChildNodes().getLength() == 0)
                    continue;
                else {
                    NodeList propertyNodeList = propertyNode.getChildNodes();
                    for (int j = 0; j < propertyNodeList.getLength(); j++) {
                        //可以看看 数据节点的子节点是什么 一般就一个节点里面是数据
                        Node dataNode  = propertyNodeList.item(j);
                        int m = 0;
                    }
                    //存在数据的节点 必定包含一个数据节点 该节点一般为一第一个
                    Node valueNode = propertyNode.getFirstChild();
                    String value = valueNode.getNodeValue();
                    //获取节点的标签名称
                    String nodeName = propertyNode.getNodeName();
                    System.out.print("\t值:" + nodeName+ "--" + value +"\t");
                    System.out.println("");
                }
            }
        } catch (ParserConfigurationException | IOException | SAXException e) {
            e.printStackTrace();
        }
    }
}

输出结果为：

info子节点有 2
第一个学生节点的属性个数为：7
学生的所有节点为：
第0节点的名称为：#text
第1节点的名称为：name
	值:name--张三	
第2节点的名称为：#text
第3节点的名称为：sex
	值:sex--1	
第4节点的名称为：#text
第5节点的名称为：grade
	值:grade--25	
第6节点的名称为：#text

对于以上结果的分析我们下面来看看

NodeList及Node的部分结构分析

下面是一些数据结构的分析。
找到以下代码，debug时查看该list的结构

    NodeList list = doc.getElementsByTagName("student");

找到list中的rootNode,也就是根节点
在这里插入图片描述
在获取到的student标签的所有节点list中，可以看到其结构是这样的。其中的fNodeName数组数据如下：

这个很好理解，就是所有的标签名称。
再来看下面的fNodeValue,这里面比较有意思

该数组里面确实有值，但是不止有标签的值，还有/n/t什么的。

/n/t在这里可以理解为，在Document中，将标签的换行符以及对齐符，看作一个节点，在这里我将这样的节点称为换行节点，以此来表示层级关系。

/n之后的/t越多表示层级越深，下面的方框中标出了部分隐形的换行节点。
在这里插入图片描述
如上述的student标签的/n之后的/t就有两个

关于#Text

按照上述的代码，执行结果，会获得到以下部分输出。
在这里插入图片描述
但是我们的xml文件里的第一个学生标签是这样的

	<student>
		<name>张三</name>
		<sex>1</sex>
		<grade>25</grade>
	</student>

那么 #Text是什么呢？还记得我们上面提到的fNodeValue？它存储的是节点的值。而这种值在Document中是以节点表示的，该节点的键值对对应为：#Text -- 值。

对于换行节点（也就是上面提到的值为/n/t类型）这样特殊的节点，该节点的键为#Text。

这个#Text可以用来表示当前节点的值或者已达节点末尾。

当前节点的值很好理解，就是该节点的值了。
后面的name节点里也会包含了#Text节点，结构如下：在这里插入图片描述
其表示为标签为name的值为张三。而name节点下的子节点只有一个：#Text -- 张三

而已达节点末尾的意思是当前节点没有子节点。也就是说对于键为#Text这样的节点，获取到的NodeList长度为0。

工具类

假如要获取XML文件里所有节点的name与value，你会怎么做？用循环？如果给的层级未知呢？

DOM树，DOM树，既然是树，那肯定用树的遍历方式：Stack - 栈、Queue - 队列或者是递归。

在下面的工具类中，请注意我所使用到的递归，以便于理解。

package xml;

import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.lang.reflect.Field;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

/**
 * title 简单处理XML文件的工具类 用于从XML文件里获取信息
 * description
 * @author 三文鱼先生
 * @date 2022/11/26
 **/
public class XmlParseUtil {
    /*
     * @description 以文件路径获取一个对应的Document对象
     * @param path 文件路径
     * @author 三文鱼先生
     * @date 11:47 2022/11/26
     * @return org.w3c.dom.Document
    */
    public static Document getDocument(String path) {
        Document document  = null;
        //获取一个工厂示例
        DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
        try {
            //获取一个Document的构造器
            DocumentBuilder db = dbf.newDocumentBuilder();
            //以文件获取该文件的Document对象
            document = db.parse(new File(path));
        }catch (ParserConfigurationException | IOException | SAXException e) {
            e.printStackTrace();
        }
        return document;
    }
    /*
     * @description 以文件流获取一个对应的Document对象
     * @param inputStream xml文件流
     * @author 三文鱼先生
     * @date 11:53 2022/11/26
     * @return org.w3c.dom.Document
    */
    public static Document getDocument(InputStream inputStream) {
        Document document  = null;
        //获取一个工厂示例
        DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
        try {
            //获取一个Document的构造器
            DocumentBuilder db = dbf.newDocumentBuilder();
            //以流的形式获取该文件的Document对象
            document = db.parse(inputStream);
        }catch (ParserConfigurationException | IOException | SAXException e) {
            e.printStackTrace();
        }
        return document;
    }

    /*
     * @description 获取当前文件的所有键值对信息 如果有重复的 则获取的是最后一个
     * @param path 文件路径
     * @author 三文鱼先生
     * @date 11:54 2022/11/26
     * @return java.util.Map<java.lang.String,java.lang.String>
    */
    public static Map<String , String> getNotRepeatInfo(String path) {
        Document document = getDocument(path);
        if(document == null)
            return null;
        Map<String , String> map = new HashMap<>();
        //document可以看作是一个根节点
        NodeList rootList = document.getChildNodes();
        for(int i = 0; i < rootList.getLength(); i++) {
            getNodeInfo(rootList.item(i) , map ,false);
        }
        return map;
    }

    /*
     * @description 获取某个节点下的所有的键值对信息
     * @param node 节点
     * @param map 键值对映射集合
     * @param needRoot 是否需要根节点
     * @author 三文鱼先生
     * @date 11:46 2022/11/26
     * @return java.util.Map<java.lang.String,java.lang.String>
    */
    public  static Map<String , String> getNodeInfo(Node node ,
                                                    Map<String , String> map,
                                                    boolean needRoot
    ) {
        //当前节点的子节点个数
        NodeList childNodes = node.getChildNodes();
        if (childNodes.getLength() == 0) {
            //0表示无数据
            return map;
        } else if (childNodes.getLength() == 1) {
            //只有一个子节点 数据节点 获取数据到Map中
            String value = node.getFirstChild().getNodeValue();
            if(needRoot)
                //替换掉根节点 只保留文件中显示的节点
                map.put(getParentName(node , "").replace("#document." , "")
                        ,value);
            else
                map.put(node.getNodeName() , value);
            return map;
        } else {
            //有多个子节点 为父节点遍历所有父节点
            for (int i = 0; i < childNodes.getLength(); i++) {
                getNodeInfo(childNodes.item(i), map , needRoot);
            }
            return map;
        }
    }

    /*
     * @description 获取所给节点的全路径
     * @param node 所给节点
     * @param str 初始字符 可以为null或者""
     * @author 三文鱼先生
     * @date 15:52 2022/11/26
     * @return java.lang.String
    */
    public static String getParentName(Node node , String str) {
        //已到根节点
        if (node.getNodeName().equals("#document")) {
            return "#document";
        }
        String path = node.getNodeName();
        str = getParentName(node.getParentNode() , str) + "." + path;
        return str;
    }

    /*
     * @description 获取某个节点下的所有键值对信息
     * @param node 所给节点
     * @param map 键值对映射
     * @author 三文鱼先生
     * @date 15:59 2022/11/26
     * @return java.util.Map<java.lang.String,java.lang.String>
    */
    public  static Map<String , String> getNodeInfo(Node node ,
                                                    Map<String , String> map
    ) {
        //默认不需要父节点
        return  getNodeInfo(node , map , false);
    }

    /*
     * @description 根据标签名称获取第1个该标签的所有键值对
     * @param filePath 文件路径
     * @param tagName 标签名称
     * @param map 键值对映射
     * @author 三文鱼先生
     * @date 15:59 2022/11/26
     * @return java.util.Map<java.lang.String,java.lang.String>
    */
    public  static Map<String , String> getNodeInfo(
            String filePath , String tagName ,Map<String , String> map
    ) {

        return  getNodeInfo(filePath , tagName , map , 0);
    }

    /*
     * @description 根据标签名称获取对应的第n个的所有键值对
     * @param filePath 文件路径
     * @param tagName 标签名称
     * @param map 键值对映射
     * @param index 指定的标签序号
     * @author 三文鱼先生
     * @date 16:00 2022/11/26
     * @return java.util.Map<java.lang.String,java.lang.String>
    */
    public  static Map<String , String> getNodeInfo(
            String filePath , String tagName ,Map<String , String> map ,int index
    ) {
        Document document = getDocument(filePath);
        //默认不需要父节点
        return  getNodeInfo(document.getElementsByTagName(tagName).item(index) , map , false);
    }

    /*
     * @description 获取标签的所有节点信息 并转为对象的List
     * @param nodeList 获取的节点List
     * @param cs 指定的类
     * @author 三文鱼先生
     * @date 16:01 2022/11/26
     * @return java.util.List<T>
    */
    public static <T> List<T> getObjectToList(NodeList nodeList , Class<?> cs) {
        if(nodeList.getLength() <= 0)
            return null;
        List<T> list = new ArrayList<>();
        T e = null;
        Map<String , Class<?>> paramsMap = getMapWithParams(cs);
        for(int i = 0; i < nodeList.getLength(); i++) {
            Node stuNode = nodeList.item(i);
            NodeList childrenNodeList = stuNode.getChildNodes();
            try {
                e = (T)cs.newInstance();
                //遍历所有参数
                for(int j = 0; j < childrenNodeList.getLength(); j++) {
                    Node paramNode = childrenNodeList.item(j);
                    if(paramNode.getChildNodes().getLength() > 0) {
                        //取出数据存入到对象中
                        String paramName = paramNode.getNodeName();
                        //使用Object接收
                        Object paramValue = paramNode.getFirstChild().getNodeValue();
                        //这里要根据值的类型做一个转换
                        paramValue = changeValueType(paramValue , paramsMap.get(paramName));
                        //设置值
                        cs.getMethod(getSetterMethodName(paramName) , paramsMap.get(paramName))
                                .invoke(e , paramValue);
                    }
                }
                list.add(e);
            }catch (Exception exception) {
                exception.printStackTrace();
            }

        }
        //获取对象的属性
        return list;
    }

    /*
     * @description 获取标签的所有节点信息 并转为对象的List
     * @param filePath 文件路径
     * @param tagName 标签名称
     * @param cs 类对象
     * @author 三文鱼先生
     * @date 16:03 2022/11/26
     * @return java.util.List<T>
    */
    public static <T> List<T> getObjectToList(String filePath , String tagName , Class<?> cs) {
        return getObjectToList(getDocument(filePath).getElementsByTagName(tagName) , cs);
    }

    /*
     * @description 用于从所给类中 获取属性名称 - 类型的映射集合
     * @param cs 所给的类
     * @author 三文鱼先生
     * @date 16:29 2022/11/26
     * @return java.util.Map<java.lang.String,java.lang.Class<?>>
    */
    private static Map<String , Class<?>> getMapWithParams(Class<?> cs) {
        Field[] fields = cs.getDeclaredFields();
        Map<String , Class<?>> classMap = new HashMap<>(16);
        for (Field field : fields) {
            classMap.put(field.getName() , field.getType());
        }
        return classMap;
    }

    /*
     * @description 获取属性字段set方法名名称
     * @param param
     * @author 三文鱼先生
     * @date 16:29 2022/11/26
     * @return java.lang.String
    */
    private static String getSetterMethodName(String param) {
        char[] chars = param.toCharArray();
        //首字母大写
        if(Character.isLowerCase(chars[0])) {
            chars[0] -= 32;
        }
        //拼接set方法
        return "set" + new String(chars);
    }

    /*
     * @description 将根据对应的类 将对象强转作为该类的对象 仅仅包含少数包装类
     * @param o
     * @param cs
     * @author 三文鱼先生
     * @date 16:30 2022/11/26
     * @return java.lang.Object
    */
    private static Object changeValueType(Object o , Class<?> cs) {
        //转换为对应的类型 包装类
        if(Integer.class.equals(cs))
            return Integer.valueOf(o.toString());
        else if(Double.class.equals(cs))
            return Double.valueOf(o.toString());
        else if(Float.class.equals(cs))
            return Float.valueOf(o.toString());
        //默认为String
        return o;
    }
}

测试

测试代码及说明

以下测试的文件还是上面的文件信息，该测试并不是闭环，只测试部分常用方法。感兴趣的自行用上面的工具类测试

public static void main(String[] args) {

        System.out.println("============获取文中下的所有键值对、如果有重复的则只返回最后一个============");
        //获取文件中所有不重复键值对的 若有重复以最后的为准
        Map<String , String> map = XmlParseUtil.getNotRepeatInfo("E:\\file\\test.xml");
        for (Map.Entry<String, String> stringStringEntry : map.entrySet()) {
            System.out.println(stringStringEntry.getKey() + "--" + stringStringEntry.getValue());
        }


        System.out.println("============获取other标签下的键值对============");
        HashMap<String , String> nodeMap = new HashMap<>();
        XmlParseUtil.getNodeInfo("E:\\file\\test.xml" , "other" , nodeMap);
        for (Map.Entry<String, String> entry : nodeMap.entrySet()) {
            System.out.println(entry.getKey() + "--" + entry.getValue());
        }

        System.out.println("============获取文件里student标签的所有数据并转为对应类的List============");
        List<Student> list = XmlParseUtil.getObjectToList("E:\\file\\test.xml" ,"student", Student.class );
        for (Student student : list) {
            System.out.println(student.toString());
        }

    }

输出结果

============获取文件下的所有键值对、如果有重复的则只返回最后一个============
phone--18168485624
director--蔡元培
sex--0
grade--28
name--张二
id--189845AUS485_25848
content--测试信息
email--3561548659@qq.com
sendTime--2022-11-17 14:24:15
============获取other标签下的键值对============
phone--18168485624
director--蔡元培
email--3561548659@qq.com
============获取文件里student标签的所有数据并转为对应类的List============
Student{name='张三', sex=1, grade=25.0}
Student{name='张二', sex=0, grade=28.0}