pdfjs 实现给定pdf数据切片高亮并且跳转

pdfjs 类的改写
- 基本展示需求的实现
- 高亮功能的实现
- - 查询功能分析
  - 切片数据处理

pdfjs 类的改写

需求： pdf文件被解析成多个分段，每个分段需要能够展示，并且通过点击分段实现源pdf内容的高亮以及跳转需求。

pdfjs 中文文档
https://gitcode.gitcode.host/docs-cn/pdf.js-docs-cn/getting_started/index.html
https://github.com/mozilla/pdf.js
文档不够详细。pdf难就难在文档上

基本展示需求的实现

pdf.js 是一个由 Mozilla 开发的 JavaScript 库，可以在 Web 浏览器中显示 PDF 文档。pdf.js 将 PDF 文档转换为 HTML5 Canvas 元素，并使用 JavaScript 控制文档的呈现和交互。pdf.js 使得不需要在计算机上安装 Adobe Reader 或其他 PDF 阅读器就可以在 Web 上阅读 PDF 文档成为可能。pdf.js是一个免费的开源软件，使用和修改都非常方便。

pdf.js / src 是pdf.js 的 api层

pdf.js / web 是显示层，在api层的基础上进行UI展示，包括：pdf分页懒加载、切换页码、缩放、查找文字、选择本地文件、侧边栏导航、打印等功能。

预构建版本
├── build/
│   ├── pdf.js                             - display layer
│   ├── pdf.js.map                         - display layer's source map
│   ├── pdf.worker.js                      - core layer
│   └── pdf.worker.js.map                  - core layer's source map
├── web/
│   ├── cmaps/                             - character maps (required by core)
│   ├── compressed.tracemonkey-pldi-09.pdf - PDF file for testing purposes
│   ├── debugger.js                        - helpful debugging features
│   ├── images/                            - images for the viewer and annotation icons
│   ├── locale/                            - translation files
│   ├── viewer.css                         - viewer style sheet
│   ├── viewer.html                        - viewer layout
│   ├── viewer.js                          - viewer layer
│   └── viewer.js.map                      - viewer layer's source map
└── LICENSE


源码版本
├── docs/                                  - website source code
├── examples/                              - simple usage examples
├── extensions/                            - browser extension source code
├── external/                              - third party code
├── l10n/                                  - translation files
├── src/
│   ├── core/                              - core layer
│   ├── display/                           - display layer
│   ├── shared/                            - shared code between the core and display layers
│   ├── interfaces.js                      - interface definitions for the core/display layers
│   ├── pdf.*.js                           - wrapper files for bundling
│   └── worker_loader.js                   - used for developer builds to load worker files
├── test/                                  - unit, font and reference tests
├── web/                                   - viewer layer
├── LICENSE
├── README.md
├── gulpfile.js                            - build scripts/logic
├── package-lock.json                      - pinned dependency versions
└── package.json                           - package definition and dependencies

展示功能很多人都做过了。我就不写了，粘一篇文章
pdf.js使用全教程

高亮功能的实现

由于后端的切片内容和前端从pdfjs中拿到的切片内容应该是相同的。即 ai \n mind\n 是一款\n之类的切片。所以我们可以用后端切片去匹配我们前端的切片。（通过数据长度以即每个片的index来判断）,pdfjs也是通过这样的方式来实现的高亮。

切片数据格式： [切片1，切片2]（注意）
移除其他功能（直接在pdfview.html 文件中添加hidden类名实现隐藏）
切片渲染可参考pdf的查找功能，通过切片数据与pdf.js 解析出的文本数据计算出数据，该数据结构与查找高亮的数据结构保持一致，通过pdfjs原生的渲染功能来进行渲染；
切片定位可参考pdf的查找功能。

查询功能分析

findController.pageMatches：第n页匹配到，相对于本页文本数据的开始index

findController.pageMatchesLenght：第n页匹配到的匹配字符串的长度
在这里插入图片描述

this._convertMatches 方法处理后的数据

在这里插入图片描述

粘贴 updateMatches方法

_updateMatches(reset = false) {  // 清空原高亮筛选逻辑，调用_renderMatches渲染新高亮样式
    if (!this.enabled && !reset) {
      return;
    }
	const {
	      findController,
	      matches,
	      pageIdx
	    } = this;
	    const {
	      textContentItemsStr,
	      textDivs
	    } = this;
	    let clearedUntilDivIdx = -1;
	    for (const match of matches) {
	      const begin = Math.max(clearedUntilDivIdx, match.begin.divIdx);
	      for (let n = begin, end = match.end.divIdx; n <= end; n++) {
	        const div = textDivs[n];
	        div.textContent = textContentItemsStr[n];
	        div.className = "";
	      }
	      clearedUntilDivIdx = match.end.divIdx + 1;
	    }
	    if (!findController?.highlightMatches || reset) {
	      return;
	    }
	    console.log('findController.pageMatches 第n页匹配到，相对于本页文本数据的开始index',findController.pageMatches)
	    console.log('findController.pageMatchesLength 第n页匹配到的匹配字符串的长度', findController.pageMatchesLength)
	    
	    const pageMatches = findController.pageMatches[pageIdx] || null;
	
	    console.log('pageMatches', pageMatches);
	
	    const pageMatchesLength = findController.pageMatchesLength[pageIdx] || null;
	    this.matches = this._convertMatches(pageMatches, pageMatchesLength); 
	    console.log('this.matches', this.matches)
	
	    this._renderMatches(this.matches);
	  }
}

text_highlighter 类中处理逻辑

enable：页面渲染时调用，初始化绑定事件，调用页面渲染更新

_convertMatches：将数据1转换成数据2

_updateMatches：清空原高亮样式，调用_renderMatches渲染新高亮样式

_renderMatches：渲染高亮，并调用findController.scrollMatchIntoView

滚动到指定位置

切片数据处理

数据结构说明
切片数据： 必须有分页信息不然无法匹配每页的textHeight实例
[
	// 第一页切片
	{
		pageIndex: 0，
		cutInfo: [
			'内容1','内容2'.....
		]
	} 
	// 第二页切片
	{
		pageIndex: 1，
		cutInfo: [
			'内容1','内容2'.....
		]
	} 
	
]

注意事项：

1、normalizeUnicode处理文本数据，如 ﬁ 这是一个字符，前后端解析可能会不一致，将前后端解析出来的数据使用pdf.js api暴露出来的normalizeUnicode进行处理，处理后为i f，两个字符，

2、空白字符过滤：前后端数据可能会存在空格、换行等空白字符差异（存在什么样的差异，比如前端会将多个空格合并成1个空格），计算时需要过滤。（我处理的正则/\s|\u0000|./g）

3、后端的切片数据与前端pdf拿到的数据有出入，渲染时无法完全对应

切片数据处理

1、将切片数据处理成分页的数据2，命名为pagesMatches，并添加自定义标识，

在这里插入图片描述
2、在text_highlighter中注册事件updatePagesMatches，用于接收存储pagesMatches，并调用_updateMatches重新渲染。

3、改造text_highlighter中_updateMatches，1清空原高亮出代码，2将第n页pagesMatches与查询的数据2合并，生成新的有自定义标识的数据2，使高亮渲染切片后查询功能正常。

4、调用_renderMatches进行渲染，将扩展的字段添加到html元素中，并添加样式。高亮渲染完成了。

在这里插入图片描述

5、高亮定位：

pagesMatches数据添加扩展字段isSelected
在这里插入图片描述
使所在页码滚动到可视区域PDFViewerApplication.pdfViewer.currentPageNumber=n

在_renderMatches渲染时根据 isSelected 与搜索选中selected 判断获取应该滚动到的html元素

调用findController.scrollMatchIntoView进行滚动

// 粘贴 textheight类
class TextHighlighter {
  constructor({
    findController,
    eventBus,
    pageIndex
  }) {
    this.findController = findController;
    this.matches = [];
    this.eventBus = eventBus;
    this.pageIdx = pageIndex;
    this._onUpdateTextLayerMatches = null;
    this.textDivs = null;
    this.textContentItemsStr = null;
    this.enabled = false;

    // 没有则创建 _onUpdatePagesMatches
    if (!this._onUpdatePagesMatches) {
      this._onUpdatePagesMatches = (evt) => {
        if (evt.pagesMatches !== defaultPagesMatches) {
          defaultPagesMatches = evt.pagesMatches;
          defaultPagesMatchesIsFocus = true;
          sessionStorage.removeItem("pdfFindBar");
        }
        this._updateMatches(false);
      };
      this.eventBus._on("updatePagesMatches", this._onUpdatePagesMatches);
    }
  }
  setTextMapping(divs, texts) {
    this.textDivs = divs;
    this.textContentItemsStr = texts;
  }
  enable() { // 页面渲染时调用，初始化绑定事件，调用页面渲染更新
    console.log('enable')
    if (!this.textDivs || !this.textContentItemsStr) {
      throw new Error("Text divs and strings have not been set.");
    }
    if (this.enabled) {
      throw new Error("TextHighlighter is already enabled.");
    }
    console.log('页面渲染------------');
    this.enabled = true;
    if (!this._onUpdateTextLayerMatches) {
      this._onUpdateTextLayerMatches = evt => {
        if (evt.pageIndex === this.pageIdx || evt.pageIndex === -1) {
          this._updateMatches();
        }
      };
      this.eventBus._on("updatetextlayermatches", this._onUpdateTextLayerMatches);
    }
    if (!this._onUpdatePagesMatches) {
      this._onUpdatePagesMatches = (evt) => {
        if (evt.pagesMatches !== defaultPagesMatches) {
          defaultPagesMatches = evt.pagesMatches;
          defaultPagesMatchesIsFocus = true;
          sessionStorage.removeItem("pdfFindBar");
        }
        this._updateMatches(false);
      };
      this.eventBus._on("updatePagesMatches", this._onUpdatePagesMatches);
    }
    this._updateMatches();
  }
  disable() {
    if (!this.enabled) {
      return;
    }
    console.log('disable')
    this.enabled = false;
    if (this._onUpdateTextLayerMatches) {
      this.eventBus._off("updatetextlayermatches", this._onUpdateTextLayerMatches);
      this._onUpdateTextLayerMatches = null;
    }
    // disable时候移除监听方法
    if (this._onUpdatePagesMatches) {
      this.eventBus._off("updatePagesMatches", this._onUpdatePagesMatches);
      this._onUpdatePagesMatches = null;
    }
    this._updateMatches(true);
  }
  _convertMatches(matches, matchesLength) { // _convertMatches：将数据转换成begin end 格式
    if (!matches) {
      return [];
    }
    const {
      textContentItemsStr
    } = this;
    let i = 0
    let iIndex = 0;
    const end = textContentItemsStr.length - 1;
    const result = [];

    try {
      for (let m = 0, mm = matches.length; m < mm; m++) {
        let matchIdx = matches[m];
        while (i !== end && matchIdx >= iIndex + textContentItemsStr[i].length) {
          iIndex += textContentItemsStr[i].length;
          i++;
        }
        if (i === textContentItemsStr.length) {
          console.error("Could not find a matching mapping");
        }
        const match = {
          begin: {
            divIdx: i,
            offset: matchIdx - iIndex
          }
        };
        matchIdx += matchesLength[m];
        while (i !== end && matchIdx > iIndex + textContentItemsStr[i].length) {
          iIndex += textContentItemsStr[i].length;
          i++;
        }
        match.end = {
          divIdx: i,
          offset: matchIdx - iIndex
        };
        result.push(match);
      }
    } catch {
      debugger
      console.log(2222222222);
    }
    debugger
    return result;
  }


  _renderMatches(matches) {
    // Early exit if there is nothing to render.
    if (matches.length === 0) {
      return;
    }
    const isPagesMatch = sessionStorage.getItem("pdfFindBar") !== "pdfFindBar";
    const { textContentItemsStr, textDivs, findController, pageIdx } = this;
    if (!textDivs?.length) {
      return;
    }

    const isSelectedPage = findController?.selected
      ? pageIdx === findController.selected.pageIdx
      : true;
    const selectedMatchIdx = findController?.selected?.matchIdx ?? 0;
    // const highlightAll = !options ? findController.state.highlightAll : true;
    const highlightAll = true;
    let prevEnd = null;
    const infinity = {
      divIdx: -1,
      offset: undefined,
    };

    function beginText(begin, className, styles) {
      const divIdx = begin.divIdx;
      if (!textDivs[divIdx]) {
        return;
      }
      textDivs[divIdx].textContent = "";
      return appendTextToDiv(divIdx, 0, begin.offset, className, styles);
    }

    function appendTextToDiv(divIdx, fromOffset, toOffset, className, styles) {
      let div = textDivs[divIdx];
      if (!div) {
        return;
      }
      if (div.nodeType === Node.TEXT_NODE) {
        const span = document.createElement("span");
        div.before(span);
        span.append(div);
        textDivs[divIdx] = span;
        div = span;
      }
      const content = textContentItemsStr[divIdx].substring(
        fromOffset,
        toOffset
      );
      const node = document.createTextNode(content);
      if (className) {
        const span = document.createElement("span");
        if (styles && span) {
          for (let p in styles) {
            span.style[p] = styles[p];
          }
        }
        span.className = `${className} appended`;
        span.append(node);

        div.append(span);
        return className.includes("selected") ? span.offsetLeft : 0;
      }
      div.append(node);
      return 0;
    }

    let i0 = selectedMatchIdx,
      i1 = i0 + 1;
    if (highlightAll) {
      i0 = 0;
      i1 = matches.length;
    } else if (!isSelectedPage) {
      // Not highlighting all and this isn't the selected page, so do nothing.
      return;
    }

    let lastDivIdx = -1;
    let lastOffset = -1;
    let selectedElement;
    let findIndex = -1;
    for (let i = i0; i < i1; i++) {
      const match = matches[i];
      const begin = match.begin;
      if (begin.divIdx === lastDivIdx && begin.offset === lastOffset) {
        // It's possible to be in this situation if we searched for a 'f' and we
        // have a ligature 'ff' in the text. The 'ff' has to be highlighted two
        // times.
        continue;
      }
      lastDivIdx = begin.divIdx;
      lastOffset = begin.offset;

      const end = match.end;

      if (match.sectionIndex === undefined) {
        findIndex += 1;
      }
      const isSelected = isPagesMatch
        ? match.isSelected
        : isSelectedPage && findIndex === selectedMatchIdx;
      const highlightSuffix = " " + match.className;
      let selectedLeft = 0;

      // Match inside new div.
      if (!prevEnd || begin.divIdx !== prevEnd.divIdx) {
        // If there was a previous div, then add the text at the end.
        if (prevEnd !== null) {
          appendTextToDiv(prevEnd.divIdx, prevEnd.offset, infinity.offset);
        }
        // Clear the divs and set the content until the starting point.
        beginText(begin);
      } else {
        appendTextToDiv(prevEnd.divIdx, prevEnd.offset, begin.offset);
      }

      if (begin.divIdx === end.divIdx) {
        selectedLeft = appendTextToDiv(
          begin.divIdx,
          begin.offset,
          end.offset,
          "highlight" + highlightSuffix,
          match.styles
        );
      } else {
        selectedLeft = appendTextToDiv(
          begin.divIdx,
          begin.offset,
          infinity.offset,
          "highlight begin" + highlightSuffix,
          match.styles
        );
        for (let n0 = begin.divIdx + 1, n1 = end.divIdx; n0 < n1; n0++) {
          if (textDivs[n0]) {
            if (match.styles) {
              for (let p in match.styles) {
                textDivs[n0].style[p] = match.styles[p];
              }
            }
            textDivs[n0].className = "highlight middle" + highlightSuffix;
          }
        }
        beginText(end, "highlight end" + highlightSuffix, match.styles);
      }
      prevEnd = end;
      if (!selectedElement && isSelected) {
        let divIdx = begin.divIdx;
        while (!textContentItemsStr[divIdx] && divIdx <= end.divIdx) {
          divIdx++;
        }
        const div = textDivs[divIdx];
        let isOut = false;
        // 定位元素需要在可视区域内
        try{
          const divStyle = div.style;
          let textLayerNode = div;
          while (
            !textLayerNode.classList.contains("textLayer") &&
            textLayerNode.parentElement
          ) {
            textLayerNode = textLayerNode.parentElement;
          }

          let left = parseFloat(divStyle.left.match(/\d+/g)[0] || "0");
          let top = parseFloat(divStyle.top.match(/\d+/g)[0] || "0");
          if (
            (divStyle.left.includes("%") && left >= 100) ||
            (divStyle.top.includes("%") && top >= 100)
          ) {
            isOut = true;
          }
          if (textLayerNode.classList.contains("textLayer")) {
            let width = parseFloat(textLayerNode.style.width.match(/\d+/g)[0] || "0");
            let height = parseFloat(textLayerNode.style.height.match(/\d+/g)[0] || "0");
            if (
              (divStyle.left.includes("px") && left > width) ||
              (!divStyle.top.includes("px") && top > height)
            ) {
              isOut = true;
            }
          }
        } catch(e) {
          console.error(e)
        }

        if (!isOut && defaultPagesMatchesIsFocus && isPagesMatch) {
          selectedElement = div;
        } else if (!isOut && !isPagesMatch) {
          selectedElement = div;
        }
      }
    }
    if (selectedElement && findController) {
      findController.scrollMatchIntoView({
        element: selectedElement,
        selectedLeft: 0,
        pageIndex: pageIdx,
        matchIndex: selectedMatchIdx,
      });
      defaultPagesMatchesIsFocus = false;
    }

    if (prevEnd) {
      appendTextToDiv(prevEnd.divIdx, prevEnd.offset, infinity.offset);
    }
  }

  /**
   * 
   * @desc 合并数据方法 
   * @returns 
   */
  _merageMatches(baseMatchs, matches) {
    while (matches.length) {
      const match = matches[0];
      const beginIndex = baseMatchs.findIndex((item) => {
        return (
          (item.begin.divIdx < match.begin.divIdx ||
            (item.begin.divIdx === match.begin.divIdx &&
              item.begin.offset <= match.begin.offset)) &&
          (item.end.divIdx > match.begin.divIdx ||
            (item.end.divIdx === match.begin.divIdx &&
              item.end.offset > match.begin.offset))
        );
      });
      const endIndex = baseMatchs.findIndex((item) => {
        return (
          (item.begin.divIdx < match.end.divIdx ||
            (item.begin.divIdx === match.end.divIdx &&
              item.begin.offset <= match.end.offset)) &&
          (item.end.divIdx > match.end.divIdx ||
            (item.end.divIdx === match.end.divIdx &&
              item.end.offset >= match.end.offset))
        );
      });
      if (endIndex === -1 && beginIndex === -1) {
        baseMatchs.push({
          ...match,
        });
        matches.shift();
        continue;
      }
      if (endIndex !== 1 && beginIndex !== -1 && endIndex !== beginIndex) {
        baseMatchs[beginIndex].end = { ...match.begin };
        baseMatchs[endIndex].begin = { ...match.end };
        baseMatchs.splice(beginIndex + 1, endIndex - beginIndex - 1, match);
        matches.shift();
        continue;
      }
      if (endIndex !== -1 && beginIndex !== -1 && endIndex === beginIndex) {
        baseMatchs.splice(
          beginIndex,
          1,
          {
            ...baseMatchs[beginIndex],
            end: { ...match.begin },
          },
          match,
          {
            ...baseMatchs[beginIndex],
            begin: { ...match.end },
          }
        );
        matches.shift();
        continue;
      }
      if (endIndex !== -1 && beginIndex === -1) {
        baseMatchs[endIndex].begin = { ...match.end };
        baseMatchs.splice(0, 0, match);
        matches.shift();
        continue;
      }
      if (endIndex === -1 && beginIndex !== -1) {
        baseMatchs[beginIndex].end = { ...match.begin };
        baseMatchs.push(match);
        matches.shift();
        continue;
      }
      matches.shift();
      console.log("没有处理的", endIndex, beginIndex);
    }
    baseMatchs.sort((a, b) => {
      return a.begin.divIdx - b.begin.divIdx;
    });
    return baseMatchs.filter((item) => {
      if (item.begin.divIdx === item.end.divIdx && item.begin.offset === item.end.offset) {
        return false
      }
      return true
    })
  }
  _updateMatches(reset = false) {  // 清空原高亮筛选逻辑，调用_renderMatches渲染新高亮样式
    if (!this.enabled && !reset) {
      return;
    }

    // this.pageIdx   当前页数index
    const { findController, pageIdx } = this;
    const { textContentItemsStr, textDivs, matches = [] } = this;
    // console.log('findController', findController);
    // console.log('pageIdx', pageIdx);
    // console.log('textContentItemsStr', textContentItemsStr);
    // console.log('textDivs', textDivs);
    // console.log('matches', matches);
    
    // 清楚匹配项
    for (let i = 0, ii = matches.length; i < ii; i++) {
      const match = matches[i];
      const begin = match.begin.divIdx;
      for (let n = begin, end = match.end.divIdx; n <= end; n++) {
        const div = textDivs[n];
        div.textContent = textContentItemsStr[n];
        div.className = "";
      }
    }
    // console.log('defaultPagesMatches',defaultPagesMatches);

    let sectionMatches = [...(defaultPagesMatches?.[this.pageIdx] || [])];
    if (findController?.highlightMatches && !reset) {
      const pageMatches = findController.pageMatches[pageIdx] || null;
      const pageMatchesLength =
        findController.pageMatchesLength[pageIdx] || null;
      const findMatches = this._convertMatches(pageMatches, pageMatchesLength);
      console.log('findMatches', findMatches);
      const selectedMatchIdx = findController.selected.matchIdx;
      pageIdx === findController.selected.pageIdx &&
        findMatches[selectedMatchIdx] &&
        (findMatches[selectedMatchIdx].className = "selected");
      this.matches = this._merageMatches(sectionMatches, findMatches);
      console.log('this.matches', this.matches);
      this._renderMatches(this.matches || []);
      return;
    }
    console.log('2222sectionMatches', sectionMatches);

    this.matches = sectionMatches;
    this._renderMatches(sectionMatches || []);
    
    // const {
    //   findController,
    //   matches,
    //   pageIdx
    // } = this;
    // const {
    //   textContentItemsStr,
    //   textDivs
    // } = this;
    // let clearedUntilDivIdx = -1;
    // for (const match of matches) {
    //   const begin = Math.max(clearedUntilDivIdx, match.begin.divIdx);
    //   for (let n = begin, end = match.end.divIdx; n <= end; n++) {
    //     const div = textDivs[n];
    //     div.textContent = textContentItemsStr[n];
    //     div.className = "";
    //   }
    //   clearedUntilDivIdx = match.end.divIdx + 1;
    // }
    // if (!findController?.highlightMatches || reset) {
    //   return;
    // }
    // console.log('findController.pageMatches 第n页匹配到，相对于本页文本数据的开始index',findController.pageMatches)
    // console.log('findController.pageMatchesLength 第n页匹配到的匹配字符串的长度', findController.pageMatchesLength)
    
    // const pageMatches = findController.pageMatches[pageIdx] || null;

    // console.log('pageMatches', pageMatches);

    // const pageMatchesLength = findController.pageMatchesLength[pageIdx] || null;
    // this.matches = this._convertMatches(pageMatches, pageMatchesLength); 
    // console.log('this.matches', this.matches)

    // this._renderMatches(this.matches);
  }
}

调用： 
function handleTest(evt: any) {
  let str = 'AiMind\n文档库'

  let pdfJs = document.getElementsByTagName('iframe')[0]
  let PDFViewerApplication = (window[0] as any).PDFViewerApplication
  PDFViewerApplication.pageIndex = 1
  let update = PDFViewerApplication.eventBus
  let normalizeUnicode = (window[0] as any).pdfjsLib.normalizeUnicode
  let unicodeHandledStr = normalizeUnicode(str)
  const regex = /\s|\u0000|\./g;
  let regHandledStr = unicodeHandledStr.replace(regex, ' ')
  let metchShotStr = regHandledStr.split(' ')
  // 写死数据测试
  let testData = {
    0: [
      {
        sectionIndex: 0,
        className: 'section-0 section-color-0',
        begin: {
          divIdx: 0,offset: 0
        }, end: {
          divIdx: 0, offset: 2
        }
      },
      {
        sectionIndex: 1,
        className: 'section-1 section-color-0',
        begin: {
          divIdx: 3,offset: 0
        }, end: {
          divIdx: 3, offset: 50
        }
      },
    ]
  }
  // 通知pdfjs 高亮渲染
  update.dispatch('updatePagesMatches', { pagesMatches:testData })
  





}