网页正文提取算法:行块分布算法 Readability

news2024/11/16 19:58:54

前提

爬取百度、搜狗、必应等搜索引擎时,详情页的正文因来源多样而无法简单通过通用的规则来匹配,这就需要相关的提取算法。

在这里插入图片描述

本文在此介绍两种网页正文提取算法:行块分布算法 & Readability。

行块分布算法

算法流程

在这里插入图片描述

算法依据

  • HTML 每一行都表示一个完整的语义;
  • 正文代码在物理位置上会靠的很近;
  • 正文代码的一行中大都是文字;
  • 正文代码的一行中非 HTML 标签的文字数量较多;
  • 正文代码的一行中超链接长度所占比率不会很大。

算法特点

  • 与 HTML 是否良构无关;
  • 不用建立 Dom 树,与 HTML 标签无关;
  • 只需求出行块的分布函数即可抽取出正文;
  • 只需对脱过标签的文本扫描一次,处理效率高;
  • 去链接群、广告信息容易;
  • 扩展性好,通用抽取采用统计方法,个别网站辅以规则,做到统计与规则相结合。

示例代码

# -*- encoding: utf-8 -*-
import re
import requests
from collections import Counter
from bs4 import BeautifulSoup


def get_html(url):
    """ 获取html """
    # obj = requests.get(url)
    # return obj.text

    try:
        obj = requests.get(url)
        code = obj.status_code
        if code == 200:
            # 防止中文正文乱码
            html = obj.content
            html_doc = str(html, 'utf-8')
            return html_doc
        return None
    except:
        return None


def filter_tags(html_str, flag):
    """ 过滤各类标签
    :param html_str: html字符串
    :param flag: 是否移除所有标签
    :return: 过滤标签后的html字符串
    """
    html_str = re.sub('(?is)<!DOCTYPE.*?>', '', html_str)
    html_str = re.sub('(?is)<!--.*?-->', '', html_str)  # remove html comment
    html_str = re.sub('(?is)<script.*?>.*?</script>', '', html_str)  # remove javascript
    html_str = re.sub('(?is)<style.*?>.*?</style>', '', html_str)  # remove css
    html_str = re.sub('(?is)<a[\t|\n|\r|\f].*?>.*?</a>', '', html_str)  # remove a
    html_str = re.sub('(?is)<li[^nk].*?>.*?</li>', '', html_str)  # remove li
    # html_str = re.sub('&.{2,5};|&#.{2,5};', '', html_str) #remove special char
    if flag:
        html_str = re.sub('(?is)<.*?>', '', html_str)  # remove tag
    return html_str


def extract_text_by_block(html_str):
    """ 根据文本块密度获取正文
    :param html_str: 网页源代码
    :return: 正文文本
    """
    html = filter_tags(html_str, True)
    lines = html.split('\n')
    blockwidth = 3
    threshold = 86
    indexDistribution = []
    for i in range(0, len(lines) - blockwidth):
        wordnum = 0
        for j in range(i, i + blockwidth):
            line = re.sub("\\s+", '', lines[j])
            wordnum += len(line)
        indexDistribution.append(wordnum)
    startindex = -1
    endindex = -1
    boolstart = False
    boolend = False
    arcticle_content = []
    for i in range(0, len(indexDistribution) - blockwidth):
        if (indexDistribution[i] > threshold and boolstart is False):
            if indexDistribution[i + 1] != 0 or indexDistribution[i + 2] != 0 or indexDistribution[i + 3] != 0:
                boolstart = True
                startindex = i
                continue
        if boolstart is True:
            if indexDistribution[i] == 0 or indexDistribution[i + 1] == 0:
                endindex = i
                boolend = True
        tmp = []
        if boolend is True:
            for index in range(startindex, endindex + 1):
                line = lines[index]
                if len(line.strip()) < 5:
                    continue
                tmp.append(line.strip() + '\n')
            tmp_str = ''.join(tmp)
            if u"Copyright" in tmp_str or u"版权所有" in tmp_str:
                continue
            arcticle_content.append(tmp_str)
            boolstart = False
            boolend = False
    return ''.join(arcticle_content)


def extract_text_by_tag(html_str, article):
    """ 全网页查找根据文本块密度获取的正文的位置,获取文本父级标签内的正文,目的是提高正文准确率
    :param html: 网页html
    :param article: 根据文本块密度获取的正文
    :return: 正文文本
    """
    lines = filter_tags(html_str, False)
    soup = BeautifulSoup(lines, 'lxml')
    p_list = soup.find_all('p')
    p_in_article = []
    for p in p_list:
        if p.text.strip() in article:
            p_in_article.append(p.parent)
    tuple = Counter(p_in_article).most_common(1)[0]
    article_soup = BeautifulSoup(str(tuple[0]), 'xml')
    return remove_space(article_soup.text)


def remove_space(text):
    """ 移除字符串中的空白字符 """
    text = re.sub("[\t\r\n\f]", '', text)
    return text


def extract(url):
    """ 抽取正文
    :param url: 网页链接
    :return:正文文本
    """
    html_str = get_html(url)
    if html_str == None:
        return None
    article_temp = extract_text_by_block(html_str)
    try:
        article = extract_text_by_tag(html_str, article_temp)
    except:
        article = article_temp
    return article


if __name__ == '__main__':
    url = 'http://www.eeo.com.cn/2020/0215/376405.shtml'
    text = extract(url)
    print(text)

Readability

Readability 介绍

Readability 是一个颇有特色的“稍后阅读”网络收藏夹服务,除了在你看到喜欢的文章时可以收藏下来之外,它最大的特点在于它能自动智能地剔除网页上一些不重要的元素并重新排版,仅为你呈现干净整洁的正文部分,使你的阅读体验更佳!它除了拥有主流浏览器的插件之外,还提供了 iOS/Android/Kindle 等移动版本的应用,可以同步到手机上随时随地高效舒适地阅读。

实现原理

Readability 通过遍历 Dom 对象,通过标签和常用文字的加减权,来重新整合出页面的内容。接下来我们就简单看看这个算法是如何实现的。首先,它定义了一系列正则:

regexps: {
        unlikelyCandidates:    /combx|comment|community|disqus|extra|foot|header|menu|remark|rss|shoutbox|sidebar|sponsor|ad-break|agegate|pagination|pager|popup|tweet|twitter/i,
        okMaybeItsACandidate:  /and|article|body|column|main|shadow/i,
        positive:              /article|body|content|entry|hentry|main|page|pagination|post|text|blog|story/i,
        negative:              /combx|comment|com-|contact|foot|footer|footnote|masthead|media|meta|outbrain|promo|related|scroll|shoutbox|sidebar|sponsor|shopping|tags|tool|widget/i,
        extraneous:            /print|archive|comment|discuss|e[\-]?mail|share|reply|all|login|sign|single/i,
        divToPElements:        /<(a|blockquote|dl|div|img|ol|p|pre|table|ul)/i,
        replaceBrs:            /(<br[^>]*>[ \n\r\t]*){2,}/gi,
        replaceFonts:          /<(\/?)font[^>]*>/gi,
        trim:                  /^\s+|\s+$/g,
        normalize:             /\s{2,}/g,
        killBreaks:            /(<br\s*\/?>(\s|&nbsp;?)*){1,}/g,
        videos:                /http:\/\/(www\.)?(youtube|vimeo)\.com/i,
        skipFootnoteLink:      /^\s*(\[?[a-z0-9]{1,2}\]?|^|edit|citation needed)\s*$/i,
        nextLink:              /(next|weiter|continue|>([^\|]|$)|»([^\|]|$))/i, // Match: next, continue, >, >>, » but not >|, »| as those usually mean last.
        prevLink:              /(prev|earl|old|new|<|«)/i
},

可以看到,标签和文字都有加权或降权分组。整个内容分析是通过 grabArticle 函数来实现的。首先开始遍历节点:

for(var nodeIndex = 0; (node = allElements[nodeIndex]); nodeIndex+=1)

然后将不像内容的元素去掉:

if (stripUnlikelyCandidates) 
{
    var unlikelyMatchString = node.className + node.id;
    if (
        (
            unlikelyMatchString.search(readability.regexps.unlikelyCandidates) !== -1 &&
            unlikelyMatchString.search(readability.regexps.okMaybeItsACandidate) === -1 &&
            node.tagName !== "BODY"
        )
    )
    {
        dbg("Removing unlikely candidate - " + unlikelyMatchString);
        node.parentNode.removeChild(node);
        nodeIndex-=1;
        continue;
    }               
}

将 DIV 替换为 P 标签后,再对目标节点进行遍历,进行计分:

var candidates = [];
for (var pt=0; pt < nodesToScore.length; pt+=1) {
    var parentNode      = nodesToScore[pt].parentNode;
    var grandParentNode = parentNode ? parentNode.parentNode : null;
    var innerText       = readability.getInnerText(nodesToScore[pt]);

    if(!parentNode || typeof(parentNode.tagName) === 'undefined') {
        continue;
    }

    /* If this paragraph is less than 25 characters, don't even count it. */
    if(innerText.length < 25) {
        continue; }

    /* Initialize readability data for the parent. */
    if(typeof parentNode.readability === 'undefined') {
        readability.initializeNode(parentNode);
        candidates.push(parentNode);
    }

    /* Initialize readability data for the grandparent. */
    if(grandParentNode && typeof(grandParentNode.readability) === 'undefined' && typeof(grandParentNode.tagName) !== 'undefined') {
        readability.initializeNode(grandParentNode);
        candidates.push(grandParentNode);
    }

    var contentScore = 0;

    /* Add a point for the paragraph itself as a base. */
    contentScore+=1;

    /* Add points for any commas within this paragraph */
    contentScore += innerText.split(',').length;
    
    /* For every 100 characters in this paragraph, add another point. Up to 3 points. */
    contentScore += Math.min(Math.floor(innerText.length / 100), 3);
    
    /* Add the score to the parent. The grandparent gets half. */
    parentNode.readability.contentScore += contentScore;

    if(grandParentNode) {
        grandParentNode.readability.contentScore += contentScore/2;             
    }
}

最后根据分值,重新拼接内容:

var articleContent        = document.createElement("DIV");
if (isPaging) {
    articleContent.id     = "readability-content";
}
var siblingScoreThreshold = Math.max(10, topCandidate.readability.contentScore * 0.2);
var siblingNodes          = topCandidate.parentNode.childNodes;


for(var s=0, sl=siblingNodes.length; s < sl; s+=1) {
    var siblingNode = siblingNodes[s];
    var append      = false;

    /**
     * Fix for odd IE7 Crash where siblingNode does not exist even though this should be a live nodeList.
     * Example of error visible here: http://www.esquire.com/features/honesty0707
    **/
    if(!siblingNode) {
        continue;
    }

    dbg("Looking at sibling node: " + siblingNode + " (" + siblingNode.className + ":" + siblingNode.id + ")" + ((typeof siblingNode.readability !== 'undefined') ? (" with score " + siblingNode.readability.contentScore) : ''));
    dbg("Sibling has score " + (siblingNode.readability ? siblingNode.readability.contentScore : 'Unknown'));

    if(siblingNode === topCandidate)
    {
        append = true;
    }

    var contentBonus = 0;
    /* Give a bonus if sibling nodes and top candidates have the example same classname */
    if(siblingNode.className === topCandidate.className && topCandidate.className !== "") {
        contentBonus += topCandidate.readability.contentScore * 0.2;
    }

    if(typeof siblingNode.readability !== 'undefined' && (siblingNode.readability.contentScore+contentBonus) >= siblingScoreThreshold)
    {
        append = true;
    }
    
    if(siblingNode.nodeName === "P") {
        var linkDensity = readability.getLinkDensity(siblingNode);
        var nodeContent = readability.getInnerText(siblingNode);
        var nodeLength  = nodeContent.length;
        
        if(nodeLength > 80 && linkDensity < 0.25)
        {
            append = true;
        }
        else if(nodeLength < 80 && linkDensity === 0 && nodeContent.search(/\.( |$)/) !== -1)
        {
            append = true;
        }
    }

    if(append) {
        dbg("Appending node: " + siblingNode);

        var nodeToAppend = null;
        if(siblingNode.nodeName !== "DIV" && siblingNode.nodeName !== "P") {
            /* We have a node that isn't a common block level element, like a form or td tag. Turn it into a div so it doesn't get filtered out later by accident. */
            
            dbg("Altering siblingNode of " + siblingNode.nodeName + ' to div.');
            nodeToAppend = document.createElement("DIV");
            try {
                nodeToAppend.id = siblingNode.id;
                nodeToAppend.innerHTML = siblingNode.innerHTML;
            }
            catch(er) {
                dbg("Could not alter siblingNode to div, probably an IE restriction, reverting back to original.");
                nodeToAppend = siblingNode;
                s-=1;
                sl-=1;
            }
        } else {
            nodeToAppend = siblingNode;
            s-=1;
            sl-=1;
        }
        
        /* To ensure a node does not interfere with readability styles, remove its classnames */
        nodeToAppend.className = "";

        /* Append sibling and subtract from our list because it removes the node when you append to another node */
        articleContent.appendChild(nodeToAppend);
    }
}

示例代码

import java.io.File;
import java.io.IOException;
import java.net.URL;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class Readability {

    private static final String CONTENT_SCORE = "readabilityContentScore";

    private final Document mDocument;
    private String mBodyCache;

    public Readability(String html) {
        super();
        mDocument = Jsoup.parse(html);
    }

    public Readability(String html, String baseUri) {
        super();
        mDocument = Jsoup.parse(html, baseUri);
    }

    public Readability(File in, String charsetName, String baseUri)
            throws IOException {
        super();
        mDocument = Jsoup.parse(in, charsetName, baseUri);
    }

    public Readability(URL url, int timeoutMillis) throws IOException {
        super();
        mDocument = Jsoup.parse(url, timeoutMillis);
    }

    public Readability(Document doc) {
        super();
        mDocument = doc;
    }

    // @formatter:off
    /**
     * Runs readability.
     * 
     * Workflow: 
     * 1. Prep the document by removing script tags, css, etc. 
     * 2. Build readability's DOM tree. 
     * 3. Grab the article content from the current dom tree. 
     * 4. Replace the current DOM tree with the new one. 
     * 5. Read peacefully.
     * 
     * @param preserveUnlikelyCandidates
     */
    // @formatter:on
    private void init(boolean preserveUnlikelyCandidates) {
        if (mDocument.body() != null && mBodyCache == null) {
            mBodyCache = mDocument.body().html();
        }

        prepDocument();

        /* Build readability's DOM tree */
        Element overlay = mDocument.createElement("div");
        Element innerDiv = mDocument.createElement("div");
        Element articleTitle = getArticleTitle();
        Element articleContent = grabArticle(preserveUnlikelyCandidates);

        /**
         * If we attempted to strip unlikely candidates on the first run
         * through, and we ended up with no content, that may mean we stripped
         * out the actual content so we couldn't parse it. So re-run init while
         * preserving unlikely candidates to have a better shot at getting our
         * content out properly.
         */
        if (isEmpty(getInnerText(articleContent, false))) {
            if (!preserveUnlikelyCandidates) {
                mDocument.body().html(mBodyCache);
                init(true);
                return;
            } else {
                articleContent
                        .html("<p>Sorry, readability was unable to parse this page for content.</p>");
            }
        }

        /* Glue the structure of our document together. */
        innerDiv.appendChild(articleTitle);
        innerDiv.appendChild(articleContent);
        overlay.appendChild(innerDiv);

        /* Clear the old HTML, insert the new content. */
        mDocument.body().html("");
        mDocument.body().prependChild(overlay);
    }

    /**
     * Runs readability.
     */
    public final void init() {
        init(false);
    }

    /**
     * Get the combined inner HTML of all matched elements.
     * 
     * @return
     */
    public final String html() {
        return mDocument.html();
    }

    /**
     * Get the combined outer HTML of all matched elements.
     * 
     * @return
     */
    public final String outerHtml() {
        return mDocument.outerHtml();
    }

    /**
     * Get the article title as an H1. Currently just uses document.title, we
     * might want to be smarter in the future.
     * 
     * @return
     */
    protected Element getArticleTitle() {
        Element articleTitle = mDocument.createElement("h1");
        articleTitle.html(mDocument.title());
        return articleTitle;
    }

    /**
     * Prepare the HTML document for readability to scrape it. This includes
     * things like stripping javascript, CSS, and handling terrible markup.
     */
    protected void prepDocument() {
        /**
         * In some cases a body element can't be found (if the HTML is totally
         * hosed for example) so we create a new body node and append it to the
         * document.
         */
        if (mDocument.body() == null) {
            mDocument.appendElement("body");
        }

        /* Remove all scripts */
        Elements elementsToRemove = mDocument.getElementsByTag("script");
        for (Element script : elementsToRemove) {
            script.remove();
        }

        /* Remove all stylesheets */
        elementsToRemove = getElementsByTag(mDocument.head(), "link");
        for (Element styleSheet : elementsToRemove) {
            if ("stylesheet".equalsIgnoreCase(styleSheet.attr("rel"))) {
                styleSheet.remove();
            }
        }

        /* Remove all style tags in head */
        elementsToRemove = mDocument.getElementsByTag("style");
        for (Element styleTag : elementsToRemove) {
            styleTag.remove();
        }

        /* Turn all double br's into p's */
        /*
         * TODO: this is pretty costly as far as processing goes. Maybe optimize
         * later.
         */
        mDocument.body().html(
                mDocument.body().html()
                        .replaceAll(Patterns.REGEX_REPLACE_BRS, "</p><p>")
                        .replaceAll(Patterns.REGEX_REPLACE_FONTS, "<$1span>"));
    }

    /**
     * Prepare the article node for display. Clean out any inline styles,
     * iframes, forms, strip extraneous &lt;p&gt; tags, etc.
     * 
     * @param articleContent
     */
    private void prepArticle(Element articleContent) {
        cleanStyles(articleContent);
        killBreaks(articleContent);

        /* Clean out junk from the article content */
        clean(articleContent, "form");
        clean(articleContent, "object");
        clean(articleContent, "h1");
        /**
         * If there is only one h2, they are probably using it as a header and
         * not a subheader, so remove it since we already have a header.
         */
        if (getElementsByTag(articleContent, "h2").size() == 1) {
            clean(articleContent, "h2");
        }
        clean(articleContent, "iframe");

        cleanHeaders(articleContent);

        /*
         * Do these last as the previous stuff may have removed junk that will
         * affect these
         */
        cleanConditionally(articleContent, "table");
        cleanConditionally(articleContent, "ul");
        cleanConditionally(articleContent, "div");

        /* Remove extra paragraphs */
        Elements articleParagraphs = getElementsByTag(articleContent, "p");
        for (Element articleParagraph : articleParagraphs) {
            int imgCount = getElementsByTag(articleParagraph, "img").size();
            int embedCount = getElementsByTag(articleParagraph, "embed").size();
            int objectCount = getElementsByTag(articleParagraph, "object")
                    .size();

            if (imgCount == 0 && embedCount == 0 && objectCount == 0
                    && isEmpty(getInnerText(articleParagraph, false))) {
                articleParagraph.remove();
            }
        }

        try {
            articleContent.html(articleContent.html().replaceAll(
                    "(?i)<br[^>]*>\\s*<p", "<p"));
        } catch (Exception e) {
            dbg("Cleaning innerHTML of breaks failed. This is an IE strict-block-elements bug. Ignoring.",
                    e);
        }
    }

    /**
     * Initialize a node with the readability object. Also checks the
     * className/id for special names to add to its score.
     * 
     * @param node
     */
    private static void initializeNode(Element node) {
        node.attr(CONTENT_SCORE, Integer.toString(0));

        String tagName = node.tagName();
        if ("div".equalsIgnoreCase(tagName)) {
            incrementContentScore(node, 5);
        } else if ("pre".equalsIgnoreCase(tagName)
                || "td".equalsIgnoreCase(tagName)
                || "blockquote".equalsIgnoreCase(tagName)) {
            incrementContentScore(node, 3);
        } else if ("address".equalsIgnoreCase(tagName)
                || "ol".equalsIgnoreCase(tagName)
                || "ul".equalsIgnoreCase(tagName)
                || "dl".equalsIgnoreCase(tagName)
                || "dd".equalsIgnoreCase(tagName)
                || "dt".equalsIgnoreCase(tagName)
                || "li".equalsIgnoreCase(tagName)
                || "form".equalsIgnoreCase(tagName)) {
            incrementContentScore(node, -3);
        } else if ("h1".equalsIgnoreCase(tagName)
                || "h2".equalsIgnoreCase(tagName)
                || "h3".equalsIgnoreCase(tagName)
                || "h4".equalsIgnoreCase(tagName)
                || "h5".equalsIgnoreCase(tagName)
                || "h6".equalsIgnoreCase(tagName)
                || "th".equalsIgnoreCase(tagName)) {
            incrementContentScore(node, -5);
        }

        incrementContentScore(node, getClassWeight(node));
    }

    /**
     * Using a variety of metrics (content score, classname, element types),
     * find the content that ismost likely to be the stuff a user wants to read.
     * Then return it wrapped up in a div.
     * 
     * @param preserveUnlikelyCandidates
     * @return
     */
    protected Element grabArticle(boolean preserveUnlikelyCandidates) {
        /**
         * First, node prepping. Trash nodes that look cruddy (like ones with
         * the class name "comment", etc), and turn divs into P tags where they
         * have been used inappropriately (as in, where they contain no other
         * block level elements.)
         * 
         * Note: Assignment from index for performance. See
         * http://www.peachpit.com/articles/article.aspx?p=31567&seqNum=5 TODO:
         * Shouldn't this be a reverse traversal?
         **/
        for (Element node : mDocument.getAllElements()) {
            /* Remove unlikely candidates */
            if (!preserveUnlikelyCandidates) {
                String unlikelyMatchString = node.className() + node.id();
                Matcher unlikelyCandidatesMatcher = Patterns.get(
                        Patterns.RegEx.UNLIKELY_CANDIDATES).matcher(
                        unlikelyMatchString);
                Matcher maybeCandidateMatcher = Patterns.get(
                        Patterns.RegEx.OK_MAYBE_ITS_A_CANDIDATE).matcher(
                        unlikelyMatchString);
                if (unlikelyCandidatesMatcher.find()
                        && maybeCandidateMatcher.find()
                        && !"body".equalsIgnoreCase(node.tagName())) {
                    node.remove();
                    dbg("Removing unlikely candidate - " + unlikelyMatchString);
                    continue;
                }
            }

            /*
             * Turn all divs that don't have children block level elements into
             * p's
             */
            if ("div".equalsIgnoreCase(node.tagName())) {
                Matcher matcher = Patterns
                        .get(Patterns.RegEx.DIV_TO_P_ELEMENTS).matcher(
                                node.html());
                if (!matcher.find()) {
                    dbg("Alternating div to p: " + node);
                    try {
                        node.tagName("p");
                    } catch (Exception e) {
                        dbg("Could not alter div to p, probably an IE restriction, reverting back to div.",
                                e);
                    }
                }
            }
        }

        /**
         * Loop through all paragraphs, and assign a score to them based on how
         * content-y they look. Then add their score to their parent node.
         * 
         * A score is determined by things like number of commas, class names,
         * etc. Maybe eventually link density.
         **/
        Elements allParagraphs = mDocument.getElementsByTag("p");
        ArrayList<Element> candidates = new ArrayList<Element>();

        for (Element node : allParagraphs) {
            Element parentNode = node.parent();
            Element grandParentNode = parentNode.parent();
            String innerText = getInnerText(node, true);

            /*
             * If this paragraph is less than 25 characters, don't even count
             * it.
             */
            if (innerText.length() < 25) {
                continue;
            }

            /* Initialize readability data for the parent. */
            if (!parentNode.hasAttr("readabilityContentScore")) {
                initializeNode(parentNode);
                candidates.add(parentNode);
            }

            /* Initialize readability data for the grandparent. */
            if (!grandParentNode.hasAttr("readabilityContentScore")) {
                initializeNode(grandParentNode);
                candidates.add(grandParentNode);
            }

            int contentScore = 0;

            /* Add a point for the paragraph itself as a base. */
            contentScore++;

            /* Add points for any commas within this paragraph */
            contentScore += innerText.split(",").length;

            /*
             * For every 100 characters in this paragraph, add another point. Up
             * to 3 points.
             */
            contentScore += Math.min(Math.floor(innerText.length() / 100), 3);

            /* Add the score to the parent. The grandparent gets half. */
            incrementContentScore(parentNode, contentScore);
            incrementContentScore(grandParentNode, contentScore / 2);
        }

        /**
         * After we've calculated scores, loop through all of the possible
         * candidate nodes we found and find the one with the highest score.
         */
        Element topCandidate = null;
        for (Element candidate : candidates) {
            /**
             * Scale the final candidates score based on link density. Good
             * content should have a relatively small link density (5% or less)
             * and be mostly unaffected by this operation.
             */
            scaleContentScore(candidate, 1 - getLinkDensity(candidate));

            dbg("Candidate: (" + candidate.className() + ":" + candidate.id()
                    + ") with score " + getContentScore(candidate));

            if (topCandidate == null
                    || getContentScore(candidate) > getContentScore(topCandidate)) {
                topCandidate = candidate;
            }
        }

        /**
         * If we still have no top candidate, just use the body as a last
         * resort. We also have to copy the body node so it is something we can
         * modify.
         */
        if (topCandidate == null
                || "body".equalsIgnoreCase(topCandidate.tagName())) {
            topCandidate = mDocument.createElement("div");
            topCandidate.html(mDocument.body().html());
            mDocument.body().html("");
            mDocument.body().appendChild(topCandidate);
            initializeNode(topCandidate);
        }

        /**
         * Now that we have the top candidate, look through its siblings for
         * content that might also be related. Things like preambles, content
         * split by ads that we removed, etc.
         */
        Element articleContent = mDocument.createElement("div");
        articleContent.attr("id", "readability-content");
        int siblingScoreThreshold = Math.max(10,
                (int) (getContentScore(topCandidate) * 0.2f));
        Elements siblingNodes = topCandidate.parent().children();
        for (Element siblingNode : siblingNodes) {
            boolean append = false;

            dbg("Looking at sibling node: (" + siblingNode.className() + ":"
                    + siblingNode.id() + ")" + " with score "
                    + getContentScore(siblingNode));

            if (siblingNode == topCandidate) {
                append = true;
            }

            if (getContentScore(siblingNode) >= siblingScoreThreshold) {
                append = true;
            }

            if ("p".equalsIgnoreCase(siblingNode.tagName())) {
                float linkDensity = getLinkDensity(siblingNode);
                String nodeContent = getInnerText(siblingNode, true);
                int nodeLength = nodeContent.length();

                if (nodeLength > 80 && linkDensity < 0.25f) {
                    append = true;
                } else if (nodeLength < 80 && linkDensity == 0.0f
                        && nodeContent.matches(".*\\.( |$).*")) {
                    append = true;
                }
            }

            if (append) {
                dbg("Appending node: " + siblingNode);

                /*
                 * Append sibling and subtract from our list because it removes
                 * the node when you append to another node
                 */
                articleContent.appendChild(siblingNode);
                continue;
            }
        }

        /**
         * So we have all of the content that we need. Now we clean it up for
         * presentation.
         */
        prepArticle(articleContent);

        return articleContent;
    }

    /**
     * Get the inner text of a node - cross browser compatibly. This also strips
     * out any excess whitespace to be found.
     * 
     * @param e
     * @param normalizeSpaces
     * @return
     */
    private static String getInnerText(Element e, boolean normalizeSpaces) {
        String textContent = e.text().trim();

        if (normalizeSpaces) {
            textContent = textContent.replaceAll(Patterns.REGEX_NORMALIZE, "");
        }

        return textContent;
    }

    /**
     * Get the number of times a string s appears in the node e.
     * 
     * @param e
     * @param s
     * @return
     */
    private static int getCharCount(Element e, String s) {
        if (s == null || s.length() == 0) {
            s = ",";
        }
        return getInnerText(e, true).split(s).length;
    }

    /**
     * Remove the style attribute on every e and under.
     * 
     * @param e
     */
    private static void cleanStyles(Element e) {
        if (e == null) {
            return;
        }

        Element cur = e.children().first();

        // Remove any root styles, if we're able.
        if (!"readability-styled".equals(e.className())) {
            e.removeAttr("style");
        }

        // Go until there are no more child nodes
        while (cur != null) {
            // Remove style attributes
            if (!"readability-styled".equals(cur.className())) {
                cur.removeAttr("style");
            }
            cleanStyles(cur);
            cur = cur.nextElementSibling();
        }
    }

    /**
     * Get the density of links as a percentage of the content. This is the
     * amount of text that is inside a link divided by the total text in the
     * node.
     * 
     * @param e
     * @return
     */
    private static float getLinkDensity(Element e) {
        Elements links = getElementsByTag(e, "a");
        int textLength = getInnerText(e, true).length();
        float linkLength = 0.0F;
        for (Element link : links) {
            linkLength += getInnerText(link, true).length();
        }
        return linkLength / textLength;
    }

    /**
     * Get an elements class/id weight. Uses regular expressions to tell if this
     * element looks good or bad.
     * 
     * @param e
     * @return
     */
    private static int getClassWeight(Element e) {
        int weight = 0;

        /* Look for a special classname */
        String className = e.className();
        if (!isEmpty(className)) {
            Matcher negativeMatcher = Patterns.get(Patterns.RegEx.NEGATIVE)
                    .matcher(className);
            Matcher positiveMatcher = Patterns.get(Patterns.RegEx.POSITIVE)
                    .matcher(className);
            if (negativeMatcher.find()) {
                weight -= 25;
            }
            if (positiveMatcher.find()) {
                weight += 25;
            }
        }

        /* Look for a special ID */
        String id = e.id();
        if (!isEmpty(id)) {
            Matcher negativeMatcher = Patterns.get(Patterns.RegEx.NEGATIVE)
                    .matcher(id);
            Matcher positiveMatcher = Patterns.get(Patterns.RegEx.POSITIVE)
                    .matcher(id);
            if (negativeMatcher.find()) {
                weight -= 25;
            }
            if (positiveMatcher.find()) {
                weight += 25;
            }
        }

        return weight;
    }

    /**
     * Remove extraneous break tags from a node.
     * 
     * @param e
     */
    private static void killBreaks(Element e) {
        e.html(e.html().replaceAll(Patterns.REGEX_KILL_BREAKS, "<br />"));
    }

    /**
     * Clean a node of all elements of type "tag". (Unless it's a youtube/vimeo
     * video. People love movies.)
     * 
     * @param e
     * @param tag
     */
    private static void clean(Element e, String tag) {
        Elements targetList = getElementsByTag(e, tag);
        boolean isEmbed = "object".equalsIgnoreCase(tag)
                       || "embed".equalsIgnoreCase(tag)
                       || "iframe".equalsIgnoreCase(tag);

        for (Element target : targetList) {
            Matcher matcher = Patterns.get(Patterns.RegEx.VIDEO).matcher(
                    target.outerHtml());
            if (isEmbed && matcher.find()) {
                continue;
            }
            target.remove();
        }
    }

    /**
     * Clean an element of all tags of type "tag" if they look fishy. "Fishy" is
     * an algorithm based on content length, classnames, link density, number of
     * images & embeds, etc.
     * 
     * @param e
     * @param tag
     */
    private void cleanConditionally(Element e, String tag) {
        Elements tagsList = getElementsByTag(e, tag);

        /**
         * Gather counts for other typical elements embedded within. Traverse
         * backwards so we can remove nodes at the same time without effecting
         * the traversal.
         * 
         * TODO: Consider taking into account original contentScore here.
         */
        for (Element node : tagsList) {
            int weight = getClassWeight(node);

            dbg("Cleaning Conditionally (" + node.className() + ":" + node.id()
                    + ")" + getContentScore(node));

            if (weight < 0) {
                node.remove();
            } else if (getCharCount(node, ",") < 10) {
                /**
                 * If there are not very many commas, and the number of
                 * non-paragraph elements is more than paragraphs or other
                 * ominous signs, remove the element.
                 */
                int p = getElementsByTag(node, "p").size();
                int img = getElementsByTag(node, "img").size();
                int li = getElementsByTag(node, "li").size() - 100;
                int input = getElementsByTag(node, "input").size();

                int embedCount = 0;
                Elements embeds = getElementsByTag(node, "embed");
                for (Element embed : embeds) {
                    if (!Patterns.get(Patterns.RegEx.VIDEO)
                            .matcher(embed.absUrl("src")).find()) {
                        embedCount++;
                    }
                }

                float linkDensity = getLinkDensity(node);
                int contentLength = getInnerText(node, true).length();
                boolean toRemove = false;

                if (img > p) {
                    toRemove = true;
                } else if (li > p && !"ul".equalsIgnoreCase(tag)
                        && !"ol".equalsIgnoreCase(tag)) {
                    toRemove = true;
                } else if (input > Math.floor(p / 3)) {
                    toRemove = true;
                } else if (contentLength < 25 && (img == 0 || img > 2)) {
                    toRemove = true;
                } else if (weight < 25 && linkDensity > 0.2f) {
                    toRemove = true;
                } else if (weight > 25 && linkDensity > 0.5f) {
                    toRemove = true;
                } else if ((embedCount == 1 && contentLength < 75)
                        || embedCount > 1) {
                    toRemove = true;
                }

                if (toRemove) {
                    node.remove();
                }
            }
        }
    }

    /**
     * Clean out spurious headers from an Element. Checks things like classnames
     * and link density.
     * 
     * @param e
     */
    private static void cleanHeaders(Element e) {
        for (int headerIndex = 1; headerIndex < 7; headerIndex++) {
            Elements headers = getElementsByTag(e, "h" + headerIndex);
            for (Element header : headers) {
                if (getClassWeight(header) < 0
                        || getLinkDensity(header) > 0.33f) {
                    header.remove();
                }
            }
        }
    }

    /**
     * Print debug logs
     * 
     * @param msg
     */
    protected void dbg(String msg) {
        dbg(msg, null);
    }

    /**
     * Print debug logs with stack trace
     * 
     * @param msg
     * @param t
     */
    protected void dbg(String msg, Throwable t) {
        System.out.println(msg + (t != null ? ("\n" + t.getMessage()) : "")
                + (t != null ? ("\n" + t.getStackTrace()) : ""));
    }

    private static class Patterns {
        private static Pattern sUnlikelyCandidatesRe;
        private static Pattern sOkMaybeItsACandidateRe;
        private static Pattern sPositiveRe;
        private static Pattern sNegativeRe;
        private static Pattern sDivToPElementsRe;
        private static Pattern sVideoRe;
        private static final String REGEX_REPLACE_BRS = "(?i)(<br[^>]*>[ \n\r\t]*){2,}";
        private static final String REGEX_REPLACE_FONTS = "(?i)<(\\/?)font[^>]*>";
        /* Java has String.trim() */
        // private static final String REGEX_TRIM = "^\\s+|\\s+$";
        private static final String REGEX_NORMALIZE = "\\s{2,}";
        private static final String REGEX_KILL_BREAKS = "(<br\\s*\\/?>(\\s|&nbsp;?)*){1,}";

        public enum RegEx {
            UNLIKELY_CANDIDATES, OK_MAYBE_ITS_A_CANDIDATE, POSITIVE, NEGATIVE, DIV_TO_P_ELEMENTS, VIDEO;
        }

        public static Pattern get(RegEx re) {
            switch (re) {
            case UNLIKELY_CANDIDATES: {
                if (sUnlikelyCandidatesRe == null) {
                    sUnlikelyCandidatesRe = Pattern
                            .compile(
                                    "combx|comment|disqus|foot|header|menu|meta|nav|rss|shoutbox|sidebar|sponsor",
                                    Pattern.CASE_INSENSITIVE);
                }
                return sUnlikelyCandidatesRe;
            }
            case OK_MAYBE_ITS_A_CANDIDATE: {
                if (sOkMaybeItsACandidateRe == null) {
                    sOkMaybeItsACandidateRe = Pattern.compile(
                            "and|article|body|column|main",
                            Pattern.CASE_INSENSITIVE);
                }
                return sOkMaybeItsACandidateRe;
            }
            case POSITIVE: {
                if (sPositiveRe == null) {
                    sPositiveRe = Pattern
                            .compile(
                                    "article|body|content|entry|hentry|page|pagination|post|text",
                                    Pattern.CASE_INSENSITIVE);
                }
                return sPositiveRe;
            }
            case NEGATIVE: {
                if (sNegativeRe == null) {
                    sNegativeRe = Pattern
                            .compile(
                                    "combx|comment|contact|foot|footer|footnote|link|media|meta|promo|related|scroll|shoutbox|sponsor|tags|widget",
                                    Pattern.CASE_INSENSITIVE);
                }
                return sNegativeRe;
            }
            case DIV_TO_P_ELEMENTS: {
                if (sDivToPElementsRe == null) {
                    sDivToPElementsRe = Pattern.compile(
                            "<(a|blockquote|dl|div|img|ol|p|pre|table|ul)",
                            Pattern.CASE_INSENSITIVE);
                }
                return sDivToPElementsRe;
            }
            case VIDEO: {
                if (sVideoRe == null) {
                    sVideoRe = Pattern.compile(
                            "http:\\/\\/(www\\.)?(youtube|vimeo)\\.com",
                            Pattern.CASE_INSENSITIVE);
                }
                return sVideoRe;
            }
            }
            return null;
        }
    }

    /**
     * Reads the content score.
     * 
     * @param node
     * @return
     */
    private static int getContentScore(Element node) {
        try {
            return Integer.parseInt(node.attr(CONTENT_SCORE));
        } catch (NumberFormatException e) {
            return 0;
        }
    }

    /**
     * Increase or decrease the content score for an Element by an
     * increment/decrement.
     * 
     * @param node
     * @param increment
     * @return
     */
    private static Element incrementContentScore(Element node, int increment) {
        int contentScore = getContentScore(node);
        contentScore += increment;
        node.attr(CONTENT_SCORE, Integer.toString(contentScore));
        return node;
    }

    /**
     * Scales the content score for an Element with a factor of scale.
     * 
     * @param node
     * @param scale
     * @return
     */
    private static Element scaleContentScore(Element node, float scale) {
        int contentScore = getContentScore(node);
        contentScore *= scale;
        node.attr(CONTENT_SCORE, Integer.toString(contentScore));
        return node;
    }

    /**
     * Jsoup's Element.getElementsByTag(Element e) includes e itself, which is
     * different from W3C standards. This utility function is exclusive of the
     * Element e.
     * 
     * @param e
     * @param tag
     * @return
     */
    private static Elements getElementsByTag(Element e, String tag) {
        Elements es = e.getElementsByTag(tag);
        es.remove(e);
        return es;
    }

    /**
     * Helper utility to determine whether a given String is empty.
     * 
     * @param s
     * @return
     */
    private static boolean isEmpty(String s) {
        return s == null || s.length() == 0;
    }

}

Readability 通过提供构造函数来实例化:

Readability readability = new Readability(html); // String
Readability readability = new Readability(url, timeoutMillis); // URL

通过运行以下命令开始内容提取:

readability.init();

输出是干净、可读的 HTML 格式内容。可以使用以下命令获取输出:

String cleanHtml = readability.outerHtml();

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/2090691.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

Feign的原理及概念

1.什么是Feign Feign是Netflix开发的声明式、模板化的HTTP客户端&#xff0c;Feign可帮助我们更加便捷、优雅地调用HTTP API。Feign可以做到使用HTTP请求远程服务时就像调用本地方法一样的体验&#xff0c;开发者完全感知不到这是远程方法&#xff0c;更感知不到这是个HTTP请求…

迁移学习之领域偏移(domain shift)

实际应用中很多任务的数据的标注成本很高&#xff0c;无法获得充足的训练数据&#xff0c;这种情况可以 使用迁移学习&#xff08;transfer learning&#xff09;。假设 A、B 是两个相关的任务&#xff0c;A 任务有很多训练数 据&#xff0c;就可以把从A任务中学习到的某些可以…

监控域名到期发送钉钉消息通知

目的 想象一下&#xff0c;域名到期都不知道&#xff0c;忘了续费&#xff0c;就像忘了交房租&#xff0c;房东&#xff08;互联网&#xff09;会毫不留情地把你扫地出门&#xff01;所以&#xff0c;及时续费&#xff0c;让顾客轻松找到你&#xff0c;生意红红火火&#xff0…

掌握测试的艺术:深入探索Python的pytest库

文章目录 **掌握测试的艺术&#xff1a;深入探索Python的pytest库**背景&#xff1a;为什么选择pytest&#xff1f;pytest是什么&#xff1f;如何安装pytest&#xff1f;5个简单的库函数使用方法1. pytest.main()2. pytest.skip()3. pytest.mark.parametrize()4. pytest.raises…

从跟跑到领跑:AIGC时代国产游戏的崛起与展望

文章目录 一、技术深度挖掘1.图形渲染与视觉盛宴2.物理引擎的精细化与真实性3.AI技术的深度融入与创新 二、行业影响深度剖析1.产业链的全面升级2.IT人才需求的多元化与高端化3.文化输出与国际影响力的增强 三、未来趋势与跨界合作1.技术创新引领未来2.跨界合作拓展市场 四、代…

VLM(视觉语言模型)综述

概述 大型语言模型的出现标志着人工智能领域转型的开始&#xff0c;它们在文本信息处理上的能力极大地推动了这一进程。尽管LLMs在文本处理上表现出色&#xff0c;但它们主要限于处理单一模态的数据&#xff0c;即文本。这限制了它们在理解和生成涉及图像和视频等多模态数据方…

最新华为OD机试-E卷-流浪地球(100分)-五语言题解(Python/C/JavaScript/Java/Cpp)

🍭 大家好这里是春秋招笔试突围 ,一枚热爱算法的程序员 ✨ 本系列打算持续跟新华为OD-E/D卷的三语言AC题解 💻 ACM金牌🏅️团队| 多次AK大厂笔试 | 编程一对一辅导 👏 感谢大家的订阅➕ 和 喜欢💗 🍿 最新华为OD机试D卷目录,全、新、准,题目覆盖率达 95% 以上,…

XSS LABS - Level 20 过关思路

关注这个靶场的其他相关笔记:XSS - LABS —— 靶场笔记合集-CSDN博客 0x01:环境配置 提示:Flash 逆向工具 JPEXS 配置请看 Level 19 的过关流程,这里就不重新教怎么安装配置了。 要想完成本关,需要下载 Flash,不然就会出现下面的情况: 我个人建议,是直接下载一个 Flash…

HUD杂散光环境模拟测试设备

概述 HUD&#xff08;Head-Up Display&#xff09;杂散光环境模拟测试设备是用于模拟飞行器在实际运行过程中可能遇到的多种光照环境的系统。它主要用于测试和验证HUD显示系统的性能&#xff0c;确保其能在各种光线条件下清晰、准确地显示信息&#xff0c;从而保障飞行员在复杂…

学习C语言(19)

整理今天的学习内容 1.memmove使用和模拟实现 void* memmove (void* destination&#xff0c;const void* source&#xff0c;size_t num&#xff09;&#xff1b; 和momcpy的差别是memmove函数处理的源内存块和目标内存块是可以重叠的 memmove的模拟实现&#xff1a; 2.mem…

程序批量卸载工具 | BCUninstaller v5.8.1 绿色版

大家好&#xff0c;今天电脑天空给大家推荐一款强大的Windows软件卸载工具——Bulk Crap Uninstaller&#xff08;BCUninstaller&#xff09;。如果你经常需要安装和卸载软件&#xff0c;那么这款工具绝对值得你一试。以下是我在使用BCUninstaller的一些心得分享&#xff0c;希…

具备自动灵敏度校准、支持单键和多点触控的触摸芯片-GTX315L

电容式触摸芯片 - GTX315L是具有多通道触发传感器的15位触摸传感器系列&#xff0c;它是通过持续模式提供中断功能和唤醒功能&#xff0c;具备自动灵敏度校准、超强抗干扰能力&#xff0c;可抗特斯拉&#xff08;小黑盒&#xff09;线圈干扰&#xff0c;支持单键/多点触控&…

CN05.1,NDVI,CMIP6及TIFF图像数据处理方法合集

笔记链接&#xff1a; 数据处理数据集&#xff1a;https://www.wolai.com/aKjMiRrEk6C3WG4Yg8rYiz需要登录wolai才能查看&#xff0c;用于个人学习记录。

五,Spring Boot中的 Spring initializr 的使用

五&#xff0c;Spring Boot中的 Spring initializr 的使用 文章目录 五&#xff0c;Spring Boot中的 Spring initializr 的使用1. 方式1&#xff1a;IDEA创建2. 方式2&#xff1a;start.spring.io 创建3. 注意事项和细节4. 最后&#xff1a; 需要&#xff1a;使用 Spring initi…

我的私有云-IOT定位/追踪系统

目录 1. 说明 2 完成后的效果 2.1 实时定位 2.2 轨迹重现 2.3 设备美照 3. 项目设计 3.1 系统拓扑图​编辑 3.2 技术选型 3.3 消息订阅处理架构图 3.4 frp服务在线监控​编辑 4. 实施 4.1 数据模型 - DeviceLocation 4.2 数据报规格定义 订阅主题 数据报格式 …

ARM32开发——(二十三)存储器介绍

1. 存储器分类 存储器按其存储介质特性主要分为“易失性存储器”和“非易失性存储器”两大类。 “易失/非易失”是指存储器断电后&#xff0c; 它存储的数据内容是否会丢失的特性。 在计算机中易失性存储器最典型的代表是内存&#xff0c;非易失性存储器的代表则是硬盘。 2.…

互联网全景消息(2)之RabbitMq高阶使用

一、RabbitMQ消息可靠性保障 消息的可靠性投递是使用消息中间件不可避免的问题&#xff0c;不管是Kafka、rocketMQ或者是rabbitMQ&#xff0c;那么在RabbitMQ中如何保障消息的可靠性呢&#xff1f; 首先来看一下rabbitMQ的 架构图&#xff1a; 首先从图里我们可以看到&#xff…

python发现是anaconda的,而不是原来的编译环境

发现有三个python编译器。 可以检查一下环境变量&#xff0c;把原来的python编译器版本上移到anaconda的python编译器之前。这样每次在终端使用python命令就是原来的python编译器版本了

基于Docker搭建Graylog分布式日志采集系统

文章目录 一、简介二、Graylog1、主要特点2、组件3、工作流程介绍4、使用场景 三、Graylog 安装部署1、 安装 docker2、安装docker compose3、 安装graylog4、Graylog控制台 四、springboot集成Graylog 一、简介 Graylog是一个开源的日志管理工具&#xff0c;主要功能包括日志…

c++中的匿名对象及内存管理及模版初阶

c中的匿名对象 A a;//a的生命周期在整个main函数中 a.Sum(1); //匿名对象生命周期只有一行&#xff0c;只有这一行会创建对象,出了这一行就会调析构 A().Sum(1);//只有这一行需要这个对象&#xff0c;其他地方不需要。 return 0; 日期到天数的转换 计算日期到天数转换_牛客…