本文由 大侠(AhcaoZhu)原创,转载请声明。
链接: https://blog.csdn.net/Ahcao2008
一图看懂 charset_normalizer 模块:字符集规范化,真正的第一个通用字符集检测器,资料整理+笔记(大全)
- 🧊摘要
- 🧊模块图
- 🧊类关系图
- 🧊模块全展开
- ☘️【charset_normalizer】
- 🔵统计
- 🔵常量
- 🌿list
- 🔵模块
- 🌿2 logging
- 🌿3 charset_normalizer.assets
- 🌿4 charset_normalizer.constant
- 🌿5 charset_normalizer.md__mypyc
- 🌿6 charset_normalizer.utils
- 🌿7 charset_normalizer.md
- 🌿8 charset_normalizer.models
- 🌿9 charset_normalizer.cd
- 🌿10 charset_normalizer.api
- 🌿11 charset_normalizer.legacy
- 🌿12 charset_normalizer.version
- 🔵函数
- 🌿13 from_bytes(sequences: bytes, steps: int = 5, chunk_size: int = 512, threshold: float = 0.2, cp_isolation: Union[List[str], NoneType] = None, cp_exclusion: Union[List[str], NoneType] = None, preemptive_behaviour: bool = True, explain: bool = False, language_threshold: float = 0.1) -> charset_normalizer.models.CharsetMatches
- 🌿14 from_fp(fp: <class 'BinaryIO'>, steps: int = 5, chunk_size: int = 512, threshold: float = 0.2, cp_isolation: Union[List[str], NoneType] = None, cp_exclusion: Union[List[str], NoneType] = None, preemptive_behaviour: bool = True, explain: bool = False, language_threshold: float = 0.1) -> charset_normalizer.models.CharsetMatches
- 🌿15 from_path(path: 'PathLike[Any]', steps: int = 5, chunk_size: int = 512, threshold: float = 0.2, cp_isolation: Union[List[str], NoneType] = None, cp_exclusion: Union[List[str], NoneType] = None, preemptive_behaviour: bool = True, explain: bool = False, language_threshold: float = 0.1) -> charset_normalizer.models.CharsetMatches
- 🌿16 detect(byte_str: bytes, should_rename_legacy: bool = False, **kwargs: Any) -> Dict[str, Union[str, float, NoneType]]
- 🌿17 set_logging_handler(name: str = 'charset_normalizer', level: int = 20, format_string: str = '%(asctime)s | %(levelname)s | %(message)s') -> None
- 🔵类
- 🌿18 charset_normalizer.models.CharsetMatch
- property
- method
- 18 add_submatch(self, other: "CharsetMatch") -> None:
- 19 output(self, encoding: str = "utf_8") -> bytes:
- 🌿19 charset_normalizer.models.CharsetMatches
- method
- 1 append(self, item: CharsetMatch) -> None:
- 2 best(self) -> Optional["CharsetMatch"]:
- 3 first(self) -> Optional["CharsetMatch"]:
- ☘️【logging】
- ☘️【charset_normalizer.assets】
- ☘️【charset_normalizer.constant】
- ☘️【charset_normalizer.md__mypyc】
- ☘️【charset_normalizer.utils】
- ☘️【charset_normalizer.md】
- ☘️【charset_normalizer.models】
- ☘️【charset_normalizer.cd】
- ☘️【charset_normalizer.api】
- ☘️【charset_normalizer.legacy】
- ☘️【charset_normalizer.version】
- ☘️【importlib】
- ☘️【unicodedata】
🧊摘要
- 全文介绍python的 charset_normalizer 模块(字符集规范化,真正的第一个通用字符集检测器)、函数、类及类的方法和属性。
- 它通过代码抓取并经AI智能翻译和人工校对。
- 是一部不可多得的权威字典类工具书。它是系列集的一部分。后续陆续发布、敬请关注。【原创:AhcaoZhu大侠】
🧊模块图
charset_normalizer
charset_normalizer.assets
charset_normalizer.constant
charset_normalizer.md__mypyc
charset_normalizer.utils
◆unicodedata
charset_normalizer.md
charset_normalizer.models
charset_normalizer.cd
charset_normalizer.api
charset_normalizer.legacy
charset_normalizer.version
🧊类关系图
◆object
charset_normalizer.md.MessDetectorPlugin
charset_normalizer.md.ArchaicUpperLowerPlugin
charset_normalizer.md.CjkInvalidStopPlugin
charset_normalizer.md.SuperWeirdWordPlugin
charset_normalizer.md.SuspiciousDuplicateAccentPlugin
charset_normalizer.md.SuspiciousRange
charset_normalizer.md.TooManyAccentuatedPlugin
charset_normalizer.md.TooManySymbolOrPunctuationPlugin
charset_normalizer.md.UnprintablePlugin
charset_normalizer.models.CharsetMatch
charset_normalizer.models.CharsetMatches
charset_normalizer.models.CliDetectionResult
◆unicodedata.UCD
🧊模块全展开
☘️【charset_normalizer】
charset_normalizer, fullname=charset_normalizer, file=charset_normalizer_init_.py
字符集规范化,真正的第一个通用字符集检测器。
一个库,可以帮助您从未知的字符集编码中读取文本。
在chardet的激励下,这个方案试图通过采取新的方法来解决问题。
支持Python核心库提供编解码器的所有IANA字符集名称。
基本用法:
>>> from charset_normalizer import from_bytes
>>> results = from_bytes('Bсеки човек има право на образование. Oбразованието!'.encode('utf_8'))
>>> best_guess = results.best()
>>> str(best_guess)
'Bсеки човек има право на образование. Oбразованието!'
其他方法和用法也可用——请[参阅](https://github.com/Ousret/charset_normalizer)上的完整文档。
版权:(c) 2021由Ahmed TAHRI
:许可:麻省理工学院,参见许可了解更多细节。
🔵统计
序号 | 类别 | 数量 |
---|---|---|
4 | str | 6 |
5 | tuple | 1 |
6 | list | 2 |
8 | dict | 1 |
9 | module | 11 |
10 | class | 2 |
11 | function | 5 |
13 | residual | 2 |
14 | system | 11 |
16 | all | 30 |
🔵常量
🌿list
1 VERSION [‘3’, ‘1’, ‘0’]
🔵模块
🌿2 logging
logging, fullname=logging, file=logging_init_.py
🌿3 charset_normalizer.assets
assets, fullname=charset_normalizer.assets, file=charset_normalizer\assets_init_.py
🌿4 charset_normalizer.constant
constant, fullname=charset_normalizer.constant, file=charset_normalizer\constant.py
🌿5 charset_normalizer.md__mypyc
md__mypyc, fullname=charset_normalizer.md__mypyc, file=charset_normalizer\md__mypyc.cp37-win_amd64.pyd
🌿6 charset_normalizer.utils
utils, fullname=charset_normalizer.utils, file=charset_normalizer\utils.py
🌿7 charset_normalizer.md
md, fullname=charset_normalizer.md, file=charset_normalizer\md.cp37-win_amd64.pyd
🌿8 charset_normalizer.models
models, fullname=charset_normalizer.models, file=charset_normalizer\models.py
🌿9 charset_normalizer.cd
cd, fullname=charset_normalizer.cd, file=charset_normalizer\cd.py
🌿10 charset_normalizer.api
api, fullname=charset_normalizer.api, file=charset_normalizer\api.py
🌿11 charset_normalizer.legacy
legacy, fullname=charset_normalizer.legacy, file=charset_normalizer\legacy.py
🌿12 charset_normalizer.version
version, fullname=charset_normalizer.version, file=charset_normalizer\version.py
公开版本
🔵函数
🌿13 from_bytes(sequences: bytes, steps: int = 5, chunk_size: int = 512, threshold: float = 0.2, cp_isolation: Union[List[str], NoneType] = None, cp_exclusion: Union[List[str], NoneType] = None, preemptive_behaviour: bool = True, explain: bool = False, language_threshold: float = 0.1) -> charset_normalizer.models.CharsetMatches
from_bytes(sequences: bytes, steps: int = 5, chunk_size: int = 512, threshold: float = 0.2, cp_isolation: Union[List[str], NoneType] = None, cp_exclusion: Union[List[str], NoneType] = None, preemptive_behaviour: bool = True, explain: bool = False, language_threshold: float = 0.1) -> charset_normalizer.models.CharsetMatches, module=charset_normalizer.api, line:33 at site-packages\charset_normalizer\api.py
给定一个原始字节序列,返回可用于呈现str对象的最佳字符集。
如果没有结果,这是一个强有力的指示源是二进制/不是文本。
默认情况下,该进程将提取5个512o的块来评估给定序列的混乱和一致性。并且会在20%的测量混乱后放弃特定的代码页。这些标准可以随意定制。
先发制人的行为并不能取代传统的检测工作流程,它优先考虑特定的代码页,但从不认为这是理所当然的。
可以提高性能。您可能希望将注意力集中在某些代码页或/而不是其他代码页上,为此使用 cp_isolation 和 cp_exclusion 排除。
除UTF-16、UTF-32外,此函数每次都将在有效载荷/序列中剥离SIG。
默认情况下,库不会设置除NullHandler之外的任何处理程序,如果您选择将'explain'开关设置为True,它将更改记录器配置以添加适合调试的StreamHandler。
可以手动设置自定义日志格式和处理程序。
🌿14 from_fp(fp: <class ‘BinaryIO’>, steps: int = 5, chunk_size: int = 512, threshold: float = 0.2, cp_isolation: Union[List[str], NoneType] = None, cp_exclusion: Union[List[str], NoneType] = None, preemptive_behaviour: bool = True, explain: bool = False, language_threshold: float = 0.1) -> charset_normalizer.models.CharsetMatches
from_fp(fp: <class ‘BinaryIO’>, steps: int = 5, chunk_size: int = 512, threshold: float = 0.2, cp_isolation: Union[List[str], NoneType] = None, cp_exclusion: Union[List[str], NoneType] = None, preemptive_behaviour: bool = True, explain: bool = False, language_threshold: float = 0.1) -> charset_normalizer.models.CharsetMatches, module=charset_normalizer.api, line:500 at site-packages\charset_normalizer\api.py
与 from_bytes 函数相同,但使用了一个已经准备好的文件指针。
不会关闭文件指针。
🌿15 from_path(path: ‘PathLike[Any]’, steps: int = 5, chunk_size: int = 512, threshold: float = 0.2, cp_isolation: Union[List[str], NoneType] = None, cp_exclusion: Union[List[str], NoneType] = None, preemptive_behaviour: bool = True, explain: bool = False, language_threshold: float = 0.1) -> charset_normalizer.models.CharsetMatches
from_path(path: ‘PathLike[Any]’, steps: int = 5, chunk_size: int = 512, threshold: float = 0.2, cp_isolation: Union[List[str], NoneType] = None, cp_exclusion: Union[List[str], NoneType] = None, preemptive_behaviour: bool = True, explain: bool = False, language_threshold: float = 0.1) -> charset_normalizer.models.CharsetMatches, module=charset_normalizer.api, line:528 at site-packages\charset_normalizer\api.py
与字节函数 from_bytes 相同,但多了一个步骤。
以二进制模式打开和读取给定的文件路径。
会引发IOError。
🌿16 detect(byte_str: bytes, should_rename_legacy: bool = False, **kwargs: Any) -> Dict[str, Union[str, float, NoneType]]
detect(byte_str: bytes, should_rename_legacy: bool = False, **kwargs: Any) -> Dict[str, Union[str, float, NoneType]], module=charset_normalizer.legacy, line:8 at site-packages\charset_normalizer\legacy.py
检测给定字节串的编码。
它应该是向后兼容的。编码名称将尽可能匹配Chardet自己的书写。
(不支持编码名称)此函数已弃用,应用于轻松迁移项目,请参阅文档以获取更多信息。
没有移除的计划。
:param byte_str: 要检查的字节序列。
:param should_rename_legacy: 我们应该将legacy编码重命名为更现代的等效编码吗?
🌿17 set_logging_handler(name: str = ‘charset_normalizer’, level: int = 20, format_string: str = ‘%(asctime)s | %(levelname)s | %(message)s’) -> None
set_logging_handler(name: str = ‘charset_normalizer’, level: int = 20, format_string: str = ‘%(asctime)s | %(levelname)s | %(message)s’) -> None, module=charset_normalizer.utils, line:348 at site-packages\charset_normalizer\utils.py
🔵类
🌿18 charset_normalizer.models.CharsetMatch
CharsetMatch, charset_normalizer.models.CharsetMatch, module=charset_normalizer.models, line:10 at site-packages\charset_normalizer\models.py
property
1 alphabets=<property object at 0x000001C836607868> kind:property type:property class:<class ‘charset_normalizer.models.CharsetMatch’>
2 bom=<property object at 0x000001C8365ED048> kind:property type:property class:<class ‘charset_normalizer.models.CharsetMatch’>
3 byte_order_mark=<property object at 0x000001C8365ED0E8> kind:property type:property class:<class ‘charset_normalizer.models.CharsetMatch’>
4 chaos=<property object at 0x000001C8365ED2C8> kind:property type:property class:<class ‘charset_normalizer.models.CharsetMatch’>
5 coherence=<property object at 0x000001C8365ED368> kind:property type:property class:<class ‘charset_normalizer.models.CharsetMatch’>
6 could_be_from_charset=<property object at 0x000001C8366078B8> kind:property type:property class:<class ‘charset_normalizer.models.CharsetMatch’>
7 encoding=<property object at 0x000001C8365E7EA8> kind:property type:property class:<class ‘charset_normalizer.models.CharsetMatch’>
8 encoding_aliases=<property object at 0x000001C8365E7F48> kind:property type:property class:<class ‘charset_normalizer.models.CharsetMatch’>
9 fingerprint=<property object at 0x000001C836607908> kind:property type:property class:<class ‘charset_normalizer.models.CharsetMatch’>
10 has_submatch=<property object at 0x000001C836607818> kind:property type:property class:<class ‘charset_normalizer.models.CharsetMatch’>
11 language=<property object at 0x000001C8365ED228> kind:property type:property class:<class ‘charset_normalizer.models.CharsetMatch’>
12 languages=<property object at 0x000001C8365ED188> kind:property type:property class:<class ‘charset_normalizer.models.CharsetMatch’>
13 multi_byte_usage=<property object at 0x000001C8365E7D18> kind:property type:property class:<class ‘charset_normalizer.models.CharsetMatch’>
14 percent_chaos=<property object at 0x000001C8365ED408> kind:property type:property class:<class ‘charset_normalizer.models.CharsetMatch’>
15 percent_coherence=<property object at 0x000001C8365ED4A8> kind:property type:property class:<class ‘charset_normalizer.models.CharsetMatch’>
16 raw=<property object at 0x000001C8365ED548> kind:property type:property class:<class ‘charset_normalizer.models.CharsetMatch’>
17 submatch=<property object at 0x000001C8366077C8> kind:property type:property class:<class ‘charset_normalizer.models.CharsetMatch’>
method
18 add_submatch(self, other: “CharsetMatch”) -> None:
kind=method class=CharsetMatch objtype=function line:77 at …\site-packages\charset_normalizer\models.py
19 output(self, encoding: str = “utf_8”) -> bytes:
kind=method class=CharsetMatch objtype=function line:203 at …\site-packages\charset_normalizer\models.py
方法使用给定的目标编码获取重新编码的字节有效负载。
默认为UTF-8。任何错误都将被编码器简单地忽略而不是替换。
🌿19 charset_normalizer.models.CharsetMatches
CharsetMatches, charset_normalizer.models.CharsetMatches, module=charset_normalizer.models, line:222 at site-packages\charset_normalizer\models.py
默认情况下,每个CharsetMatch项从最可能到最不可能排序的容器。
像一个列表(可迭代的),但不实现所有相关的方法。
method
1 append(self, item: CharsetMatch) -> None:
kind=method class=CharsetMatches objtype=function line:254 at …\site-packages\charset_normalizer\models.py
插入一个匹配项。将相应地插入以保持排序。
可以作为子匹配插入。
2 best(self) -> Optional[“CharsetMatch”]:
kind=method class=CharsetMatches objtype=function line:274 at …\site-packages\charset_normalizer\models.py
只需返回第一个匹配项。
严格等价于matches[0]。
3 first(self) -> Optional[“CharsetMatch”]:
kind=method class=CharsetMatches objtype=function line:282 at …\site-packages\charset_normalizer\models.py
冗余方法,调用best()方法。
因为BC的原因而保留。
☘️【logging】
logging, fullname=logging, file=logging_init_.py
☘️【charset_normalizer.assets】
assets, fullname=charset_normalizer.assets, file=charset_normalizer\assets_init_.py
☘️【charset_normalizer.constant】
constant, fullname=charset_normalizer.constant, file=charset_normalizer\constant.py
☘️【charset_normalizer.md__mypyc】
md__mypyc, fullname=charset_normalizer.md__mypyc, file=charset_normalizer\md__mypyc.cp37-win_amd64.pyd
☘️【charset_normalizer.utils】
utils, fullname=charset_normalizer.utils, file=charset_normalizer\utils.py
☘️【charset_normalizer.md】
md, fullname=charset_normalizer.md, file=charset_normalizer\md.cp37-win_amd64.pyd
☘️【charset_normalizer.models】
models, fullname=charset_normalizer.models, file=charset_normalizer\models.py
☘️【charset_normalizer.cd】
cd, fullname=charset_normalizer.cd, file=charset_normalizer\cd.py
☘️【charset_normalizer.api】
api, fullname=charset_normalizer.api, file=charset_normalizer\api.py
☘️【charset_normalizer.legacy】
legacy, fullname=charset_normalizer.legacy, file=charset_normalizer\legacy.py
☘️【charset_normalizer.version】
version, fullname=charset_normalizer.version, file=charset_normalizer\version.py
☘️【importlib】
importlib, fullname=importlib, file=importlib_init_.py
☘️【unicodedata】
unicodedata, fullname=unicodedata, file=unicodedata.pyd