1 PyDictObject
在c++中STL中的map是基于 RB-tree平衡二元树实现,搜索的时间复杂度为O(log2n)
Python中PyDictObject是基于散列表(散列函数)实现,搜索时间最优为O(1)
1.1 散列列表
问题:散列冲突:多个元素计算得到相同的哈希值
解决方法:
(1)开链法
(2)开放地址法:二次探测法(python中用的)
通过增加一个二次函数形式的偏移量来查找下一个空闲位置。哈希表的大小为m,一个元素的初始位置由哈希函数 h(x)决定。若发生冲突,则将元素插入到位置 (h(x)+i^2)mod m处,其中i是探测的步数。探测序列是一个二次序列,例如:1^2,(-1)^2,2^2,(-2)^2在表中寻找下一个可用位置。
从一个位置出发可依次到达多个位置,形成“冲突探测链”(注意删除链上元素导致的断链,采用伪删除技术。)
1.2 定义
typedef struct {
/* Cached hash code of me_key. */
Py_hash_t me_hash; //存储me_key的散列值,维护该值是避免每次查询时重新计算
PyObject *me_key;
PyObject *me_value; /* This field is only meaningful for combined tables */
} PyDictKeyEntry;
entry生存周期的四种状态
- unused态:me_key/me_value都是null(entry在初始化的时候)。
- Active态:entry存储了键值对的状态。
- Dummy态:me_key指向dummy对象(伪删除)。
- Pending态:键!=空,值=空(仅拆分),尚未插入到拆分表中。
/* The ma_values pointer is NULL for a combined table
* or points to an array of PyObject* for a split table*/
typedef struct {
PyObject_HEAD
Py_ssize_t ma_used; /*字典中项的数量*/
#ifdef Py_BUILD_CORE
uint64_t ma_version_tag; /*表示字典中对象版本*/
#else
Py_DEPRECATED(3.12) uint64_t ma_version_tag; /*表示字典中对象版本*/
#endif
/*若ma_values为空,则表是结合的,键与值都存储在ma_keys中*/
/*若ma_values不为空,则表是分开的,键存储在ma_keys中,值存储在ma_values*/
PyDictKeysObject *ma_keys; /*实际存储数据的哈希表,具体有两种存储方式*/
PyDictValues *ma_values; //根据两种存储方式决定是否有值
} PyDictObject;
struct _dictkeysobject {
Py_ssize_t dk_refcnt;//引用计数器数目
/* Size of the hash table (dk_indices). It must be a power of 2. */
uint8_t dk_log2_size; //这张哈希表的大小(最大存储的元素的数目)
/* Size of the hash table (dk_indices) by bytes. */
uint8_t dk_log2_index_bytes; //哈希表大小的字节数
/* Kind of keys */
uint8_t dk_kind; //类型的键
/* Version number -- Reset to 0 by any modification to keys */
uint32_t dk_version; //版本号
/* Number of usable entries in dk_entries. */
Py_ssize_t dk_usable; //在dk_entries中可用的数量
/* Number of used entries in dk_entries. */
Py_ssize_t dk_nentries; //在dk_entries中使用的数量(8个字节)
/* Actual hash table of dk_size entries. It holds indices in dk_entries,
or DKIX_EMPTY(-1) or DKIX_DUMMY(-2).
Indices must be: 0 <= indice < USABLE_FRACTION(dk_size).
The size in bytes of an indice depends on dk_size:
- 1 byte if dk_size <= 0xff (char*)
- 2 bytes if dk_size <= 0xffff (int16_t*)
- 4 bytes if dk_size <= 0xffffffff (int32_t*)
- 8 bytes otherwise (int64_t*)
Dynamically sized, SIZEOF_VOID_P is minimum. */
char dk_indices[]; /* 索引,一个元素一个字节 */
/* "PyDictKeyEntry or PyDictUnicodeEntry dk_entries[USABLE_FRACTION(DK_SIZE(dk))];" array follows:
see the DK_ENTRIES() macro */
};
1.3 python3.6+的存储方法
- 第一条key-value,计算inx=hash(key)%num,num是索引表长,索引表中存放着对于enries的偏移量。
- 依据indices[inx]的值(偏移量)存放Hash value=hash(key)、key、value
- 若该位置已经有元素,则根据冲突解决策略找下一个空闲的索引。
- 查找键的时候同样流程,并比较键与值来确定是否需要所需元素。
---------------后续有必要再继续写---------------------------------------------------
2 解释器
2.1组成
编译器:得到字节码的编译结果(import py文件、import compileall、内建函数compile后会得到.pyc文件)
虚拟机:执行字节码
执行环境:字典对象,维护运行过程中动态创建的变量和变量名与变量值的映射。
2.2执行脚本流程
1.完成模块的加载和链接
2.将源代码编译为PyCodeObject对象,并将其写入内存,使得CPU快速读取,加快程序运行
注:字节码与PyCodeObject对象的关系?
PyCodeObject对象包含字符串,常量值,操作(字节码)等静态信息(运行时存储在PyCodeObject对象中,运行结束后存储在pyc文件)
3.从内存空间中读取指定并执行(虚拟机完成)
编译器与虚拟机在:python .dll
4.程序结束后根据调用的操作指令决定是否也将PyCodeObject对象写入硬盘,即.pyc文件或.pyo文件。
5.下一次再执行该脚本,则先检查本地是否有上述.pyc文件。如有,则执行。
2.3PyCodeObject
struct PyCodeObject{ \
PyObject_VAR_HEAD \
\
/* Note only the following fields are used in hash and/or comparisons \
* \
* - co_name \
* - co_argcount \
* - co_posonlyargcount \
* - co_kwonlyargcount \
* - co_nlocals \
* - co_stacksize \
* - co_flags \
* - co_firstlineno \
* - co_consts \
* - co_names \
* - co_localsplusnames \
* This is done to preserve the name and line number for tracebacks \
* and debuggers; otherwise, constant de-duplication would collapse \
* identical functions/lambdas defined on different lines. \
*/ \
\
/* These fields are set with provided values on new code objects. */ \
\
/* The hottest fields (in the eval loop) are grouped here at the top. */ \
PyObject *co_consts; /* list (constants used) */ \
PyObject *co_names; /* list of strings (names used) */ \
PyObject *co_exceptiontable; /* Byte string encoding exception handling \
table */ \
int co_flags; /* CO_..., see below */ \
\
/* The rest are not so impactful on performance. */ \
int co_argcount; /* #arguments, except *args */ \
int co_posonlyargcount; /* #positional only arguments */ \
int co_kwonlyargcount; /* #keyword only arguments */ \
int co_stacksize; /* #entries needed for evaluation stack */ \
int co_firstlineno; /* first source line number */ \
\
/* redundant values (derived from co_localsplusnames and \
co_localspluskinds) */ \
int co_nlocalsplus; /* number of local + cell + free variables */ \
int co_framesize; /* Size of frame in words */ \
int co_nlocals; /* number of local variables */ \
int co_ncellvars; /* total number of cell variables */ \
int co_nfreevars; /* number of free variables */ \
uint32_t co_version; /* version number */ \
\
PyObject *co_localsplusnames; /* tuple mapping offsets to names */ \
PyObject *co_localspluskinds; /* Bytes mapping to local kinds (one byte \
per variable) */ \
PyObject *co_filename; /* unicode (where it was loaded from) */ \
PyObject *co_name; /* unicode (name, for reference) */ \
PyObject *co_qualname; /* unicode (qualname, for reference) */ \
PyObject *co_linetable; /* bytes object that holds location info */ \
PyObject *co_weakreflist; /* to support weakrefs to code objects */ \
_PyCoCached *_co_cached; /* cached co_* attributes */ \
uint64_t _co_instrumentation_version; /* current instrumentation version */ \
_PyCoMonitoringData *_co_monitoring; /* Monitoring data */ \
int _co_firsttraceable; /* index of first traceable instruction */ \
/* Scratch space for extra data relating to the code object. \
Type is a void* to keep the format private in codeobject.c to force \
people to go through the proper APIs. */ \
void *co_extra; \
char co_code_adaptive[(SIZE)]; \
}
1.一个命名空间对应一个PyCodeObject对象。
2.类、函数、module都对应一个独立的命名空间(存在嵌套关系)。
2.4pyc文件
pyc文件=magic number( 区别python版本)+pyc文件的最后一次修改时间(再次加载时判断是否修改过)+PyCodeObject对象。
2.5创建pyc文件的具体过程(把PyCodeObject对象写入文件)
-------------------------待写