Nanolog起步笔记-10-log解压过程(4)寻找meta续2

Nanolog起步笔记-10-log解压过程4寻找meta续2

写在前面
- 重新开始trace
readDictionaryFragment
- 读取meta头部
- 读入每个记录
- createMicroCode
- - 读入头部，和文件名
- 切分format字符串
- - PrintFragment
后记

写在前面

前面的工作，已做打下令人有信心的基础。

重新开始trace

之前我们起步就看到了 metadata ，显然这前就已加载了。
所以，只需要重走一遍代码，就能得到了。
目前的代码相当简练。走过几步后看到：

在这里插入图片描述
显然我们接近了。
这个第一个得到的，就是我们想要的。
先得到类型：

type=73
结转为
struct UnknownHeader {
uint8_t entryType:2;
uint8_t other:6;
};
得到：
LOG_MSGS_OR_DIC = 1,

然后我们详细分析readDictionaryFragment.

readDictionaryFragment

bool
Log::Decoder::readDictionaryFragment(FILE *fd) {

读取meta头部

在这里我们稍作停留：
在这里插入图片描述
能够看到一些信息。特别是，DictionaryFragment 的定义

    /**
     * A DictionaryFragment contains a partial mapping of unique identifiers to
     * static log information on disk. Following this structure is one or more
     * CompressedLogInfo. The order in which these log infos appear determine
     * its unique identifier (i.e. by order of appearance starting at 0).
     */
    NANOLOG_PACK_PUSH
    struct DictionaryFragment {
        // Byte representation of an EntryType::LOG_MSG_OR_DIC to indicate
        // the start of a dictionary fragment.
        uint32_t entryType:2;

        // Number of bytes for this fragment (including all CompressedLogInfo)
        uint32_t newMetadataBytes:30;

        // Total number of FormatMetadata encountered so far in the log
        // including this fragment (used as a sanity check only).
        uint32_t totalMetadataEntries;
    };
    NANOLOG_PACK_POP

翻译一下：
DictionaryFragment 包含唯一标识符到磁盘上的静态日志信息的部分映射。此结构后面是一个或多个CompressedLogInfo。
这些日志信息的出现顺序,取决于其唯一标识符（即按从 0开始的出现顺序）。

这是比较容易理解的。因为logid或者fragment_id，就是这个id。也就是我们要找的meta record 的id.

在这里插入图片描述
这里得到了上图的信息。
最明显的是：(df).totalMetadataEntries=2
这与预期一致，因为我们的确，只有两次写入log。

读入每个记录

    /**
     * Stores the static log information associated with a log message on disk.
     * Following this structure are the filename and format string.
     */
    NANOLOG_PACK_PUSH
    struct CompressedLogInfo {
        // LogLevel severity of the original log invocation
        uint8_t severity;

        // File line number in which the original log invocation appeared
        uint32_t linenum;

        // Length of the filename that is associated with this log invocation
        // and follows after this structure.
        uint16_t filenameLength;

        // Length of the format string that is associated with this log
        // invocation and comes after filename.
        uint16_t formatStringLength;
    };
    NANOLOG_PACK_POP

CompressedLogInfo的信息很明确。
首先，它没有数组之类的信息的位置。
其次，4个信息，简单明了:级别；行号，文件名长，format长。
所以，它是每个meta的record的头部。并不是全部的信息。

读入过程如下：

        CompressedLogInfo cli;
        newBytesRead += fread(&cli, 1, sizeof(CompressedLogInfo), fd);
       newBytesRead += fread(filename, 1, cli.filenameLength, fd);
        newBytesRead += fread(format, 1, cli.formatStringLength, fd);

然后入队和创建：

        fmtId2metadata.push_back(endOfRawMetadata);
        fmtId2fmtString.push_back(format);
        createMicroCode(&endOfRawMetadata,
                            format,
                            filename,
                            cli.linenum,
                            cli.severity);

createMicroCode

读入头部，和文件名

在这里插入图片描述
跳过头部：*microCode += sizeof(FormatMetadata);
*microCode = stpcpy(*microCode, filename) + 1;

     /**
     * Describes a unique log message within the user sources. The order in
     * which this structure appears in the log file determines the associated
     * logId. Following this structure are the source filename and
     * PrintFragments required for this message.
     */

    NANOLOG_PACK_PUSH
    struct FormatMetadata {
        // Number of nibbles in the dynamic data stream used to pack() arguments
        uint8_t numNibbles;

        // Number of PrintFragments following this data structure
        uint8_t numPrintFragments;

        // Log level of the LOG statement in the original source file
        uint8_t logLevel;

        // Line number of the LOG statement in the original source file
        uint32_t lineNumber;

        // Number of bytes in filename[] (including the null character)
        uint16_t filenameLength;

        // Filename for the original source file containing the LOG statement
        char filename[];
    };
    NANOLOG_PACK_POP

FormatMetadata中的信息，是前面的CompressedLogInfo 超集。因为二者作用域不同。CompressedLogInfo 更加倾向于是序列化之后的域中。
FormatMetadata相当于内存中的数据域。
一般而言元数据在程序中，有三层：最上层的有函数的类级；没有函数，但有指针的内存级；用于存盘的序列化级，最低级，要么没有指针，要么指针是转义的（例如转为相对偏移量，从文件的某个位置，一般是零）。

切分format字符串

看到这个
定义正则表达式，准备切分format
我们知道要切分字符串了。
首先，定位到下一个’%’
啊，这里感慨一下，连boost中的正则表达式，也成为std的一部分了。
过正则匹配split后，(这里注意google的工程师，手工将字符串定位到指定位置，然后再进行正则匹配，而不是直接用正则表达式进行查找，可见他们对效率的追求达到了极致)。
在这里插入图片描述
得到
type=NanoLogInternal::Log::const_char_ptr_t
代码如下，这些代码很繁琐，但实际上，如果写底层工程，复用最多的反而是这些底层代码。


/**
 * Parses the <length> and <specifier> components of a printf format sub-string
 * according to http://www.cplusplus.com/reference/cstdio/printf/ and returns
 * a corresponding FormatType.
 *
 * \param length
 *      Length component of the printf format string
 * \param specifier
 *      Specifier component of the printf format string
 * @return
 *      The FormatType corresponding to the length and specifier. A value of
 *      MAX_FORMAT_TYPE is returned in case of error.
 */
static NanoLogInternal::Log::FormatType
getFormatType(std::string length, char specifier)
{
    using namespace NanoLogInternal::Log;

    // Signed Integers
    if (specifier == 'd' || specifier == 'i') {
        if (length.empty())
            return int_t;

        if (length.size() == 2) {
            if (length[0] == 'h') return signed_char_t;
            if (length[0] == 'l') return long_long_int_t;
        }

        switch(length[0]) {
            case 'h': return short_int_t;
            case 'l': return long_int_t;
            case 'j': return intmax_t_t;
            case 'z': return size_t_t;
            case 't': return ptrdiff_t_t;
            default : break;
        }
    }

    // Unsigned integers
    if (specifier == 'u' || specifier == 'o'
            || specifier == 'x' || specifier == 'X')
    {
        if (length.empty())
            return unsigned_int_t;

        if (length.size() == 2) {
            if (length[0] == 'h') return unsigned_char_t;
            if (length[0] == 'l') return unsigned_long_long_int_t;
        }

        switch(length[0]) {
            case 'h': return unsigned_short_int_t;
            case 'l': return unsigned_long_int_t;
            case 'j': return uintmax_t_t;
            case 'z': return size_t_t;
            case 't': return ptrdiff_t_t;
            default : break;
        }
    }

    // Strings
    if (specifier == 's') {
        if (length.empty()) return const_char_ptr_t;
        if (length[0] == 'l') return const_wchar_t_ptr_t;
    }

    // Pointer
    if (specifier == 'p') {
        if (length.empty()) return const_void_ptr_t;
    }


    // Floating points
    if (specifier == 'f' || specifier == 'F'
            || specifier == 'e' || specifier == 'E'
            || specifier == 'g' || specifier == 'G'
            || specifier == 'a' || specifier == 'A')
    {
        if (length.size() == 1 && length[0] == 'L' )
            return long_double_t;
        else
            return double_t;
    }

    if (specifier == 'c') {
        if (length.empty()) return int_t;
        if (length[0] == 'l') return wint_t_t;
    }

    fprintf(stderr, "Attempt to decode format specifier failed: %s%c\r\n",
            length.c_str(), specifier);
    return MAX_FORMAT_TYPE;
}

从我们看到，其实数据是紧密排列的：
即当前文件名"main.cc"的’\0’之后，紧跟着格式信息。
在这里插入图片描述

稍等啊，这里我有些偏离当前正在分析的问题的主题了，基实我要分析的是
在这里插入图片描述
然后根据formatString，自动换算出一个唯一标识符，作为CTF meta的event的name.
当然，也不完全偏离，解析数据时，还是要走现在的部分。

PrintFragment

    /**
     * Describes how to interpret the dynamic log stream and partially
     * reconstruct the original log message.
     */
    NANOLOG_PACK_PUSH
    struct PrintFragment {
        // The type of the argument to pull from the dynamic buffer to the
        // partial format string (formatFragment)
        uint8_t argType:5;

        // Indicates that the fragment requires a dynamic width/precision
        // argument in addition to one required by the format specifier.
        bool hasDynamicWidth:1;
        bool hasDynamicPrecision:1;

        //TODO(syang0) is this necessary? The format fragment is null-terminated
        // Length of the format fragment
        uint16_t fragmentLength;

        // A fragment of the original LOG statement that contains at most
        // one format specifier.
        char formatFragment[];
    };
    NANOLOG_PACK_POP

这里稍作笔记：
在这里插入图片描述 ((NanoLogInternal::Log::PrintFragment *)0x7fffb70a2021)->fragmentLength=42
这里的42是指从format头到现在的位置。
可以，想象，下一次memory copy就短得多了，因为这次，包括了当前这个%号的所有的format字符串。

本轮最后，更新了父node的信息：
在这里插入图片描述
之后是循环遍历完后面的3个%，略。

后记

暂时本系列告一段落的。占用了大概有3天的时间。
后面我需要准备开始转为CTF的meta的工作。
在开始之前，需要先研究一下我们自己的魔改版本的nanolog，具体细节有哪些。
还要看一下业务的代码，哪些位置，是我关注的。
我们做性能分析时，是划格子，涉及到的probe点，不是越多越好，正相反，每个主动对象，我们只需要关注收跳（TTI），和每slot的所有的task的generate,push to Queuq， scheduled, finish。
以及，在processing过程中，被OS重新调度该线程的情况。

后面的工作，许多内容，与本系列关系不太大了。
当然，个人还是有所收获。特别是，以后改为内存形态的log，以及定位当前log在解压时，有错误发生的bug，有了足够的信心和了解。