Go 语言规范学习（1）

文章目录

- Introduction
- Notation
- - - 示例（Go 语言的 `if` 语句）：
- Source code representation
- - Characters
  - - 例子：变量名可以是中文
  - Letters and digits
- Lexical elements
- - Comments
  - Tokens
  - Semicolons
  - - 例子：查看程序所有的token
  - Identifiers
  - Keywords
  - Operators and punctuation【运算符和标点符号】
  - Integer literals
  - Floating-point literals
  - Imaginary literals
  - Rune literals
  - String literals
- Constants

Language version go1.24 (Dec 30, 2024)

Introduction

Go is a general-purpose language designed with systems programming in mind. It is strongly typed and garbage-collected and has explicit support for concurrent programming. Programs are constructed from packages, whose properties allow efficient management of dependencies.

The syntax is compact and simple to parse, allowing for easy analysis by automatic tools such as integrated development environments.

Notation

The syntax is specified using a variant of Extended Backus-Naur Form (EBNF):

Syntax      = { Production } .
Production  = production_name "=" [ Expression ] "." .
Expression  = Term { "|" Term } .
Term        = Factor { Factor } .
Factor      = production_name | token [ "…" token ] | Group | Option | Repetition .
Group       = "(" Expression ")" .
Option      = "[" Expression "]" .
Repetition  = "{" Expression "}" .

Productions are expressions constructed from terms and the following operators, in increasing precedence:

|   alternation
()  grouping
[]  option (0 or 1 times)
{}  repetition (0 to n times)

Lowercase production names are used to identify lexical (terminal【终结符】) tokens.

Non-terminals【非终结符】 are in CamelCase.

Lexical tokens are enclosed in double quotes "" or back quotes ````.

在形式语言理论（Formal Language Theory）和编译原理中，产生式（Production） 是描述语言语法规则的表达式，它定义了如何从一个符号（非终结符）推导出其他符号（终结符或非终结符）的组合。产生式是 上下文无关文法（Context-Free Grammar, CFG） 的核心组成部分。

一个产生式通常写成：

非终结符 = 符号序列 .

其中：

非终结符（Non-terminal）：表示语法结构的抽象概念（如 Expression、Statement）。
符号序列：由终结符（Token）或其他非终结符组成的序列。
. 表示产生式的结束。

示例（Go 语言的 `if` 语句）：

IfStmt = "if" [ SimpleStmt ";" ] Expression Block [ "else" ( IfStmt | Block ) ] .

IfStmt 是非终结符。
"if"、";"、"else" 是终结符（Token）。
SimpleStmt、Expression、Block 是其他非终结符。

The form a … b represents the set of characters from a through b as alternatives.

The horizontal ellipsis … is also used elsewhere in the spec to informally denote various enumerations or code snippets that are not further specified.

The character …【单个unicode字符】 (as opposed to the three characters ...) is not a token of the Go language.

Source code representation

Source code is Unicode text encoded in UTF-8. The text is not canonicalized, so a single accented code point is distinct from the same character constructed from combining an accent and a letter; those are treated as two code points. For simplicity, this document will use the unqualified term character to refer to a Unicode code point in the source text.

Go 不会对源代码进行 Unicode 规范化（Canonicalization）。这意味着：

组合字符序列（如 é 的两种表示方式）会被视为不同的代码点（Code Points）：
1. 单一代码点：é（U+00E9，带重音的拉丁小写字母 e）
2. 组合序列：e（U+0065） + ́（U+0301，组合重音符号）
虽然这两种形式在显示上相同，但 Go 会将其视为 两个不同的字符序列

Each code point is distinct; for instance, uppercase and lowercase letters are different characters.

Implementation restriction: For compatibility with other tools, a compiler may disallow the NUL character (U+0000) in the source text.

Implementation restriction: For compatibility with other tools, a compiler may ignore a UTF-8-encoded byte order mark (U+FEFF) if it is the first Unicode code point in the source text. A byte order mark may be disallowed anywhere else in the source.

Characters

The following terms are used to denote specific Unicode character categories:

newline        = /* the Unicode code point U+000A */ .
unicode_char   = /* an arbitrary Unicode code point except newline */ .
unicode_letter = /* a Unicode code point categorized as "Letter" */ .
unicode_digit  = /* a Unicode code point categorized as "Number, decimal digit" */ .

In The Unicode Standard 8.0, Section 4.5 “General Category” defines a set of character categories. Go treats all characters in any of the Letter categories Lu, Ll, Lt, Lm, or Lo as Unicode letters, and those in the Number category Nd as Unicode digits.

Go 将以下 Unicode 字母类别的所有字符视为「字母」，可用于标识符（变量名、函数名等）：

Lu（Letter, uppercase）：大写字母（如 A, Ω, Д）
Ll（Letter, lowercase）：小写字母（如 a, ω, д）
Lt（Letter, titlecase）：标题字母（如某些连字字符的首字母形式，例如 ǅ）
Lm（Letter, modifier）：修饰字母（如上标字母 ʰ）
Lo（Letter, other）：其他字母（如汉字 中、日文假名 あ、希伯来字母 א）

Go 仅将以下类别的字符视为「数字」：

Nd（Number, decimal digit）：十进制数字（如 0-9、阿拉伯数字 ٠、中文数字 ０等）

例子：变量名可以是中文

package main

import "fmt"

var 名字 = "Go"  // 汉字（Lo）
const π = 3.14 // 希腊字母（Ll）
func Δx(x int) int { // 大写希腊字母（Lu）
	return x + 1
}

func main() {
	fmt.Println(Δx(2))
}

Letters and digits

The underscore character _ (U+005F) is considered a lowercase letter.

letter        = unicode_letter | "_" .
decimal_digit = "0" … "9" .
binary_digit  = "0" | "1" .
octal_digit   = "0" … "7" .
hex_digit     = "0" … "9" | "A" … "F" | "a" … "f" .

Lexical elements

Comments

Comments serve as program documentation. There are two forms:

Line comments start with the character sequence // and stop at the end of the line.
General comments start with the character sequence /* and stop with the first subsequent character sequence */.

A comment cannot start inside a rune or string literal, or inside a comment.

A general comment containing no newlines acts like a space. Any other comment acts like a newline.

Tokens

Tokens form the vocabulary of the Go language. There are four classes: identifiers, keywords, operators and punctuation, and literals. 【标识符，关键字，运算符和标点符号，字面值】

White space, formed from spaces (U+0020), horizontal tabs (U+0009), carriage returns (U+000D), and newlines (U+000A), is ignored except as it separates tokens that would otherwise combine into a single token.

Also, a newline or end of file may trigger the insertion of a semicolon.

While breaking the input into tokens, the next token is the longest sequence of characters that form a valid token.

Semicolons

The formal syntax uses semicolons ";" as terminators in a number of productions. Go programs may omit most of these semicolons using the following two rules:

When the input is broken into tokens, a semicolon is automatically inserted into the token stream immediately after a line’s final token if that token is
- an identifier
- an integer, floating-point, imaginary, rune, or string literal
- one of the keywords break, continue, fallthrough, or return
- one of the operators and punctuation ++, --, ), ], or }
To allow complex statements to occupy a single line, a semicolon may be omitted before a closing ")" or "}".

To reflect idiomatic use, code examples in this document elide semicolons using these rules.

例子：查看程序所有的token

package main

import (
	"fmt"
	"go/scanner"
	"go/token"
)

func main() {
	src := []byte(`
package main

import "fmt"

func main() {
	s := "Hello, tokens!"
	fmt.Println(s)
}
`)

	// 初始化 scanner
	fset := token.NewFileSet()
	file := fset.AddFile("example.go", fset.Base(), len(src))
	var s scanner.Scanner
	s.Init(file, src, nil, scanner.ScanComments)

	// 扫描 tokens
	for {
		pos, tok, lit := s.Scan()
		if tok == token.EOF {
			break
		}
		fmt.Printf("%-10s %-15s %q\n", fset.Position(pos), tok, lit)
	}
}

输出

example.go:2:1 package         "package"
example.go:2:9 IDENT           "main"
example.go:2:13 ;               "\n"
example.go:4:1 import          "import"
example.go:4:8 STRING          "\"fmt\""
example.go:4:13 ;               "\n"
example.go:6:1 func            "func"
example.go:6:6 IDENT           "main"
example.go:6:10 (               ""
example.go:6:11 )               ""
example.go:6:13 {               ""
example.go:7:2 IDENT           "s"
example.go:7:4 :=              ""
example.go:7:7 STRING          "\"Hello, tokens!\""
example.go:7:23 ;               "\n"
example.go:8:2 IDENT           "fmt"
example.go:8:5 .               ""
example.go:8:6 IDENT           "Println"
example.go:8:13 (               ""
example.go:8:14 IDENT           "s"
example.go:8:15 )               ""
example.go:8:16 ;               "\n"
example.go:9:1 }               ""
example.go:9:2 ;               "\n"

Identifiers

Identifiers name program entities such as variables and types. An identifier is a sequence of one or more letters and digits. The first character in an identifier must be a letter.

identifier = letter { letter | unicode_digit } .
a
_x9
ThisVariableIsExported
αβ

Some identifiers are predeclared.

预先声明的标识符如下：

Types:
	any bool byte comparable
	complex64 complex128 error float32 float64
	int int8 int16 int32 int64 rune string
	uint uint8 uint16 uint32 uint64 uintptr

Constants:
	true false iota

Zero value:
	nil

Functions:
	append cap clear close complex copy delete imag len
	make max min new panic print println real recover

Keywords

The following keywords are reserved and may not be used as identifiers.

break        default      func         interface    select
case         defer        go           map          struct
chan         else         goto         package      switch
const        fallthrough  if           range        type
continue     for          import       return       var

Operators and punctuation【运算符和标点符号】

The following character sequences represent operators (including assignment operators) and punctuation [Go 1.18]:

+    &     +=    &=     &&    ==    !=    (    )
-    |     -=    |=     ||    <     <=    [    ]
*    ^     *=    ^=     <-    >     >=    {    }
/    <<    /=    <<=    ++    =     :=    ,    ;
%    >>    %=    >>=    --    !     ...   .    :
     &^          &^=          ~

Integer literals

An integer literal is a sequence of digits representing an integer constant.

An optional prefix sets a non-decimal base: 0b or 0B for binary【二进制】, 0, 0o, or 0O for octal【八进制】, and 0x or 0X for hexadecimal 【十六进制】[Go 1.13].

A single 0 is considered a decimal【十进制】 zero.

In hexadecimal literals, letters a through f and A through F represent values 10 through 15.

For readability, an underscore character _ may appear after a base prefix or between successive digits; such underscores do not change the literal’s value.

int_lit        = decimal_lit | binary_lit | octal_lit | hex_lit .
decimal_lit    = "0" | ( "1" … "9" ) [ [ "_" ] decimal_digits ] .
binary_lit     = "0" ( "b" | "B" ) [ "_" ] binary_digits .
octal_lit      = "0" [ "o" | "O" ] [ "_" ] octal_digits .
hex_lit        = "0" ( "x" | "X" ) [ "_" ] hex_digits .

decimal_digits = decimal_digit { [ "_" ] decimal_digit } .
binary_digits  = binary_digit { [ "_" ] binary_digit } .
octal_digits   = octal_digit { [ "_" ] octal_digit } .
hex_digits     = hex_digit { [ "_" ] hex_digit } .

42
4_2
0600
0_600
0o600
0O600       // second character is capital letter 'O'
0xBadFace
0xBad_Face
0x_67_7a_2f_cc_40_c6
170141183460469231731687303715884105727
170_141183_460469_231731_687303_715884_105727

_42         // an identifier, not an integer literal
42_         // invalid: _ must separate successive digits
4__2        // invalid: only one _ at a time
0_xBadFace  // invalid: _ must separate successive digits

Floating-point literals

A floating-point literal is a decimal or hexadecimal representation of a floating-point constant.

A decimal floating-point literal consists of an integer part 【整数部分】(decimal digits), a decimal point【小数点】, a fractional part 【小数部分】(decimal digits), and an exponent part【指数部分】 (e or E followed by an optional sign and decimal digits).

One of the integer part or the fractional part may be elided; one of the decimal point or the exponent part may be elided. An exponent value exp scales the mantissa【尾数】 (integer and fractional part) by 10exp.

A hexadecimal floating-point literal consists of a 0x or 0X prefix, an integer part (hexadecimal digits), a radix point【基数点。在 任意进制（如二进制、八进制、十六进制）的浮点数中，用于分隔整数部分和小数部分的点符号 .】, a fractional part (hexadecimal digits), and an exponent part (p or P followed by an optional sign and decimal digits).

One of the integer part or the fractional part may be elided; the radix point may be elided as well, but the exponent part is required. (This syntax matches the one given in IEEE 754-2008 §5.12.3.) An exponent value exp scales the mantissa (integer and fractional part) by 2exp [Go 1.13].

For readability, an underscore character _ may appear after a base prefix or between successive digits; such underscores do not change the literal value.

float_lit         = decimal_float_lit | hex_float_lit .

decimal_float_lit = decimal_digits "." [ decimal_digits ] [ decimal_exponent ] |
                    decimal_digits decimal_exponent |
                    "." decimal_digits [ decimal_exponent ] .
decimal_exponent  = ( "e" | "E" ) [ "+" | "-" ] decimal_digits .

hex_float_lit     = "0" ( "x" | "X" ) hex_mantissa hex_exponent .
hex_mantissa      = [ "_" ] hex_digits "." [ hex_digits ] |
                    [ "_" ] hex_digits |
                    "." hex_digits .
hex_exponent      = ( "p" | "P" ) [ "+" | "-" ] decimal_digits .

0.
72.40
072.40       // == 72.40
2.71828
1.e+0
6.67428e-11
1E6
.25
.12345E+5
1_5.         // == 15.0
0.15e+0_2    // == 15.0

0x1p-2       // == 0.25
0x2.p10      // == 2048.0
0x1.Fp+0     // == 1.9375
0X.8p-0      // == 0.5
0X_1FFFP-16  // == 0.1249847412109375
0x15e-2      // == 0x15e - 2 (integer subtraction)

0x.p1        // invalid: mantissa has no digits
1p-2         // invalid: p exponent requires hexadecimal mantissa
0x1.5e-2     // invalid: hexadecimal mantissa requires p exponent
1_.5         // invalid: _ must separate successive digits
1._5         // invalid: _ must separate successive digits
1.5_e1       // invalid: _ must separate successive digits
1.5e_1       // invalid: _ must separate successive digits
1.5e1_       // invalid: _ must separate successive digits

Imaginary literals

An imaginary literal represents the imaginary part【虚部】 of a complex constant. It consists of an integer or floating-point literal followed by the lowercase letter i. The value of an imaginary literal is the value of the respective integer or floating-point literal multiplied by the imaginary unit i [Go 1.13]

imaginary_lit = (decimal_digits | int_lit | float_lit) "i" .

For backward compatibility, an imaginary literal’s integer part consisting entirely of decimal digits (and possibly underscores) is considered a decimal integer, even if it starts with a leading 0.

0i
0123i         // == 123i for backward-compatibility
0o123i        // == 0o123 * 1i == 83i
0xabci        // == 0xabc * 1i == 2748i
0.i
2.71828i
1.e+0i
6.67428e-11i
1E6i
.25i
.12345E+5i
0x1p-2i       // == 0x1p-2 * 1i == 0.25i

Rune literals

A rune literal represents a rune constant, an integer value identifying a Unicode code point. A rune literal is expressed as one or more characters enclosed in single quotes, as in 'x' or '\n'.

Within the quotes, any character may appear except newline and unescaped single quote. A single quoted character represents the Unicode value of the character itself, while multi-character sequences beginning with a backslash encode values in various formats.

The simplest form represents the single character within the quotes; since Go source text is Unicode characters encoded in UTF-8, multiple UTF-8-encoded bytes may represent a single integer value. For instance, the literal 'a' holds a single byte representing a literal a, Unicode U+0061, value 0x61, while 'ä' holds two bytes (0xc3 0xa4) representing a literal a-dieresis, U+00E4, value 0xe4.

Several backslash escapes allow arbitrary values to be encoded as ASCII text. There are four ways to represent the integer value as a numeric constant: \x followed by exactly two hexadecimal digits; \u followed by exactly four hexadecimal digits; \U followed by exactly eight hexadecimal digits, and a plain backslash \ followed by exactly three octal digits. In each case the value of the literal is the value represented by the digits in the corresponding base.

Although these representations all result in an integer, they have different valid ranges. Octal escapes must represent a value between 0 and 255 inclusive. Hexadecimal escapes satisfy this condition by construction. The escapes \u and \U represent Unicode code points so within them some values are illegal, in particular those above 0x10FFFF and surrogate halves.

After a backslash, certain single-character escapes represent special values:

\a   U+0007 alert or bell
\b   U+0008 backspace
\f   U+000C form feed
\n   U+000A line feed or newline
\r   U+000D carriage return
\t   U+0009 horizontal tab
\v   U+000B vertical tab
\\   U+005C backslash
\'   U+0027 single quote  (valid escape only within rune literals)
\"   U+0022 double quote  (valid escape only within string literals)

An unrecognized character following a backslash in a rune literal is illegal.

rune_lit         = "'" ( unicode_value | byte_value ) "'" .
unicode_value    = unicode_char | little_u_value | big_u_value | escaped_char .
byte_value       = octal_byte_value | hex_byte_value .
octal_byte_value = `\` octal_digit octal_digit octal_digit .
hex_byte_value   = `\` "x" hex_digit hex_digit .
little_u_value   = `\` "u" hex_digit hex_digit hex_digit hex_digit .
big_u_value      = `\` "U" hex_digit hex_digit hex_digit hex_digit
                           hex_digit hex_digit hex_digit hex_digit .
escaped_char     = `\` ( "a" | "b" | "f" | "n" | "r" | "t" | "v" | `\` | "'" | `"` ) .
'a'
'ä'
'本'
'\t'
'\000'
'\007'
'\377'
'\x07'
'\xff'
'\u12e4'
'\U00101234'
'\''         // rune literal containing single quote character
'aa'         // illegal: too many characters
'\k'         // illegal: k is not recognized after a backslash
'\xa'        // illegal: too few hexadecimal digits
'\0'         // illegal: too few octal digits
'\400'       // illegal: octal value over 255
'\uDFFF'     // illegal: surrogate half
'\U00110000' // illegal: invalid Unicode code point

String literals

A string literal represents a string constant obtained from concatenating a sequence of characters.

There are two forms: raw string literals and interpreted string literals.

Raw string literals are character sequences between back quotes【反引号】 as in `foo`. Within the quotes, any character may appear except back quote. The value of a raw string literal is the string composed of the uninterpreted (implicitly UTF-8-encoded) characters between the quotes; in particular, backslashes have no special meaning and the string may contain newlines. Carriage return characters (‘\r’) inside raw string literals are discarded from the raw string value.

Interpreted string literals are character sequences between double quotes【双引号】, as in "bar". Within the quotes, any character may appear except newline and unescaped double quote. The text between the quotes forms the value of the literal, with backslash escapes interpreted as they are in rune literals (except that \' is illegal and \" is legal), with the same restrictions. The three-digit octal (\nnn) and two-digit hexadecimal (\xnn) escapes represent individual bytes of the resulting string; all other escapes represent the (possibly multi-byte) UTF-8 encoding of individual characters. Thus inside a string literal \377 and \xFF represent a single byte of value 0xFF=255, while ÿ, \u00FF, \U000000FF and \xc3\xbf represent the two bytes 0xc3 0xbf of the UTF-8 encoding of character U+00FF.

string_lit             = raw_string_lit | interpreted_string_lit .
raw_string_lit         = "`" { unicode_char | newline } "`" .
interpreted_string_lit = `"` { unicode_value | byte_value } `"` .
`abc`                // same as "abc"
`\n
\n`                  // same as "\\n\n\\n"
"\n"
"\""                 // same as `"`
"Hello, world!\n"
"日本語"
"\u65e5本\U00008a9e"
"\xff\u00FF"
"\uD800"             // illegal: surrogate half
"\U00110000"         // illegal: invalid Unicode code point

These examples all represent the same string:

"日本語"                                 // UTF-8 input text
`日本語`                                 // UTF-8 input text as a raw literal
"\u65e5\u672c\u8a9e"                    // the explicit Unicode code points
"\U000065e5\U0000672c\U00008a9e"        // the explicit Unicode code points
"\xe6\x97\xa5\xe6\x9c\xac\xe8\xaa\x9e"  // the explicit UTF-8 bytes

If the source code represents a character as two code points, such as a combining form involving an accent and a letter, the result will be an error if placed in a rune literal (it is not a single code point), and will appear as two code points if placed in a string literal.

Constants

There are boolean constants, rune constants, integer constants, floating-point constants, complex constants, and string constants.

Rune, integer, floating-point, and complex constants are collectively called numeric constants.

A constant value is represented by a rune, integer, floating-point, imaginary, or string literal, an identifier denoting a constant, a constant expression, a conversion with a result that is a constant, or the result value of some built-in functions such as min or max applied to constant arguments, unsafe.Sizeof applied to certain values, cap or len applied to some expressions, real and imag applied to a complex constant and complex applied to numeric constants.

The boolean truth values are represented by the predeclared constants true and false. The predeclared identifier iota denotes an integer constant.

In general, complex constants are a form of constant expression and are discussed in that section.

Numeric constants represent exact values of arbitrary precision and do not overflow. Consequently, there are no constants denoting the IEEE 754 negative zero, infinity, and not-a-number values.

在 Go 语言中，数值常量（Numeric Constants） 的设计非常独特，与其他编程语言（如 C、Java）中的常量或变量有本质区别。

1. 任意精度（Arbitrary Precision）

无限精度计算：
Go 的数值常量在编译时不受固定位数（如 32/64 位）限制，可以表示任意大的整数或任意小的小数，只要代码能写出来（受限于源码长度）。
精确无舍入：
计算过程中不会丢失精度（不像浮点数有 IEEE 754 舍入误差）。

2. 不会溢出（No Overflow）

自动扩展表示范围：
编译器会动态调整常量的存储方式，确保计算时永远不会溢出（直到内存耗尽）。

Constants may be typed or untyped. 【有类型的常量和无类型的常量】

Literal constants, true, false, iota, and certain constant expressions containing only untyped constant operands are untyped.

A constant may be given a type explicitly by a constant declaration or conversion, or implicitly when used in a variable declaration or an assignment statement or as an operand in an expression. It is an error if the constant value cannot be represented as a value of the respective type. If the type is a type parameter, the constant is converted into a non-constant value of the type parameter.

An untyped constant has a default type which is the type to which the constant is implicitly converted in contexts where a typed value is required, for instance, in a short variable declaration such as i := 0 where there is no explicit type. The default type of an untyped constant is bool, rune, int, float64, complex128, or string respectively, depending on whether it is a boolean, rune, integer, floating-point, complex, or string constant.

Implementation restriction: Although numeric constants have arbitrary precision in the language, a compiler may implement them using an internal representation with limited precision. That said, every implementation must:

Represent integer constants with at least 256 bits.
Represent floating-point constants, including the parts of a complex constant, with a mantissa of at least 256 bits and a signed binary exponent of at least 16 bits.
Give an error if unable to represent an integer constant precisely.
Give an error if unable to represent a floating-point or complex constant due to overflow.
Round to the nearest representable constant if unable to represent a floating-point or complex constant due to limits on precision.

These requirements apply both to literal constants and to the result of evaluating constant expressions.