文章目录

Hive数据类型和文件格式
- 1 基本数据类型
- 2. 集合数据类型
- - 2.1 Struct举例
  - 2.2 Array举例
  - 2.3 Map举例
- 3 数据类型转换
- - 3.1 隐式转换
  - 3.2 显示转换
- 4 文本文件数据编码

Hive数据类型和文件格式

Hive支持关系数据库中的大多数基本数据类型，同时也支持3种新的集合数据类型。

1 基本数据类型

Hive 数据类型	Java 数据类型	长度	例子
TINYINT	byte	1byte 有符号整数	20
SMALINT	short	2byte 有符号整数	20
INT	int	4byte 有符号整数	20
BIGINT	long	8byte 有符号整数	20
BOOLEAN	boolean	布尔类型，true 或者false	TRUE、FALSE
FLOAT	float	单精度浮点数	3.14159
DOUBLE	double	双精度浮点数	3.14159
STRING	string	字符系列。可以指定字符集。可以使用单引号或者双引号。	’ now is the time ' “for all good men”
TIMESTAMP		时间类型
BINARY		字节数组

Hive 的 STRING 类型相当于数据库的 varchar 类型，该类型是一个可变的字符串，不过它不限定最多能存储多少个字符，理论上它可以存储 2GB 的字符数。

2. 集合数据类型

Hive中的列支持struct、map和array集合数据类型。

数据类型	描述	语法示例
STRUCT	和 c 语言中的 struct 类似，都可以通过“点”符号访问元素内容。例如，如果某个列的数据类型是 STRUCT{first STRING, lastSTRING},那么第1个元素可以通过字段.first来引用。	struct(‘john’,‘Doe’) struct<street:string,city:string>
MAP	MAP是一组键-值对元组集合，可以通过key来访问元素。例如，如果某个列的数据类型是MAP，其中键->值对是’first’->‘John’和’last’->‘Doe’，那么可以通过字段名[‘last’]获取最后一个元素	map<string, int>
ARRAY	ARRAY是由一系列具有相同数据类型的元素组成的集合，这些些素可以通过下标来访问。例如有一个ARRAY类型的变量fruits，它是由[‘apple’,‘orange’,‘mango’]组成，那么我们可以通过 fruits[1] 来访问元素 orange ，因为ARRAY类型的下标是从0开始的。	Array(‘John’, ‘Doe’) Array

ARRAY 和 MAP 与 Java 中的 Array 和 Map 类似，而 STRUCT 与 C 语言中的Struct 类似，它封装了一个命名字段集合，复杂数据类型允许任意层次的嵌套。

2.1 Struct举例

（1）假设有如下两条数据，为了便于理解，以JSON格式来表示它的数据结构：

[
{
	"stuid": 1,
	"stuname":'alan',
	"score":{
		"math":98,
		"computer":89
	}
},
{
	"stuid": 2,
	"stuname":'john',
	"score":{
		"math":95,
		"computer":97
	}
}
]

（2）在目录/root/data中创建本地测试文件struct.txt，保存下面的数据。

1,alan,98_89
2,john,95_97

（3）在Hive上创建测试表test_struct

create table test_struct
(
    stuid   int,
    stuname string,
    score   struct<math:int,computer:int>
)
    row format delimited fields terminated by ','
        collection items terminated by '_'
        lines terminated by '\n';

字段解释：

row format delimited fields terminated by ',' -- 列分隔符
collection items terminated by '_' -- MAP STRUCT和ARRAY的分隔符(数据分割符号)
lines terminated by '\n'; -- 行分隔符

（4）接下来，导入struct.txt中的文本数据到测试表test_struct

load data local inpath '/root/data/struct.txt' into table test_struct;

（5）访问表test_struct中的数据

select * from test_struct;

（6）访问结构中的数据

select stuname,score.math,score.computer from test_struct;

2.2 Array举例

（1）假设有如下两条数据，为了便于理解，以JSON格式来表示它的数据结构：

[
{
	"stuid": 1,
	"stuname":'alan',
	"hobbys":["music","sports"]
},
{
	"stuid": 2,
	"stuname":'john',
	"hobbys":["music","travel"]
}
]

（2）在目录/root/data中创建本地测试文件array.txt，保存下面的数据。

1,alan,music_sports
2,john,music_travel

（3）在Hive上创建测试表test_array

create table test_array
(
    stuid   int,
    stuname string,
    hobbys  array<string>
)
    row format delimited fields terminated by ','
        collection items terminated by '_'
        lines terminated by '\n';

（4）接下来，导入array.txt中的文本数据到测试表test_array

load data local inpath '/root/data/array.txt' into table test_array;

（5）访问表test_array中的数据

select * from test_array;

（6）访问数组中的数据

set hive.cli.print.header=true;
select stuname,hobbys[0] from test_array;

2.3 Map举例

（1）假设有如下两条数据，为了便于理解，以JSON格式来表示它的数据结构：

[
{
	"stuid": 1,
	"stuname":'alan',
	"score":{
		"math":98,
		"computer":89
	}
},
{
	"stuid": 2,
	"stuname":'john',
	"score":{
		"math":95,
		"computer":97
	}
}
]

（2）在目录/root/data中创建本地测试文件 map.txt，保存下面的数据。

1,alan,math:98_computer:89
2,john,math:95_computer:97

（3）在Hive上创建测试表test_map

create table test_map
(
    stuid   int,
    stuname string,
    score   map<string,int>
)
    row format delimited fields terminated by ','
        collection items terminated by '_'
        map keys terminated by ':'
        lines terminated by '\n';

字段解释：

row format delimited fields terminated by ',' -- 列分隔符
collection items terminated by '_' --MAP STRUCT 和 ARRAY 的分隔符(数据分割符号)
map keys terminated by ':' -- MAP 中的 key 与 value 的分隔符
lines terminated by '\n'; -- 行分隔符

（4）接下来，导入map.txt中的文本数据到测试表test_map

load data local inpath '/root/data/map.txt' into table test_map;

（5）访问表test_map中的数据

set hive.cli.print.header=true;
select * from test_map;

（6）访问map中的数据

select stuname,score['math'] as math,score['computer'] as computer from test_map;

3 数据类型转换

Hive 的原子数据类型是可以进行隐式转换的，类似于 Java 的类型转换。转换的原则是从数据范围小的类型向数据范围大的类型转换，或从数据精度低的类型向数据精度高的类型转换，以保证数据和精度不丢失。例如某表达式使用 BIGINT类型，INT 会自动转换为BIGINT 类型，但是 Hive 不会进行反向转换。例如，某表达式使用 INT 类型，BIGINT 不会自动转换为 INT 类型，它会返回错误，除非使用 CAST 操作。

3.1 隐式转换

（1）任何整数类型都可以隐式地转换为一个范围更广的类型，如 TINYINT 可以转换成 INT，INT 可以转换成 BIGINT。

（2）所有整数类型、FLOAT 和 STRING 类型都可以隐式地转换成 DOUBLE。

（3）TINYINT、SMALLINT、INT 都可以转换为 FLOAT。

（4）BOOLEAN 类型不可以转换为任何其它的类型。

3.2 显示转换

可以使用 CAST 操作进行显示数据类型转换，例如 CAST(‘1’ AS INT)将把字符串’1’ 转换成整数 1；如果强制类型转换失败，如执行 CAST(‘X’ AS INT)，表达式返回空值NULL。

select '2'+3,cast('2' as int)+1;

4 文本文件数据编码

Hive中经常经使用未经压缩的文本文件来存储数据，各字段之间如何保证正确分隔，分隔符的选择十分重要，已选定的分隔符不能出现在数据中。Hive默认使用了几个控制字符，这些字符很少出现在字段值中。

分隔符	描述
\n	对于文本文件来说，每行都是一条记录，因此换行符可以分隔记录
^A(Ctrl+V+A)	用于分隔字段（列）。在CREATE TABLE语句中可以使用八进制编码\001表示
^B(Ctrl+V+B)	用于分隔ARRAY或者STRUCT中的元素，或用于MAP中键-值对之间的分隔。在CREATE TABLE语句中可以使用八进制编码\002表示
^C(Ctrl+V+C)	用于MAP中键和值之间的分隔。在CREATE TABLE语句中可以使用八进制编码\003表示

下面是一张员工表：

CREATE TABLE employees
(
    name         STRING,
    salary       FLOAT,
    subordinates ARRAY<STRING>,
    deductions   MAP<STRING,FLOAT>,
    address      STRUCT<street:STRING,city:STRING,state:STRING,zip:INT>
);

其中，字段subordinates（下属员工）是一个字符串数组，字段deductions是一个由键-值对构成的map，其记录了每一次的扣除额。最后，每名员工的家庭住址使用struct数据类型存储。employees表的第1行记录看上去和下面展示的一样，它用到了上面表格中的分隔符。

John Doe^A100000.0^AMary Smith^BTodd Jones^AFederal Taxes^C.2^BState Taxes^C.05^BInsurance^C.1^A1 Michigan Ave.^BChicage^BIL^B60600

在vi中的显示效果如下：

注意："^A"不是直接按字符键^A直接输入的，而是在编辑状态下按Ctrl+V+A输入的，同理按下Ctrl+V+B可以输入不见字符"^B"。

很显然上面记录的可读性不好，把它转换成可读性好的JSON格式如下：

{
"name": "John Doe",
"salary": 100000.0,
"subordinates": ["Mary Smith","Todd Jones"],
"deductions": {
	"Federal Taxes": .2,
	"State Taxes": .05,
	"Insurance": .1
},
"address": {
	"street": "1 Michigan Ave.",
	"city": "Chicago",
	"state": "IL",
	"zip": 60600
    }
}

用户可以不使用这些默认的分隔符，而指定使用其他分隔符。下面建表语句明确指定了分隔符：

CREATE TABLE employees
(
    name         STRING,
    salary       FLOAT,
    subordinates ARRAY<STRING>,
    deductions   MAP<STRING,FLOAT>,
    address      STRUCT<street:STRING,city:STRING,state:STRING,zip:INT>
)
    row format delimited fields terminated by '\001'
        collection items terminated by '\002'
        map keys terminated by '\003'
        lines terminated by '\n'
    stored as textfile;

目前行分隔符只支持’\n’，不能是别的字符，stored as textfile可以被省略，默认就是 textfile格式的文件。把上面输入的数据加载到employees表中：

load data local inpath '/root/data/employees.txt' into table employees;

查看employees表中的数据

select * from employees;