【Juicy-Bigdata】Hive基础操作

余生2022-11-152025-01-08

Hive介绍

Hive的数据存储

基于Hadoop的HDFS
没有专门的数据存储格式，支持TextFile,SequenceFile, RCFile

Hive的架构

数据库VS数据仓库

数据库：基本的事务处理，增删改查

数据库：注重查询

OLTP：操作性处理称为联机事务处理，关心相应时间、数据安全性、完整性

OLAP：分析型处理，称为联机分析处理，针对主题历史数据进行分析，支持管理决策

HIVE SQL

DDL 数据定义语言

对数据库的操作

创建数据库

1
2
3

create database if not exists myhive;
-- 指定存储位置
create database myhive location '/myhvie'

修改数据库

1	alter database myhive set dpproperties('createtime'='20220202');

查看数据库详细信息
1
desc database [extended] myhive;
删除数据库
1
drop database myhive;

对数据表的操作

对内部表的操作

1.1 建立
1
2
use myhive;
create table stu(id int, name string);
数据类型：
- Boolean
- tinyint 1字节
- smallint 2字节
- int 4字节
- bigint 8字节
- float 4字节
- double 8字节
- deicimal 任意精度小数，decimal(11,2) 代表最多有11 位数字，其中后2 位是小数，整数部分是9位
- string
- varchar
- char 固定长度
- binary 字节数组
- timestamp
- date
- interval
- Array
- Map
- Struct
- UNION
1.2 创建表并指定字段之间的分隔符
1
create table if not exists stu(id int, name string) row format delimited fields terminated by '\t' store as textfile location '/stu';
1.3 根据查询结果创建表
1
create table stu2 as select * from stu;
1.4 根据已经存在的表结构创建表
1
create table stu3 like stu2;
1.5 查询创建表的语句
1
show create table stu;
对外部表的操作

--构建
create external table stu(s_id string) row format delimited fields terminated by '\t' [localtion '/stu'];

-- 加载数据
-- [local]本地,否则hdfs；overwrite进行覆盖操作
load data [local] inpath '/export/stu.csv' [overwrite] into table student;

对分区表的操作

create table score(s_id string) partitioned by (month string);
create table score(s_id string) partitioned by (year string, month string, day string);

-- 加载数据到一个分区的表中
load data local inpath '' into table score partition(year='2022',month='02',day='31');

-- 查看分区
show partitions score;

-- 添加分区
alter table score add partition(month='201803') pritition(day='20');

--删除分区
alter table score drop partition(month='202204')

当表是分区表时，比如partitioned by (day string)，则这个文件夹下的每一个文件夹就是一个分区，且文件夹名为day=20201123 这种格式

对分桶表的操作

分区是分文件夹

分桶是分文件

1 2	create table course (c_id string,c_name string) clustered by(c_id) into 3 buckets; insert overwrite table course select * from course_common cluster by(c_id);

删除清空表

1
2
3

drop table score;
truncate table score;
-- drop可以从回收站恢复数据，表结构无法恢复。truncate不进回收站，并且无法清空外部表

DQL数据查询语言

函数

行转列

1
2
3

-- concat_ws  collect_list  collect_set
select name, concat_ws(',',collect_list(favor))
from student_favors group by name;

列转行

-- split string切分为array
-- explode 将array map划为多行
-- lateral view建立“虚拟表，”
select name, favor_new from student_favors_2 lateral view
    explode(split(favors_list,',')) table1 as favor_new;

窗口函数与分析函数

-- partition by  order by 
-- over (partition by ** order by **)
-- row between(preceding, following, current row, unbounded, unbounded preceding, unbounded following)
select cookieid,createtime,pv,
sum(pv) over(partition by cookieid order by createtime rows between 3 preceding and
1 following) as pv5
from test_t1;

排序

SELECT
cookieid,
createtime,
pv,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn
FROM test_t2;
-- row_number() 按行序
-- rank() 排名相等留空位
-- dense_rank()

HIVE执行计划

1	explain query;

explain dependency

描述SQL需要的数据来源

使用场景：快速排除异常；理清表的输入

explain authorization

当前SQL 访问的数据来源（INPUTS）和数据输出（OUTPUTS），以及当前Hive 的访问用户（CURRENT_USER）和操作（OPERATION）