create database 数据库名; \\创建数据库 drop database 数据库名; \\删除数据库 drop table 表名; \\删除表 show databses; \\显示数据库列表 show tables; \\显示表列表 use 数据库名; \\切换数据库 desc 表名; \\显示表名 desc format 表名; \\显示表明并列出详细的format信息 创建表名,可选项比较多这里给出官方语法: CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name -- (Note: TEMPORARY available in Hive 0.14.0 and later) [(col_name data_type [COMMENT col_comment], ... [constraint_specification])] [COMMENT table_comment] [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] [SKEWED BY (col_name, col_name, ...) -- (Note: Available in Hive 0.10.0 and later)] ON ((col_value, col_value, ...), (col_value, col_value, ...), ...) [STORED AS DIRECTORIES] [ [ROW FORMAT row_format] [STORED AS file_format] | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)] -- (Note: Available in Hive 0.6.0 and later) ] [LOCATION hdfs_path] [TBLPROPERTIES (property_name=property_value, ...)] -- (Note: Available in Hive 0.6.0 and later) [AS select_statement]; -- (Note: Available in Hive 0.5.0 and later; not supported for external tables)
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name LIKE existing_table_or_view_name [LOCATION hdfs_path]; 复杂数据类型 data_type : primitive_type | array_type | map_type | struct_type | union_type -- (Note: Available in Hive 0.7.0 and later) 基本数据类型 primitive_type : TINYINT | SMALLINT | INT | BIGINT | BOOLEAN | FLOAT | DOUBLE | DOUBLE PRECISION -- (Note: Available in Hive 2.2.0 and later) | STRING | BINARY -- (Note: Available in Hive 0.8.0 and later) | TIMESTAMP -- (Note: Available in Hive 0.8.0 and later) | DECIMAL -- (Note: Available in Hive 0.11.0 and later) | DECIMAL(precision, scale) -- (Note: Available in Hive 0.13.0 and later) | DATE -- (Note: Available in Hive 0.12.0 and later) | VARCHAR -- (Note: Available in Hive 0.12.0 and later) | CHAR -- (Note: Available in Hive 0.13.0 and later)
union_type : UNIONTYPE < data_type, data_type, ... > -- (Note: Available in Hive 0.7.0 and later) 行格式规范 row_format : DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char] [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char] [NULL DEFINED AS char] -- (Note: Available in Hive 0.13 and later) | SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)] 文件基本类型 file_format: : SEQUENCEFILE | TEXTFILE -- (Default, depending on hive.default.fileformat configuration) | RCFILE -- (Note: Available in Hive 0.6.0 and later) | ORC -- (Note: Available in Hive 0.11.0 and later) | PARQUET -- (Note: Available in Hive 0.13.0 and later) | AVRO -- (Note: Available in Hive 0.14.0 and later) | JSONFILE -- (Note: Available in Hive 4.0.0 and later) | INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname 表约束 constraint_specification: : [, PRIMARY KEY (col_name, ...) DISABLE NOVALIDATE ] [, CONSTRAINT constraint_name FOREIGN KEY (col_name, ...) REFERENCES table_name(col_name, ...) DISABLE NOVALIDATE */ hivef分表语法: alter table 表名 add partition(col_name=col_value) alter table 表名 drop partition(col_name=col_vaule) 注意如果是多分表,添加表要添加所有嵌套的分区值
hive SQL语句之DML语句
插入数据有三种种方式:
load file方式
1 2 3 4
--加载本地数据到hive表 load data local inpath '/root/data/data' into table psn;--(/root/data/data指的是本地 linux目录) --加载hdfs数据文件到hive表 load data inpath '/data/data' into table psn;--(/data/data指的是hdfs的目录)
insert data方式
1 2 3 4 5 6 7 8
--从表中查询数据插入结果表 INSERT OVERWRITE TABLE psn9 SELECT id,name FROM psn --从表中获取部分列插入到新表中 from psn insert overwrite table psn9 select id,name insert into table psn10 select id
row_format : DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char] [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char] : SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]
示例如下
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
192.168.57.4 - - [29/Feb/2019:18:14:35 +0800](1.1" 304 - --创建表 CREATE TABLE logtbl ( host STRING, identity STRING, t_user STRING, time STRING, request STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) \\[(.*)\\] \"(.*)\" (-|[0-9]*) (-|[0- 9]*)" ) STORED AS TEXTFILE; --加载数据 load data local inpath '/root/data/log' into table logtbl;
hive 动态分区与分桶
要开启动态分区需要先进行如下设置
1 2 3 4 5 6 7 8 9 10 11 12
--hive设置hive动态分区开启 set hive.exec.dynamic.partition=true; 默认:true --hive的动态分区模式 set hive.exec.dynamic.partition.mode=nostrict; 默认:strict(至少有一个分区列是静态分区) --每一个执行mr节点上,允许创建的动态分区的最大数量(100) set hive.exec.max.dynamic.partitions.pernode; --所有执行mr节点上,允许创建的所有动态分区的最大数量(1000) set hive.exec.max.dynamic.partitions; --所有的mr job允许创建的文件的最大数量(100000) set hive.exec.max.created.files;
hive动态分区语法
1 2 3
-Hive extension (dynamic partition inserts): INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement; INSERT INTO TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;
hive分桶
hive分桶是对列值取hash值的方式,将不同的数据放到不同的文件中存储
对于hive中每一个表,分区都可以进一步进行分桶
由列的hash值除以桶的个数据决定每条数据划分到哪个桶中
hive分桶的配置
1 2
--设置hive支持分桶 set hive.enforce.bucketing=true;
hive分桶的抽样查询
1 2 3 4 5
select * from bucket_table tablesample(bucket 1 out of 4 on columns) -TABLESAMPLE语法: TABLESAMPLE(BUCKET x OUT OF y) x:表示从哪个bucket开始抽取数据 y:必须为该表总bucket数的倍数或因子
create index t1_index on table psn2(name) as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' with deferred rebuild in table t1_index_table; --as:指定索引器; --in table:指定索引表,若不指定默认生成在default__psn2_t1_index__表中 create index t1_index on table psn2(name) as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' with deferred rebuild; --查询索引 show index on psn2; --重建索引(建立索引之后必须重建索引才能生效) ALTER INDEX t1_index ON psn2 REBUILD; --删除索引 DROP INDEX IF EXISTS t1_index ON psn2;