魏长东

weichangdong

hive中的一些不常见函数的用法

常见的函数就不废话了,和标准sql类似,下面我们要聊到的基本是HQL里面专有的函数,

hive里面的函数大致分为如下几种:Built-in、Misc.、UDF、UDTF、UDAF

我们就挑几个标准SQL里没有,但是在HIVE SQL在做统计分析常用到的来说吧。

1、array_contains (Collection Functions)

这是内置的对集合进行操作的函数,用法举例:

create EXTERNAL table IF NOT EXISTS userInfo (id int,sex string, age int, name string, email string,sd string, ed string)  ROW FORMAT DELIMITED FIELDS 
TERMINATED BY '\t' location '/hive/dw';
 
select * from userinfo where sex='male' and (id!=1 and id !=2 and id!=3 and id!=4 and id!=5) and age < 30;
select * from (select * from userinfo where sex='male' and !array_contains(split('1,2,3,4,5',','),cast(id as string))) tb1 where tb1.age < 30;

 

2、get_json_object (Misc. Functions)

测试数据:

first {"store":{"fruit":[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} third
first {"store":{"fruit":[{"weight":9,"type":"apple"},{"weight":91,"type":"pear"}],"bicycle":{"price":19.952,"color":"red2"}},"email":"amy@only_for_json_udf_test.net","owner":"amy2"} third
first {"store":{"fruit":[{"weight":10,"type":"apple"},{"weight":911,"type":"pear"}],"bicycle":{"price":19.953,"color":"red3"}},"email":"amy@only_for_json_udf_test.net","owner":"amy3"} third

create external table if not exists t_json(f1 string, f2 string, f3 string) row format delimited fields TERMINATED BY ' ' location '/test/json'
select get_json_object(t_json.f2, '$.owner') from t_json;
SELECT * from t_json where get_json_object(t_json.f2, '$.store.fruit[0].weight') = 9;
SELECT get_json_object(t_json.f2, '$.non_exist_key') FROM t_json;

 

##########查询json
select  get_json_object(json_data, '$.root_path'), get_json_object(json_data, '$.country'), get_json_object(json_data, '$.count')  from  wcd_0316_cloudpkg;
 
【ok】 
select  b.* from wcd_0316_cloudpkg  lateral view json_tuple(json_data,'root_path','country','count') b as f1,f2,f3;
【ok】 
select  b.* from wcd_0316_cloudpkg a lateral view json_tuple(json_data,'root_path','country','count') b as f1,f2,f3;

 

3、parse_url_tuple

 

测试数据:

url1 http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1
url2 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-getjsonobject
url3 https://www.google.com.hk/#hl=zh-CN&newwindow=1&safe=strict&q=hive+translate+example&oq=hive+translate+example&gs_l=serp.3...10174.11861.6.12051.8.8.0.0.0.0.132.883.0j7.7.0...0.0...1c.1j4.8.serp.0B9C1T_n0Hs&bav=on.2,or.&bvm=bv.44770516,d.aGc&fp=e13e41a6b9dab3f6&biw=1241&bih=589

create external table if not exists t_url(f1 string, f2 string) row format delimited fields TERMINATED BY ' ' location '/test/url';
SELECT f1, b.* FROM t_url LATERAL VIEW parse_url_tuple(f2, 'HOST', 'PATH', 'QUERY', 'QUERY:k1') b as host, path, query, query_id;

4、explode

explode 是一个 hive 内置的表生成函数:Built-in Table-Generating Functions (UDTF),主要是解决 1 to N 的问题,即它可以把一行输入拆成多行,比如一个 array 的每个元素拆成一行,作为一个虚表输出。它有如下需要注意的地方:

Using the syntax "SELECT udtf(col) AS colAlias..." has a few limitations:
No other expressions are allowed in SELECT
SELECT pageid, explode(adid_list) AS myCol... is not supported
UDTF's can't be nested
SELECT explode(explode(adid_list)) AS myCol... is not supported
GROUP BY / CLUSTER BY / DISTRIBUTE BY / SORT BY is not supported
SELECT explode(adid_list) AS myCol ... GROUP BY myCol is not supported

从上面的原理与语法上可知,

  • select 列中不能 udtf 和其它非 udtf 列混用,

  • udtf 不能嵌套,

  • 不支持 GROUP BY / CLUSTER BY / DISTRIBUTE BY / SORT BY

  • 还有 select 中出现的 udtf 一定需要列别名,否则会报错:

  •  

  • SELECT explode(myCol) AS myNewCol FROM myTable;
    SELECT explode(myMap) AS (myMapKey, myMapValue) FROM myMapTable;
    SELECT posexplode(myCol) AS pos, myNewCol FROM myTable;
    
    

wcd

ateral view 是Hive中提供给UDTF的conjunction,它可以解决UDTF不能添加额外的select列的问题。当我们想对hive表中某一列进行split之后,想对其转换成1 to N的模式,即一行转多列。hive不允许我们在UDTF函数之外,再添加其它select语句。

如下,我们想将登录某个游戏的用户id放在一个字段user_ids里,对每一行数据用UDTF后输出多行。

select game_id, explode(split(user_ids,'\\[\\[\\[')) as user_id   from login_game_log  where dt='2014-05-15' ;
FAILED: Error in semantic analysis: UDTF's are not supported outside the SELECT clause, nor nested in expressions。

提示语法分析错误,UDTF不支持函数之外的select 语句,如果我们想支持怎么办呢?接下来就是Lateral View 登场的时候了。

Lateral view 其实就是用来和像类似explode这种UDTF函数联用的。lateral view 会将UDTF生成的结果放到一个虚拟表中,然后这个虚拟表(1 to N)会和输入行即每个game_id进行join 来达到连接UDTF外的select字段的目的(源表和拆分的虚表按行做行内 1 join N 的直接连接),这也是为什么 LATERAL VIEW udtf(expression) 后面需要表别名和列别名的原因。

Lateral View Syntax

lateralView: LATERAL VIEW udtf(expression) tableAlias AS columnAlias (',' columnAlias)*

fromClause: FROM baseTable (lateralView)*

可以看出,可以在2个地方用Lateral view:

  • 在udtf前面用

  • 在from baseTable后面用

  •  

wcd

给定数据:

Array<int> col1     Array<string> col2

[1, 2]                       [a", "b", "c"]

[3, 4]                       [d", "e", "f"]

转换目标:

想同时把第一列和第二列拆开,类似做笛卡尔乘积。

我们可以这样写:

SELECT myCol1, myCol2 FROM baseTable
LATERAL VIEW explode(col1) myTable1 AS myCol1
LATERAL VIEW explode(col2) myTable2 AS myCol2;

有一种情况,如果UDTF转换的Array是空的怎么办呢?

在Hive0.12里面会支持outer关键字,如果UDTF的结果是空,默认会被忽略输出。

如果加上outer关键字,则会像left outer join 一样,还是会输出select出的列,而UDTF的输出结果是NULL。

总结:

  • Lateral View通常和UDTF一起出现,为了解决UDTF不允许在select字段的问题。

  • Multiple Lateral View可以实现类似笛卡尔乘积。

  • Outer关键字可以把不输出的UDTF的空结果,输出成NULL,防止丢失数据。