Wednesday, July 8, 2015

Issue Related With The Owner Of Hive Dirs/Files Created On HDFS is Not The User Executing Hive Command (Hive User Impersonation)

When executing hive command from one of our gateways in any user 'A', then doing some operations which will create files/dirs to hive warehouse on HDFS, the owner of newly-created files/dirs will always be 'supertool', which is the creator of hiveserver2(metastore) process, whichever user 'A' is:
###-- hiveserver2(metastore) belongs to user 'supertool' --
K1201:~>ps aux | grep -v grep | grep metastore.HiveMetaStore --color
500      30320  0.0  0.5 1209800 263548 ?      Sl   Jan28  59:29 /usr/java/jdk1.7.0_11//bin/java -Xmx10000m -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/home/workspace/hadoop/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/home/workspace/hadoop -Dhadoop.id.str=supertool -Dhadoop.root.logger=INFO,console -Djava.library.path=/home/workspace/hadoop/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Xmx512m -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.RunJar /home/workspace/hive-0.13.0-bin/lib/hive-service-0.13.0.jar org.apache.hadoop.hive.metastore.HiveMetaStore
K1201:~>cat /etc/passwd | grep 500
supertool:x:500:500:supertool:/home/supertool:/bin/bash

###-- invoke hive command in user 'withdata' and create a database and table --
114:~>whoami
withdata
114:~>hive
hive> create database test_db;
OK
Time taken: 1.295 seconds
hive> use test_db;
OK
Time taken: 0.031 seconds
hive> create table test_tbl(id int);
OK
Time taken: 0.864 seconds

###-- the newly-created database and table belongs to user 'supertool' --
114:~>hadoop fs -ls /user/supertool/hive/warehouse | grep test_db
drwxrwxr-x   - supertool    supertool             0 2015-07-08 15:13 /user/supertool/hive/warehouse/test_db.db
114:~>hadoop fs -ls /user/supertool/hive/warehouse/test_db.db
Found 1 items
drwxrwxr-x   - supertool supertool          0 2015-07-08 15:13 /user/supertool/hive/warehouse/test_db.db/test_tbl

This can be explained by Hive User Impersonation. By default, HiveServer2 performs the query processing as the user who submitted the query. But if related parameters, which are as follows, are set wrongly, the query will run as the user that the hiveserver2 process runs as. The correct way to configure is as below:
<property>
  <name>hive.server2.enable.doAs</name>
  <value>true</value>
  <description>Set this property to enable impersonation in Hive Server 2</description>
</property>
<property>
  <name>hive.metastore.execute.setugi</name>
  <value>true</value>
  <description>Set this property to enable Hive Metastore service impersonation in unsecure mode. In unsecure mode, setting this property to true will cause the metastore to execute DFS operations using the client's reported user and group permissions. Note that this property must be set on both the client and server sides. If the client sets it to true and the server sets it to false, the client setting will be ignored.</description>
</property>

The above settings is self-explained well in their descriptions. Thus there's a need to rectify our hive-site.xml and then restart our hiveserver2(metastore) service.

At this point, there's a puzzling problem that no matter how I change my HIVE_HOME/conf/hive-site.xml, the corresponding setting is not altered at runtime. Eventually, I found that there's another hive-site.xml under HADOOP_HOME/etc/hadoop directory. Consequently, it is advised that we should not put any hive-related configuration files under HADOOP_HOME directory in avoidance of confusion. The official configuration files loading order of precedence can be found at REFERENCE_5.

After revising HIVE_HOME/conf/hive-site.xml, the following commands have guaranteed that the preceding problem is addressed properly.
###-- check runtime hive parameters related with hive user impersonation --
k1227:/home/workspace/hive-0.13.0-bin>hive
hive> set system:user.name;
system:user.name=hadoop
hive> set hive.server2.enable.doAs;
hive.server2.enable.doAs=true
hive> set hive.metastore.execute.setugi;
hive.metastore.execute.setugi=true

###-- start hiveserver2(metastore) again --
k1227:/home/workspace/hive-0.13.0-bin>hive --service metastore
Starting Hive Metastore Server
15/07/08 14:28:59 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
15/07/08 14:28:59 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
15/07/08 14:28:59 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
15/07/08 14:28:59 INFO Configuration.deprecation: mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node
15/07/08 14:28:59 INFO Configuration.deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
15/07/08 14:28:59 INFO Configuration.deprecation: mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack
15/07/08 14:28:59 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
15/07/08 14:28:59 INFO Configuration.deprecation: mapred.committer.job.setup.cleanup.needed is deprecated. Instead, use mapreduce.job.committer.setup.cleanup.needed
15/07/08 14:28:59 WARN conf.HiveConf: DEPRECATED: Configuration property hive.metastore.local no longer has any effect. Make sure to provide a valid value for hive.metastore.uris if you are connecting to a remote metastore.
15/07/08 14:28:59 WARN conf.HiveConf: DEPRECATED: hive.metastore.ds.retry.* no longer has any effect.  Use hive.hmshandler.retry.* instead
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/workspace/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/workspace/hive-0.13.0-bin/lib/jud_test.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
^Z
[1]+  Stopped                 hive --service metastore
k1227:/home/workspace/hive-0.13.0-bin>bg 1
[1]+ hive --service metastore &
k1227:/home/workspace/hive-0.13.0-bin>ps aux | grep metastore
hadoop    6597 26.6  0.4 1161404 275564 pts/0  Sl   14:28   0:14 /usr/java/jdk1.7.0_11//bin/java -Xmx20000m -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/home/workspace/hadoop/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/home/workspace/hadoop -Dhadoop.id.str=hadoop -Dhadoop.root.logger=INFO,console -Djava.library.path=/home/workspace/hadoop/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Xmx512m -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.RunJar /home/workspace/hive-0.13.0-bin/lib/hive-service-0.13.0.jar org.apache.hadoop.hive.metastore.HiveMetaStore
hadoop   11936  0.0  0.0 103248   868 pts/0    S+   14:29   0:00 grep metastore

In which, `set system:user.name` will display current user executing hive command; `set [parameter]` will display the specific parameter's value at runtime. Alternatively, we could list all runtime parameters via `set` command in hive, or from command line: `hive -e "set;" > hive_runtime_parameters.txt`.

A possible exception 'TTransportException: Could not create ServerSocket on address 0.0.0.0/0.0.0.0:9083' will be complained when launching metastore service. According to REFERENCE_6, this is because another metastore or sort of service occupies 9083 port, which is the default port for hive metastore. Kill it beforehand:
k1227:/home/workspace/hive-0.13.0-bin>lsof -i:9083
COMMAND  PID   USER   FD   TYPE     DEVICE SIZE/OFF NODE NAME
java    3499 hadoop  236u  IPv4 3913377019      0t0  TCP *:9083 (LISTEN)
k1227:/home/workspace/hive-0.13.0-bin>kill -9 3499

In this way, we could create database/table again, the owner of corresponding HDFS files/dirs will be changed to the user invoking hive command.



REFERENCE:
1. Setting Up HiveServer2 - Impersonation
2. hive-default.xml.template [hive.metastore.execute.setugi]
3. Hive User Impersonation -mapr
4. Configuring User Impersonation with Hive Authorization - drill
5. AdminManual Configuration - hive [order of precedence]
6. TTransportException: Could not create ServerSocket on address 0.0.0.0/0.0.0.0:9083 - cloudera community




No comments:

Post a Comment