Saturday, April 28, 2018

POST请求超时问题和消息体过大问题

超时问题

1. 现象

HttpClient请求后,会在sender端报如下错误:
2018-04-28T10:08:46,144 ERROR [d78a9e9c-eac5-44ad-bf1b-4c0e8d387540 HiveServer2-Handler-Pool: Thread-90] common.HookHelper: failureHookRequest fail, url:http://url/func
com.foo.analytics.auth.org.apache.http.ConnectionClosedException: Premature end of chunk coded message body: closing chunk expected
at com.foo.analytics.auth.org.apache.http.impl.io.ChunkedInputStream.getChunkSize(ChunkedInputStream.java:266) ~[foo-analytics-auth-1.0.0.jar:?]
at com.foo.analytics.auth.org.apache.http.impl.io.ChunkedInputStream.nextChunk(ChunkedInputStream.java:225) ~[foo-analytics-auth-1.0.0.jar:?]
at com.foo.analytics.auth.org.apache.http.impl.io.ChunkedInputStream.read(ChunkedInputStream.java:184) ~[foo-analytics-auth-1.0.0.jar:?]
at com.foo.analytics.auth.org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:135) ~[foo-analytics-auth-1.0.0.jar:?]
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284) ~[?:1.8.0_144]
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326) ~[?:1.8.0_144]
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) ~[?:1.8.0_144]
at java.io.InputStreamReader.read(InputStreamReader.java:184) ~[?:1.8.0_144]
at java.io.Reader.read(Reader.java:140) ~[?:1.8.0_144]
at com.foo.analytics.auth.org.apache.http.util.EntityUtils.toString(EntityUtils.java:227) ~[foo-analytics-auth-1.0.0.jar:?]
at com.foo.analytics.auth.org.apache.http.util.EntityUtils.toString(EntityUtils.java:308) ~[foo-analytics-auth-1.0.0.jar:?]
at com.foo.analytics.auth.hive.common.HookHelper.smartEngineRequest(HookHelper.java:219) [foo-analytics-auth-1.0.0.jar:?]
at com.foo.analytics.auth.hive.Hook.HivePreSemanticAnalyzerHook.func(HivePreSemanticAnalyzerHook.java:44) [foo-analytics-auth-1.0.0.jar:?]

2. 排查思路

  1. 首先确定了在webserver端的日志里根本找不到此请求对应的unique id, 也就是说,根本没有进入controller
  2. 经google,发现HttpClient和nginx都有对请求的timeout设置,分别做如下改动:
HttpClient端把timeout从60s改为10min:
PoolingHttpClientConnectionManager connManager = new PoolingHttpClientConnectionManager();
// Create socket configuration
SocketConfig socketConfig = SocketConfig.custom().setSoKeepAlive(true)
.setSoTimeout(180000).setSoReuseAddress(true).setTcpNoDelay(true).build();
// Configure the connection manager to use socket configuration either
// by default or for a specific host.
connManager.setDefaultSocketConfig(socketConfig);
// Validate connections after 1 minute of inactivity
connManager.setValidateAfterInactivity(180000);
connManager.setMaxTotal(100);
connManager.setDefaultMaxPerRoute(20);
// Create global request configuration
RequestConfig defaultRequestConfig = RequestConfig.custom()
.setCookieSpec(CookieSpecs.DEFAULT).setExpectContinueEnabled(true)
.setConnectTimeout(180000).setSocketTimeout(180000)
.setConnectionRequestTimeout(180000).build();
CloseableHttpClient hc = HttpClients.custom().setConnectionManager(connManager)
.setDefaultRequestConfig(defaultRequestConfig).setDefaultSocketConfig(socketConfig)
.build();
String encodedMsg = new String(Files.readAllBytes(Paths.get(args[0])), UTF_8);
UrlEncodedFormEntity entity = new UrlEncodedFormEntity(Lists.newArrayList(new BasicNameValuePair("message", encodedMsg)), UTF_8);
HttpUriRequest httpPost = HttpUtils.post("http://url/func", entity);
try (CloseableHttpResponse response = hc.execute(httpPost)) {
// ...
}
在nginx对应的upstream设置:
proxy_connect_timeout 300s;
proxy_read_timeout 300s;
proxy_send_timeout 300s;
client_body_timeout 300s;
send_timeout 300s;

消息体过大问题

1. 现象

通过HttpClient发送POST请求到SpringBoot(Embedded Tomcat)时,如果send的data较小,则一切正常;如果data较大(~3MB),则controller端接收到的String为空。

2. 排查思路

  1. 因为请求是先经过nginx,再redirect到两台webserver。所以先看了下nginx的access log,发现请求会经过nginx并不报错。排除。
  2. 直接把sending的message保存到文件,模拟一个test case,直接将请求发送到一台具体的webserver。在controller接收端打印接收到的message和对应的size,为空。
  3. google了下,发现nginx、SpringBoot和tomcat都有对request size的限制:
在application.properties里添加如下参数:
spring.servlet.multipart.max-file-size=30MB
spring.servlet.multipart.max-request-size=30MB
server.tomcat.max-http-post-size=-1
在nginx对应的upstream里添加:
client_max_body_size 10M;
REFERENCE: 

Thursday, April 26, 2018

HiveOnSpark系列:hive.auto.convert.join issue

 于hive.auto.convert.join参数的解释如下:
Whether Hive enables the optimization about converting common
join into mapjoin based on the input file size
在Spark On Hive时,由于execution plan会在semantic analysis的时候生成,导致对于输入数据量相对小,但后面中间结果很多的task,会强行被计划成MapJoin,导致类似如下错误在yarn的task日志打印出:
java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.hive.serde2.WriteBuffers.nextBufferToWrite(WriteBuffers.java:246)
at org.apache.hadoop.hive.serde2.WriteBuffers.write(WriteBuffers.java:222)
at org.apache.hadoop.hive.ql.exec.persistence.MapJoinBytesTableContainer$LazyBinaryKvWriter.writeKey(MapJoinBytesTableContainer.java:250)
at org.apache.hadoop.hive.ql.exec.persistence.BytesBytesMultiHashMap.put(BytesBytesMultiHashMap.java:428)
at org.apache.hadoop.hive.ql.exec.persistence.MapJoinBytesTableContainer.putRow(MapJoinBytesTableContainer.java:395)
at org.apache.hadoop.hive.ql.exec.persistence.MapJoinTableContainerSerDe.loadOptimized(MapJoinTableContainerSerDe.java:200)
类似的SQL如下,子查询的数据量在36亿条(select的column相对不大),但这些数据做笛卡尔积就会膨胀很多,如果还被以MapJoin方式执行,则会OOM。
select
a.dt as first_day, b.dt as rent_day
from (
SELECT dt, uid
from tbl
WHERE dt >= '20180320'
group by p_date,user_id
)a join (
SELECT dt, uid
from tbl
WHERE dt >= '20180320'
group by p_date,user_id
)b on a.uid=b.uid
limit 100;

Tuesday, April 24, 2018

Gradle / Maven Dependency Conflict Issue - 定位冲突的根本思路

Resolving Maven Dependency Conflict Problem In Intellij, 这次从根本上记录下排查dependency conflict的debug思路。
一般出现冲突的现象是,在服务器端提示如下错误信息:
2018-04-25 11:09:08,624 [ERROR] [http-nio-8591-exec-8] (org.apache.catalina.core.ContainerBase.[Tomcat].[localhost].[/].[dispatcherServlet]) - Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Handler dispatch failed; nested exception is java.lang.NoSuchMethodError: com.google.protobuf.AbstractMessageLite$Builder.addAll(Ljava/lang/Iterable;Ljava/util/List;)V] with root cause
java.lang.NoSuchMethodError: com.google.protobuf.AbstractMessageLite$Builder.addAll(Ljava/lang/Iterable;Ljava/util/List;)V
...
本example中可见是com.google.protobuf.AbstractMessageLite$Builder.addAll方法无法找到。既然zip包里存在冲突,则先解压缩,navigate到所有jar包所在目录,执行如下command:
ll | awk '{print $9}' | xargs -I %%% sh -c 'echo "current_jar="%%%; jar tf %%%' > dep.txt
目的是将所有jar包的所有class信息都print出来。然后找到AbstractMessageLite类所在的jar包(如果存在冲突,一定有至少两个jar包存在此类)。由此找到hive-exec-2.3.2.jar和protobuf-java-3.5.1.jar两个包都存在此类。Apply javap -classpath <jar_file> 'com.google.protobuf.AbstractMessageLite$Builder' to both jar files respectively, 可以得到如下部分log:
~/Downloads/kuaishou-lineage-webapp-1.0-SNAPSHOT/lib $ javap -classpath protobuf-java-3.5.1.jar 'com.google.protobuf.AbstractMessageLite$Builder'
Compiled from "AbstractMessageLite.java"
public abstract class com.google.protobuf.AbstractMessageLite$Builder<MessageType extends com.google.protobuf.AbstractMessageLite<MessageType, BuilderType>, BuilderType extends com.google.protobuf.AbstractMessageLite$Builder<MessageType, BuilderType>> implements com.google.protobuf.MessageLite$Builder {
...
protected static <T> void addAll(java.lang.Iterable<T>, java.util.Collection<? super T>);
protected static <T> void addAll(java.lang.Iterable<T>, java.util.List<? super T>);
...
}
~/Downloads/kuaishou-lineage-webapp-1.0-SNAPSHOT/lib $ javap -classpath hive-exec-2.3.2.jar 'com.google.protobuf.AbstractMessageLite$Builder'
Compiled from "AbstractMessageLite.java"
public abstract class com.google.protobuf.AbstractMessageLite$Builder<BuilderType extends com.google.protobuf.AbstractMessageLite$Builder> implements com.google.protobuf.MessageLite$Builder {
...
protected static <T> void addAll(java.lang.Iterable<T>, java.util.Collection<? super T>);
...
}
可见hive-exec-2.3.2.jar中少了一种addAll接口。只需在maven/gradle中将对应的dependency排除即可, Khalas.

Monday, April 23, 2018

HiveOnSpark系列: ADD JAR后执行SQL-on-spark抛KryoException: Unable to find class

H
ive-On-Spark配置成功后,准备试用下Hive UDF是否能在Spark-on-Hive环境下正常使用:
set hive.execution.engine=spark;
add jar viewfs:///dirs/brickhouse-0.7.1-SNAPSHOT-jar-with-dependencies.jar;
create temporary function to_json AS 'brickhouse.udf.json.ToJsonUDF';
select to_json(app_metric) as tt from tbl_name where dt = '20180417' limit 10;
但在yarn-cluster模式下执行后会complain如下错误:
org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find class: brickhouse.udf.json.ToJsonUDF
Serialization trace:
genericUDF (org.apache.hadoop.hive.ql.plan.ExprNodeGenericFuncDesc)
colExprMap (org.apache.hadoop.hive.ql.exec.SelectOperator)
childOperators (org.apache.hadoop.hive.ql.exec.TableScanOperator)
aliasToWork (org.apache.hadoop.hive.ql.plan.MapWork)
invertedWorkGraph (org.apache.hadoop.hive.ql.plan.SparkWork)
at org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:156)
at org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:133)
at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
at org.apache.hadoop.hive.ql.exec.SerializationUtilities$KryoWithHooks.readClass(SerializationUtilities.java:181)
at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:118)
at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551)
at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790)
at org.apache.hadoop.hive.ql.exec.SerializationUtilities$KryoWithHooks.readClassAndObject(SerializationUtilities.java:176)
at org.apache.hive.com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:161)
at org.apache.hive.com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:39)
at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:708)
at org.apache.hadoop.hive.ql.exec.SerializationUtilities$KryoWithHooks.readObject(SerializationUtilities.java:214)
at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551)
at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790)
at org.apache.hadoop.hive.ql.exec.SerializationUtilities$KryoWithHooks.readClassAndObject(SerializationUtilities.java:176)
at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:134)
at org.apache.hive.com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:40)
at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:708)
at org.apache.hadoop.hive.ql.exec.SerializationUtilities$KryoWithHooks.readObject(SerializationUtilities.java:214)
at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551)
at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790)
at org.apache.hadoop.hive.ql.exec.SerializationUtilities$KryoWithHooks.readClassAndObject(SerializationUtilities.java:176)
at org.apache.hive.com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:161)
at org.apache.hive.com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:39)
at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:708)
at org.apache.hadoop.hive.ql.exec.SerializationUtilities$KryoWithHooks.readObject(SerializationUtilities.java:214)
at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551)
at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790)
at org.apache.hadoop.hive.ql.exec.SerializationUtilities$KryoWithHooks.readClassAndObject(SerializationUtilities.java:176)
at org.apache.hive.com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:153)
at org.apache.hive.com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:39)
at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:708)
at org.apache.hadoop.hive.ql.exec.SerializationUtilities$KryoWithHooks.readObject(SerializationUtilities.java:214)
at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551)
at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:686)
at org.apache.hadoop.hive.ql.exec.SerializationUtilities$KryoWithHooks.readObject(SerializationUtilities.java:206)
at org.apache.hadoop.hive.ql.exec.spark.KryoSerializer.deserialize(KryoSerializer.java:60)
at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:329)
at org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:358)
at org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:323)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: brickhouse.udf.json.ToJsonUDF
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:154)
... 47 more
此时如果只看stacktrace,感觉可能是jar文件没有分发到对应的RemoteDriver,导致类找不到。
MARK: debug除了看stacktrace,另一个思路是看看stacktrace上面的INFO。因为有些错误可能是fail-silent, 所以INFO里已经暴露出一些不正确的日志内容可以用于排查问题
拉取部分INFO日志内容如下:
18/04/23 18:51:39 INFO cluster.YarnClusterScheduler: YarnClusterScheduler.postStartHook done
18/04/23 18:51:39 INFO spark.SparkContext: Added JAR viewfs://hadoop-lt-cluster/tmp/hive/dp/_spark_session_dir/bc42b5c8-f183-4088-b238-2c3a75725d06/hive-exec-2.3.2.jar at viewfs://hadoop-lt-cluster/tmp/hive/dp/_spark_session_dir/bc42b5c8-f183-4088-b238-2c3a75725d06/hive-exec-2.3.2.jar with timestamp 1524480699144
18/04/23 18:51:39 INFO spark.SparkContext: Added JAR viewfs://hadoop-lt-cluster/tmp/hive/dp/_spark_session_dir/bc42b5c8-f183-4088-b238-2c3a75725d06/kuaishou-analytics-auth-1.0.0.jar at viewfs://hadoop-lt-cluster/tmp/hive/dp/_spark_session_dir/bc42b5c8-f183-4088-b238-2c3a75725d06/kuaishou-analytics-auth-1.0.0.jar with timestamp 1524480699162
18/04/23 18:51:39 INFO spark.SparkContext: Added JAR viewfs://hadoop-lt-cluster/tmp/hive/dp/_spark_session_dir/bc42b5c8-f183-4088-b238-2c3a75725d06/brickhouse-0.7.1-SNAPSHOT-jar-with-dependencies-2.jar at viewfs://hadoop-lt-cluster/tmp/hive/dp/_spark_session_dir/bc42b5c8-f183-4088-b238-2c3a75725d06/brickhouse-0.7.1-SNAPSHOT-jar-with-dependencies-2.jar with timestamp 1524480699165
18/04/23 18:51:39 INFO storage.BlockManagerMasterEndpoint: Registering block manager bjlt-h1180.sy:37446 with 7.0 GB RAM, BlockManagerId(53, bjlt-h1180.sy, 37446)
18/04/23 18:51:39 INFO client.RemoteDriver: Received job request 2d71a807-c512-4032-8c9e-71a378d3168b
18/04/23 18:51:39 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(null) (10.48.74.35:59232) with ID 41
18/04/23 18:51:39 INFO spark.ExecutorAllocationManager: New executor 41 has registered (new total is 53)
18/04/23 18:51:39 INFO storage.BlockManagerMasterEndpoint: Registering block manager bjlt-h1864.sy:36387 with 7.0 GB RAM, BlockManagerId(41, bjlt-h1864.sy, 36387)
18/04/23 18:51:39 INFO client.SparkClientUtilities: Added jar[file:/media/disk3/yarn_data/usercache/dp/appcache/application_1523431310007_1301182/container_e95_1523431310007_1301182_01_000001/viewfs:/hadoop-lt-cluster/tmp/hive/dp/_spark_session_dir/bc42b5c8-f183-4088-b238-2c3a75725d06/hive-exec-2.3.2.jar] to classpath.
18/04/23 18:51:39 INFO client.SparkClientUtilities: Added jar[file:/media/disk3/yarn_data/usercache/dp/appcache/application_1523431310007_1301182/container_e95_1523431310007_1301182_01_000001/viewfs:/hadoop-lt-cluster/tmp/hive/dp/_spark_session_dir/bc42b5c8-f183-4088-b238-2c3a75725d06/brickhouse-0.7.1-SNAPSHOT-jar-with-dependencies-2.jar] to classpath.
18/04/23 18:51:39 INFO client.SparkClientUtilities: Added jar[file:/media/disk3/yarn_data/usercache/dp/appcache/application_1523431310007_1301182/container_e95_1523431310007_1301182_01_000001/viewfs:/hadoop-lt-cluster/tmp/hive/dp/_spark_session_dir/bc42b5c8-f183-4088-b238-2c3a75725d06/kuaishou-analytics-auth-1.0.0.jar] to classpath.
18/04/23 18:51:39 INFO client.RemoteDriver: Failed to run job 2d71a807-c512-4032-8c9e-71a378d3168b
发现SparkContext的"Added JAR"语句对应的路径是正确的,但SparkClientUtilities对应的"Added JAR"是类似"file:/media/disk3/yarn_data/usercache/dp/appcache/application_1523431310007_1301182/container_e95_1523431310007_1301182_01_000001/viewfs:/hadoop-lt-cluster/tmp/hive/dp/_spark_session_dir/bc42b5c8-f183-4088-b238-2c3a75725d06/hive-exec-2.3.2.jar", 可见路径解析有问题。此时,去Hive源代码中review下发现code snippet如下:
private static URL urlFromPathString(String path, Long timeStamp,
Configuration conf, File localTmpDir) {
URL url = null;
try {
if (StringUtils.indexOf(path, "file:/") == 0) {
url = new URL(path);
} else if (StringUtils.indexOf(path, "hdfs:/") == 0) {
Path remoteFile = new Path(path);
Path localFile =
new Path(localTmpDir.getAbsolutePath() + File.separator + remoteFile.getName());
Long currentTS = downloadedFiles.get(path);
if (currentTS == null) {
currentTS = -1L;
}
if (!new File(localFile.toString()).exists() || currentTS < timeStamp) {
LOG.info("Copying " + remoteFile + " to " + localFile);
FileSystem remoteFS = remoteFile.getFileSystem(conf);
remoteFS.copyToLocalFile(remoteFile, localFile);
downloadedFiles.put(path, timeStamp);
}
return urlFromPathString(localFile.toString(), timeStamp, conf, localTmpDir);
} else {
url = new File(path).toURL();
}
} catch (Exception err) {
LOG.error("Bad URL " + path + ", ignoring path", err);
}
return url;
}
可见当前版本的Hive(0.23)只有对hdfs和file两种scheme的支持,不支持viewfs。增加适配viewfs的代码判断:else if (StringUtils.indexOf(path, "hdfs:/") == 0 || StringUtils.indexOf(path, "viewfs:/") == 0 ) {, 问题解决。