HiveServer2 (HS2) is a service that enables clients to execute queries against Hive. HiveServer2 is the successor to HiveServer1 which has been deprecated. HS2 supports multi-client concurrency and authentication. It is designed to provide better support for open API clients like JDBC and ODBC.
HS2 is a single process running as a composite service, which includes the Thrift-based Hive service (TCP or HTTP) and a Jetty web server for web UI.
The usage of those layers in the HS2 implementation is described below.
The TThreadPoolServer allocates one worker thread per TCP connection. Each thread is always associated with a connection even if the connection is idle. So there is a potential performance issue resulting from a large number of threads due to a large number of concurrent connections. In the future, HS2 might switch to another server type for TCP mode, for example, TThreadedSelectorServer. Here is an article about a performance comparison between different Thrift Java servers.
The metastore can be configured as embedded (in the same process as HS2) or as a remote server (which is a Thrift-based service as well). HS2 talks to the metastore for the metadata required for query compilation.
HS2 is a single process running as a composite service, which includes the Thrift-based Hive service (TCP or HTTP) and a Jetty web server for web UI.
HS2 Architecture
The Thrift-based Hive service is the core of HS2 and responsible for servicing the Hive queries (e.g., from Beeline). Thrift is an RPC framework for building cross-platform services. Its stack consists of 4 layers: Server, Transport, Protocol, and Processor. You can find more details about the layers atThe usage of those layers in the HS2 implementation is described below.
Server
HS2 uses a TThreadPoolServer (from Thrift) for TCP mode, or a Jetty server for the HTTP mode.The TThreadPoolServer allocates one worker thread per TCP connection. Each thread is always associated with a connection even if the connection is idle. So there is a potential performance issue resulting from a large number of threads due to a large number of concurrent connections. In the future, HS2 might switch to another server type for TCP mode, for example, TThreadedSelectorServer. Here is an article about a performance comparison between different Thrift Java servers.
Transport
HTTP mode is required when a proxy is needed between the client and server (for example, for load balancing or security reasons). That is why it is supported, as well as TCP mode. You can specify the transport mode of the Thrift service through the Hive configuration property hive.server2.transport.mode.Protocol
The Protocol implementation is responsible for serialization and deserialization. HS2 is currently using TBinaryProtocol as its Thrift protocol for serialization. In the future, other protocols may be considered, such as TCompactProtocol, based on more performance evaluation.Processor
Process implementation is the application logic to handle requests. For example, the ThriftCLIService.ExecuteStatement() method implements the logic to compile and execute a Hive query.Dependencies of HS2
Metastore
Hadoop cluster
HS2 prepares physical execution plans for various execution engines (MapReduce/Tez/Spark) and submits jobs to the Hadoop cluster for execution.
Here is a sequence of API calls involved to make the first query:
JDBC Client
The JDBC driver is recommended for the client side to interact with HS2. Note that there are some use cases (e.g., Hadoop Hue) where the Thrift client is used directly and JDBC is bypassed.Here is a sequence of API calls involved to make the first query:
- The JDBC client (e.g., Beeline) creates a HiveConnection by initiating a transport connection (e.g., TCP connection) followed by an OpenSession API call to get a SessionHandle. The session is created from the server side.
- The HiveStatement is executed (following JDBC standards) and an ExecuteStatement API call is made from the Thrift client. In the API call, SessionHandle information is passed to the server along with the query information.
- The HS2 server receives the request and asks the driver (which is a CommandProcessor) for query parsing and compilation. The driver kicks off a background job that will talk to Hadoop and then immediately returns a response to the client. This is an asynchronous design of the ExecuteStatement API. The response contains an OperationHandle created from the server side.
- The client uses the OperationHandle to talk to HS2 to poll the status of the query execution.
Thank you.Well it was nice post and very helpful information onHadoop Admin Online Course
ReplyDelete