Apache HBase is a key-value store in Hadoop ecosystem. It is based on HDFS, and can provide high performance data access on large amount of volume. HBase is written in Java, and has native support for Java clients. But with the help of Thrift and various language bindings, we can access HBase in web services quite easily. This article will describe how to read and write HBase table with Python and Thrift.
Generate Thrift Class
For anyone who is new to Apache Thrift, it provides an IDL (Interface Description Language) to let you describe your service methods and data types and then transform them into different languages. For instance, a Thrift type definition like this:
1 | struct TColumn { |
Will be transformed into the following Python code:
1 | class TColumn(object): |
HBase Thrift vs Thrift2
HBase provides two versions of Thrift IDL files, and they have two main differences.
First, thrift2
mimics the data types and methods from HBase Java API, which could be more intuitive to use. For instance, constructing a Get
operation in Java is:
1 | Get get = new Get(Bytes.toBytes("rowkey")); |
In thrift2
, there is a corresponding TGet
type:
1 | tget = TGet( |
While in thrift
, we directly invoke one of the get
methods:
1 | client.getRowWithColumns( |
The second difference is that thrift2
lacks the administration interfaces, like createTable
, majorCompact
, etc. Currently these APIs are still under development, so if you need to use them via Thrift, you will have to fall back to version one.
After deciding which version we use, now we can download the hbase.thrift
file, and generate Python code from it. One note on Thrift version though. Since we will use Python 3.x, which is supported by Thrift 0.10 onwards, so make sure you install the right version. Execute the following command, and you will get several Python files.
1 | $ thrift -gen py hbase.thrift |
Run HBase in Standalone Mode
In case you do not have a running HBase service to test against, you can follow the quick start guide (link) to download the binaries, do some minor configuration, and then execute the following commands to start a standalone HBase server as well as the Thrift2 server.
1 | bin/start-hbase.sh |
Then in the HBase shell, we create a test table and read / write some data.
1 | > create "tsdata", NAME => "cf" |
Connect to HBase via Thrift2
Here is the boilerplate of making a connection to HBase Thrift server. Note that Thrift client is not thread-safe, and it does neither provide connection pooling facility. You may choose to connect on every request, which is actually fast enough, or maintain a pool of connections yourself.
1 | from thrift.transport import TSocket |
We can test the connection with some basic operations:
1 | from hbase.ttypes import TPut, TColumnValue, TGet |
Thrift2 Data Types and Methods Overview
For a full list of the available APIs, one can directly look into hbase.thrift
or hbase/THBaseService.py
files. Following is an abridged table of those data types and methods.
Data Types
Class | Description | Example |
---|---|---|
TColumn | Represents a column family or a single column. | TColumn(family=’cf’, qualifier=’gender’) |
TColumnValue | Column and its value. | TColumnValue(family=’cf’, qualifier=’gender’, value=’male’) |
TResult | Query result, a single row. row attribute would be None if no result is found. |
TResult(row=’employee_001’, columnValues=[TColumnValue]) |
TGet | Query a single row. | TGet(row=’employee_001’, columns=[TColumn]) |
TPut | Mutate a single row. | TPut(row=’employee_001’, columnValues=[TColumnValue]) |
TDelete | Delete an entire row or only some columns. | TDelete(row=’employee_001’, columns=[TColumn]) |
TScan | Scan for multiple rows and columns. | See below. |
THBaseService Methods
Method Signature | Description |
---|---|
get(table: str, tget: TGet) -> TResult | Query a single row. |
getMultiple(table: str, tgets: List[TGet]) -> List[TResult] | Query multiple rows. |
put(table: str, tput: TPut) -> None | Mutate a row. |
putMultiple(table: str, tputs: List[TPut]) -> None | Mutate multiple rows. |
deleteSingle(table: str, tdelete: TDelete) -> None | Delete a row. |
deleteMultiple(table: str, tdeletes: List[TDelete]) -> None | Delete multiple rows. |
openScanner(table: str, tscan: TScan) -> int | Open a scanner, returns scannerId. |
getScannerRows(scannerId: int, numRows: int) -> List[TResult] | Get scanner rows. |
closeScanner(scannerId: int) -> None | Close a scanner. |
getScannerResults(table: str, tscan: TScan, numRows: int) -> List[TResult] | A convenient method to get scan results. |
Scan Operation Example
I wrote some example codes on GitHub (link), and the following is how a Scan
operation is made.
1 | scanner_id = client.openScanner( |
Thrift Server High Availability
There are several solutions to eliminate the single point of failure of Thrift server. You can either (1) randomly select a server address on the client-side, and fall back to others if failure is detected, (2) setup a proxy facility to load balance the TCP connections, or (3) run individual Thrift server on every client machine, and let client code connects the local Thrift server. Usually we use the second approach, so you may consult your system administrator on that topic.