Ji ZHANG's Blog

If I rest, I rust.

HotSpot JVM中的对象指针压缩

原文:https://wiki.openjdk.java.net/display/HotSpot/CompressedOops

什么是一般对象指针?

一般对象指针(oop, ordinary object pointer)是HotSpot虚拟机的一个术语,表示受托管的对象指针。它的大小通常和本地指针是一样的。Java应用程序和GC子系统会非常小心地跟踪这些受托管的指针,以便在销毁对象时回收内存空间,或是在对空间进行整理时移动(复制)对象。

在一些从Smalltalk和Self演变而来的虚拟机实现中都有一般对象指针这个术语,包括:

部分系统中会使用小整型(smi, small integers)这个名称,表示一个指向30位整型的虚拟指针。这个术语在Smalltalk的V8实现中也可以看到。

为什么需要压缩?

LP64系统中,指针需要使用64位来表示;ILP32系统中则只需要32位。在ILP32系统中,堆内存的大小只能支持到4Gb,这对很多应用程序来说是不够的。在LP64系统中,所有应用程序运行时占用的空间都会比ILP32大1.5倍左右,这是因为指针占用的空间增加了。虽然内存是比较廉价的,但网络带宽和缓存容量是紧张的。所以,为了解决4Gb的限制而增加堆内存的占用空间,就有些得不偿失了。

在x86芯片中,ILP32模式可用的寄存器数量是LP64模式的一半。SPARC没有此限制;RISC芯片本来就提供了很多寄存器,LP64模式下会提供更多。

压缩后的一般对象指针在使用时需要将32位整型按因数8进行扩展,并加到一个64位的基础地址上,从而找到所指向的对象。这种方法可以表示四十亿个对象,相当于32Gb的堆内存。同时,使用此法压缩数据结构也能达到和ILP32系统相近的效果。

我们使用解码来表示从32位对象指针转换成64位地址的过程,其反过程则称为编码

Spark Streaming Logging Configuration

Spark Streaming applications tend to run forever, so their log files should be properly handled, to avoid exploding server hard drives. This article will give some practical advices of dealing with these log files, on both Spark on YARN and standalone mode.

Log4j’s RollingFileAppender

Spark uses log4j as logging facility. The default configuraiton is to write all logs into standard error, which is fine for batch jobs. But for streaming jobs, we’d better use rolling-file appender, to cut log files by size and keep only several recent files. Here’s an example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
log4j.rootLogger=INFO, rolling

log4j.appender.rolling=org.apache.log4j.RollingFileAppender
log4j.appender.rolling.layout=org.apache.log4j.PatternLayout
log4j.appender.rolling.layout.conversionPattern=[%d] %p %m (%c)%n
log4j.appender.rolling.maxFileSize=50MB
log4j.appender.rolling.maxBackupIndex=5
log4j.appender.rolling.file=/var/log/spark/${dm.logging.name}.log
log4j.appender.rolling.encoding=UTF-8

log4j.logger.org.apache.spark=WARN
log4j.logger.org.eclipse.jetty=WARN

log4j.logger.com.anjuke.dm=${dm.logging.level}

This means log4j will roll the log file by 50MB and keep only 5 recent files. These files are saved in /var/log/spark directory, with filename picked from system property dm.logging.name. We also set the logging level of our package com.anjuke.dm according to dm.logging.level property. Another thing to mention is that we set org.apache.spark to level WARN, so as to ignore verbose logs from spark.

ElasticSearch Performance Tips

Recently we’re using ElasticSearch as a data backend of our recommendation API, to serve both offline and online computed data to users. Thanks to ElasticSearch’s rich and out-of-the-box functionality, it doesn’t take much trouble to setup the cluster. However, we still encounter some misuse and unwise configurations. So here’s a list of ElasticSearch performance tips that we learned from practice.

Tip 1 Set Num-of-shards to Num-of-nodes

Shard is the foundation of ElasticSearch’s distribution capability. Every index is splitted into several shards (default 5) and are distributed across cluster nodes. But this capability does not come free. Since data being queried reside in all shards (this behaviour can be changed by routing), ElasticSearch has to run this query on every shard, fetch the result, and merge them, like a map-reduce process. So if there’re too many shards, more than the number of cluter nodes, the query will be executed more than once on the same node, and it’ll also impact the merge phase. On the other hand, too few shards will also reduce the performance, for not all nodes are being utilized.

Shards have two roles, primary shard and replica shard. Replica shard serves as a backup to the primary shard. When primary goes down, the replica takes its job. It also helps improving the search and get performance, for these requests can be executed on either primary or replica shard.

Shards can be visualized by elasticsearch-head plugin:

The cu_docs index has two shards 0 and 1, with number_of_replicas set to 1. Primary shard 0 (bold bordered) resides in server Leon, and its replica in Pris. They are green becuase all primary shards have enough repicas sitting in different servers, so the cluster is healthy.

Since number_of_shards of an index cannot be changed after creation (while number_of_replicas can), one should choose this config wisely. Here are some suggestions:

  1. How many nodes do you have, now and future? If you’re sure you’ll only have 3 nodes, set number of shards to 2 and replicas to 1, so there’ll be 4 shards across 3 nodes. If you’ll add some servers in the future, you can set number of shards to 3, so when the cluster grows to 5 nodes, there’ll be 6 distributed shards.
  2. How big is your index? If it’s small, one shard with one replica will due.
  3. How is the read and write frequency, respectively? If it’s search heavy, setup more relicas.

深入理解Reduce-side Join

在《MapReduce Design Patterns》一书中,作者给出了Reduce-side Join的实现方法,大致步骤如下:

  1. 使用MultipleInputs指定不同的来源表和相应的Mapper类;
  2. Mapper输出的Key为Join的字段内容,Value为打了来源表标签的记录;
  3. Reducer在接收到同一个Key的记录后,执行以下两步:
    1. 遍历Values,根据标签将来源表的记录分别放到两个List中;
    2. 遍历两个List,输出Join结果。

具体实现可以参考这段代码。但是这种实现方法有一个问题:如果同一个Key的记录数过多,存放在List中就会占用很多内存,严重的会造成内存溢出(Out of Memory, OOM)。这种方法在一对一的情况下没有问题,而一对多、多对多的情况就会有隐患。那么,Hive在做Reduce-side Join时是如何避免OOM的呢?两个关键点:

  1. Reducer在遍历Values时,会将前面的表缓存在内存中,对于最后一张表则边扫描边输出;
  2. 如果前面几张表内存中放不下,就写入磁盘。

使用git Rebase让历史变得清晰

当多人协作开发一个分支时,历史记录通常如下方左图所示,比较凌乱。如果希望能像右图那样呈线性提交,就需要学习git rebase的用法。

“Merge branch”提交的产生

我们的工作流程是:修改代码→提交到本地仓库→拉取远程改动→推送。正是在git pull这一步产生的Merge branch提交。事实上,git pull等效于get fetch origin和get merge origin/master这两条命令,前者是拉取远程仓库到本地临时库,后者是将临时库中的改动合并到本地分支中。

要避免Merge branch提交也有一个“土法”:先pull、再commit、最后push。不过万一commit和push之间远程又发生了改动,还需要再pull一次,就又会产生Merge branch提交。

使用git pull —rebase

修改代码→commit→git pull —rebase→git push。也就是将get merge origin/master替换成了git rebase origin/master,它的过程是先将HEAD指向origin/master,然后逐一应用本地的修改,这样就不会产生Merge branch提交了。具体过程见下文扩展阅读。

Spark快速入门

Apache Spark是新兴的一种快速通用的大规模数据处理引擎。它的优势有三个方面:

  • 通用计算引擎 能够运行MapReduce、数据挖掘、图运算、流式计算、SQL等多种框架;
  • 基于内存 数据可缓存在内存中,特别适用于需要迭代多次运算的场景;
  • 与Hadoop集成 能够直接读写HDFS中的数据,并能运行在YARN之上。

Spark是用Scala语言编写的,所提供的API也很好地利用了这门语言的特性。它也可以使用Java和Python编写应用。本文将用Scala进行讲解。

安装Spark和SBT

  • 官网上下载编译好的压缩包,解压到一个文件夹中。下载时需注意对应的Hadoop版本,如要读写CDH4 HDFS中的数据,则应下载Pre-built for CDH4这个版本。
  • 为了方便起见,可以将spark/bin添加到$PATH环境变量中:
1
2
export SPARK_HOME=/path/to/spark
export PATH=$PATH:$SPARK_HOME/bin
  • 在练习例子时,我们还会用到SBT这个工具,它是用来编译打包Scala项目的。Linux下的安装过程比较简单:
    • 下载sbt-launch.jar到$HOME/bin目录;
    • 新建$HOME/bin/sbt文件,权限设置为755,内容如下:
1
2
SBT_OPTS="-Xms512M -Xmx1536M -Xss1M -XX:+CMSClassUnloadingEnabled -XX:MaxPermSize=256M"
java $SBT_OPTS -jar `dirname $0`/sbt-launch.jar "$@"

离线环境下构建sbt项目

在公司网络中使用sbtMaven等项目构建工具时,我们通常会搭建一个公用的Nexus镜像服务,原因有以下几个:

  • 避免重复下载依赖,节省公司带宽;
  • 国内网络环境不理想,下载速度慢;
  • IDC服务器没有外网访问权限;
  • 用于发布内部模块。

sbt的依赖管理基于Ivy,虽然它能直接使用Maven中央仓库中的Jar包,在配置时还是有一些注意事项的。

MySQL异常UTF-8字符的处理

ETL流程中,我们会将Hive中的数据导入MySQL——先用Hive命令行将数据保存为文本文件,然后用MySQL的LOAD DATA语句进行加载。最近有一张表在加载到MySQL时会报以下错误:

1
Incorrect string value: '\xF0\x9D\x8C\x86' for column ...

经查,这个字段中保存的是用户聊天记录,因此会有一些表情符号。这些符号在UTF-8编码下需要使用4个字节来记录,而MySQL中的utf8编码只支持3个字节,因此无法导入。

根据UTF-8的编码规范,3个字节支持的Unicode字符范围是U+0000–U+FFFF,因此可以在Hive中对数据做一下清洗:

1
SELECT REGEXP_REPLACE(content, '[^\\u0000-\\uFFFF]', '') FROM ...

这样就能排除那些需要使用3个以上字节来记录的字符了,从而成功导入MySQL。

以下是一些详细说明和参考资料。

在CDH 4.5上安装Shark 0.9

Spark是一个新兴的大数据计算平台,它的优势之一是内存型计算,因此对于需要多次迭代的算法尤为适用。同时,它又能够很好地融合到现有的Hadoop生态环境中,包括直接存取HDFS上的文件,以及运行于YARN之上。对于Hive,Spark也有相应的替代项目——Shark,能做到 drop-in replacement ,直接构建在现有集群之上。本文就将简要阐述如何在CDH4.5上搭建Shark0.9集群。

准备工作

  • 安装方式:Spark使用CDH提供的Parcel,以Standalone模式启动
  • 软件版本
  • 服务器基础配置
    • 可用的软件源(如中科大的源
    • 配置主节点至子节点的root账户SSH无密码登录。
    • /etc/hosts中写死IP和主机名,或者DNS做好正反解析。

Use WebJars in Scalatra Project

As I’m working with my first Scalatra project, I automatically think of using WebJars to manage Javascript library dependencies, since it’s more convenient and seems like a good practice. Though there’s no official support for Scalatra framework, the installation process is not very complex. But this doesn’t mean I didn’t spend much time on this. I’m still a newbie to Scala, and there’s only a few materials on this subject.

Add WebJars Dependency in SBT Build File

Scalatra uses .scala configuration file instead of .sbt, so let’s add dependency into project/build.scala. Take Dojo for example.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
object DwExplorerBuild extends Build {
  ...
  lazy val project = Project (
    ...
    settings = Defaults.defaultSettings ++ ScalatraPlugin.scalatraWithJRebel ++ scalateSettings ++ Seq(
      ...
      libraryDependencies ++= Seq(
        ...
        "org.webjars" % "dojo" % "1.9.3"
      ),
      ...
    )
  )
}

To view this dependency in Eclipse, you need to run sbt eclipse again. In the Referenced Libraries section, you can see a dojo-1.9.3.jar, and the library lies in META-INF/resources/webjars/.