Why Use Lodash When ES6 Is Available

Lodash is a well-known JavaScript utility library that makes it easy to manipulate arrays and objects, as well as functions, strings, etc. I myself enjoys its functional way to process collections, especially chaining and lazy evaluation. But as ECMAScript 2015 Standard (ES6) becomes widely supported by major browsers, and Babel, the JavaScript compiler that transforms ES6 codes to ES5, plays a major role in today’s frontend development, it seems that most Lodash utilities can be replaced by ES6. But should we? In my opinion, Lodash will remain popular, for it still has lots of useful features that could improve the way of programming.

_.map and Array#map Are Different

_.map, _.reduce, _.filter and _.forEach are frequently used functions when processing collections, and ES6 provides direct support for them:

1
2
3
4
5
6
7
8
9
10
_.map([1, 2, 3], (i) => i + 1)
_.reduce([1, 2, 3], (sum, i) => sum + i, 0)
_.filter([1, 2, 3], (i) => i > 1)
_.forEach([1, 2, 3], (i) => { console.log(i) })

// becomes
[1, 2, 3].map((i) => i + 1)
[1, 2, 3].reduce((sum, i) => sum + i, 0)
[1, 2, 3].filter((i) => i > 1)
[1, 2, 3].forEach((i) => { console.log(i) })

But Lodash’s _.map is more powerful, in that it works on objects, has iteratee / predicate shorthands, lazy evaluation, guards against null parameter, and has better performance.

Read More

Process Python Collections with Functional Programming

I develop Spark applications with Scala, and it has a very powerful collection system, in which functional programming is certainly a key. Java 8 also introduces Lambda Expression and Stream API. In JavaScript, there is a Lodash library that provides powerful tools to process arrays and objects. When my primary work language changes to Python, I am wondering if it’s possible to manipulate collections in a FP way, and fortunately Python already provides syntax and tools for functional programming. Though list comprehension is the pythonic way to deal with collections, but the idea and concepts of FP is definitely worth learning.

Wordcount Example

Let’s first write a snippet to count the word occurences from a paragraph, in of course a functional way.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import re
import itertools


content = """
an apple orange the grape
banana an apple melon
an orange banana apple
"""

word_matches = re.finditer(r'\S+', content)
words = map(lambda m: m.group(0), word_matches)
fruits = filter(lambda s: len(s) > 3, words)
grouped_fruits = itertools.groupby(sorted(fruits))
fruit_counts = map(lambda t: (t[0], len(list(t[1]))), grouped_fruits)
print(list(fruit_counts))

Run this example and you’ll get a list of fruits, along with their counts:

1
[('apple', 3), ('banana', 2), ('grape', 1), ('melon', 1), ('orange', 2)]

This example includes most aspects of processing collections with FP style. For instance, re.finditer returns an iterator that is lazily evaluated; map and filter are used to do transformations; itertools module provides various functions to cope with iterables; and last but not least, the lambda expression, an easy way to define inline anonymous function. All of them will be described in the following sections.

Read More

Difference Between Lodash _.assign and _.assignIn

In Lodash, both _.assign and _.assignIn are ways to copy source objects’ properties into target object. According the documentation, _.assign processes own enumerable string keyed properties, while _.assignIn processes both own and inherited source properties. There’re also other companion functions like _.forOwn and _.forIn, _.has and _.hasIn. So what’s the difference between them?

In brief, the In in latter methods implies the way for...in loop behaves, which iterates all enumerable properties of the object itself and those the object inherits from its constructor’s prototype. JavaScript has an inheritance mechanism called prototype chain. When iterating an object’s properties with for...in or _.forIn, all properties appeared in the object and its prototype are processed, until the prototype resolves to null. Here’s the example code taken from Lodash’s doc:

1
2
3
4
5
6
function Foo() { this.a = 1; }
Foo.prototype.b = 2;
function Bar() { this.c = 3; }
Bar.prototype.d = 4;
_.assign({a: 0}, new Foo, new Bar); // => {a: 1, c: 3}
_.assignIn({a: 0}, new Foo, new Bar); // => {a:1, b:2, c:3, d:4}

Read More

Python 2 to 3 Quick Guide

Few years ago I was programming Python 2.7, when 3.x was still not an option, because of its backward-incompatibiliy and lack of popular third-party libraries support. But now it’s safe to say Python 3 is totally ready, and here’s a list of references for those (including me) who are adopting Python 3 with a 2.x background.

  1. All Strings Are Unicode
  2. print Becomes a Function
  3. Less Lists More Views
  4. Integer Division Returns Float
  5. Comparison Operators Raises TypeError
  6. Set Literal Support
  7. New String Formatting
  8. Exception Handling
  9. Global Function Changes
  10. Renaming Modules and Relative Import

All Strings Are Unicode

When dealing with non-ASCII encodings in Python 2, there’re str, unicode, u'...', s.encode(), etc. In Python 3, there’re only text and binary data. The former is str, strings that are always represented in Unicode; the later is bytes, which is just a sequence of byte numbers.

  • Conversion between str and bytes:
1
2
3
4
5
6
7
# str to bytes
'str'.encode('UTF-8')
bytes('str', encoding='UTF-8')

# bytes to str
b'bytes'.decode('UTF-8')
str(b'bytes', encoding='UTF-8')
  • basestring is removed, use str as type: isinstance(s, str)
  • bytes is immutable, the corresponding mutable version is bytearray.
  • The default source file encoding is UTF-8 now.

Read More

View Spark Source in Eclipse

Reading source code is a great way to learn opensource projects. I used to read Java projects’ source code on GrepCode for it is online and has very nice cross reference features. As for Scala projects such as Apache Spark, though its source code can be found on GitHub, it’s quite necessary to setup an IDE to view the code more efficiently. Here’s a howto of viewing Spark source code in Eclipse.

Install Eclipse and Scala IDE Plugin

One can download Eclipse from here. I recommend the “Eclipse IDE for Java EE Developers”, which contains a lot of daily-used features.

Then go to Scala IDE’s official site and install the plugin through update site or zip archive.

Generate Project File with Maven

Spark is mainly built with Maven, so make sure you have Maven installed on your box, and download the latest Spark source code from here, unarchive it, and execute the following command:

1
$ mvn -am -pl core dependency:resolve eclipse:eclipse

Read More

Spark Streaming Logging Configuration

Spark Streaming applications tend to run forever, so their log files should be properly handled, to avoid exploding server hard drives. This article will give some practical advices of dealing with these log files, on both Spark on YARN and standalone mode.

Log4j’s RollingFileAppender

Spark uses log4j as logging facility. The default configuraiton is to write all logs into standard error, which is fine for batch jobs. But for streaming jobs, we’d better use rolling-file appender, to cut log files by size and keep only several recent files. Here’s an example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
log4j.rootLogger=INFO, rolling

log4j.appender.rolling=org.apache.log4j.RollingFileAppender
log4j.appender.rolling.layout=org.apache.log4j.PatternLayout
log4j.appender.rolling.layout.conversionPattern=[%d] %p %m (%c)%n
log4j.appender.rolling.maxFileSize=50MB
log4j.appender.rolling.maxBackupIndex=5
log4j.appender.rolling.file=/var/log/spark/${dm.logging.name}.log
log4j.appender.rolling.encoding=UTF-8

log4j.logger.org.apache.spark=WARN
log4j.logger.org.eclipse.jetty=WARN

log4j.logger.com.shzhangji.dm=${dm.logging.level}

This means log4j will roll the log file by 50MB and keep only 5 recent files. These files are saved in /var/log/spark directory, with filename picked from system property dm.logging.name. We also set the logging level of our package com.shzhangji.dm according to dm.logging.level property. Another thing to mention is that we set org.apache.spark to level WARN, so as to ignore verbose logs from spark.

Read More

ElasticSearch Performance Tips

Recently we’re using ElasticSearch as a data backend of our recommendation API, to serve both offline and online computed data to users. Thanks to ElasticSearch’s rich and out-of-the-box functionality, it doesn’t take much trouble to setup the cluster. However, we still encounter some misuse and unwise configurations. So here’s a list of ElasticSearch performance tips that we learned from practice.

Tip 1 Set Num-of-shards to Num-of-nodes

Shard is the foundation of ElasticSearch’s distribution capability. Every index is splitted into several shards (default 5) and are distributed across cluster nodes. But this capability does not come free. Since data being queried reside in all shards (this behaviour can be changed by routing), ElasticSearch has to run this query on every shard, fetch the result, and merge them, like a map-reduce process. So if there’re too many shards, more than the number of cluter nodes, the query will be executed more than once on the same node, and it’ll also impact the merge phase. On the other hand, too few shards will also reduce the performance, for not all nodes are being utilized.

Shards have two roles, primary shard and replica shard. Replica shard serves as a backup to the primary shard. When primary goes down, the replica takes its job. It also helps improving the search and get performance, for these requests can be executed on either primary or replica shard.

Shards can be visualized by elasticsearch-head plugin:

The cu_docs index has two shards 0 and 1, with number_of_replicas set to 1. Primary shard 0 (bold bordered) resides in server Leon, and its replica in Pris. They are green becuase all primary shards have enough repicas sitting in different servers, so the cluster is healthy.

Since number_of_shards of an index cannot be changed after creation (while number_of_replicas can), one should choose this config wisely. Here are some suggestions:

  1. How many nodes do you have, now and future? If you’re sure you’ll only have 3 nodes, set number of shards to 2 and replicas to 1, so there’ll be 4 shards across 3 nodes. If you’ll add some servers in the future, you can set number of shards to 3, so when the cluster grows to 5 nodes, there’ll be 6 distributed shards.
  2. How big is your index? If it’s small, one shard with one replica will due.
  3. How is the read and write frequency, respectively? If it’s search heavy, setup more relicas.

Read More

Use WebJars in Scalatra Project

As I’m working with my first Scalatra project, I automatically think of using WebJars to manage Javascript library dependencies, since it’s more convenient and seems like a good practice. Though there’s no official support for Scalatra framework, the installation process is not very complex. But this doesn’t mean I didn’t spend much time on this. I’m still a newbie to Scala, and there’s only a few materials on this subject.

Add WebJars Dependency in SBT Build File

Scalatra uses .scala configuration file instead of .sbt, so let’s add dependency into project/build.scala. Take Dojo for example.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
object DwExplorerBuild extends Build {
...
lazy val project = Project (
...
settings = Defaults.defaultSettings ++ ScalatraPlugin.scalatraWithJRebel ++ scalateSettings ++ Seq(
...
libraryDependencies ++= Seq(
...
"org.webjars" % "dojo" % "1.9.3"
),
...
)
)
}

To view this dependency in Eclipse, you need to run sbt eclipse again. In the Referenced Libraries section, you can see a dojo-1.9.3.jar, and the library lies in META-INF/resources/webjars/.

Read More

Generate Auto-increment Id in Map-reduce Job

In DBMS world, it’s easy to generate a unique, auto-increment id, using MySQL’s AUTO_INCREMENT attribute on a primary key or MongoDB’s Counters Collection pattern. But when it comes to a distributed, parallel processing framework, like Hadoop Map-reduce, it is not that straight forward. The best solution to identify every record in such framework is to use UUID. But when an integer id is required, it’ll take some steps.

Solution A: Single Reducer

This is the most obvious and simple one, just use the following code to specify reducer numbers to 1:

1
job.setNumReduceTasks(1);

And also obvious, there are several demerits:

  1. All mappers output will be copied to one task tracker.
  2. Only one process is working on shuffel & sort.
  3. When producing output, there’s also only one process.

The above is not a problem for small data sets, or at least small mapper outputs. And it is also the approach that Pig and Hive use when they need to perform a total sort. But when hitting a certain threshold, the sort and copy phase will become very slow and unacceptable.

Read More

Manage Leiningen Project Configuration

In Maven projects, we tend to use .properties files to store various configurations, and use Maven profiles to switch between development and production environments. Like the following example:

1
2
# database.properties
mydb.jdbcUrl=${mydb.jdbcUrl}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
<!-- pom.xml -->
<profiles>
<profile>
<id>development</id>
<activation><activeByDefault>true</activeByDefault></activation>
<properties>
<mydb.jdbcUrl>jdbc:mysql://127.0.0.1:3306/mydb</mydb.jdbcUrl>
</properties>
</profile>
<profile>
<id>production</id>
<!-- This profile could be moved to ~/.m2/settings.xml to increase security. -->
<properties>
<mydb.jdbcUrl>jdbc:mysql://10.0.2.15:3306/mydb</mydb.jdbcUrl>
</properties>
</profile>
</profiles>

As for Leiningen projects, there’s no variable substitution in profile facility, and although in profiles we could use :resources to compact production-wise files into Jar, these files are actually replacing the original ones, instead of being merged. One solution is to strictly seperate environment specific configs from the others, so the replacement will be ok. But here I take another approach, to manually load files from difference locations, and then merge them.

Read More