Generate Auto-increment Id in Map-reduce Job

In DBMS world, it’s easy to generate a unique, auto-increment id, using MySQL’s AUTO_INCREMENT attribute on a primary key or MongoDB’s Counters Collection pattern. But when it comes to a distributed, parallel processing framework, like Hadoop Map-reduce, it is not that straight forward. The best solution to identify every record in such framework is to use UUID. But when an integer id is required, it’ll take some steps.

Solution A: Single Reducer

This is the most obvious and simple one, just use the following code to specify reducer numbers to 1:

1
job.setNumReduceTasks(1);

And also obvious, there are several demerits:

  1. All mappers output will be copied to one task tracker.
  2. Only one process is working on shuffel & sort.
  3. When producing output, there’s also only one process.

The above is not a problem for small data sets, or at least small mapper outputs. And it is also the approach that Pig and Hive use when they need to perform a total sort. But when hitting a certain threshold, the sort and copy phase will become very slow and unacceptable.

Read More

Manage Leiningen Project Configuration

In Maven projects, we tend to use .properties files to store various configurations, and use Maven profiles to switch between development and production environments. Like the following example:

1
2
# database.properties
mydb.jdbcUrl=${mydb.jdbcUrl}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
<!-- pom.xml -->
<profiles>
<profile>
<id>development</id>
<activation><activeByDefault>true</activeByDefault></activation>
<properties>
<mydb.jdbcUrl>jdbc:mysql://127.0.0.1:3306/mydb</mydb.jdbcUrl>
</properties>
</profile>
<profile>
<id>production</id>
<!-- This profile could be moved to ~/.m2/settings.xml to increase security. -->
<properties>
<mydb.jdbcUrl>jdbc:mysql://10.0.2.15:3306/mydb</mydb.jdbcUrl>
</properties>
</profile>
</profiles>

As for Leiningen projects, there’s no variable substitution in profile facility, and although in profiles we could use :resources to compact production-wise files into Jar, these files are actually replacing the original ones, instead of being merged. One solution is to strictly seperate environment specific configs from the others, so the replacement will be ok. But here I take another approach, to manually load files from difference locations, and then merge them.

Read More