In DBMS world, it’s easy to generate a unique, auto-increment id, using MySQL’s AUTO_INCREMENT attribute on a primary key or MongoDB’s Counters Collection pattern. But when it comes to a distributed, parallel processing framework, like Hadoop Map-reduce, it is not that straight forward. The best solution to identify every record in such framework is to use UUID. But when an integer id is required, it’ll take some steps.
Solution A: Single Reducer
This is the most obvious and simple one, just use the following code to specify reducer numbers to 1:
And also obvious, there are several demerits:
- All mappers output will be copied to one task tracker.
- Only one process is working on shuffel & sort.
- When producing output, there’s also only one process.
The above is not a problem for small data sets, or at least small mapper outputs. And it is also the approach that Pig and Hive use when they need to perform a total sort. But when hitting a certain threshold, the sort and copy phase will become very slow and unacceptable.