Home > Tech Trivia > Sort 10GB dataset over 1GB memory

Sort 10GB dataset over 1GB memory

This is a frequently encountered problem, the solution to which is based on the concept of External Sorting.

Here’s a summary of the process involved in doing this:

  1. Of course, given that you only have 1 GB of main memory, that’s what you can read at once at any given time. So, read data in chunks of 1 GB at a time and sort using on of the conventional sorting algorithms, most likely Quicksort.
  2. Write each sorted dataset to disk (of course, assuming that you have enough disk space :)).
  3. You will end up having 10 chunks of 1 GB (10GB/1GB = 10 chunks), individually sorted, datasets which now needs to be merged into one single output file (on disk).
  4. Now read the first 100 MB (= 1 GB /10 chunks) of each sorted chunk into an input buffer in main memory and allocate the remaining to an output buffer.
  5. Perform a 10-way merge and store the results in the output buffer. If, at any given time, the output buffer is full, write it to the final sorted file, and empty it. If any of the 10 input buffers gets empty, fill it with the next 100 MB of its associated 1 GB sorted chunk until no more data from the chunk is available. This is the key step that makes external merge sort work externally — because the merge algorithm only makes one pass sequentially through each of the chunks, each chunk does not have to be loaded completely; rather, sequential parts of the chunk can be loaded as needed.
<Inspired from Wikipedia>
Categories: Tech Trivia Tags: ,
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: