How I'd build the next Google -or- How to exploit the fact that only the first 10 pages of a search result are really necessary, to do a distributed map-reduce that doesn't bottleneck on the difficult-to-parallelise 'reduce'. In a traditional reduce/merge operation, you'd combine the results from the disparate systems to create one massive dataset on one system. A common search term would produce a large number of results, which would be fine in the 'map' (parallel searching) step, but for the 'reduce' step (merging the results) it would cause quite a bottleneck. When i first heard of the map-reduce algorithm, i often wondered how search engines got around this problem. But i recently was learning about the merge algorithm, and was thinking about how it could be combined with the fact that most people never look past the first few pages of a search result, and it made sense. So this article contains my somewhat vague ramblings (another post will have real code!) about how i'd implement the next biggest search engine (Clarkson's voice) ... in the world. Lets say i've got a big dataset of things to search through. Split this dataset up into quarters: A, B, C and D. Lets have 4 servers, each responsible for searching through its own quarter. A user then comes along, and says they want to search for 'flux capacitor'. The first part of how this works is pretty straight forwards, and lovely to parallelise: each of the 4 servers searches through its own quarter, making a list of all pages that match this search. If we then got the 4 servers to send their results to one server to merge them together, we would quickly overrun the network in the data centre, so the next bit is where i think the 'magic' is. In hindsight, its kinda obvious to me now, so if you're smarter than me and think 'this has been done a million times before' please don't leave rude comments. Anyway, I digress, so lets start with an overview picture: [[posterous-content:tnddjlzGfdDtdghpgqDI]] Each of the 4 servers will then sort their results, and grab only the top 100 rows. It's as though each server did this: select top 100 * from my_quarter where contents like '%flux capacitor%' order by pagerank desc Servers B and D then send their results to A and C, respectively. Server A and C now have 100+100 rows each. They each perform a merge operation on these rows to get a sorted 200-row set, and only keep the first 100 rows of that. (A merge operation is used to combine 2 independently sorted sets into one sorted set.) Then server C sends its 100-row set to server A. Again, server A merges its own 100 rows with C's 100 rows to produce the final results. These results are then paginated, at 10 rows per page x 10 pages, to give the user his results. I think this would be immensely scalable, what are your thoughts?

Thanks for reading! And if you want to get in touch, I'd love to hear from you: chris.hulbert at gmail.