A common need when working with data, is to find the top 10 in a larger list. An easy solution is to use the built-in sort function to sort the entire list, and then just take the first 10 items. While this works, it feels wasteful. Other than knowing what is the tail, you’re not interested in the tail of the list at all.
January 2017, Bora M. Alper posted his take on this problem:
lazysort. Using python generators, he sorted the list using dropsort as needed, returning just the part of the list that was requested. While Bora’s idea is quite neat, sadly his benchmark showed that it couldn’t actually compete with using python’s built-in sort function. This is not very surprising, cpython’s
sorted function uses the state-of-the-art TimSort algorithm, and it is implemented in optimized C.
But his approach gave me inspiration. While I feel dropsort is sort of a gimmick, Bora might actually be on to something here. During my time as a game developer, I built an early-out version of quicksort, which returned as soon as the head of the list was sorted. We should be able to build a something similar using python generators.
A generator-based quicksort could look like this:
def partition(unsorted): sublist_length = len(unsorted) if sublist_length <= 1: return unsorted, () pivot = unsorted lhs =  rhs =  for idx in range(1, sublist_length): finger = unsorted[idx] if finger < pivot: lhs.append(finger) else: rhs.append(finger) lhs.append(pivot) return lhs, pivot, rhs def quicksort(unsorted): if len(unsorted) < 2: yield from unsorted else: lhs, rhs = partition(unsorted) yield from quicksort(lhs) del lhs # release memory used by lhs yield pivot yield from quicksort(rhs)
For the uninitiated: quicksort works by splitting lists in two parts: one list with all items smaller than some pivot value, one with all items larger than that pivot value, and then repeating that procedure on both lists. Splitting the lists is the task of the partition function.
When comparing my implementation to lazy sort and the built-in
sorted function, you can clearly see that while Bora’s
lazysort can’t compete, his idea is sound. (This benchmark was based on retrieving the top 10 from a list of integers).
The chart is dominated by
lazysort, obscuring the fact that partial quicksort is indeed faster than the built-in
Now that we know we can beat
sorted, let’s see why this is. A lazy sorting algorithm only has benefit when it can decide not to do some work. In quicksort’s case, not having to sort the right partition several times in a row is a huge benefit. But where’s the cut-off when sorting items?
If you just need the first few percent, lazy
quicksort can help out, but it has nothing on using
heapq, which breaks even at ~45%, and has a worst case (100% iteration) premium of only 200% over
sorted. Conclusion: while partial quicksort is an interesting gimmick, it has no added value over
I’m doubtful optimization of
quicksort will make it compete with
quicksort‘s best case (only retrieving one item) still requires performing
log(n) partitions; O(n log(log(n)) performance, and it’s worst case (retrieving all items) is still O(n log(n)). The
heapq based solution has a generalized performance of O(n + m log(n)), which is always better than
Currently support for
reverse parameters is lacking, but those are trivial to add.
Lazy Quicksort could fake an list-like interface. The current implementation only sorts the head, but the principle applies to any part of the sorted list. This is something you can’t achieve with heapq, which can only operate at the end of the lists. This could speed up computation of median values, and percentiles, although with my test data, it will probably only outperform
heapq for the 45-55 percentile, which is a really narrow niche.
All code used for blog-post is on github
In Internet discussions, showing user icons really helps to show who’s saying what. Humans need more time to read a users name than they need to recognize a picture. As the old saying goes: a picture is worth a thousand words - that’s quite an impressive compression ratio.
Many ...Read More
Celery is an asynchronous job queue. It is used to build distributed applications — application which run on (potentially) multiple hosts, but it is also useful if your application runs on a single host.
Celery distributes and schedules work by passing messages around through an AMQP broker such as RabbitMQ ...Read More
I’ve been asked to compare Python to Java quite a few times now, far more often than other combinations of languages. While most of these questions came from trolls, some of them were sincere. In one instance, business decisions were made based on my recommendation.
If anything ...Read More