Find Largest Numbers
Let's say we have a long list of unsorted numbers (potentially millions), and we want to find the M largest numbers contained in it. Implement a function find_largest(input, m) that will find and return the largest m values given an input array or file.
If the input array is empty, return None (Python) or null.
Examples
Pythonfind_largest([1,5,4,2,3], 3) # => [3, 4, 5]
Example reading from a local file:
Python# Create a new file with line-separated numbers with open('numbers.txt', 'w') as file: for n in range(0, 1000): print(str(n), file=file) # Find the largest numbers in the file with open('numbers.txt') as file: find_largest(file, 2) # => [998, 999]
What data structures would be useful for keeping track of minimum and maximum values?
If our input is very big, will we be able to store the entire list in memory?
One option is to use a min-heap and update it as you iterate through the input.
Is your solution space-efficient? What happens if input is too long to fit in memory?
Before we try to solve the task at hand, let's think about a simpler version of this problem: finding the single largest number in a list. This would be equivalent to find_largest(input, 1). Finding the maximum value in a list is not difficult; all we have to do is keep track of the largest value we've seen so far.
Python# Note: This is just for illustration, you could also use a built-in `max` method.
def find_max(input):
max_num = float('-inf') # Initialize to smallest number
for i in input:
if i > max_num:
max_num = i
return max_num
Now, how can we extend this simple idea to arbitrary values of M? One option would be to create an array max_nums of size M and store the highest numbers we've found so far. Every time we see a new number, we check if it is greater than the smallest entry in max_nums and, if so, replace it. But now this requires us to check every entry in max_nums or keep sorting the array to ensure we're replacing the correct numbers.
Fortunately, there's a data structure that takes care of this for us: the heap! Remember, a heap (or priority queue) is a tree-like structure that automatically prioritizes itself as we add and remove values. A max-heap will always return the largest value it contains, and a min-heap will always return its smallest value.
We can achieve the same idea as our max_nums array with a min-heap of size M. Every time we see a new number, we check if it is greater than the smallest number in the heap. If so, we pop() the smaller number and push() the bigger one.
Here's our final solution:
from heapq import heappush, heappop
def find_largest(input, m):
if not input:
return None
max_nums = [float('-inf')]
for i in input:
if int(i) > max_nums[0]:
if len(max_nums) >= m:
heappop(max_nums)
heappush(max_nums, int(i))
return max_numsNote: In Python, the above function works on either an array of numbers or a file pointer! In the file scenario, the for...in loop will iterate over the lines of the file without loading the entire list into memory. We use the int() method to convert any string values into numbers, and we compare the current number with max_nums[0] because this is guaranteed to be the smallest item in the heap.
Why is this the optimal solution?
You may be thinking, 'Why not just sort the array and take the top M numbers, or use a max-heap instead of a min-heap?'
Good question! It's important to discuss alternatives like these during an interview. In this case, we know from the prompt that we could have millions of numbers, which means we have a problem: we may run out of space if we try to keep the entire list in memory. It's also not as time-efficient to sort the entire array either. If we only care about the top M elements, why waste time sorting the rest of the list?
Sorting an array or using a max-heap will both take O(n log n) time and O(n) space, whereas using a min-heap of size m will take O(n log m) time and only O(m) space—which means it will be more efficient as long as m < n. This may not seem like a critical distinction, but at a larger scale, this solution could save you a lot of time and money!
public static Integer[] findLargest(int[] input, int m) { if(input==null || input.length==0) return null; PriorityQueue<Integer> minHeap=new PriorityQueue<Integer>(); for(int i:input) { if(minHeap.size()<m) { minHeap.add(i); } else { Integer top=minHeap.peek(); if(i>(int)top){ minHeap.poll(); minHeap.add(i); } } } Integer[] res=minHeap.toArray(new Integer[0]); Arrays.sort(res); return res; }