Most Common Words
We want to find the most frequently used words in a long string of text. Write a function most_common_words(text) that returns an array containing words with their frequencies, sorted from most common to least common, with ties broken by alphabetic order.
For greater accuracy, your solution should ignore punctuation and capitalization.
Pythondef most_common_words(text):
# Your code here
Examples
For example, given the string
Pythontext = 'It was the best of times, it was the worst of times.'
then most_common_words(text) should return
[
('it', 2),
('of', 2),
('the', 2),
('times', 2),
('was', 2),
('best', 1),
('worst', 1)
]You can use a built-in function like split to convert a string into an array of words.
Which data structure will make it easiest to map words to their frequencies?
You can put the word frequencies in a hash table, and then return the entries sorted by their count.
Does your solution correctly account for punctuation and capitalization? ('It' vs. 'it.')
Our solution processes the input text by making it lowercase, removing punctuation, splitting the words by whitespace, and then building a frequency dictionary. Finally, it returns the sorted entries, first sorting by alphabetical order and then by frequency (Note: this assumes the use of a stable sorting algorithm, which Python and JavaScript use by default).
We provide two potential implementations. The first uses the language's built-in regular expression library and string splitting, making it more readable and easier to implement. The second solution is more performant as it sanitizes and splits the input text in a single pass. Both implementations have similar time complexity because the bottleneck is the sorting step, which is O(n log n).
Solution 1: Using Regular Expressions and Built-in Split Method
import re
def most_common_words(text):
# Sanitize the text by removing punctuation and converting to lowercase
punctuation_re = r'[.,;!\"\'\(\)]'
sanitized_text = re.sub(punctuation_re, ' ', text).lower()
word_array = sanitized_text.split()
word_freq = {}
# Increment the count each time a word occurs
for word in word_array:
word_freq[word] = word_freq.get(word, 0) + 1
# Sort the words alphabetically and then by their frequency
words = sorted(word_freq.items())
words = sorted(words, reverse=True, key=lambda x: x[1])
return wordsSolution 2: Sanitizing and Splitting in One Pass
def most_common_words(text):
word_freq = {}
start_idx = 0
for i in range(len(text)):
# Iterate until we hit a word barrier
if text[i] in ' .,;!\"\'-()':
if i > start_idx:
word = text[start_idx:i].lower()
word_freq[word] = word_freq.get(word, 0) + 1
start_idx = i + 1
# Handle the last word if there's no punctuation at the end
if start_idx < len(text):
word = text[start_idx:].lower()
word_freq[word] = word_freq.get(word, 0) + 1
# Sort alphabetically and then by frequency
words = sorted(word_freq.items())
words = sorted(words, reverse=True, key=lambda x: x[1])
return wordsTime Complexity: Both solutions have a time complexity dominated by the sorting step, which is O(n log n), where n is the number of unique words.
Space Complexity: Both solutions use O(n) space for the frequency dictionary and the sorted list of words, where n is the number of unique words in the text.
import java.util.*; public class MostCommonWords { public static String[][] mostCommonWords(String text) { // your code goes here Map<String, Integer> map = new HashMap<>(); for(String s : text.replaceAll("[\\p{Punct}]", "").toLowerCase().split(" ")) { if(!s.isEmpty()) { map.merge(s, 1, Integer::sum); } } return map.entrySet().stream().sorted( (e1, e2) -> { int comp = Integer.compare(e2.getValue(), e1.getValue()); return comp == 0 ? e1.getKey().compareTo(e2.getKey()) : comp; } ).map( e -> new String[]{e.getKey(), String.valueOf(e.getValue())} ).toArray(String[][]::new); } public static void main(String[] args) { // debug your code below String text = "It was the best of times, it was the worst of times."; String[][] result = mostCommonWords(text); System.out.println("Most Common Words:"); for (String[] pair : result) { System.out.println(pair[0] + ": " + pair[1]); } } }