Autocorrect and Minimum Edit Distance

NLP
deep-learning
Published

October 25, 2021

This is my brief note from DeepLearning.AI’s NLP Specialization Course.

What is Autocorrect?

Autocorrect is an application that changes misspelled word into a correct word. When writing messages or drafting an email, you may have noticied that if we type any words that is misspelled, then that word automatically gets corrected with correct spelling and based on the context.

Fig. Autocorrect in action in google document

How does autocorrect work??

While typing the document we can see we get automatic correction in our document. The basic working of this automatic correction is:

  • Identifying a misspelled word
  • Find the strings n edit distance away
  • Filter the candidates
  • Calculate word probabilities

Now let’s see each of the points in detail.

1. Identifying a misspelled word:

Let’s say we are writing a sentence This is a draft docment of the APIs. Here we can see clearly see that the word docment is misspelled. But, how do we know that this is a misspelled word? Well, we will have a dictionary containing all correct words and if we do not encounter given string in the dictionary, that string is obviously a misspelled word.

if word not in vocab:
  misspelled = True       # If the word is not in vocab, we identify it as a misspelled word.

While identifying a misspelled words, we are only looking at the vocab but not the context. Consider a sentence Hello deah. Here, dear is misspelled as deah. If we write deer instead of dear, then we would not be able to identify misspelled word because deer is present in vocab, though it is contextually incorrect.

2. Find strings n edit distance away

Edit is an operation that is performed on a string to change it.

  • Types of edit: - Insert (add a letter) ‘to’: ‘top’, ‘two’ - Delete (remove a letter) ‘hat’: ‘ha’, ‘at’, ‘ht’ - Switch (swap 2 adjacent letters) ‘eta’: ‘eat’, ‘tea’ - Replace (change 1 letter to another) ‘jaw’: ‘jar’, ‘paw’ Using these edits, we can find all possible strings that are n edits away.

3. Filter candidates

After findings strings that are n edit distance away, next step is to filter those strings. After applying edits, the strings are compared with the vocab, and if those strings are not present in vocab, they are discarded. This, way we get a list of actual words.

4. Calculate the word probabilities

The final step is to calculate the word probabilities and find the most likely word from the vocab. Given the sentence I am learning AI because AI is the new electricity, we find occurrence of each word and calculate the probability. Probability of given word w can be calculated as the ratio of the count of word w to the total size of the corpus. Mathematically: \(P(w) = \frac{C(w)}{V}\),

where:

  • \(P(w) - Probability \ of\ a\ word\)
  • \(C(w) - Number \ of \ times \ the \ word \ appears\)
  • \(V - Total \ size \ of \ the \ corpus\)

Minimum Edit Distance

Till now, we have seen how edit distance works and what are its applications in NLP domain. Now let’s look at the minimum edit distance.

Minimum edit distance is the minimum number of edits required to transform one string to another. It is used in various applications such as spelling correction, machine translation, DNA sequencing, and many more. To calculate minimum edit distance, we use three types of operations which is also discussed above, i.e., insert, delete and replace.

For example: Consider source word deer and target word door. To change source to target, we need to perform two replace operations i.e., replace each of e to o. Here number of edits is 2, but in minimum distance distance, there are cost associated with different edit operations.

Edit Operations Cost
Insert 1
Delete 1
Replace 2

Using above table, the edit distance for the above problem is: Edit distance = 2 x replace cost = 2x2 = 4

This is a brute force method where we are simply looking at the source and target word to calculate the edit distance. If we have large sentence, calculating edit distance with mentioned approach becomes tedious. To make things simpler, we can opt for tabular method using Dynamic Programming approaches.