This is my brief note from DeepLearning.AI’s NLP Specialization Course.
What is Autocorrect?
Autocorrect is an application that changes misspelled word into a correct word. When writing messages or drafting an email, you may have noticied that if we type any words that is misspelled, then that word automatically gets corrected with correct spelling and based on the context.
How does autocorrect work??
While typing the document we can see we get automatic correction in our document. The basic working of this automatic correction is:
- Identifying a misspelled word
- Find the strings
n
edit distance away - Filter the candidates
- Calculate word probabilities
Now let’s see each of the points in detail.
1. Identifying a misspelled word:
Let’s say we are writing a sentence This is a draft docment of the APIs
. Here we can see clearly see that the word docment
is misspelled. But, how do we know that this is a misspelled word? Well, we will have a dictionary containing all correct words and if we do not encounter given string in the dictionary, that string is obviously a misspelled word.
if word not in vocab:
= True # If the word is not in vocab, we identify it as a misspelled word. misspelled
While identifying a misspelled words, we are only looking at the vocab but not the context. Consider a sentence Hello deah
. Here, dear
is misspelled as deah
. If we write deer
instead of dear
, then we would not be able to identify misspelled word because deer
is present in vocab, though it is contextually incorrect.
2. Find strings n edit distance away
Edit is an operation that is performed on a string to change it.
- Types of edit: - Insert (add a letter)
‘to’: ‘top’, ‘two’
- Delete (remove a letter)‘hat’: ‘ha’, ‘at’, ‘ht’
- Switch (swap 2 adjacent letters)‘eta’: ‘eat’, ‘tea’
- Replace (change 1 letter to another)‘jaw’: ‘jar’, ‘paw’
Using these edits, we can find all possible strings that aren
edits away.
3. Filter candidates
After findings strings that are n edit distance away, next step is to filter those strings. After applying edits, the strings are compared with the vocab, and if those strings are not present in vocab, they are discarded. This, way we get a list of actual words.
4. Calculate the word probabilities
The final step is to calculate the word probabilities and find the most likely word from the vocab. Given the sentence I am learning AI because AI is the new electricity
, we find occurrence of each word and calculate the probability. Probability of given word w
can be calculated as the ratio of the count of word w
to the total size of the corpus. Mathematically: \(P(w) = \frac{C(w)}{V}\),
where:
- \(P(w) - Probability \ of\ a\ word\)
- \(C(w) - Number \ of \ times \ the \ word \ appears\)
- \(V - Total \ size \ of \ the \ corpus\)
Minimum Edit Distance
Till now, we have seen how edit distance works and what are its applications in NLP domain. Now let’s look at the minimum edit distance.
Minimum edit distance is the minimum number of edits required to transform one string to another. It is used in various applications such as spelling correction, machine translation, DNA sequencing, and many more. To calculate minimum edit distance, we use three types of operations which is also discussed above, i.e., insert, delete and replace.
For example: Consider source word deer
and target word door
. To change source to target, we need to perform two replace operations i.e., replace each of e
to o
. Here number of edits is 2, but in minimum distance distance, there are cost associated with different edit operations.
Edit Operations | Cost |
---|---|
Insert | 1 |
Delete | 1 |
Replace | 2 |
Using above table, the edit distance for the above problem is: Edit distance = 2 x replace cost
= 2x2
= 4
This is a brute force method where we are simply looking at the source and target word to calculate the edit distance. If we have large sentence, calculating edit distance with mentioned approach becomes tedious. To make things simpler, we can opt for tabular method using Dynamic Programming approaches.