HuggingFace — Tokenization

MaheswaraReddy
3 min readFeb 8, 2023

--

As we know, any machine learning model cannot receive raw stings as input. Text must be tokenized and it has to be encoded as numerical vectors and then push it to model. Tokenization is the step which will optimally split the words of corpus into subunits.

We can say that there are 3 methods of tokenization.

1). Character Tokenization

2). Word Tokenization

3). SubWord Tokenization

Character Tokenization:

String is nothing but an array of characters, So any text can be converted as list of characters using list function in python.

But model expects numbers, so, these characters needs to be converted as numbers, this process can be called as “numericalization”. Best way to do this is with encoding each unique token with unique integer

Each character is associated with unique integer in the output. These tokens has to be converted to list of integers.- (Special characters and then capital letters in alphabetical order and then lower case letter with alphabetical order ranking is given)

[4, 6, 19, 16, 22, 20, 1, 0, 11, 6, 15, 7, 13, 21, 2, 0, 5, 10, 0, 18, 14, 6, 24, 13, 16, 11, 0, 20, 18, 13, 16, 0, 6, 16, 9, 0, 7, 19, 6, 8, 13, 16, 11, 0, 10, 17, 19, 0, 3, 20, 12, 23, 13, 16]

Lets understand the above list with one example.

First letter in the text was M. After Numericalization M key is associated with 4 value. So, the final list first character is represented as 4. Final list is replacement of character tokenization with the replacement of values from a tokenized dictionary.

Character encoding will take more time compared to other tokenization’s.

Word Tokenization:

Text will be splitting into words and for those words number will be assigned. Complexity of the training process will be reduced with this word tokenization.

Punctuations are not accounted in this word tokenization. So, after this we have to assign a value to each word.

Problem with tokenization is if we have more number of words in an input layer, consider 1000 words in input layer and next immediate hidden layer has 100 neurons, then weights corresponding to them will be 1000*100. If the input size increases and hidden layer size increases then weights has to be increased, then the cost of the network will be more. Rare words we can remove and common words can be removed to nullify this type of situation. But we may be missing some data. To overcome this type of situations we can consider Subword Tokenization.

SubWord Tokenization:

This will combine best aspects of character tokenization and word tokenization. Split rare words into smaller units to allow the model to deal with complex words and misspellings. On the other hand, keep frequent words as unique entities so that we can keep the length of our inputs to manageable size.

We have several sub-word tokenization algorithms are available, but we will discuss here about WordPiece, which is widely used in BERT and in DistilBERT.

Words are mapped to integers in the above dictionary with key as “input_ids”

Attention_mask’s says that which word needs to be consider during the modeling is specified.

Input id’s will be padded when we use specific models, because, these models require specific length of text. In that case even attention_mask will be zero for the padded zero’s.

Picture is taken from internet

Thank You for reading and Please follow me to support.

--

--

Responses (1)