Tokenizer - Vivek's Digital Garden

# Tokenizer Tokenizer is used to convert words into a sequence of integers or tokens. The tokens are created based on the frequency of appearance of the word in the corpus. ```python tf.keras.preprocessing.text.Tokenizer( num_words=None, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=' ', char_level=False, oov_token=None, document_count=0, **kwargs ) ``` There is three steps to using the Tokenizer class. 1. First you need to initialize it with appropriate inputs. 2. Second you need to fit it on the corpus you are training on using `fit_on_texts` 3. Third you need to convert the text to sequences for use in the neural network using `texts_to_sequences` ```python from tensorflow.keras.preprocessing import Tokenizer tokenizer = Tokenizer(num_words=100, oov_token='<OOV>') tokenizer.fit_on_texts(sentences) print(tokenizer.word_index) ``` Tokenizer also strips punctuation and capital letters Tokenizer builds a dictionary of values `tokenizer.texts_to_sequences(sentences)` It is usually a good idea give a token for out of vocabulary words otherwise they will be ignored. Example output: ```python tokenizer.word_index {'my': 1, 'i': 2, 'love': 3, 'dog': 4, 'cat': 5, 'racoon': 6, 'do': 7, 'you': 8, 'think': 9, 'is': 10, 'amazing': 11} ```