Word Embedding - Vivek's Digital Garden

## Stop Words Stop words are common words that can be removed from the text for training and testing. With some RNNs removing stop words may not be a good idea ## Tokenizer ```python tf.keras.preprocessing.text.Tokenizer( num_words=None, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=' ', char_level=False, oov_token=None, document_count=0, **kwargs) ``` Tokenizer is a keras module that can be used to create a dictionary of numbers corresponding to the words in a a given sentence. The tokenizer has to be first initialized before used. The arguments available during initialization include `num_words` and `oov_token`. The later is the value given for "out of vocabulary" words. i.e.. the words that the neural network hasnt trained on. Without the `oov_token` such words are just dropped. Here is an example of Tokenizer initialization ```python tokenizer = Tokenizer(num_words=1000, oov_token="<OOV>" ``` After initialization, the tokenizer needs to be fit with training data. ```python tokenizer.fit_on_texts(texts) ``` After the tokenizer is fit, it is ready to convert the texts into numbers and it can be done with ```python sequences = tokenizer.texts_to_sequences(texts) ``` The output of the last command can have varying lengths because each sentence can be varying lengths. Hence it is important to use padding as the next step ## Padding ```python tf.keras.preprocessing.sequence.pad_sequences( sequences, maxlen=None, dtype='int32', padding='pre', truncating='pre', value=0.0) ``` Padding ensures all the sequences are of same length. `maxlen` is the parameter that can be used to specify the length of each item. If values exceed this value it is truncated, else it is padded with zeros (or that specified in `value`). Padding can be done with `pre` or `post` same for truncation ## Embedding ```python tf.keras.layers.Embedding( input_dim, output_dim, embeddings_initializer='uniform', embeddings_regularizer=None, activity_regularizer=None, embeddings_constraint=None, mask_zero=False, input_length=None, **kwargs) ``` Embedding is a set of numbers that represent a word for the purpose of the prediction. Words that have similar meaning (outcome) have similar embedding. For example, the word "dog" might have similar embedding to the word "canine". Embedding is the first layer in a deep neural network and are learned as part of training parameters for Embedding include `input_dim` - This is the number or words in the vocabulary `output_dim` - This is the dimension of the embedding (eg. 16 or 32) `input_dim` - This is the `maxlen` used in the padding step `embedding_regulzrizer` - tensorflow regularizers can be used and seem to help avoid overfitting.