# Sequence to Sequence ## Machine Translation ### Encoder and Decoder - Encoder part takes the first set of inputs and gives an encoding - Decoder outputs one value every time step taking previous steps output as an input ![[Pasted image 20210202222713.png]] This type of model works well on image captioning where the encoder is a cnn and the decoder is a rnn.. ### Beam Search vs Greedy Search In machine translation when you are translating a sentence into another language you can end up with very similar looking sentences with widely different meaning. #### Greedy search - It is the method where at each timestep you pick the best possible word that could be picked for that timestep. This approach can lead to to a less optimal overall selection of words #### Beam Search You set a "beam width". For example if you set a beam width of 3, it keeps three possible likely words at each time steps. In next step it figures out the most likely second word given each of those first words. Then it figures out what is the best possible two word combination given all the combinations. If beam width is three, you have three copies of the network ##### Length Normalization The total probability of a full sentence prediction is calculated as a sum of log of probabilities of predictions at each time step. This method tends to favor short sentences. Hence it needs to be normalized for length Large beam width will lead to more accurate result at expense of speed. ### Blue Score For a given French sentence for example there may be multiple acceptable English sentences. Bleu Bilingual evaluation understudy. Bleu score is a method to compare the machine translation with one of the proper tranlations ### Attention Model When working on a long sequence, a human translator does not read the whole thing memorize it and translate the whole thing. Instead they read sections and translate and continue... Attention model has two networks the second network takes the output of first network by applying some "attention weights" to figure out how much weightage to give for output at each time step. Attention weights are essentially computed as a softmax (sum of probabilities add to one)