NLP Tutorial

Part 1: Introduction

Supervised Learning Paradigm

Supervised Learning: Where the ground truth for targets can be found in observations.

Observations: Items we want to predict from. Denoted with x value or called input

Targets: Labels that correspond to an observation. Denoted with x value or called ground truth

Model: Function that takes obervations to predict targets

Parameter: Weights of observations to parameterize model. Denoted with w

Predictions: Estimates of targets for observations. Denoted with y hat or ŷ

Loss Function: Calculates how far the predictions are from the ground truth and gives a real scalar value, loss. Denoted with L

supLearningParadigm.png

For an input X, the model F is ŷ = F(X, w) and the loss function L is L(y,ŷ)

Gradient Descent is commonly used to find the roots of a linear system. w is iteratively updated over all of x until L reaches a certain threshold.

An alternative to this is Stochastic Gradient Descent which uses random numbers or random “minibatches” instead of random numbers.

Iteratively updating parameters of the model is known as Backpropagation. The forward step is used to calculate the Loss and the backward step is used to update the parameters.

Observation and Target Encoding

tEncoding.png

Text can be represented as a numerical vector which is also a count-based representation based on heuristics.

One-Hot Representation

Starts with a zero vector and sets 1 to the corresponding index of the word if present in the text.

Ex:
This is the best day ever.
This is the best laptop here.

Vocabulary: {‘this’, ‘is’, ‘the’, ‘best’, ‘day’, ‘laptop’, ‘ever’, ‘here’}

Binary encoding for “this is the best” is [1,1,1,1,0,0,0,0]

  this is the best day laptop ever here
1this 1 0 0 0 0 0 0 0
1is 0 1 0 0 0 0 0 0
1the 0 0 1 0 0 0 0 0
1best 0 0 0 1 0 0 0 0
1day 0 0 0 0 1 0 0 0
1laptop 0 0 0 0 0 1 0 0
1ever 0 0 0 0 0 0 1 0
1here 0 0 0 0 0 0 0 1

One-hot representation for encoding the sentences “This is the best day ever” and “This is the best laptop here”

TF Representation

TF Representation is the sum of one-hot representations of each word in the sentence. The sentence “This here is the best laptop this day” using the previous one-hot encoding would give the encoding [2, 1, 1, 1, 1, 1, 0, 1].

from sklearn.feature_extraction.text import CountVectorizer
import seaborn as sns

corpus = ["This is the best day ever", "This is the best best laptop here"]
vocab = {'this', 'is' ,'the','best', 'day' ,'laptop', 'ever', 'here'}

one_hot_vectorizer = CountVectorizer(binary=True)
one_hot = one_hot_vectorizer.fit_transform(corpus).toarray()
sns.heatmap(one_hot, annot=True, cbar=False, xticklabels=vocab, yticklabels=['Sentence 1','Sentence 2'])

png

Collapsed one-hot representation with the corpus “This is the best day ever” and “This is the best best laptop here”.

TF-IDF Representation

The TF-IDF representation uses the word’s frequency in the corpus to weight said words. Common words are not as important but the rarer words will have more weight.

The Inverse Document Frequency (IDF) is the log of the number of documents divided by the number of documents containing the word.

from sklearn.feature_extraction.text import TfidfVectorizer
import seaborn as sns

tfidf_vectorizer = TfidfVectorizer()
tfidf = tfidf_vectorizer.fit_transform(corpus).toarray()
sns.heatmap(tfidf, annot=True, cbar=False, xticklabels=vocab, yticklabels= ['Sentence 1', 'Sentence 2'])

png

Target Encoding

Most NLP tasks use unique categorical labels but that can have issues as output labels increase such as in the language modeling problem which is to predict the next word. The label space will include everything from special characters, names, etc. so we must find a better solution for this.

Other NLP problems use labels use numerical labels but this can be easily satisfied by making “bins” or ranges for all of the values. (Ex: 0-20, 20-40, … , 180-200)

Computational Graphs

A computational graph is an abstraction which models math expressions. For our purposes, computational graphs will help us see what kind of additional differentiation is required to get parameter gradients

For a linear model, the equation used would be

y = wx + b

which can be simplified into

z = wx & y = z + b

A directed acyclic graph (DAG) of the aforementioned equations would look like so:

DAG.png

PyTorch Basics

We will be using PyTorch for this course. PyTorch is a tensor manipulation library that offers a multitude of deep learning packages.

A tensor is a mathmatical data structure wherein there is some dimensions of data. A Rank 0 Tensor has but one value, a Rank 1 Tensor is a vector, a Rank 2 Tensor is a matrix, and so on as the Rank N Tensor is an n-dimensional array of scalars.

Installing PyTorch

Download using either Pip or Conda

Use this command for installation using conda

conda install pytorch torchvision -c pytorch

Creating Tensors

Create a helper function display(x) so that we can see important attributes of the tensor easily.

import torch

def display(x):
    print("Tensor Type: {}".format(x.type()))
    print("Tensor Size: {}".format(x.shape))
    print("Values: \n{}".format(x))
    

display(torch.Tensor(3, 4))
Tensor Type: torch.FloatTensor
Tensor Size: torch.Size([3, 4])
Values: 
tensor([[ 9.4774e-38,  2.2960e-38, -1.7292e-23,  4.5869e-41],
        [ 0.0000e+00,  1.0842e-19, -2.4786e+15, -1.0845e-19],
        [-1.7267e-23,  4.5869e-41, -1.5941e-24,  4.5869e-41]])

There’s a number of ways to randomly initialize a tensor’s values as well, either uniformly random and randomly from a normal distribution

display(torch.rand(2, 3))  # uniformly random
display(torch.randn(2, 3)) # random normal distribution
Tensor Type: torch.FloatTensor
Tensor Size: torch.Size([2, 3])
Values: 
tensor([[0.9510, 0.5519, 0.4149],
        [0.5744, 0.3234, 0.8496]])
Tensor Type: torch.FloatTensor
Tensor Size: torch.Size([2, 3])
Values: 
tensor([[ 0.9831,  0.1202, -0.3810],
        [-0.6465,  0.1269, -0.5633]])

Tensors can also be filled by one value or certain values. torch.zeros and torch.ones both fill up the tensor with the given dimensions with 0’s and 1’s respectively. torch.fill_ takes the given defined tensor, and fills up every value with the given value.

filledTensor = torch.zeros(2, 3) #Fills tensor with 0s
display(filledTensor)

filledTensor = torch.ones(2, 3) #Fills tensor with 1s
display(filledTensor)

filledTensor.fill_(8) #Fills tensor with 8s
display(filledTensor)
Tensor Type: torch.FloatTensor
Tensor Size: torch.Size([2, 3])
Values: 
tensor([[0., 0., 0.],
        [0., 0., 0.]])
Tensor Type: torch.FloatTensor
Tensor Size: torch.Size([2, 3])
Values: 
tensor([[1., 1., 1.],
        [1., 1., 1.]])
Tensor Type: torch.FloatTensor
Tensor Size: torch.Size([2, 3])
Values: 
tensor([[8., 8., 8.],
        [8., 8., 8.]])

Tensors can also be defined by lists and by numpy arrays.

import numpy as np

listTensor = torch.Tensor([[1, 2, 3],
                           [4, 5, 6]])
display(listTensor)

numpyArray = np.random.rand(2, 3)
display(torch.from_numpy(numpyArray))
Tensor Type: torch.FloatTensor
Tensor Size: torch.Size([2, 3])
Values: 
tensor([[1., 2., 3.],
        [4., 5., 6.]])
Tensor Type: torch.DoubleTensor
Tensor Size: torch.Size([2, 3])
Values: 
tensor([[0.2463, 0.6978, 0.6044],
        [0.7769, 0.3949, 0.9071]], dtype=torch.float64)

Tensor Types and Size

Just like numerical values, Tensor’s can also vary in their types which include float (the default), long, double, etc.

This can be changed by calling the specific Tensor type’s constructor or through torch.tensor() and passing the wanted data type.

display(listTensor)

listTensor = listTensor.long()
display(listTensor)

listIntTensor = torch.tensor([[1, 2, 3],
                             [4, 5, 6]], dtype=torch.int64)
display(listIntTensor)

listIntTensor = listIntTensor.double()
display(listIntTensor)
Tensor Type: torch.FloatTensor
Tensor Size: torch.Size([2, 3])
Values: 
tensor([[1., 2., 3.],
        [4., 5., 6.]])
Tensor Type: torch.LongTensor
Tensor Size: torch.Size([2, 3])
Values: 
tensor([[1, 2, 3],
        [4, 5, 6]])
Tensor Type: torch.LongTensor
Tensor Size: torch.Size([2, 3])
Values: 
tensor([[1, 2, 3],
        [4, 5, 6]])
Tensor Type: torch.DoubleTensor
Tensor Size: torch.Size([2, 3])
Values: 
tensor([[1., 2., 3.],
        [4., 5., 6.]], dtype=torch.float64)

Tensor Operations

Just like many data structures, the +,-,/,* operations work as well as .add().

listTensor = torch.Tensor([[1, 2, 3],
                           [4, 5, 6]])
display(listTensor)

display(listTensor*3) # Multiply all values by 3

display(listTensor+listTensor) # Add it to itself

display(torch.add(listTensor,listTensor)) # Add it to itself using torch.add()

display(listTensor*listTensor) # Multiplies each value to the corresponding value (NOT matrix multiplication)
Tensor Type: torch.FloatTensor
Tensor Size: torch.Size([2, 3])
Values: 
tensor([[1., 2., 3.],
        [4., 5., 6.]])
Tensor Type: torch.FloatTensor
Tensor Size: torch.Size([2, 3])
Values: 
tensor([[ 3.,  6.,  9.],
        [12., 15., 18.]])
Tensor Type: torch.FloatTensor
Tensor Size: torch.Size([2, 3])
Values: 
tensor([[ 2.,  4.,  6.],
        [ 8., 10., 12.]])
Tensor Type: torch.FloatTensor
Tensor Size: torch.Size([2, 3])
Values: 
tensor([[ 2.,  4.,  6.],
        [ 8., 10., 12.]])
Tensor Type: torch.FloatTensor
Tensor Size: torch.Size([2, 3])
Values: 
tensor([[ 1.,  4.,  9.],
        [16., 25., 36.]])

There are also a various number of ways to change and view Rank 2 tensors differently

tensorRankOne = torch.arange(6)
display(tensorRankOne)

tensorRankOne = tensorRankOne.view(2, 3) # rearrage it into a rank 2 tensor
display(tensorRankOne)

display(torch.sum(tensorRankOne, dim=0)) # get sum of each column
display(torch.sum(tensorRankOne, dim=1)) # get sum of each row

display(torch.transpose(tensorRankOne, 0, 1)) # swap the axis or transpose the tensor

Tensor Type: torch.LongTensor
Tensor Size: torch.Size([6])
Values: 
tensor([0, 1, 2, 3, 4, 5])
Tensor Type: torch.LongTensor
Tensor Size: torch.Size([2, 3])
Values: 
tensor([[0, 1, 2],
        [3, 4, 5]])
Tensor Type: torch.LongTensor
Tensor Size: torch.Size([3])
Values: 
tensor([3, 5, 7])
Tensor Type: torch.LongTensor
Tensor Size: torch.Size([2])
Values: 
tensor([ 3, 12])
Tensor Type: torch.LongTensor
Tensor Size: torch.Size([3, 2])
Values: 
tensor([[0, 3],
        [1, 4],
        [2, 5]])

Indexing, Slicing, and Joining

Very similar to NumPy’s way of indexing and slicing. PyTorch also has it’s own functions for more complex indexing and slicing as well

tensorRankTwo = torch.arange(6).view(2,3)
display(tensorRankTwo)

display(tensorRankTwo[:1, :2])

display(tensorRankTwo[0, 1])

indices = torch.LongTensor([0, 2])
display(torch.index_select(tensorRankTwo, dim=1, index=indices)) # get the first and second column

indices = torch.LongTensor([0, 0, 0])
display(torch.index_select(tensorRankTwo, dim=0, index=indices)) # get the first row 3 times

display(torch.cat([tensorRankTwo,tensorRankTwo], dim=0)) # Concatenation of second tensor to the bottom of first tensor

display(torch.cat([tensorRankTwo,tensorRankTwo], dim=1)) # Concatenation of each row of the tensors
Tensor Type: torch.LongTensor
Tensor Size: torch.Size([2, 3])
Values: 
tensor([[0, 1, 2],
        [3, 4, 5]])
Tensor Type: torch.LongTensor
Tensor Size: torch.Size([1, 2])
Values: 
tensor([[0, 1]])
Tensor Type: torch.LongTensor
Tensor Size: torch.Size([])
Values: 
1
Tensor Type: torch.LongTensor
Tensor Size: torch.Size([2, 2])
Values: 
tensor([[0, 2],
        [3, 5]])
Tensor Type: torch.LongTensor
Tensor Size: torch.Size([3, 3])
Values: 
tensor([[0, 1, 2],
        [0, 1, 2],
        [0, 1, 2]])
Tensor Type: torch.LongTensor
Tensor Size: torch.Size([4, 3])
Values: 
tensor([[0, 1, 2],
        [3, 4, 5],
        [0, 1, 2],
        [3, 4, 5]])
Tensor Type: torch.LongTensor
Tensor Size: torch.Size([2, 6])
Values: 
tensor([[0, 1, 2, 0, 1, 2],
        [3, 4, 5, 3, 4, 5]])

Tensors and Computational Graphs

While defining a Tensor, another action that can be done is tracking the tensor’s current gradient and gradient function. This can be done by

requires_gradient = true
gradientTensor = torch.ones(2, 2, requires_grad=True)

display(gradientTensor)
print(gradientTensor.grad is None)

changedTensor = (gradientTensor + 2) * (gradientTensor + 5) + 3
display(changedTensor)
print(gradientTensor.grad is None)

meanTensor = changedTensor.mean()
display(meanTensor)
meanTensor.backward()
print(gradientTensor.grad is None)
Tensor Type: torch.FloatTensor
Tensor Size: torch.Size([2, 2])
Values: 
tensor([[1., 1.],
        [1., 1.]], requires_grad=True)
True
Tensor Type: torch.FloatTensor
Tensor Size: torch.Size([2, 2])
Values: 
tensor([[21., 21.],
        [21., 21.]], grad_fn=<AddBackward0>)
True
Tensor Type: torch.FloatTensor
Tensor Size: torch.Size([])
Values: 
21.0
False

By setting requires_grad=True , PyTorch now has to manage the gradient computation’s forward passes. Backward pass is initiated by .backward() which evaluates a loss function.

The gradient is typically is the slope of a function’s output with respect to its input. The nodes’ gradients can be accessed using the .grad for the respective member variable. Optimizers also use the .grad variable to update the parameters.

CUDA Tensors

Utilizing the GPU of your laptop allows for faster processing. Through the usage of the CUDA API, you can allocate the tensor to GPU memory on NVIDIA GPUs.

First you must check if CUDA can be supported, if so then create a torch device that utilizes the GPU and use .to(device) to instantiate future tensors to the GPU.

It is not possible to operate with CUDA and non-CUDA tensors not on the same device.

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cudaTensor = torch.rand(3, 3).to(device)
display(cudaTensor)
cpu
Tensor Type: torch.FloatTensor
Tensor Size: torch.Size([3, 3])
Values: 
tensor([[0.2927, 0.6019, 0.4595],
        [0.0556, 0.1222, 0.7028],
        [0.1000, 0.7749, 0.7047]])