Introduction to NumPy for Statistical Natural Language Processing in Python

LING / C SC 439/539Statistical Natural Language Processing Numerical Python

Outline • Overview of NumPy • Creating arrays • Resizing arrays • Indexing and selection • Operations on one array • Operations on two arrays • Linear algebra

Numerical Python • Module that can be imported in Python • Allows for: • Datatypes for vectors and matrices (called Arrays) • Vectorized computations, similar to MATLAB • Highly efficient; calls numerical libraries coded in C • Code looks much more like math • Fewer explicitly coded loops • Results in concise code

Vectorized computing • Standard Python: L = [1,2,3,4,5] L2 = [] for i in range(len(L)): L2.append(L[i] * 3) • NumPy: L = array([1,2,3,4,5]) L2 = L * 3

NumPy documentation • NumPyUser Guide • http://docs.scipy.org/doc/ • Guide to NumPy by Travis Oliphant (creator of NumPy) • http://www.tramy.us/guidetoscipy.html

First, import NumPy >>> from numpy import *

help(functionname) >>> help(eye) Help on function eye in module numpy.lib.twodim_base: eye(N, M=None, k=0, dtype=<type 'float'>) Return a 2-D array with ones on the diagonal and zeros elsewhere. Parameters ---------- N : int Number of rows in the output. M : int, optional Number of columns in the output. If None, defaults to `N`. k : int, optional Index of the diagonal: 0 refers to the main diagonal, a positive value refers to an upper diagonal, and a negative value to a lower diagonal. dtype : dtype, optional Data-type of the returned array.

help(functionname) Returns ------- I : ndarray (N,M) An array where all elements are equal to zero, except for the `k`-th diagonal, whose values are equal to one. See Also -------- diag : Return a diagonal 2-D array using a 1-D array specified by the user. Examples -------- >>> np.eye(2, dtype=int) array([[1, 0], [0, 1]]) >>> np.eye(3, k=1) array([[ 0., 1., 0.], [ 0., 0., 1.], [ 0., 0., 0.]])

Arrays in NumPy • All these are arrays in NumPy: • One-dimensional vector • Two-dimensional matrix • Higher-dimensional matrix

Creating a vector (one-dimensional array) >>> v = array([1,2,3,4,5]) >>> v array([1, 2, 3, 4, 5]) >>> ndim(v) # number of dimensions 1 >>> shape(v) # 5 elements in first dim. (5,) >>> size(v) # total number of elements 5

Creating a matrix (two-dimensional array) >>> a = array([[1,2,3],[4,5,6]]) >>> a array([[1, 2, 3], [4, 5, 6]]) >>> ndim(a) # number of dimensions 2 >>> a.shape # 2 rows, 3 columns (2, 3) >>> size(a) # total number of elements 6

When coding comma-separated types in Python (e.g. arrays or lists), can press enter after a comma >>> # these all produce the same result: >>> a = array([[1,2,3],[4,5,6]]) >>> a = array([[1, 2, 3], [4, 5, 6]]) >>> a = array([[1, 2, 3], [4, 5, 6]])

Calling functions vs. object attributes >>> a.shape (2, 3) >>> shape(a) (2, 3) • Produces same results whether you pass in object to function, or access the object’s attribute • The function accesses the object‘s attribute • Both can be used interchangeably • But in cases where a function is defined in another module, you’ll want to access the function through the object • You’ll see this later with max Also: a.ndima.size ndim(a) size(a)

Special functions to create matrices(2-d arrays) >>> ones((2,3)) array([[ 1., 1., 1.], [ 1., 1., 1.]]) >>> zeros((2,3)) array([[ 0., 0., 0.], [ 0., 0., 0.]]) >>> eye(3) array([[ 1., 0., 0.], [ 0., 1., 0.], [ 0., 0., 1.]])

Type of an array >>> a array([[1, 2, 3], [4, 5, 6]]) >>> a.dtype dtype('int32') >>> b = ones([2,3]) >>> b array([[ 1., 1., 1.], [ 1., 1., 1.]]) >>> b.dtype dtype('float64')

linspace >>> # vector with linearly spaced values >>> # linspace(start, stop, num values) >>> # function determines spacing of vals for you >>> linspace(3, 16, 5) array([ 3. , 6.25, 9.5 , 12.75, 16. ]) >>> linspace(15, 19, 4) array([ 15., 16.33333333, 17.66666667, 19.])

Arrays of random numbers >>> random.rand(2,3) # uniformly dist. between 0 and 1 array([[ 0.49386404, 0.12125634, 0.58045141], [ 0.80695113, 0.32188799, 0.63249074]]) >>> random.randn(2,3) # normal dist., mean=0, var=1 array([[-0.37422103, 1.03866716, -0.53547127], [ 0.30022273, 0.23015563, 0.80873554]])

Arrays of random numbers >>> # 2 x 3 matrix, uniformly dist. between 5 and 7 >>> random.uniform(5, 7, (2, 3)) array([[ 6.50654571, 5.77650203, 6.68806597], [ 6.29241871, 6.45282975, 6.4707847 ]]) >>> # 4 x 3 matrix, rand. ints between 3 and 6 >>> random.randint(3, 6, (4, 3)) array([[3, 3, 3], [5, 3, 5], [4, 3, 3], [5, 3, 3]])

Shape of an array >>> a.shape # or shape(a) (2, 3) >>> nrow = a.shape[0] >>> ncol = a.shape[1] >>> nrow 2 >>> ncol 3 >>> zeros(a.shape) # new array w/ same shape array([[ 0., 0., 0.], [ 0., 0., 0.]])

Transpose >>> a array([[1, 2, 3], [4, 5, 6]]) >>> transpose(a) # or a.transpose() array([[1, 4], [2, 5], [3, 6]]) >>> a # didn’t change it array([[1, 2, 3], [4, 5, 6]]) >>> a = transpose(a) # need to assign to variable >>> a array([[1, 4], [2, 5], [3, 6]])

Reshaping an array >>> a # 2 x 3 matrix array([[1, 2, 3], [4, 5, 6]]) >>> reshape(a, (3, 2)) # 3 x 2 matrix array([[1, 2], [3, 4], [5, 6]]) >>> reshape(a, (1,6)) # 1 x 6 matrix array([[1, 2, 3, 4, 5, 6]])

Concatenation >>> a = array([[1,2,3],[4,5,6]]) >>> a array([[1, 2, 3], [4, 5, 6]]) >>> b = zeros(a.shape) >>> b array([[ 0., 0., 0.], [ 0., 0., 0.]])

Concatenation >>> # note that it’s converted to float >>> concatenate((a,b), axis=0) array([[ 1., 2., 3.], [ 4., 5., 6.], [ 0., 0., 0.], [ 0., 0., 0.]]) >>> concatenate((a,b), axis=1) array([[ 1., 2., 3., 0., 0., 0.], [ 4., 5., 6., 0., 0., 0.]])

Try to concatenate a matrix with a vector >>> a array([[1, 2, 3], [4, 5, 6]]) >>> c = arange(3) >>> c array([0, 1, 2]) >>> concatenate((a,c), axis=0) Traceback (most recent call last): File "<pyshell#270>", line 1, in <module> concatenate((a,c), axis=0) ValueError: arrays must have same number of dimensions

Convert vector to matrix before concatenating >>> c.shape # one-dimensional (3,) >>> a.shape # two-dimensional (2, 3) >>> array([c]) array([[0, 1, 2]]) >>> concatenate((a, array([c])), axis=0) array([[1, 2, 3], [4, 5, 6], [0, 1, 2]])

Turn matrix into 1-d vector >>> a array([[1, 2, 3], [4, 5, 6]]) >>> ravel(a) array([1, 2, 3, 4, 5, 6])

append: for vectors >>> a = array([1,2,3]) >>> a = append(a, 4) >>> a array([1, 2, 3, 4]) >>> append(a, array([7,8,9])) array([1, 2, 3, 4, 7, 8, 9])

Outline • Overview of NumPy • Creating arrays • Resizing arrays • Indexing and selection • Operations on one arrays • Operations on two arrays • Linear algebra

Indexing vectors >>> b = array([3,5,7,9,11,13,15]) >>> b array([ 3, 5, 7, 9, 11, 13, 15]) >>> b[0] 3 >>> b[5:] array([13, 15]) >>> b[[0,5,2]] # indices can be in any order array([ 3, 13, 7])

Indexing arrays • Let M be a matrix of size m x n • m rows • n columns • m * n total elements • Mi,jis the entry of M at row i and column j >>> a = array([[1, 2, 3], [4, 5, 6]]) >>> a[0,1] # value at row 1, column 2 2

Indexing arrays >>> b = reshape(arange(12), (3,4)) >>> b array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]]) >>> b[0,:] # first row, all cols array([0, 1, 2, 3]) >>> b[1:,:] # second row to end, all cols array([[4, 5, 6, 7], [8, 9, 10, 12]]) >>> b[[0,2],:] # first & third rows, all cols array([[ 0, 1, 2, 3], [ 8, 9, 10, 11]])

Indexing arrays >>> b array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]]) >>> # all rows, cols 2 through 4 (exclusive) >>> b[:,1:3] array([[ 1, 2], [ 5, 6], [ 9, 10]]) >>> b[2,0] # third row, first column 6

Logical selection >>> a array([[1, 2, 3], [4, 5, 6]]) >>> a%2==0 array([[False, True, False], [ True, False, True]], dtype=bool) >>> # get even values of a >>> a[a%2==0] # returns a vector array([2, 4, 6])

Logical selection >>> # where(condition, x, y): >>> # when True, return x, else return y >>> where(a%2==0, 1, -1) # returns a matrix array([[-1, 1, -1], [ 1, -1, 1]]) >>> where(a%2==0, a, 0) # returns a matrix array([[0, 2, 0], [4, 0, 6]])

Unique >>> r = random.randint(0,5, (9,)) >>> r array([0, 3, 2, 2, 2, 1, 1, 4, 3]) >>> unique(r) array([0, 1, 2, 3, 4])

Modifying entries in an array >>> a array([[1, 2, 3], [4, 5, 6]]) >>> a[1,2] = 0 >>> a array([[1, 2, 3], [4, 5, 0]])

Modifying entries in an array >>> a[1,:] = array([7,8,9]) >>> a array([[1, 2, 3], [7, 8, 9]]) >>> a[:,0:2] = array([[-1,-2],[-3,-4]]) >>> a array([[-1, -2, 3], [-3, -4, 9]])

Append an array >>> a = array([1,2,3]) >>> a = append(a, 4) >>> a array([1, 2, 3, 4]) >>> append(a, array([7,8,9])) array([1, 2, 3, 4, 7, 8, 9])

Sum >>> a array([[1, 2, 3], [4, 5, 6]]) >>> sum(a, axis=0) # sum over columns array([5, 7, 9]) >>> sum(a, 1) # sum over rows array([ 6, 15]) >>> sum(a) 21 >>> sum(sum(a)) # often in Marsland’s code 21

Elementwise numerical operations >>> a + 1 array([[2, 3, 4], [5, 6, 7]]) >>> a**2 array([[ 1, 4, 9], [16, 25, 36]]) >>> sqrt(a) array([[ 1. , 1.41421356, 1.73205081], [ 2. , 2.23606798, 2.44948974]])

Division >>> a array([[1, 2, 3], [4, 5, 6]]) >>> a / 3 array([[0, 0, 1], [1, 1, 2]]) >>> a / 3.0 array([[ 0.33333333, 0.66666667, 1. ], [ 1.33333333, 1.66666667, 2. ]])

Try to call max >>> max(a) # calling built-in function! Traceback (most recent call last): File "<pyshell#323>", line 1, in <module> max(a) ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

max and min >>> a array([[1, 2, 3], [4, 5, 6]]) >>> # below we call max, min in NumPy >>> a.max(axis=0) # max for each column array([4, 5, 6]) >>> a.max(axis=1) # max for each row array([3, 6]) >>> a.min(1) # min for each row array([1, 4])

argmax and argmin >>> r = random.randint(0,20, (3,4)) >>> r array([[18, 3, 12, 7], [ 2, 12, 5, 4], [ 5, 8, 19, 15]]) >>> # find the index with the max value >>> argmax(r) # returns index as 1-d vector 10 >>> ravel(r) array([18, 3, 12, 7, 2, 12, 5, 4, 5, 8, 19, 15]) >>> ravel(r)[argmax(r)] 19

Sorting >>> r = random.randint(0, 10, (3, 4)) >>> r array([[3, 8, 8, 1], [4, 5, 7, 7], [1, 1, 2, 8]]) >>> sort(r, axis=0) # sort each column array([[1, 1, 2, 1], [3, 5, 7, 7], [4, 8, 8, 8]]) >>> sort(r, 1) # sort each row array([[1, 3, 8, 8], [4, 5, 7, 7], [1, 1, 2, 8]])

argsort >>> q = array(['C','B','E','A','D']) >>> argsort(q) array([3, 1, 0, 4, 2]) >>> q[argsort(q)] array(['A', 'B', 'C', 'D', 'E'], dtype='|S1')

argsort C B E A D Original array 1 2 3 4 0 A B C D E Sorted array 1 2 3 4 0 Index of item in original array 1 0 4 2 3

Introduction to NumPy for Statistical Natural Language Processing in Python

Introduction to NumPy for Statistical Natural Language Processing in Python

Presentation Transcript

LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539 Statistical Natural Language Processing