Python Tutorial Part II--Numpy and Pandas

April 3, 2017 | Michael Gong

1. Numpy/Scipy

NumPy and SciPy are open-source add-on modules to Python that provide common mathematical and numerical routines in pre-compiled, fast functions. These are growing into highly mature packages that provide functionality that meets, or perhaps exceeds, that associated with common commercial software like MatLab.

The NumPy (Numeric Python) package provides basic routines for manipulating large arrays and matrices of numeric data.

The SciPy (Scientific Python) package extends the functionality of NumPy with a substantial collection of useful algorithms, like minimization, Fourier transformation, regression, and other applied mathematical techniques.

import numpy as np 

Matlab to Numpy translation click here

Numpy Complete documentation click here

Scipy Complete documentation click here

print(np.__version__)
np.__config__.show()
1.11.2
openblas_lapack_info:
  NOT AVAILABLE
blas_mkl_info:
    libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread']
    include_dirs = ['/home/michael/anaconda3/include']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    library_dirs = ['/home/michael/anaconda3/lib']
lapack_mkl_info:
    libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread']
    include_dirs = ['/home/michael/anaconda3/include']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    library_dirs = ['/home/michael/anaconda3/lib']
blas_opt_info:
    libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread']
    include_dirs = ['/home/michael/anaconda3/include']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    library_dirs = ['/home/michael/anaconda3/lib']
lapack_opt_info:
    libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread']
    include_dirs = ['/home/michael/anaconda3/include']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    library_dirs = ['/home/michael/anaconda3/lib']

1.1 Array creation

a = np.array([1,2,3])
print(type(a))
print(a)
<class 'numpy.ndarray'>
[1 2 3]
b = np.array([[1,2,3],[3,4,5]])
print("this is b\n",b)
print("the shape of b is:",b.shape)
print("the dimension of b is:",b.ndim)
print("number of elements of b:", b.size)
this is b
 [[1 2 3]
 [3 4 5]]
the shape of b is: (2, 3)
the dimension of b is: 2
number of elements of b: 6

** A little bit introduction of object oriented programming**

a = np.array([1,2,3,3,4,5,5,6,7])
print("the type of a:",type(a))
b = a.reshape(3,3)
print("the type of b:",type(b))
c1 = b**2
print("the type of c:",type(c1))

# we can write:
c2 = np.array([1,2,3,3,4,5,5,6,7]).reshape(3,3)**2

print(np.allclose(c1,c2))
the type of a: <class 'numpy.ndarray'>
the type of b: <class 'numpy.ndarray'>
the type of c: <class 'numpy.ndarray'>
True

1.2 Array operation

# generate random ndarray
mA = np.random.randint(1,10,size=(3,3))
mB = np.random.randint(1,10,size=(3,3))
vC = np.random.randint(1,10,size=(3,1))
vD = np.random.randint(1,10,size=(1,3))
print("mA:\n",mA)
print("mB:\n",mB)
print("vC:\n",vC)
print("vD:\n",vD)
mA:
 [[8 9 6]
 [1 3 7]
 [5 3 2]]
mB:
 [[6 7 1]
 [3 2 4]
 [5 8 2]]
vC:
 [[9]
 [1]
 [2]]
vD:
 [[3 6 8]]

element-wise operation *,/,+,-

print(mA+mB)
[[14 16  7]
 [ 4  5 11]
 [10 11  4]]
print(mA-mB)
[[ 2  2  5]
 [-2  1  3]
 [ 0 -5  0]]
print(mA/mB)
[[ 1.33333333  1.28571429  6.        ]
 [ 0.33333333  1.5         1.75      ]
 [ 1.          0.375       1.        ]]
print(mA*mB)
[[48 63  6]
 [ 3  6 28]
 [25 24  4]]
print(mA-vC)
[[-1  0 -3]
 [ 0  2  6]
 [ 3  1  0]]
print(mA-vD)
[[ 5  3 -2]
 [-2 -3 -1]
 [ 2 -3 -6]]
print(mA*mB-vC + mB**2)
[[ 75 103  -2]
 [ 11   9  43]
 [ 48  86   6]]

Matrix operation

Matrix mupliplication @

# The matrix multiplication is done by @ operation (in python 3.0+), in python 2.*, it is done by "dot" method
print(mA@mB)
[[105 122  56]
 [ 50  69  27]
 [ 49  57  21]]

Matrix transpose T

print(mA.T)
[[8 1 5]
 [9 3 3]
 [6 7 2]]

Matrix inversion “inv”

print(np.linalg.inv(mA))
# to make it easier to type, we import the "inv" function alone
from numpy.linalg import inv
print(inv(mA))
[[ -1.42857143e-01   6.93889390e-18   4.28571429e-01]
 [  3.14285714e-01  -1.33333333e-01  -4.76190476e-01]
 [ -1.14285714e-01   2.00000000e-01   1.42857143e-01]]
[[ -1.42857143e-01   6.93889390e-18   4.28571429e-01]
 [  3.14285714e-01  -1.33333333e-01  -4.76190476e-01]
 [ -1.14285714e-01   2.00000000e-01   1.42857143e-01]]
# again, we can benefit from the OOP design
ret = mA.T@inv(mB) - vC + mB.T@vC - vD
print(ret)
[[ 57.5         53.89285714  47.46428571]
 [ 80.          75.28571429  69.42857143]
 [ 12.75         9.23214286   7.16071429]]

1.3 Slicing

A = np.random.rand(3,3)
print(A)
print(A[:,[0]])
[[ 0.07731399  0.10377817  0.1793915 ]
 [ 0.84609483  0.65710391  0.17246314]
 [ 0.7439164   0.89687702  0.1211748 ]]
[[ 0.07731399]
 [ 0.84609483]
 [ 0.7439164 ]]
# slicing is done through bracket []
print(mA)
print('the first element is:\n',mA[0,0])
print('the first column is:\n',mA[:,[0]])
print('the first two columns are:\n',mA[:,:2])

# boolean and integer indexing
indexB = np.array([True, False, True])
indexI = np.array([1,0,1])
array = np.array([1,2,3])
print("Boolean indexing result:\n" ,array[indexB])
print("Integer indexing result:\n", array[indexI])
[[8 9 6]
 [1 3 7]
 [5 3 2]]
the first element is:
 8
the first column is:
 [[8]
 [1]
 [5]]
the first two columns are:
 [[8 9]
 [1 3]
 [5 3]]
Boolean indexing result:
 [1 3]
Integer indexing result:
 [2 1 2]

1.4 Frequently used Numpy function

  • diag: create diagonal matrix from vector
  • diagonal: extract the diagonal element
  • eye: identity matrix
  • zeros: matrix filled with zero
  • ones: matrix filled with one
  • reshape: reshape the matrix
vY = np.random.rand(3)
print(vY.shape,type(vY))
(3,) <class 'numpy.ndarray'>
print(np.diag(vY))
[[ 0.36944492  0.          0.        ]
 [ 0.          0.46322172  0.        ]
 [ 0.          0.          0.80256682]]
print(np.diagonal(np.diag(vY)))
[ 0.36944492  0.46322172  0.80256682]
print(np.eye(3))
[[ 1.  0.  0.]
 [ 0.  1.  0.]
 [ 0.  0.  1.]]
print(np.zeros([3,3]))
print(np.ones([3,3]))
[[ 0.  0.  0.]
 [ 0.  0.  0.]
 [ 0.  0.  0.]]
[[ 1.  1.  1.]
 [ 1.  1.  1.]
 [ 1.  1.  1.]]
print(np.zeros_like(vY))
print(np.zeros_like(np.diag(vY)))
[ 0.  0.  0.]
[[ 0.  0.  0.]
 [ 0.  0.  0.]
 [ 0.  0.  0.]]
# benefits of the OOS design
print((np.diag(vY)+np.random.rand(3,3)).reshape(9,1).max()**2)
1.38636810344

1.5 Some example and exercises

(1) Subtract the mean of each row of a matrix

a = np.random.rand(3,3)
b = a - a.mean(axis=1) #np.mean(a,axis=1)
print(b)
[[-0.27371758 -0.13738752  0.631215  ]
 [-0.25698583 -0.11128908  0.68699173]
 [-0.30855162  0.03770426 -0.26797938]]

(2) compute the moving average of an array

def moving_average(vY,window):
    T = vY.shape[0]
    ret = np.zeros(T - window)
    for j,t in enumerate(range(window,T)):
        ret[j] = vY[t-window:t].mean()
    return ret 

vY = np.random.rand(20)
window = 5
ret = moving_average(vY,window)
print(ret)
[ 0.31439101  0.3067207   0.2612112   0.4488305   0.51791912  0.40021872
  0.5155174   0.60879213  0.60470478  0.58250843  0.57880459  0.60341439
  0.62814638  0.6328429   0.73670997]

(3) find the closest point

def find_closePointV1(points,target):
    min_distance = 1
    point_index = 0
    for i,point in enumerate(points):
        distance = ((point-target)**2).sum()
        if distance < min_distance:
            min_distance = distance
            point_index = i
    return min_distance,point_index
    
def find_closePointV2(points,target):
    distances = ((points-target)**2).sum(axis=1)
    min_distance = distances.min()
    point_index = distances.argmin()
    return min_distance,point_index
points = np.random.rand(2000,2)
target = np.array([0.5,0.5])
min_distance1,point_index1 = find_closePointV1(points,target)
min_distance2,point_index2 = find_closePointV2(points,target)

print("Version 1 the minimum distance:",min_distance1)
print("Version 1 the point with minimum distance",points[point_index1])
print("Version 2 the minimum distance:",min_distance2)
print("Version 2 the point with minimum distance",points[point_index2])
Version 1 the minimum distance: 0.000113540004683
Version 1 the point with minimum distance [ 0.50663792  0.50833535]
Version 2 the minimum distance: 0.000113540004683
Version 2 the point with minimum distance [ 0.50663792  0.50833535]
# version 1 and version 2 have huge performance difference
%timeit find_closePointV1(points,target)
100 loops, best of 3: 5.76 ms per loop
%timeit find_closePointV2(points,target)
The slowest run took 4.57 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 46.9 µs per loop

(4) One line code of simulation of Pi

N = 1000000
pi = 4*((np.random.rand(N,2)**2).sum(axis=1)<1).sum()/N
print(pi)
3.141164

2. Pandas

Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

  • Use Pandas as possible as you can!
import pandas as pd

2.1 ** Object creation**

index = [i for i in range(4)]
columns = ['A','B','C','D','E']
data = pd.DataFrame(np.random.rand(4,5),index=index,columns=columns)
print(data)
          A         B         C         D         E
0  0.377511  0.935913  0.484039  0.494305  0.017102
1  0.674354  0.961497  0.719836  0.463993  0.966345
2  0.113718  0.558731  0.725746  0.287497  0.447674
3  0.799740  0.051104  0.585415  0.475480  0.651220

2.2 ** Selction**

data['A']
0    0.582853
1    0.205175
2    0.142668
3    0.517169
Name: A, dtype: float64
data[['A','B']]
A B
0 0.582853 0.050482
1 0.205175 0.676394
2 0.142668 0.728272
3 0.517169 0.224242
data[0:3]
A B C D E
0 0.582853 0.050482 0.257578 0.340971 0.907032
1 0.205175 0.676394 0.830756 0.963753 0.675394
2 0.142668 0.728272 0.353043 0.082804 0.585398
# Label based selection
data.loc[3,'B']
0.22424205926919127
data.loc[0:2,['A','B']]
A B
0 0.582853 0.050482
1 0.205175 0.676394
2 0.142668 0.728272
# integer based selection
data.iloc[1:4,0:2]
A B
1 0.205175 0.676394
2 0.142668 0.728272
3 0.517169 0.224242
data.iloc[1:3,:]
A B C D E
1 0.205175 0.676394 0.830756 0.963753 0.675394
2 0.142668 0.728272 0.353043 0.082804 0.585398
data.ix[1:3]
A B C D E
1 0.205175 0.676394 0.830756 0.963753 0.675394
2 0.142668 0.728272 0.353043 0.082804 0.585398
3 0.517169 0.224242 0.585978 0.654013 0.965902
# Boolean indexing
indexB = data['A'] > 0.2
print(indexB)
print(data[indexB])
0     True
1     True
2    False
3     True
Name: A, dtype: bool
          A         B         C         D         E
0  0.582853  0.050482  0.257578  0.340971  0.907032
1  0.205175  0.676394  0.830756  0.963753  0.675394
3  0.517169  0.224242  0.585978  0.654013  0.965902
# or conveniently 
print(data[data['A']>0.2])
          A         B         C         D         E
0  0.582853  0.050482  0.257578  0.340971  0.907032
1  0.205175  0.676394  0.830756  0.963753  0.675394
3  0.517169  0.224242  0.585978  0.654013  0.965902
# OOP
data[data['A']>0.2]['C'].ix[0]
0.25757750460410422

2.3 Arithmatic operations

data['F'] = (data['A'] + data['B'])/data['C'] - data['D']*data['E']
print(data)
          A         B         C         D         E           F
0  0.772780  0.251506  0.638696  0.712863  0.639697    1.147697
1  0.723921  0.383294  0.007697  0.661608  0.638037  143.427074
2  0.808509  0.679512  0.296928  0.872748  0.898736    4.227009
3  0.226408  0.768940  0.292466  0.896504  0.974840    2.529343

2.4 Apply

def func(series):
    return series.sum()/series.std()
ret = data.apply(func,axis=1)
data['ret'] = data.apply(func,axis=1)

print(ret)
print('\n')
print(data)
0    5.779099
1    2.757361
2    6.135847
3    5.855379
dtype: float64


          A         B         C         D         E           F       ret
0  0.772780  0.251506  0.638696  0.712863  0.639697    1.147697  5.779099
1  0.723921  0.383294  0.007697  0.661608  0.638037  143.427074  2.757361
2  0.808509  0.679512  0.296928  0.872748  0.898736    4.227009  6.135847
3  0.226408  0.768940  0.292466  0.896504  0.974840    2.529343  5.855379

2.5 Groupby

This is a extremely useful object, please go through the detail cookbook

By “group by” we are referring to a process involving one or more of the following steps

  • Splitting the data into groups based on some criteria
  • Applying a function to each group independently
  • Combining the results into a data structure
data = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})
data
A B C D
0 foo one 0.785 -0.553
1 bar one 0.724 1.079
2 foo two -1.760 -0.818
3 bar three 0.721 0.440
4 foo two 0.026 0.068
5 bar two 0.442 0.321
6 foo one -0.328 0.176
7 foo three -0.473 -1.483
groupA = data.groupby('A')
groupAB = data.groupby(['A','B'])
print(groupA)
print(groupAB)
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7eff7ff467b8>
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7eff7ff46128>
# GroupBy object attributes
groupA.groups
{'bar': Int64Index([1, 3, 5], dtype='int64'),
 'foo': Int64Index([0, 2, 4, 6, 7], dtype='int64')}
# some statistics
# Std
print(groupAB.std())
# Mean
print(groupAB.mean())
# Sum
print(groupAB.sum())
                  C         D
A   B                        
bar one         NaN       NaN
    three       NaN       NaN
    two         NaN       NaN
foo one    0.287689  0.533024
    three       NaN       NaN
    two    0.259739  0.182370
                  C         D
A   B                        
bar one    0.093289 -1.435398
    three  1.610691  0.099456
    two    1.674365 -1.969331
foo one   -0.433944  0.118717
    three  0.640498  0.695818
    two    1.002847 -0.708532
                  C         D
A   B                        
bar one    0.093289 -1.435398
    three  1.610691  0.099456
    two    1.674365 -1.969331
foo one   -0.867888  0.237433
    three  0.640498  0.695818
    two    2.005694 -1.417063

Iteration through groups

for group_name, group_data in groupA:
    print('group name: ',group_name)
    print('group data:\n',group_data)
    print('\n')
group name:  bar
group data:
      A      B         C         D
1  bar    one -1.763415 -1.015387
3  bar  three -1.104977 -2.578075
5  bar    two -1.650251 -0.601321


group name:  foo
group data:
      A      B         C         D
0  foo    one  0.738800  0.320824
2  foo    two -0.866414 -0.223506
4  foo    two  0.009047 -0.798006
6  foo    one  0.209088 -0.120664
7  foo  three -0.470803  0.823928
# you can flexibly run function on group data, for example: OLS regression
beta = np.zeros(2)
for i, (group_name,group_data) in enumerate(groupA):
    mY = group_data[['C']].values
    mX = group_data[['D']].values
    beta[i] = np.linalg.lstsq(mX,mY)[0]
print(beta)
[ 0.7005281   0.00695643]
# the method above is very flexible, you can use another useful method: group apply
def myfunc(group_data):
    print(group_data)
    mY = group_data[['C']].values
    mX = group_data[['D']].values
    return np.linalg.lstsq(mX,mY)[0]

groupA.apply(myfunc)
       B         C         D
1    one -1.763415 -1.015387
3  three -1.104977 -2.578075
5    two -1.650251 -0.601321
       B         C         D
1    one -1.763415 -1.015387
3  three -1.104977 -2.578075
5    two -1.650251 -0.601321
       B         C         D
0    one  0.738800  0.320824
2    two -0.866414 -0.223506
4    two  0.009047 -0.798006
6    one  0.209088 -0.120664
7  three -0.470803  0.823928





A
bar      [[0.700528102717]]
foo    [[0.00695642969842]]
dtype: object
def myfunc(group_data):
    mY = group_data[['C']].values
    mX = group_data[['D']].values
    return float(np.linalg.lstsq(mX,mY)[0])

result = groupA.apply(myfunc)
print(result)
A
bar    0.700528
foo    0.006956
dtype: float64
def myfunc(group_data):
    # if the return result is a dataframe, pandas will look at its index, and merge the result back to original 
    # dataframe by same index
    return (group_data['C']/group_data['D'].sum()).to_frame('ret')
data['ret'] = data.groupby('A').apply(myfunc)
data
A B C D ret
0 foo one -0.230517 -0.258188 0.476460
1 bar one 0.093289 -1.435398 -0.028224
2 foo two 1.186510 -0.579577 -2.452417
3 bar three 1.610691 0.099456 -0.487309
4 foo two 0.819184 -0.837487 -1.693185
5 bar two 1.674365 -1.969331 -0.506574
6 foo one -0.637371 0.495621 1.317392
7 foo three 0.640498 0.695818 -1.323855

2.6 Merge data

This is another power object, please go through detail

basic syntax:

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
         left_index=False, right_index=False, sort=True,
         suffixes=('_x', '_y'), copy=True, indicator=False)
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                       'A': ['A0', 'A1', 'A2', 'A3'],
                       'B': ['B0', 'B1', 'B2', 'B3']})


right = pd.DataFrame({'key': ['K1', 'K2', 'K3', 'K4'],
                        'C': ['C0', 'C1', 'C2', 'C3'],
                        'D': ['D0', 'D1', 'D2', 'D3']})
print(left)
print(right)
    A   B key
0  A0  B0  K0
1  A1  B1  K1
2  A2  B2  K2
3  A3  B3  K3
    C   D key
0  C0  D0  K1
1  C1  D1  K2
2  C2  D2  K3
3  C3  D3  K4
inner_merged = pd.merge(left,right,how='inner',left_on=['key'],right_on=['key'])
print(inner_merged)
    A   B key   C   D
0  A1  B1  K1  C0  D0
1  A2  B2  K2  C1  D1
2  A3  B3  K3  C2  D2
left_merged = pd.merge(left,right,how='left',left_on=['key'],right_on=['key'])
print(left_merged)
    A   B key    C    D
0  A0  B0  K0  NaN  NaN
1  A1  B1  K1   C0   D0
2  A2  B2  K2   C1   D1
3  A3  B3  K3   C2   D2

2.7 Pivot and reshape table

…Again…very power object, details click here

data = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})
data
A B C D
0 foo one 1.329326 0.515690
1 bar one 0.912863 -0.309627
2 foo two -0.219018 2.094762
3 bar three -0.159688 0.737186
4 foo two 1.182747 0.359100
5 bar two -0.013464 -1.514432
6 foo one -1.277657 0.657462
7 foo three 0.012385 0.535246
table = pd.pivot_table(data,index=['A'],columns=['B'],values=['C'])
print(table)
            C                    
B         one     three       two
A                                
bar  0.912863 -0.159688 -0.013464
foo  0.025835  0.012385  0.481865

2.7 Useful funciton on DataFrame

data.describe()
data.describe(percentiles=[0.2,0.3,0.4,0.8])
# shape
data.shape
# transpose
data.T
# link to numpy
data_values = data.values
print(type(data_values))
print(data_values)
# caveat, data_values is a "VIEW" of the DataFrame, which means that it is mutable!
data_values[0,0] = 15
data

3. Learn Pandas by example

import matplotlib.pyplot as plt
%matplotlib inline
data = pd.read_csv('data.csv')
data.head(6)
Date minor Open High Low Close Volume Adj Close
0 2010-01-04 AAPL 213.430 214.500 212.380 214.010 123432400.000 27.727
1 2010-01-04 INTC 20.790 21.030 20.730 20.880 47800900.000 16.443
2 2010-01-04 MSFT 30.620 31.100 30.590 30.950 38409100.000 25.555
3 2010-01-05 AAPL 214.600 215.590 213.250 214.380 150476200.000 27.775
4 2010-01-05 INTC 20.940 20.990 20.600 20.870 52357700.000 16.435
5 2010-01-05 MSFT 30.850 31.100 30.640 30.960 49749600.000 25.564
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5412 entries, 0 to 5411
Data columns (total 8 columns):
Date         5412 non-null object
minor        5412 non-null object
Open         5412 non-null float64
High         5412 non-null float64
Low          5412 non-null float64
Close        5412 non-null float64
Volume       5412 non-null float64
Adj Close    5412 non-null float64
dtypes: float64(6), object(2)
memory usage: 338.3+ KB
# change the column name
data.rename(columns={'minor':'ticker'},inplace=True)
data.head(5)
Date ticker Open High Low Close Volume Adj Close
0 2010-01-04 AAPL 213.430 214.500 212.380 214.010 123432400.000 27.727
1 2010-01-04 INTC 20.790 21.030 20.730 20.880 47800900.000 16.443
2 2010-01-04 MSFT 30.620 31.100 30.590 30.950 38409100.000 25.555
3 2010-01-05 AAPL 214.600 215.590 213.250 214.380 150476200.000 27.775
4 2010-01-05 INTC 20.940 20.990 20.600 20.870 52357700.000 16.435
data.sort_values(by=['ticker','Date'],inplace=True)
data.head(5)
Date ticker Open High Low Close Volume Adj Close
0 2010-01-04 AAPL 213.430 214.500 212.380 214.010 123432400.000 27.727
3 2010-01-05 AAPL 214.600 215.590 213.250 214.380 150476200.000 27.775
6 2010-01-06 AAPL 214.380 215.230 210.750 210.970 138040000.000 27.333
9 2010-01-07 AAPL 211.750 212.000 209.050 210.580 119282800.000 27.283
12 2010-01-08 AAPL 210.300 212.000 209.060 211.980 111902700.000 27.464
data['Date'] = pd.to_datetime(data['Date'].astype(str), format='%Y-%m-%d')
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5412 entries, 0 to 5411
Data columns (total 8 columns):
Date         5412 non-null datetime64[ns]
ticker       5412 non-null object
Open         5412 non-null float64
High         5412 non-null float64
Low          5412 non-null float64
Close        5412 non-null float64
Volume       5412 non-null float64
Adj Close    5412 non-null float64
dtypes: datetime64[ns](1), float64(6), object(1)
memory usage: 380.5+ KB
# if we only interested in adjusted close price, we can pivot the table
close_price = pd.pivot_table(data,index='Date',columns='ticker',values='Adj Close')
close_price.head(6)
ticker AAPL INTC MSFT
Date
2010-01-04 27.727 16.443 25.555
2010-01-05 27.775 16.435 25.564
2010-01-06 27.333 16.380 25.407
2010-01-07 27.283 16.222 25.143
2010-01-08 27.464 16.403 25.316
2010-01-11 27.222 16.498 24.994
close_price.plot(figsize=(12,5))
<matplotlib.axes._subplots.AxesSubplot at 0x7eff568117b8>

png

# if we want the return, Pandas provide an extremely easy transformation
stock_return = close_price.pct_change()
stock_return.head(5)
ticker AAPL INTC MSFT
Date
2010-01-04 NaN NaN NaN
2010-01-05 0.001729 -0.000479 0.000323
2010-01-06 -0.015906 -0.003354 -0.006137
2010-01-07 -0.001849 -0.009615 -0.010400
2010-01-08 0.006648 0.011165 0.006897
#plot the cumulative return in one line
(1+stock_return).cumprod().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x7f76333135f8>

png

# just show the style of OOP
(1 + pd.pivot_table(data,index='Date',columns='ticker',values='Adj Close').pct_change()).cumprod().plot()

png

# save the return dataframe to csv
stock_return.to_csv('stockReturn.csv')
# or, to save it to HDF5 format, which is much faster
# create a HDF store
hdf = pd.HDFStore('stock.h5')
hdf['stock_return'] = stock_return
hdf['raw_data'] = data
hdf.close()
%timeit data = pd.read_csv('data.csv')
100 loops, best of 3: 5.75 ms per loop
hdf = pd.HDFStore('stock.h5')
%timeit data = hdf['raw_data']
The slowest run took 6.86 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 2.61 ms per loop