The Python Quants



Financial and Data Analytics with Python

by Dr. Yves J. Hilpisch

The Python Quants GmbH

www.pythonquants.com

yves@pythonquants.com

Dagstuhl Castle, 18. June 2013

The Python Quants GmbH

Continuum is about Python, Data, Analytics, Visualization.

The team consists of 40+ people:

  • Travis Oliphant (CEO)
  • Peter Wang (President)
  • Francesc Alted (Senior Software Architect)
  • many more Python enthusiasts

Continuum builds the Python-based data analytics infrastructure of choice for financial, corporate and governmental institutions.

  • Anaconda Accelerate – Python distribution with HPC/parallel computing capabilitities
  • NumbaPro – just-in-time compiler for CPU and GPU code
  • Wakari – Web-based, scalable, collaborative data analytics with e.g. IPython Notebook
  • others like IOPro or blaze

One of my recent Python-based projects:

Eurex VSTOXX Advanced Services

Parallel Option Pricing

The LSM Algorithm

The following is an implementation of the Least-Squares Monte Carlo algorithm (LSM) of Longstaff-Schwartz (2001): "Valuing American Options by Simulation: A Simple Least-Squares Approach." Review of Financial Studies, Vol. 14, 113-147. This algorithm values American options by Monte Carlo simulation, using ordinary least-squares regression to estimate continuation values.

In [1]:
def optionValue(S0, vol, T, K=40, M=50, I=4096, r=0.06):
    import numpy as np
    np.random.seed(150000)  # fix the seed for every valuation
    dt = T / M  # time interval
    df = np.exp(-r * dt)  # discount factor per time time interval
    # Simulation of Index Levels
    S = np.zeros((M + 1, I), 'd')  # stock price matrix
    S[0, :] = S0  # intial values for stock price
    for t in range(1, M + 1):
        ran = np.random.standard_normal(I / 2)
        ran = np.concatenate((ran, -ran))  # antithetic variates
        ran = ran - np.mean(ran)  # correct first moment
        ran = ran / np.std(ran)  # correct second moment
        S[t, :] = S[t - 1, :] * np.exp((r - vol ** 2 / 2) * dt
                        + vol * ran * np.sqrt(dt))
    h = np.maximum(K - S, 0)  # inner values for put option
    V = np.zeros_like(h)  # value matrix
    V[-1] = h[-1]
    # Valuation by LSM
    for t in range(M - 1, 0, -1):
        rg = np.polyfit(S[t, :], V[t + 1, :] * df, 5)  # regression
        C = np.polyval(rg, S[t, :])  # evaluation of regression
        V[t, :] = np.where(h[t, :] > C, h[t, :],
                         V[t + 1, :] * df)  # exercise decision/optimization
    V0 = np.sum(V[1, :] * df) / I  # LSM estimator
    print "S0 %4.1f|vol %4.2f|T %2.1f| Option Value %8.3f" % (S0, vol, T, V0)
    return V0

The Sequential Calculation

We want to replicate the whole table 1 of the seminal paper.

In [2]:
def seqValue():
    optionValues = []
    for S0 in (36., 38., 40., 42., 44.):  # initial stock price values
        for vol in (0.2, 0.4):  # volatility values
            for T in (1.0, 2.0):  # times-to-maturity
                optionValues.append(optionValue(S0, vol, T))
    return optionValues

Now, we measure the time for the 20 different American put options of that table 1 with sequential execution.

In [3]:
from time import time
t0 = time()
optionValues = seqValue()  # calculate all values
t1 = time(); d1 = t1 - t0
print "Duration in Seconds %6.3f" % d1
S0 36.0|vol 0.20|T 1.0| Option Value    4.500
S0 36.0|vol 0.20|T 2.0| Option Value    4.854
S0 36.0|vol 0.40|T 1.0| Option Value    7.112
S0 36.0|vol 0.40|T 2.0| Option Value    8.472
S0 38.0|vol 0.20|T 1.0| Option Value    3.262
S0 38.0|vol 0.20|T 2.0| Option Value    3.723
S0 38.0|vol 0.40|T 1.0| Option Value    6.112
S0 38.0|vol 0.40|T 2.0| Option Value    7.682
S0 40.0|vol 0.20|T 1.0| Option Value    2.305
S0 40.0|vol 0.20|T 2.0| Option Value    2.861
S0 40.0|vol 0.40|T 1.0| Option Value    5.258
S0 40.0|vol 0.40|T 2.0| Option Value    6.899
S0 42.0|vol 0.20|T 1.0| Option Value    1.569
S0 42.0|vol 0.20|T 2.0| Option Value    2.164
S0 42.0|vol 0.40|T 1.0| Option Value    4.544
S0 42.0|vol 0.40|T 2.0| Option Value    6.171
S0 44.0|vol 0.20|T 1.0| Option Value    1.020
S0 44.0|vol 0.20|T 2.0| Option Value    1.602
S0 44.0|vol 0.40|T 1.0| Option Value    3.912
S0 44.0|vol 0.40|T 2.0| Option Value    5.591
Duration in Seconds  2.673

The Parallel Calculation

First, start a local cluster, if you have multiple cores in your machine.

In [4]:
# in the shell ...
####
# ipcluster start -n 8
####
# or maybe more cores, if available

Enable parallel computing capabilities.

In [5]:
from IPython.parallel import Client
cluster_profile = "default"
# the profile is set to the default one
# see IPython docs for further details
c = Client(profile=cluster_profile)
view = c.load_balanced_view()

Again, a loop for the 20 options of table 1. This time asynchronously distributed to the 4 cores.

In [6]:
def parValue():
    optionValues = []
    for S in (36., 38., 40., 42., 44.):
        for vol in (0.2, 0.4):
            for T in (1.0, 2.0):
                value = view.apply_async(optionValue, S, vol, T)
                # asynchronously calculate the option values
                optionValues.append(value)
    return optionValues

Now, we measure the time needed with parallel execution.

In [7]:
def execution():
    optionValues = parValue()  # calculate all values
    print "Submitted tasks %d" % len(optionValues)
    c.wait(optionValues)
    # wait for all tasks to be finished
    return optionValues
t0 = time()
optionValues = execution()
t1 = time(); d2 = t1 - t0
print "Duration in Seconds %6.3f" % d2
Submitted tasks 20
Duration in Seconds  0.654
In [8]:
d1 / d2  # speed-up of parallel execution
Out[8]:
4.084153618252551

Single values for the American put options can be directly accessed.

In [9]:
optionValues[0].result
Out[9]:
4.4995294086976036

You can also inspect the metadata and access single data fields in JSON-typical manner.

In [10]:
optionValues[0].metadata
Out[10]:
{'after': [],
 'completed': datetime.datetime(2013, 6, 18, 11, 24, 18, 801104),
 'data': {},
 'engine_id': 0,
 'engine_uuid': '7ab67f8d-1977-45a1-bc18-ecbb5c5ac028',
 'follow': [],
 'msg_id': '40682209-fa2a-412c-ae87-dd2bdb2c1c66',
 'outputs': [],
 'outputs_ready': True,
 'pyerr': None,
 'pyin': None,
 'pyout': None,
 'received': datetime.datetime(2013, 6, 18, 11, 24, 18, 805855),
 'started': datetime.datetime(2013, 6, 18, 11, 24, 18, 398935),
 'status': 'ok',
 'stderr': '',
 'stdout': 'S0 36.0|vol 0.20|T 1.0| Option Value    4.500\n',
 'submitted': datetime.datetime(2013, 6, 18, 11, 24, 18, 394793)}

The whole output of the 20 valuations.

In [11]:
for result in optionValues:
    print result.metadata['stdout'],
S0 36.0|vol 0.20|T 1.0| Option Value    4.500
S0 36.0|vol 0.20|T 2.0| Option Value    4.854
S0 36.0|vol 0.40|T 1.0| Option Value    7.112
S0 36.0|vol 0.40|T 2.0| Option Value    8.472
S0 38.0|vol 0.20|T 1.0| Option Value    3.262
S0 38.0|vol 0.20|T 2.0| Option Value    3.723
S0 38.0|vol 0.40|T 1.0| Option Value    6.112
S0 38.0|vol 0.40|T 2.0| Option Value    7.682
S0 40.0|vol 0.20|T 1.0| Option Value    2.305
S0 40.0|vol 0.20|T 2.0| Option Value    2.861
S0 40.0|vol 0.40|T 1.0| Option Value    5.258
S0 40.0|vol 0.40|T 2.0| Option Value    6.899
S0 42.0|vol 0.20|T 1.0| Option Value    1.569
S0 42.0|vol 0.20|T 2.0| Option Value    2.164
S0 42.0|vol 0.40|T 1.0| Option Value    4.544
S0 42.0|vol 0.40|T 2.0| Option Value    6.171
S0 44.0|vol 0.20|T 1.0| Option Value    1.020
S0 44.0|vol 0.20|T 2.0| Option Value    1.602
S0 44.0|vol 0.40|T 1.0| Option Value    3.912
S0 44.0|vol 0.40|T 2.0| Option Value    5.591

Collecting and Analyzing High-Frequency Financial Data

This example illustrates first the collection and the analysis of daily and then of high frequency stock price data – both "popular" topics in todays financial markets. It uses the powerful and convenient pandas library.

In [12]:
import pandas as pd
import pandas.io.data as pdd

Daily Stock Price Data

In [13]:
AAPL = pdd.DataReader('AAPL', data_source='yahoo', start='1/1/1984')  # retrieve stock prices since IPO 
In [14]:
AAPL
Out[14]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 7256 entries, 1984-09-07 00:00:00 to 2013-06-17 00:00:00
Data columns:
Open         7256  non-null values
High         7256  non-null values
Low          7256  non-null values
Close        7256  non-null values
Volume       7256  non-null values
Adj Close    7256  non-null values
dtypes: float64(5), int64(1)
In [15]:
AAPL.head()
Out[15]:
Open High Low Close Volume Adj Close
Date
1984-09-07 26.50 26.87 26.25 26.50 2981600 2.96
1984-09-10 26.50 26.62 25.87 26.37 2346400 2.95
1984-09-11 26.62 27.37 26.62 26.87 5444000 3.00
1984-09-12 26.87 27.00 26.12 26.12 4773600 2.92
1984-09-13 27.50 27.62 27.50 27.50 7429600 3.07
In [16]:
AAPL['Close'].plot()
Out[16]:
<matplotlib.axes.AxesSubplot at 0x3b6cf10>
In [17]:
AAPL['log_ret'] = log(AAPL['Close'] / AAPL['Close'].shift(1))
In [18]:
AAPL.head()
Out[18]:
Open High Low Close Volume Adj Close log_ret
Date
1984-09-07 26.50 26.87 26.25 26.50 2981600 2.96 NaN
1984-09-10 26.50 26.62 25.87 26.37 2346400 2.95 -0.004918
1984-09-11 26.62 27.37 26.62 26.87 5444000 3.00 0.018783
1984-09-12 26.87 27.00 26.12 26.12 4773600 2.92 -0.028309
1984-09-13 27.50 27.62 27.50 27.50 7429600 3.07 0.051485
In [19]:
AAPL['log_ret'].describe()
Out[19]:
count    7255.000000
mean        0.000385
std         0.033125
min        -0.731247
25%        -0.014855
50%         0.000000
75%         0.015923
max         0.286796
In [20]:
AAPL['log_ret'].hist(bins=40)
Out[20]:
<matplotlib.axes.AxesSubplot at 0x3c0ddd0>

High-Frequency Stock Price Data

Using pandas, this following reads intraday, high-frequency data from a Web source, plots it and resamples it.

In [21]:
url = 'http://hopey.netfonds.no/posdump.php?date=20130614&paper=AAPL.O&csv_format=csv'
# you may have to adjust the date since only recent dates are available
AAPL = pd.read_csv(url, index_col=0, header=0, parse_dates=True)
In [22]:
AAPL
Out[22]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 12533 entries, 2013-06-14 01:16:55 to 2013-06-14 23:19:32
Data columns:
bid                  12533  non-null values
bid_depth            12533  non-null values
bid_depth_total      12533  non-null values
offer                12533  non-null values
offer_depth          12533  non-null values
offer_depth_total    12533  non-null values
dtypes: float64(2), int64(4)
In [23]:
AAPL['bid'].plot()
Out[23]:
<matplotlib.axes.AxesSubplot at 0x41fc2d0>
In [24]:
AAPL = AAPL[AAPL.index > datetime.datetime(2013, 6, 14, 12, 0, 0)]  # taking a sub-set of the data
In [25]:
# this resamples the record frequency to 10 minutes, using mean as aggregation rule
AAPL_10min = AAPL.resample(rule='10min', how='mean')
AAPL_10min.head()
Out[25]:
bid bid_depth bid_depth_total offer offer_depth offer_depth_total
time
2013-06-14 12:10:00 435.300000 100 100 435.740000 200.000000 200.000000
2013-06-14 12:20:00 435.250000 100 100 435.860000 100.000000 100.000000
2013-06-14 12:30:00 435.200000 100 100 435.860000 100.000000 100.000000
2013-06-14 12:40:00 435.216667 100 100 435.860000 133.333333 133.333333
2013-06-14 12:50:00 434.955000 200 200 435.774167 191.666667 191.666667
In [26]:
AAPL_10min['bid'].plot()
Out[26]:
<matplotlib.axes.AxesSubplot at 0x4523d10>
In [27]:
# this function can be applied by management to reverse fortune ...
def f(x):
    return 432 - x + 432
In [28]:
AAPL_10min['bid'].apply(f).plot()
Out[28]:
<matplotlib.axes.AxesSubplot at 0x45449d0>

Fast Disk-Based Data Analytics

This example illustrates typical data analytics tasks based on the PyTables library which rests on the HDF5 database standard.

In [29]:
import tables as tb
from tables import *
In [30]:
# h5.close()
h5 = open_file('/home/yhilpisch/Temp/example.h5', 'w')
In [31]:
row_des = {
    'DATE': tb.StringCol(26, pos=1),
    'NO1': tb.Float64Col(pos=2),
    'NO2': tb.Float64Col(pos=3),
    'NO3': tb.Float64Col(pos=4),
    'NO4': tb.Float64Col(pos=5),
    }
filters = tb.Filters(complevel=2)
In [32]:
rows = 10000000
tab = h5.create_table('/', 'dummy_data', title='table', description=row_des, expectedrows=rows, filters=filters)
In [33]:
# populating the table with dummy data
r = tab.row
def genTab():
    for i in range(rows):
        rand = standard_normal(4)
        r['DATE'] = datetime.datetime.now()
        r['NO1'], r['NO2'], r['NO3'], r['NO4'] = rand
        r.append()
    tab.flush()
%time genTab()
tab
CPU times: user 1min 40s, sys: 11.4 s, total: 1min 52s
Wall time: 1min 54s
Out[33]:
/dummy_data (Table(10000000,), shuffle, zlib(2)) 'table'
  description := {
  "DATE": StringCol(itemsize=26, shape=(), dflt='', pos=0),
  "NO1": Float64Col(shape=(), dflt=0.0, pos=1),
  "NO2": Float64Col(shape=(), dflt=0.0, pos=2),
  "NO3": Float64Col(shape=(), dflt=0.0, pos=3),
  "NO4": Float64Col(shape=(), dflt=0.0, pos=4)}
  byteorder := 'little'
  chunkshape := (4519,)
In [34]:
# looking into the table
tab[:3]
Out[34]:
array([ ('2013-06-18 11:24:28.272186', 1.1203410743749882, -0.8352379153375216, -1.2919781660873217, -1.6348680542132725),
       ('2013-06-18 11:24:28.272277', -0.5462648679055429, 0.8278247279360921, -0.6459374792177821, 1.2654821119198476),
       ('2013-06-18 11:24:28.272299', -1.388790769970219, 0.4573883550857721, 0.578149246726677, 0.04427853797707416)], 
      dtype=[('DATE', '|S26'), ('NO1', '<f8'), ('NO2', '<f8'), ('NO3', '<f8'), ('NO4', '<f8')])
In [35]:
# histogram of a column's values
hist(tab.cols.NO1[::100], bins=30)
grid(True)
In [36]:
# a SQL-like query for data from two columns
%time res = array([(row['NO1'], row['NO2']) for row in tab.where('((NO1 < -0.5) | (NO1 > 0.5)) & ((NO2 < -1) | (NO2 > 1))')])
CPU times: user 14.5 s, sys: 144 ms, total: 14.6 s
Wall time: 14.2 s
In [37]:
res = res[::100]  # only every x-th record
res[:5]
Out[37]:
array([[-0.92639639,  1.76924943],
       [-0.86429104,  1.99071732],
       [ 0.79151535,  2.18724794],
       [ 0.64515076,  3.6376278 ],
       [-0.77100211,  2.84507245]])
In [38]:
plot(res.T[0], res.T[1], 'ro')
grid(True)

In-Memory Analytics with pandas

This example illustrates in-memory data analytics with pandas -- based on the previously generated HDF5 database.

In [39]:
import pandas as pd
%time data = pd.DataFrame(tab.read())
data.index = data['DATE']
del data['DATE']
h5.close()
CPU times: user 6.34 s, sys: 512 ms, total: 6.85 s
Wall time: 6.86 s
In [40]:
data
Out[40]:
<class 'pandas.core.frame.DataFrame'>
Index: 10000000 entries, 2013-06-18 11:24:28.272186 to 2013-06-18 11:26:22.263551
Data columns:
NO1    10000000  non-null values
NO2    10000000  non-null values
NO3    10000000  non-null values
NO4    10000000  non-null values
dtypes: float64(4)
In [41]:
%time data.hist(grid=True, bins=30)
CPU times: user 5.45 s, sys: 248 ms, total: 5.7 s
Wall time: 5.7 s
Out[41]:
array([[Axes(0.125,0.563043;0.336957x0.336957),
        Axes(0.563043,0.563043;0.336957x0.336957)],
       [Axes(0.125,0.125;0.336957x0.336957),
        Axes(0.563043,0.125;0.336957x0.336957)]], dtype=object)
In [42]:
%time res = data[['NO1', 'NO2']][(data.NO1 < -0.5) & (data.NO2 < -1)]
CPU times: user 212 ms, sys: 60 ms, total: 272 ms
Wall time: 272 ms
In [43]:
plot(res.NO1, res.NO2, 'ro')
grid(True)
In [44]:
data['SIGN'] = array(data['NO1']).round(0)  #generates new column 
In [45]:
data.head()
Out[45]:
NO1 NO2 NO3 NO4 SIGN
DATE
2013-06-18 11:24:28.272186 1.120341 -0.835238 -1.291978 -1.634868 1
2013-06-18 11:24:28.272277 -0.546265 0.827825 -0.645937 1.265482 -1
2013-06-18 11:24:28.272299 -1.388791 0.457388 0.578149 0.044279 -1
2013-06-18 11:24:28.272310 -0.926396 1.769249 -1.341341 0.902475 -1
2013-06-18 11:24:28.272320 -0.446076 -1.578122 -0.815648 0.129584 -0
In [46]:
%time data.groupby(['SIGN']).count()
CPU times: user 1.74 s, sys: 736 ms, total: 2.48 s
Wall time: 2.48 s
Out[46]:
NO1 NO2 NO3 NO4 SIGN
SIGN
-6 1 1 1 1 1
-5 25 25 25 25 25
-4 2328 2328 2328 2328 2328
-3 59545 59545 59545 59545 59545
-2 604778 604778 604778 604778 604778
-1 2417467 2417467 2417467 2417467 2417467
-0 3832749 3832749 3832749 3832749 3832749
1 2417036 2417036 2417036 2417036 2417036
2 604321 604321 604321 604321 604321
3 59414 59414 59414 59414 59414
4 2304 2304 2304 2304 2304
5 32 32 32 32 32
In [47]:
data.groupby(['SIGN']).mean()
Out[47]:
NO1 NO2 NO3 NO4
SIGN
-6 -5.586834 0.057630 0.671113 -1.092833
-5 -4.658663 0.078163 -0.161723 -0.246599
-4 -3.737356 0.009310 0.002855 -0.017930
-3 -2.786637 -0.003414 -0.002366 0.008004
-2 -1.847916 0.002868 0.001307 -0.001836
-1 -0.920871 -0.001191 0.000869 -0.001073
-0 -0.000151 0.000185 0.000875 0.000917
1 0.920679 -0.001195 0.000605 0.001957
2 1.848316 -0.001816 0.000322 0.000366
3 2.786931 0.003385 0.001877 -0.004025
4 3.735813 -0.009022 0.011115 0.006007
5 4.720344 0.053851 -0.102913 0.162162
In [48]:
data.boxplot(column=['NO1'], by=['SIGN'])
Out[48]:
<matplotlib.axes.AxesSubplot at 0x522b150>

The Python Quants



The Python Quants GmbH – Experts for Python in Quant Finance

The Python Quants GmbH – the company Web site

www.pythonquants.com

Dr. Yves J. Hilpisch – my personal Web site

www.hilpisch.com

Derivatives Analytics with Python – my new book

Read an Excerpt and Order the Book

Contact Us

yves@pythonquants.com

@dyjh