The Python Quants

Interactive Financial Analytics with Python & IPython¶

Tutorial with Examples based on the VSTOXX Volatility Index

Dr. Yves J. Hilpisch

The Python Quants GmbH

www.pythonquants.com

yves@pythonquants.com

@dyjh

For Python Quants – 14. March 2014

You find the presentation and the IPython Notebook here:

About Me¶

A brief bio:

Managing Partner of The Python Quants
Founder of Visixion GmbH – The Python Quants
Lecturer Mathematical Finance at Saarland University
Focus on Financial Industry and Financial Analytics
Book "Derivatives Analytics with Python" (2013)
Book "Python for Finance" O'Reilly (2014)
Dr.rer.pol in Mathematical Finance
Graduate in Business Administration
Martial Arts Practitioner and Fan

See www.hilpisch.com.

Python for Finance

Python for Finance (O'Reilly Shop)

Python for Analytics¶

This tutorial focuses on

Python as a general purpose financial analytics environment
interactive analytics examples
prototyping-like Python usage

It does not address such important issues like

architectural issues regarding hardware and software
development processes, testing, documentation and production
real world problem modeling

A fundamental Python stack for interactive data analytics and visualization should at least contain the following libraries tools:

Python – the Python interpreter itself
NumPy – high performance, flexible array structures and operations
SciPy – collection of scientific modules and functions (e.g. for regression, optimization, integration)
pandas – time series and panel data analysis and I/O
PyTables – hierarchical, high performance database (e.g. for out-of-memory analytics)
matplotlib – 2d and 3d visualization
IPython – interactive data analytics, visualization, publishing

It is best to use e.g. a Python distribution like Anaconda to ensure consistency of libraries.

First Financial Analytics Example¶

We need to make a couple of imports for what is to come.

In [1]:

import numpy as np
import pandas as pd
import pandas.io.data as pdd
from urllib import urlretrieve
%matplotlib inline

The convenience function DataReader makes it easy to read historical stock price data from Yahoo! Finance (http://finance.yahoo.com).

In [2]:

try:    
    index = pdd.DataReader('^GDAXI', data_source='yahoo', start='2007/3/30')
      # e.g. the EURO STOXX 50 ticker symbol -- ^SX5E
except:
    index = pd.read_csv('dax.txt', index_col=0, parse_dates=True)

In [3]:

index.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1775 entries, 2007-03-30 00:00:00 to 2014-03-11 00:00:00
Data columns (total 6 columns):
Open         1775 non-null float64
High         1775 non-null float64
Low          1775 non-null float64
Close        1775 non-null float64
Volume       1775 non-null int64
Adj Close    1775 non-null float64
dtypes: float64(5), int64(1)

pandas strength is the handling of indexed/labeled/structured data, like times series data.

In [4]:

index.tail()

Out[4]:

	Open	High	Low	Close	Volume	Adj Close
Date
2014-03-05	9562.39	9599.00	9534.43	9542.02	73341700	9542.02
2014-03-06	9577.35	9587.44	9505.32	9542.87	103682600	9542.87
2014-03-07	9538.45	9543.24	9346.82	9350.75	103246700	9350.75
2014-03-10	9305.51	9382.98	9216.07	9265.50	84875400	9265.50
2014-03-11	9295.32	9375.29	9259.18	9307.79	72300800	9307.79

5 rows × 6 columns

pandas makes it easy to implement vectorized operations, like calculating log-returns over whole time series.

In [5]:

index['Returns'] = np.log(index['Close'] / index['Close'].shift(1))

In addition, pandas makes plotting quite simple and compact.

In [6]:

index[['Close', 'Returns']].plot(subplots=True, style='b', figsize=(8, 5))

Out[6]:

array([<matplotlib.axes.AxesSubplot object at 0x10737fb10>,
       <matplotlib.axes.AxesSubplot object at 0x10756ab10>], dtype=object)

We now want to check how annual volatility changes over time.

In [7]:

index['Mov_Vol'] = pd.rolling_std(index['Returns'], window=252) * np.sqrt(252)

Obviously, the annual volatility changes significantly over time.

In [8]:

index[['Close', 'Returns', 'Mov_Vol']].plot(subplots=True, style='b', figsize=(8, 5))

Out[8]:

array([<matplotlib.axes.AxesSubplot object at 0x1075b4cd0>,
       <matplotlib.axes.AxesSubplot object at 0x10291c950>,
       <matplotlib.axes.AxesSubplot object at 0x1029414d0>], dtype=object)

Exercise¶

Trend-based investment strategy with the EURO STOXX 50 index:

2 trends 42d & 252d
long, short, cash positions
no transaction costs

Signal generation:

invest (go long) when the 42d trend is more than 100 points above the 252d trend
sell (go short) when the 42d trend is more than 20 points below the 252d trend
invest in cash (no interest) when neither of both is true

Historical Correlation between EURO STOXX 50 and VSTOXX¶

It is a stylized fact that stock indexes and related volatility indexes are highly negatively correlated. The following example analyzes this stylized fact based on the EURO STOXX 50 stock index and the VSTOXX volatility index using Ordinary Least-Squares regession (OLS).

First, we collect historical data for both the EURO STOXX 50 stock and the VSTOXX volatility index.

In [9]:

import pandas as pd
import datetime as dt
from urllib import urlretrieve

In [10]:

try:
    es_url = 'http://www.stoxx.com/download/historical_values/hbrbcpe.txt'
    vs_url = 'http://www.stoxx.com/download/historical_values/h_vstoxx.txt'
    urlretrieve(es_url, 'es.txt')
    urlretrieve(vs_url, 'vs.txt')
except:
    pass

The EURO STOXX 50 data is not yet in the right format. Some house cleaning is necessary (I).

In [11]:

lines = open('es.txt').readlines()  # reads the whole file line-by-line

In [12]:

lines[:5]  # header not well formatted

Out[12]:

['Price Indices - EURO Currency\n',
 'Date    ;Blue-Chip;Blue-Chip;Broad    ; Broad   ;Ex UK    ;Ex Euro Zone;Blue-Chip; Broad\n',
 '        ;  Europe ;Euro-Zone;Europe   ;Euro-Zone;         ;            ; Nordic  ; Nordic\n',
 '        ;  SX5P   ;  SX5E   ;SXXP     ;SXXE     ; SXXF    ;    SXXA    ;    DK5F ; DKXF\n',
 '31.12.1986;775.00 ;  900.82 ;   82.76 ;   98.58 ;   98.06 ;   69.06 ;  645.26  ;  65.56\n']

The EURO STOXX 50 data is not yet in the right format. Some house cleaning is necessary (II).

In [13]:

lines[3883:3890]  # from 27.12.2001 additional semi-colon

Out[13]:

['20.12.2001;3537.34;  3617.47;   286.07;   300.97;   317.10;   267.23;  5268.36 ;  363.19\n',
 '21.12.2001;3616.80;  3696.44;   291.39;   306.60;   322.55;   272.18;  5360.52 ;  370.94\n',
 '24.12.2001;3622.85;  3696.98;   291.90;   306.77;   322.69;   272.95;  5360.52 ;  370.94\n',
 '27.12.2001;3686.23;  3778.39;   297.11;   312.43;   327.57;   277.68;  5479.59;   378.69;\n',
 '28.12.2001;3706.93;  3806.13;   298.73;   314.52;   329.94;   278.87;  5585.35;   386.99;\n',
 '02.01.2002;3627.81;  3755.56;   293.69;   311.43;   326.77;   272.38;  5522.25;   380.09;\n',
 '03.01.2002;3699.09;  3833.09;   299.09;   317.54;   332.62;   277.08;  5722.57;   396.12;\n']

The EURO STOXX 50 data is not yet in the right format. Some house cleaning is necessary (III).

In [14]:

lines = open('es.txt').readlines()  # reads the whole file line-by-line
new_file = open('es50.txt', 'w')  # opens a new file
new_file.writelines('date' + lines[3][:-1].replace(' ', '') + ';DEL' + lines[3][-1])
    # writes the corrected third line (additional column name)
    # of the orginal file as first line of new file
new_file.writelines(lines[4:])  # writes the remaining lines of the orginal file

The EURO STOXX 50 data is not yet in the right format. Some house cleaning is necessary (IV).

In [15]:

list(open('es50.txt'))[:5]  # opens the new file for inspection

Out[15]:

['date;SX5P;SX5E;SXXP;SXXE;SXXF;SXXA;DK5F;DKXF;DEL\n',
 '31.12.1986;775.00 ;  900.82 ;   82.76 ;   98.58 ;   98.06 ;   69.06 ;  645.26  ;  65.56\n',
 '01.01.1987;775.00 ;  900.82 ;   82.76 ;   98.58 ;   98.06 ;   69.06 ;  645.26  ;  65.56\n',
 '02.01.1987;770.89 ;  891.78 ;   82.57 ;   97.80 ;   97.43 ;   69.37 ;  647.62  ;  65.81\n',
 '05.01.1987;771.89 ;  898.33 ;   82.82 ;   98.60 ;   98.19 ;   69.16 ;  649.94  ;  65.82\n']

Now, the data can be safely read into a DataFrame object.

In [16]:

es = pd.read_csv('es50.txt', index_col=0, parse_dates=True, sep=';', dayfirst=True)

In [17]:

del es['DEL']  # delete the helper column

In [18]:

es.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6997 entries, 1986-12-31 00:00:00 to 2014-02-18 00:00:00
Data columns (total 8 columns):
SX5P    6997 non-null float64
SX5E    6997 non-null float64
SXXP    6997 non-null float64
SXXE    6997 non-null object
SXXF    6996 non-null float64
SXXA    6996 non-null float64
DK5F    6996 non-null float64
DKXF    6996 non-null float64
dtypes: float64(7), object(1)

The VSTOXX data can be read without touching the raw data.

In [19]:

vs = pd.read_csv('vs.txt', index_col=0, header=2, parse_dates=True, sep=',', dayfirst=True)

# you can alternatively read from the Web source directly
# without saving the csv file to disk:
# vs = pd.read_csv(vs_url, index_col=0, header=2,
#                  parse_dates=True, sep=',', dayfirst=True)

We now merge the data for further analysis.

In [20]:

import datetime as dt
data = pd.DataFrame({'EUROSTOXX' :
            es['SX5E'][es.index > dt.datetime(1999, 12, 31)]})
data = data.join(pd.DataFrame({'VSTOXX' :
            vs['V2TX'][vs.index > dt.datetime(1999, 12, 31)]}))
data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3622 entries, 2000-01-03 00:00:00 to 2014-02-18 00:00:00
Data columns (total 2 columns):
EUROSTOXX    3622 non-null float64
VSTOXX       3600 non-null float64
dtypes: float64(2)

Let's inspect the two time series.

In [21]:

data.head()

Out[21]:

	EUROSTOXX	VSTOXX
date
2000-01-03	4849.22	30.9845
2000-01-04	4657.83	33.2225
2000-01-05	4541.75	32.5944
2000-01-06	4500.69	31.1811
2000-01-07	4648.27	27.4407

5 rows × 2 columns

A picture can tell almost the complete story.

In [22]:

data.plot(subplots=True, grid=True, style='b', figsize=(10, 5))

Out[22]:

array([<matplotlib.axes.AxesSubplot object at 0x108087d90>,
       <matplotlib.axes.AxesSubplot object at 0x107517850>], dtype=object)

We now generate log returns for both time series.

In [23]:

rets = np.log(data / data.shift(1)) 
rets.head()

Out[23]:

	EUROSTOXX	VSTOXX
date
2000-01-03	NaN	NaN
2000-01-04	-0.040268	0.069740
2000-01-05	-0.025237	-0.019087
2000-01-06	-0.009082	-0.044328
2000-01-07	0.032264	-0.127785

5 rows × 2 columns

To this new data set, also stored in a DataFrame object, we apply OLS.

In [24]:

xdat = rets['EUROSTOXX']
ydat = rets['VSTOXX']
model = pd.ols(y=ydat, x=xdat)
model

Out[24]:


-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <x> + <intercept>

Number of Observations:         3577
Number of Degrees of Freedom:   2

R-squared:         0.5544
Adj R-squared:     0.5543

Rmse:              0.0379

F-stat (1, 3575):  4447.8744, p-value:     0.0000

Degrees of Freedom: model 1, resid 3575

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             x    -2.7183     0.0408     -66.69     0.0000    -2.7982    -2.6384
     intercept    -0.0007     0.0006      -1.10     0.2704    -0.0019     0.0005
---------------------------------End of Summary---------------------------------

Again, we want to see how our results look graphically.

In [25]:

import matplotlib.pyplot as plt
plt.plot(xdat, ydat, 'r.')
ax = plt.axis()  # grab axis values
x = np.linspace(ax[0], ax[1] + 0.01)
plt.plot(x, model.beta[1] + model.beta[0] * x, 'b', lw=2)
plt.grid(True)
plt.axis('tight')

Out[25]:

(-0.10000000000000001, 0.16, -0.43562265909764758, 0.43687964474802654)

Let us see if we can identify systematics over time. And indeed, during the crisis 2007/2008 (yellow dots) volatility has been more pronounced than more recently (red dots).

In [26]:

import matplotlib as mpl
mpl_dates = mpl.dates.date2num(rets.index)
plt.figure(figsize=(8, 4))
plt.scatter(rets['EUROSTOXX'], rets['VSTOXX'], c=mpl_dates, marker='o')
plt.grid(True)
plt.xlabel('EUROSTOXX')
plt.ylabel('VSTOXX')
plt.colorbar(ticks=mpl.dates.DayLocator(interval=250),
          format=mpl.dates.DateFormatter('%d %b %y'))

Out[26]:

<matplotlib.colorbar.Colorbar instance at 0x10b2a17a0>

Exercise¶

We want to test whether the EURO STOXX 50 and/or the VSTOXX returns are normally distributed or not (e.g. if they might have fat tails). We want to do a

graphical illustration (using qqplot of statsmodels.api) and a
statistical test (using normaltest of scipy.stats)

Add on: plot a histogram of the log return frequencies and compare that to a normal distribution with same mean and variance (using e.g. norm.pdf from scipy.stats)

Constant Proportion VSTOXX Investment¶

There has been a number of studies which have illustrated that constant proportion investments in volatility derivatives – given a diversified equity portfolio – might improve investment performance considerably. See, for instance, the study

The Benefits of Volatility Derivatives in Equity Portfolio Management

We now want to replicate (in a simplified fashion) what you can flexibly test here on the basis of two backtesting applications for VSTOXX-based investment strategies:

Two Assets Backtesting

Four Assets Backtesting

The strategy we are going to implement and test is characterized as follows:

An investor has total wealth of say 100,000 EUR
He invests, say, 70% of that into a diversified equity portfolio
The remainder, i.e. 30%, is invested in the VSTOXX index directly
Through (daily) trading the investor keeps the proportions constant
No transaction costs apply, all assets are infinitely divisible

We already have the necessary data available. However, we want to drop 'NaN' values and want to normalize the index values.

In [27]:

data = data.dropna()

In [28]:

data = data / data.ix[0] * 100

In [29]:

data.head()

Out[29]:

	EUROSTOXX	VSTOXX
date
2000-01-03	100.000000	100.000000
2000-01-04	96.053180	107.222966
2000-01-05	93.659393	105.195824
2000-01-06	92.812659	100.634511
2000-01-07	95.856035	88.562668

5 rows × 2 columns

First, the initial invest.

In [30]:

invest = 100
cratio = 0.3
data['Equity'] = (1 - cratio) * invest / data['EUROSTOXX'][0]
data['Volatility'] = cratio * invest / data['VSTOXX'][0]

This can already be considered an static investment strategy.

In [31]:

data['Static'] = (data['Equity'] * data['EUROSTOXX']
                + data['Volatility'] * data['VSTOXX'])

In [32]:

data[['EUROSTOXX', 'Static']].plot(figsize=(10, 5))

Out[32]:

<matplotlib.axes.AxesSubplot at 0x10b427d10>

Second, the dynamic strategy with daily adjustments to keep the value ratio constant.

In [33]:

for i in range(1, len(data)):
    evalue = data['Equity'][i - 1] * data['EUROSTOXX'][i]
      # value of equity position
    vvalue = data['Volatility'][i - 1] * data['VSTOXX'][i]
      # value of volatility position
    tvalue = evalue + vvalue
      # total wealth 
    data['Equity'][i] = (1 - cratio) * tvalue / data['EUROSTOXX'][i]
      # re-allocation of total wealth to equity ...
    data['Volatility'][i] = cratio * tvalue / data['VSTOXX'][i]
      # ... and volatility position

Third, the total wealth position.

In [34]:

data['Dynamic'] = (data['Equity'] * data['EUROSTOXX']
                + data['Volatility'] * data['VSTOXX'])

In [35]:

data.head()

Out[35]:

	EUROSTOXX	VSTOXX	Equity	Volatility	Static	Dynamic
date
2000-01-03	100.000000	100.000000	0.700000	0.300000	100.000000	100.000000
2000-01-04	96.053180	107.222966	0.724420	0.278124	99.404116	99.404116
2000-01-05	93.659393	105.195824	0.725761	0.276930	97.120322	97.106211
2000-01-06	92.812659	100.634511	0.718221	0.283884	95.159214	95.228521
2000-01-07	95.856035	88.562668	0.686354	0.318376	93.668025	93.987330

5 rows × 6 columns

A brief check if the ratios are indeed constant.

In [36]:

(data['Volatility'] * data['VSTOXX'] / data['Dynamic'])[:5]

Out[36]:

date
2000-01-03    0.3
2000-01-04    0.3
2000-01-05    0.3
2000-01-06    0.3
2000-01-07    0.3
dtype: float64

In [37]:

(data['Equity'] * data['EUROSTOXX'] / data['Dynamic'])[:5]

Out[37]:

date
2000-01-03    0.7
2000-01-04    0.7
2000-01-05    0.7
2000-01-06    0.7
2000-01-07    0.7
dtype: float64

Let us inspect the performance of the strategy.

In [38]:

data[['EUROSTOXX', 'Dynamic']].plot(figsize=(10, 5))

Out[38]:

<matplotlib.axes.AxesSubplot at 0x10b0d3410>

Exercise¶

Write a Python function which allows for an arbitrary but constant ratio to be invested in the VSTOXX index and which returns net performance values (in percent) for the constant proportion VSTOXX strategy.

Add on: find the ratio to be invested in the VSTOXX that gives the maximum performance.

Analyzing High Frequency Data¶

Using standard Python functionality and pandas, the code that follows reads intraday, high-frequency data from a Web source, plots it and resamples it.

In [39]:

try:
    url = 'http://hopey.netfonds.no/posdump.php?'
    url += 'date=%s%s%s&paper=AAPL.O&csv_format=csv' % ('2014', '03', '12')
    # you may have to adjust the date since only recent dates are available
    urlretrieve(url, 'aapl.csv')
except:
    pass

In [40]:

AAPL = pd.read_csv('aapl.csv', index_col=0, header=0, parse_dates=True)

In [41]:

AAPL.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 9637 entries, 2014-03-12 09:00:01 to 2014-03-12 19:21:50
Data columns (total 6 columns):
bid                  9637 non-null float64
bid_depth            9637 non-null int64
bid_depth_total      9637 non-null int64
offer                9637 non-null float64
offer_depth          9637 non-null int64
offer_depth_total    9637 non-null int64
dtypes: float64(2), int64(4)

The intraday evolution of the Apple stock price.

In [42]:

AAPL['bid'].plot()

Out[42]:

<matplotlib.axes.AxesSubplot at 0x10ba09090>

In [43]:

AAPL = AAPL[AAPL.index > dt.datetime(2014, 3, 12, 10, 0, 0)]
  # only data later than 10am at that day

A resampling of the data is easily accomplished with pandas.

In [44]:

# this resamples the record frequency to 5 minutes, using mean as aggregation rule
AAPL_5min = AAPL.resample(rule='5min', how='mean').fillna(method='ffill')
AAPL_5min.head()

Out[44]:

	bid	bid_depth	bid_depth_total	offer	offer_depth	offer_depth_total
time
2014-03-12 10:00:00	534.850000	100.000000	100.000000	536.086000	100.000000	100.000000
2014-03-12 10:05:00	534.850000	100.000000	100.000000	536.086000	100.000000	100.000000
2014-03-12 10:10:00	535.355000	100.000000	100.000000	536.090000	100.000000	100.000000
2014-03-12 10:15:00	535.054286	100.000000	100.000000	536.090000	142.857143	142.857143
2014-03-12 10:20:00	534.600000	133.333333	133.333333	536.023333	116.666667	116.666667

5 rows × 6 columns

Let's have a graphical look at the new data set.

In [45]:

AAPL_5min['bid'].plot()

Out[45]:

<matplotlib.axes.AxesSubplot at 0x10bf9fe90>

With pandas you can easily apply custom functions to time series data.

In [46]:

AAPL_5min['bid'].apply(lambda x: 2 * 530 - x).plot()
  # this mirrors the stock price development at

Out[46]:

<matplotlib.axes.AxesSubplot at 0x10c866910>

Why Python for Financial Analytics & Visualization?¶

10 years ago, Python was considered exotic in the analytics space – at best. Languages/packages like R and Matlab dominated the scene. Today, Python has become a major force in financial analytics & visualization due to a number of characteristics:

syntax: Python syntax is pretty close to the symbolic language used in mathematical finance (also: symbolic Python with SymPy)
multi-purpose: prototyping, development, production, sytems administration – Python is one for all
libraries: there is a library for almost any task or problem you face
efficiency: Python speeds up all IT development tasks for analytics applications and reduces maintenance costs
performance: Python has evolved from a scripting language to a 'meta' language with bridges to all high performance environments (e.g. LLVM, multi-core CPUs, GPUs, clusters)
interoperalbility: Python seamlessly integrates with almost any other language and technology
interactivity: Python allows domain experts to get closer to their business and financial data pools and to do real-time analytics
collaboration: solutions like Wakari with IPython Notebook allow the easy sharing of code, data, results, graphics, etc.

The Python Quants

The Python Quants GmbH – Python Data Exploration & Visualization¶

The Python Quants – the company Web site

www.pythonquants.com

Dr. Yves J. Hilpisch – my personal Web site

www.hilpisch.com

Python for Finance – my NEW book (out as Early Release)

Python for Finance (O'Reilly)

Derivatives Analytics with Python – my current book

www.derivatives-analytics-with-python.com

Contact Us

yves@pythonquants.com | @dyjh