pandas: Data Handling and Analysis in Python

Author: Spencer Lyon

Many research tasks require you to use and analyze data. Most of the time before you can actually get to the analysis you have to collect, organize, clean, munge, and otherwise prepare the data. As most data scientists will tell you, the preparation phase takes the majority of the users time. As with most things, there are difficult ways and easier ways to work with data on a computer. Domain specific platforms like Stata, SQL, SASS, and R are very capable data analysis and handling packages, but they are all lacking in one way or another. Python is a very flexible and powerful language that overcomes most if not all of the major pitfalls inherent in the domain specific languages, but historically has not been as good at data handling. Enter pandas. pandas is an extremely efficient, powerful, and flexible data package for python. The following is a quote taken from the pandas website:

[pandas] has the broader goal of becoming the most powerful and flexible open source data analysis manipulation tool available in any language. It is already well on its way toward this goal.

The purpose of this lab is to introduce you to the basic data structures in pandas, teach you some helpful pandas tricks, and get you some experience handling and analyzing real data. The text of this lab is somewhat detailed and should be thought of as more of a reference to look back on when you need to look something up, rather than a text you need to fully understand on the first read.

The remainder of this lab will be structured a bit differently than the other labs you have seen so far. I will present a lot of information at the beginning and save most assignments until the end of the lab. It is structured this way because there are a number of foundational things you need to know about pandas before being able to do meaningful analysis with actual data. As such, I recommend that you spend the first hour or so reading through the material and the last 3 hours (probably a bit more) working through the exercises. Don’t worry about mastering all of the material presented on your initial reading. It is meant to introduce you to the topics and serve as a reference for you to look back on while completing the assignments.

Note

A lot of the content of this lab was drawn from two main sources: the pandas documentation, and the excellent book by Wes McKinney (main author of pandas) Python for Data Analysis

Core pandas Data Structures

Complete the introduction found on the pandas objects page.

To use in examples below, I define the following objects:

In [1]: ser = Series(np.arange(10, 16))

In [2]: df = DataFrame(np.arange(30).reshape(6, 5), columns=list('abcde'))

In [3]: mul_df = DataFrame(np.arange(12).reshape((4, 3)),
   ...:                    index=[['a', 'a', 'b', 'b'],
   ...:                            [1, 2, 1, 2]],
   ...:                    columns=[['Ohio', 'Ohio', 'Colorado'],
   ...:                                ['Green', 'Red', 'Green']])
   ...:

In [4]: mul_df.index.names = ['let', 'num']

In [5]: mul_df.columns.names = ['state', 'color']

In [6]: ts_df = pd.util.testing.makeTimeDataFrame(nper=500)

In [7]: big_df = DataFrame(np.random.randint(500, size=(500, 10)),
   ...:                    columns=list('abcdefghij'))
   ...:

Essential Functionality

Viewing Data

Now that we understand how to construct pandas objects, I will go over some basic functionality.

pandas has an intelligent printing system that won’t take over your screen by trying to print something that is too large.

In [8]: big_df
Out[8]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 500 entries, 0 to 499
Data columns (total 10 columns):
a    500  non-null values
b    500  non-null values
c    500  non-null values
d    500  non-null values
e    500  non-null values
f    500  non-null values
g    500  non-null values
h    500  non-null values
i    500  non-null values
j    500  non-null values
dtypes: int64(10)

There are many ways to force pandas to show you at least parts of your data.

# Slicing
In [9]: big_df.ix[:5]
Out[9]: 
     a    b    c    d    e    f    g    h    i    j
0  413  478  357   74  354   71  258  397  187  460
1  172  479  289   44   60   82  122  357   15   65
2  423  144  374  161  233  379  143  357  187  321
3  449  473   12  183  110  414  289  294  450  391
4  214  461  119  381   54   26  117  220  230   52
5  294  320  400  232  157  436  268   20  402   52

In [10]: big_df.ix[:5, ::5]
Out[10]: 
     a    f
0  413   71
1  172   82
2  423  379
3  449  414
4  214   26
5  294  436

# Using the head() method -- shows first n rows
In [11]: big_df.head()
Out[11]: 
     a    b    c    d    e    f    g    h    i    j
0  413  478  357   74  354   71  258  397  187  460
1  172  479  289   44   60   82  122  357   15   65
2  423  144  374  161  233  379  143  357  187  321
3  449  473   12  183  110  414  289  294  450  391
4  214  461  119  381   54   26  117  220  230   52

In [12]: big_df.head(5)  # default is 5
Out[12]: 
     a    b    c    d    e    f    g    h    i    j
0  413  478  357   74  354   71  258  397  187  460
1  172  479  289   44   60   82  122  357   15   65
2  423  144  374  161  233  379  143  357  187  321
3  449  473   12  183  110  414  289  294  450  391
4  214  461  119  381   54   26  117  220  230   52

In [13]: big_df.head(7)
Out[13]: 
     a    b    c    d    e    f    g    h    i    j
0  413  478  357   74  354   71  258  397  187  460
1  172  479  289   44   60   82  122  357   15   65
2  423  144  374  161  233  379  143  357  187  321
3  449  473   12  183  110  414  289  294  450  391
4  214  461  119  381   54   26  117  220  230   52
5  294  320  400  232  157  436  268   20  402   52
6  274  312  280  213   30  336  182   22   31  445

# Using the tail method -- shows last n rows
In [14]: big_df.tail()
Out[14]: 
       a    b    c    d    e    f    g    h    i    j
495  219  312  243  351  459  303  297  282   73  109
496  179  425   31  233  281   24  330  278  225   56
497  179  231  381  342  476  286   63  454  231  288
498  212  257   17  174  416  228  225   83  290  434
499  369  245  401  109  253   87   71  380   29  171

In [15]: big_df.tail(5)  # default is 5
Out[15]: 
       a    b    c    d    e    f    g    h    i    j
495  219  312  243  351  459  303  297  282   73  109
496  179  425   31  233  281   24  330  278  225   56
497  179  231  381  342  476  286   63  454  231  288
498  212  257   17  174  416  228  225   83  290  434
499  369  245  401  109  253   87   71  380   29  171

In [16]: big_df.tail(7)
Out[16]: 
       a    b    c    d    e    f    g    h    i    j
493   50  241  293  407   50  265  119  199  404  247
494   69  340   56  381   38  471  288  442   17  439
495  219  312  243  351  459  303  297  282   73  109
496  179  425   31  233  281   24  330  278  225   56
497  179  231  381  342  476  286   63  454  231  288
498  212  257   17  174  416  228  225   83  290  434
499  369  245  401  109  253   87   71  380   29  171

Reindexing

A very important method for pandas objects is reindex, which means to create a new object with the data conformed to a new index.

In [17]: ser
Out[17]: 
0    10
1    11
2    12
3    13
4    14
5    15
dtype: int64

In [18]: ser.reindex([1, 2, 4, 6])
Out[18]: 
1    11
2    12
4    14
6   NaN
dtype: float64

Notice that the call to reindex used the data contained in ser where possible and put placeholders for missing values where the data didn’t exist in ser. The same holds for DataFrame objects:

In [19]: df
Out[19]: 
    a   b   c   d   e
0   0   1   2   3   4
1   5   6   7   8   9
2  10  11  12  13  14
3  15  16  17  18  19
4  20  21  22  23  24
5  25  26  27  28  29

In [20]: df.reindex([1, 3, 7])
Out[20]: 
    a   b   c   d   e
1   5   6   7   8   9
3  15  16  17  18  19
7 NaN NaN NaN NaN NaN

If you didn’t want missing data to appear in the reindexed object, you could use some interpolation logic by passing an interpolation rule to the method argument. The two main arguments that can be passed are ffill and bfill.

# Read docstrings for more information
In [21]: df.reindex([1, 3, 5, 7], method='ffill')
Out[21]: 
    a   b   c   d   e
1   5   6   7   8   9
3  15  16  17  18  19
5  25  26  27  28  29
7  25  26  27  28  29

With a DataFrame you can reindex the rows like we have been doing, but you can also reindex the rows and take only a few columns or just reindex the columns themselves.

In [22]: df.reindex([1, 3], columns=['a', 'b', 'e'])
Out[22]: 
    a   b   e
1   5   6   9
3  15  16  19

In [23]: df.reindex(columns=['a', 'b', 'e'])
Out[23]: 
    a   b   e
0   0   1   4
1   5   6   9
2  10  11  14
3  15  16  19
4  20  21  24
5  25  26  29

Note

As will be seen later, pandas has very flexible data alignment and handling routines. A lot of the work done behind the scenes to align data sets uses the reindex method.

Also recall the use if the ix indexing field that was introduced with the data structures on the other page. It is a very robust and powerful way to accomplish many re-indexing tasks.

Arithmetic and Data Alignment

We have already seen that when we do basic arithmetic pandas objects, the intrinsic link between data and labels is not broken. This results in a new pandas object whose row/column index is the union of indexes of the original objects, with NaN propogated where indexes didn’t align. This is very useful for doing computation with data sets that need to be aligned, but sometimes isn’t what we want. All pandas objects have add, sub, mul, and div methods that give you more flexibility over this behavior.

In [24]: df1 = DataFrame(np.arange(9).reshape((3, 3)), columns=list('bcd'),
   ....:                 index=['AZ', 'TX', 'NY'])
   ....:

In [25]: df2 = DataFrame(np.arange(12).reshape((4, 3)), columns=list('bcd'),
   ....:                 index=['UT', 'AZ', 'TX', 'NY'])
   ....:

# Produces NaN values
In [26]: df1 + df2
Out[26]: 
     b   c   d
AZ   3   5   7
NY  15  17  19
TX   9  11  13
UT NaN NaN NaN

# Same as above
In [27]: df1.add(df2)
Out[27]: 
     b   c   d
AZ   3   5   7
NY  15  17  19
TX   9  11  13
UT NaN NaN NaN

# No NaN's
In [28]: df1.add(df2, fill_value=0)
Out[28]: 
     b   c   d
AZ   3   5   7
NY  15  17  19
TX   9  11  13
UT   0   1   2

Arithmetic between DataFrame and Series is well defined and is analogous to operations between 1 and 2 dimensional numpy arrays, with the added benefit of data alignment.

In [29]: series = df2.ix[0]

In [30]: df2 - series
Out[30]: 
    b  c  d
UT  0  0  0
AZ  3  3  3
TX  6  6  6
NY  9  9  9

In [31]: ser2 = Series(range(4), index=list('bcde'))

# Note NaNs for columns 'e' that doesn't appear in df2
In [32]: df2 + ser2
Out[32]: 
    b   c   d   e
UT  0   2   4 NaN
AZ  3   5   7 NaN
TX  6   8  10 NaN
NY  9  11  13 NaN

# Can also align on rows
In [33]: ser3 = df2['c']

In [34]: ser3 + df2
Out[34]: 
    AZ  NY  TX  UT   b   c   d
UT NaN NaN NaN NaN NaN NaN NaN
AZ NaN NaN NaN NaN NaN NaN NaN
TX NaN NaN NaN NaN NaN NaN NaN
NY NaN NaN NaN NaN NaN NaN NaN

# But we need to use .### methods
In [35]: df2.sub(ser3, axis=0)
Out[35]: 
    b  c  d
UT -1  0  1
AZ -1  0  1
TX -1  0  1
NY -1  0  1

Function Application and Mapping

As we saw on the other page, pandas objects are a subclass of numpy.ndarray and as such can be passed as arguments to all numpy ufuncs.

In [36]: np.sin(df)
Out[36]: 
          a         b         c         d         e
0  0.000000  0.841471  0.909297  0.141120 -0.756802
1 -0.958924 -0.279415  0.656987  0.989358  0.412118
2 -0.544021 -0.999990 -0.536573  0.420167  0.990607
3  0.650288 -0.287903 -0.961397 -0.750987  0.149877
4  0.912945  0.836656 -0.008851 -0.846220 -0.905578
5 -0.132352  0.762558  0.956376  0.270906 -0.663634

In [37]: np.abs(np.sin(df))
Out[37]: 
          a         b         c         d         e
0  0.000000  0.841471  0.909297  0.141120  0.756802
1  0.958924  0.279415  0.656987  0.989358  0.412118
2  0.544021  0.999990  0.536573  0.420167  0.990607
3  0.650288  0.287903  0.961397  0.750987  0.149877
4  0.912945  0.836656  0.008851  0.846220  0.905578
5  0.132352  0.762558  0.956376  0.270906  0.663634

Another very common operation is to apply a function on 1d arrays made up of the columns or rows of a pandas object. For example, if we wanted to find the total distance spanned by values in a row (column) of a data frame we could do something like this:

In [38]: f = lambda x: x.max() - x.min()

In [39]: rand_df = DataFrame(randn(5, 5), columns=list('abcde'))

In [40]: rand_df
Out[40]: 
          a         b         c         d         e
0 -0.503157 -1.411566  0.707046  2.748645  0.145893
1 -1.826045  0.457516  0.860263 -0.917852 -0.645512
2 -0.747378 -0.531364 -0.172163 -0.860901 -1.235838
3  0.174375 -0.182662 -0.436242  0.083190 -1.520176
4  0.263203 -0.415688  0.239489  0.399649  1.131151

In [41]: rand_df.apply(f)
Out[41]: 
a    2.089248
b    1.869083
c    1.296505
d    3.666497
e    2.651327
dtype: float64

In [42]: rand_df.apply(f, axis=1)
Out[42]: 
0    4.160211
1    2.686308
2    1.063675
3    1.694551
4    1.546839
dtype: float64

The function sent to apply doesn’t need to just return scalars. In fact, it can any python object you want.

In [43]: f2 = lambda x: 'Max: %.3f, Min: %.3f' % (x.max(), x.min())

In [44]: f3 = lambda x: (x.max(), x.min())

In [45]: f4 = lambda x: Series([x.max(), x.min()], index=['max', 'min'])

In [46]: rand_df.apply(f2)
Out[46]: 
a    Max: 0.263, Min: -1.826
b    Max: 0.458, Min: -1.412
c    Max: 0.860, Min: -0.436
d    Max: 2.749, Min: -0.918
e    Max: 1.131, Min: -1.520
dtype: object

In [47]: rand_df.apply(f3)
Out[47]: 
a     (0.263202922997, -1.82604541073)
b     (0.457516490127, -1.41156615518)
c    (0.860262971401, -0.436241770117)
d     (2.74864493448, -0.917852431896)
e      (1.13115127829, -1.52017570556)
dtype: object

In [48]: rand_df.apply(f4, axis=1)
Out[48]: 
        max       min
0  2.748645 -1.411566
1  0.860263 -1.826045
2 -0.172163 -1.235838
3  0.174375 -1.520176
4  1.131151 -0.415688

You can also apply functions element-wise to every item in a dataframe using the applymap method.

In [49]: formater = lambda x: '%.2f' % x

In [50]: rand_df.applymap(formater)
Out[50]: 
       a      b      c      d      e
0  -0.50  -1.41   0.71   2.75   0.15
1  -1.83   0.46   0.86  -0.92  -0.65
2  -0.75  -0.53  -0.17  -0.86  -1.24
3   0.17  -0.18  -0.44   0.08  -1.52
4   0.26  -0.42   0.24   0.40   1.13

As I am sure you are beginning to see, you can be very creative in your use of apply and applymap to accomplish some very sophisticated and powerful tasks.

Sorting and Ranking

Another common task is to sort the data based values or keys. This is very easy and efficient in pandas (I make the note about efficiency because all of the operations below are easy enough to compute on your own, but pandas has highly optimized code for doing them).

In [51]: frame = DataFrame([[4, 0], [7, 1], [-3, 0], [2, 1]],
   ....:                   columns=['three', 'one'], index=[20, 40, 10, 30])
   ....:

In [52]: frame
Out[52]: 
    three  one
20      4    0
40      7    1
10     -3    0
30      2    1

# Sort rows numerically
In [53]: frame.sort_index()
Out[53]: 
    three  one
10     -3    0
20      4    0
30      2    1
40      7    1

# Sort column labels alphabetically
In [54]: frame.sort_index(axis=1)
Out[54]: 
    one  three
20    0      4
40    1      7
10    0     -3
30    1      2

# Sort sort by column 'one'
In [55]: frame.sort_index(by='one')
Out[55]: 
    three  one
20      4    0
10     -3    0
40      7    1
30      2    1

# Sort by 'one', then 'three'
In [56]: frame.sort_index(by=['one', 'three'])
Out[56]: 
    three  one
10     -3    0
20      4    0
30      2    1
40      7    1

# Sort by 'three', then 'one'
In [57]: frame.sort_index(by=['three', 'one'])
Out[57]: 
    three  one
10     -3    0
30      2    1
20      4    0
40      7    1

Other useful methods

There is so much more that pandas can do that we do not have time to look at individually right now. I will simply list useful DataFrame or Series methods and group them into logical categories.

Summarizing Data

  • count
  • describe
  • min, max
  • idxmin, idxmax
  • argmin, argmax
  • quantile
  • sum, mean, median, mad,
  • var, std, corr, cov, corrwith
  • skew, kurt
  • cumsum, cummin, cummax, cumprod
  • diff
  • pct_change

Membership, unique values, and value counts

  • value_counts
  • value_counts
  • isin

Missing Data:

  • dropna
  • dropna
  • dropna
  • notnull

Manipulating Data

  • swaplevel
  • sortlevel
  • 'level' argument to most of the above methods

Plotting

  • plot: This one has a ton of functionality so you will want to refer to the plotting docs for more info.

For an overview of other pandas features, see the documentation.

Time Series Analysis

Time series analysis is a huge topic that could fill books, so I will only briefly highlight some of the key features. For more information see the time series docs.

Generating date ranges

One common task in economic or financial data analysis is to generate a range of equally spaced dates. An example might be that you know you have quarterly data starting in Q1 2000 and ending in Q4 2012 and you would like to tell python (pandas) about those dates. To do this you will use the pandas.date_range function. I will illustrate with some examples below:

# Example from above. Note robust string parsing
In [58]: pd.date_range(start='1-1-2000', end='12/31/2012', freq='Q')
Out[58]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[2000-03-31 00:00:00, ..., 2012-12-31 00:00:00]
Length: 52, Freq: Q-DEC, Timezone: None

# Maybe we wanted start of period instead of ending period data
In [59]: pd.date_range(start='1-1-2000', end='12/31/2012', freq='QS')
Out[59]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[2000-01-01 00:00:00, ..., 2012-10-01 00:00:00]
Length: 52, Freq: QS-JAN, Timezone: None

# We might have frequency, starting period, and number of periods
In [60]: pd.date_range(start='January 1 2000', freq='QS', periods=52)
Out[60]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[2000-01-01 00:00:00, ..., 2012-10-01 00:00:00]
Length: 52, Freq: QS-JAN, Timezone: None

In the above examples I only tried quarterly data, but there are many different frequencies I could have used. Below I list many of them.

  • B: business day frequency
  • D: calendar day frequency
  • W: weekly frequency
  • M: month end frequency
  • BM: business month end frequency
  • MS: month start frequency
  • BMS: business month start frequency
  • Q: quarter end frequency
  • BQ: business quarter endfrequency
  • QS: quarter start frequency
  • BQS: business quarter start frequency
  • A: year end frequency
  • BA: business year end frequency
  • AS: year start frequency
  • BAS: business year start frequency
  • H: hourly frequency
  • T: minutely frequency
  • S: secondly frequency
  • L: milliseonds
  • U: microseconds

Note

You can combine any of the above frequencies with a logical anchor point. For example, you can combine weekly with a three letter abbreviation for which day you want repated each week (e.g. W-TUE is every Tuesday). With annual and quarterly frequencies you can give a three letter abbreviation for the month (BQS-JUN is the start of business quarters, starting at June 1st).

I could use any of these frequencies, or any logical combination of them.

# Business annual data
In [61]: pd.date_range(end='1-1-1990', freq='BA', periods=20)
Out[61]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[1970-12-31 00:00:00, ..., 1989-12-29 00:00:00]
Length: 20, Freq: BA-DEC, Timezone: None

# Every 5 hours 10 minutes
In [62]: pd.date_range(start='April 25th 1989 3:40 PM', periods=100, freq='5h10min')
Out[62]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[1989-04-25 15:40:00, ..., 1989-05-16 23:10:00]
Length: 100, Freq: 310T, Timezone: None

DatetimeIndex

Notice that the return value of the date_range function is an object of type DatetimeIndex. This means that the result could be used as the index for a Series or DataFrame. When you do this, you open up a whole set of time-series specific functionality.

One of the most common and most useful tasks is to resample data at one frequency to another frequency.

# Notice the frequency is B, or business daily
In [63]: ts_df
Out[63]: 
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 500 entries, 2000-01-03 00:00:00 to 2001-11-30 00:00:00
Freq: B
Data columns (total 4 columns):
A    500  non-null values
B    500  non-null values
C    500  non-null values
D    500  non-null values
dtypes: float64(4)

# Get weekly data
In [64]: ts_df.resample('W')
Out[64]: 
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 100 entries, 2000-01-09 00:00:00 to 2001-12-02 00:00:00
Freq: W-SUN
Data columns (total 4 columns):
A    100  non-null values
B    100  non-null values
C    100  non-null values
D    100  non-null values
dtypes: float64(4)

# Use how argument
In [65]: ts_df.resample('2M', how='first')
Out[65]: 
                   A         B         C         D
2000-01-31  0.496714  0.926178  1.399355  0.778361
2000-03-31 -0.225776  0.543360 -0.573700  0.582098
2000-05-31  1.356240 -0.408075 -1.119670 -0.281328
2000-07-31  0.257550 -1.778720 -0.623769  0.323079
2000-09-30  0.346448  1.126565 -2.153390  0.325796
2000-11-30  0.385317 -0.309546  0.440475  1.938929
2001-01-31  0.852433  1.014370  1.628397  1.710613
2001-03-31 -1.237815  0.610586  0.956702  0.760056
2001-05-31 -1.214189  1.453384 -1.975467  0.441941
2001-07-31 -0.773010  2.010205 -0.884803 -0.410814
2001-09-30 -1.534114  0.249384  0.455888  0.638051
2001-11-30 -0.792873  1.117296 -1.760763  1.374438

# Different how
In [66]: ts_df.resample('8M', how='max')
Out[66]: 
                   A         B         C         D
2000-01-31  1.579213  1.909417  2.075261  1.854093
2000-09-30  2.720169  2.573360  2.439752  3.193108
2001-05-31  3.852731  2.632382  2.579709  2.170943
2002-01-31  3.078881  2.526932  2.601683  3.137749

# ohlc: open-high-low-close. only works on series objects
In [67]: ts_df['A'].resample('8M', how='ohlc')
Out[67]: 
                open      high       low     close
2000-01-31  0.496714  1.579213 -1.913280  1.465649
2000-09-30 -0.225776  2.720169 -2.619745  0.173181
2001-05-31  0.385317  3.852731 -3.241267  0.097676
2002-01-31 -0.773010  3.078881 -2.301921 -1.382800

# use lambda function
In [68]: ts_df.resample('8M', how=lambda x: np.sqrt(np.abs(x.min() - x.max())))
Out[68]: 
                   A         B         C         D
2000-01-31  1.868821  1.891777  1.662663  1.928893
2000-09-30  2.310825  2.295702  2.309980  2.472743
2001-05-31  2.663456  2.171247  2.273929  2.207753
2002-01-31  2.319656  2.225042  2.137297  2.465388

There is a lot more available to the resample, method, but I refer the reader to the time series docs for more information.

Other Notable Topics

Other topics that you will want to look at when you have time are given below:

  • Time series plotting (just use the plot method on an object with a DatetimeIndex)
  • Indexing DatetimeIndex objects. (To get all data from December 2012 through the end of May 2013 data you could do df.ix['December 2012':May 2013'])
  • Moving Window functions
  • Up-sampling and down-sampling
  • Time zone handling
  • Periods vs. time stamps (we have been using time stamps)

Group-By

One of the most flexible and powerful constructs in pandas is accessed via the groupby method for a pandas data structure. The process of groupby can be summarized in three stages:

  1. Split
  2. Apply
  3. Combine

As the steps indicate, when using groupby we will be deciding how to split up our data, what to do with the categorized chunks, and then how to combine it again (really pandas usually takes care of this last step for us). This topic can feel a bit esoteric, but once understood should be appealing to anyone who needs to aggregate and act on data (pretty much everyone who ever works with data!). I will not have time to cover all the possible use-cases of groupby in this material, but the groupby docs are a good place to look for more examples and inspiration.

Note

The tips data below originally came from Bryant % Smith (1995) and has the total bill and tip amount for the tips given at a restaurant based on the time of day, day of the week, the size of the party, sex of the payer, and whether or not the payer was a smoker.

We will need some objects to play with:

In [69]: df  = DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
   ....:                  'key2' : ['one', 'two', 'one', 'two', 'one'],
   ....:                  'data1' : np.random.randn(5),
   ....:                  'data2' : np.random.randn(5)})
   ....:

In [70]: colors = ['red', 'blue', 'red', 'blue', 'red']

In [71]: ranks = [1, 2, 1, 1, 2]

# Some data I loaded in before
In [72]: tips
Out[72]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 244 entries, 0 to 243
Data columns (total 8 columns):
total_bill    244  non-null values
tip           244  non-null values
sex           244  non-null values
smoker        244  non-null values
day           244  non-null values
time          244  non-null values
size          244  non-null values
tip_pct       244  non-null values
dtypes: float64(3), int64(1), object(4)

We can use groubpy on our DataFrame and pass it python iterable(s) of the correct size or we could use the DataFrame’s columns themselves as the keys to aggregate on.

# Group the series of just 'data1' column
In [73]: df['data1'].groupby(colors)
Out[73]: <pandas.core.groupby.SeriesGroupBy at 0x110bdf750>

# Apply it to the whole frame
In [74]: df.groupby(ranks)
Out[74]: <pandas.core.groupby.DataFrameGroupBy at 0x110bdf810>

# Group on more than one thing (also works for series)
In [75]: df.groupby([colors, ranks])
Out[75]: <pandas.core.groupby.DataFrameGroupBy at 0x110bdf210>

# Group by the columns of df
In [76]: by_keys = df.groupby(['key1', 'key2'])

In [77]: by_keys
Out[77]: <pandas.core.groupby.DataFrameGroupBy at 0x110bdf350>

As you have seen, the return value of the groupby object isn’t very useful on its own, but we can do some very cool things with it.

# Compute mean of each group
In [78]: by_keys.mean()
Out[78]: 
              data1     data2
key1 key2                    
a    one   0.006013  0.693592
     two   1.066719  0.630959
b    one  -0.321913  0.135072
     two  -0.274178 -1.227735

# Compute max of just the 'data2' column
In [79]: by_keys['data2'].max()
Out[79]: 
key1  key2
a     one     0.774125
      two     0.630959
b     one     0.135072
      two    -1.227735
Name: data2, dtype: float64

# The above was a Series. We can unstack it into a DataFrame
In [80]: by_keys['data2'].prod().unstack()
Out[80]: 
key2       one       two
key1                    
a     0.474584  0.630959
b     0.135072 -1.227735

There are a number of highly optimized groupby methods. I will list some of the most common below

  • count
  • sum
  • mean
  • median
  • std
  • var
  • min
  • max
  • prod
  • first
  • last

You can also use the agg method of a GroupBy object to give more complicated functions as well as pass a list of functions to apply at once.

In [81]: ppl_grp = tips.groupby(['sex', 'smoker'])

In [82]: ppl_pct = ppl_grp['tip_pct']

In [83]: ppl_pct.agg('mean')
Out[83]: 
sex     smoker
Female  No        0.156921
        Yes       0.182150
Male    No        0.160669
        Yes       0.152771
Name: tip_pct, dtype: float64

# Notice function names become column labels
In [84]: ppl_pct.agg(['mean', 'sum', 'std', lambda x: x.max() - x.min()])
Out[84]: 
                   mean        sum       std  <lambda>
sex    smoker                                         
Female No      0.156921   8.473732  0.036421  0.195876
       Yes     0.182150   6.010962  0.071595  0.360233
Male   No      0.160669  15.584865  0.041849  0.220186
       Yes     0.152771   9.166271  0.090588  0.674707

# Aggregate multiple columns with a list of functions
In [85]: funcs = ['count', 'mean', 'max']

In [86]: ppl_grp['tip_pct', 'total_bill'].agg(funcs)
Out[86]: 
               tip_pct                      total_bill                  
                 count      mean       max       count       mean    max
sex    smoker                                                           
Female No           54  0.156921  0.252672          54  18.105185  35.83
       Yes          33  0.182150  0.416667          33  17.977879  44.30
Male   No           97  0.160669  0.291990          97  19.791237  48.33
       Yes          60  0.152771  0.710345          60  22.284500  50.81

# Compute different aggregations for different columns
In [87]: mapping = {'tip_pct' : ['min', 'max', 'mean', 'std'], 'size' : ['sum', 'mean']}

In [88]: ppl_grp.agg(mapping)
Out[88]: 
                tip_pct                                size          
                    min       max      mean       std   sum      mean
sex    smoker                                                        
Female No      0.056797  0.252672  0.156921  0.036421   140  2.592593
       Yes     0.056433  0.416667  0.182150  0.071595    74  2.242424
Male   No      0.071804  0.291990  0.160669  0.041849   263  2.711340
       Yes     0.035638  0.710345  0.152771  0.090588   150  2.500000

As with other topics, there is so much more that you can do with groupby than what I can cover here. See the documentation for more information.

Data Collection and Input/Output

pandas is also an excellent tool for getting data in and out of python. This section will illustrate the differences pandas i/o functions and those in other python packages.

Using pandas

Depending on your version of pandas, there are between 4-7 utility functions that can be used get data in and out of pandas.

Spend about 10 minutes reading through the data IO documentation and familiarize yourself with the read_table and read_csv functions. There is a lot of information there, so don’t worry about capturing all of it. The function arguments you should focus on are the following:

  • filepath_or_buffer
  • sep
  • dtype
  • header
  • index_col
  • parse_dates
  • thousands
  • usecols

To make sure that you have understood the material on the pandas website, you should be able to follow the examples below:

# Show contents of file
In [89]: print open('./source/resources/Data/flat.csv').read()
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

In [90]: pd.read_csv('./source/resources/Data/flat.csv')
Out[90]: 
   1   2   3   4  hello
0  5   6   7   8  world
1  9  10  11  12    foo

# We really didn't want first row to be header
In [91]: pd.read_csv('./source/resources/Data/flat.csv', header=None)
Out[91]: 
   0   1   2   3      4
0  1   2   3   4  hello
1  5   6   7   8  world
2  9  10  11  12    foo

# Can give column names
In [92]: names = ['a', 'b', 'c', 'd', 'message']

In [93]: pd.read_csv('./source/resources/Data/flat.csv', header=None, names=names)
Out[93]: 
   a   b   c   d message
0  1   2   3   4   hello
1  5   6   7   8   world
2  9  10  11  12     foo

# Can set named column as index
In [94]: pd.read_csv('./source/resources/Data/flat.csv', header=None,
   ....:             names=names, index_col='message')
   ....:
Out[94]: 
         a   b   c   d
message               
hello    1   2   3   4
world    5   6   7   8
foo      9  10  11  12

Other Tools

Imagine you wanted to some basic macroeconomic analysis with common data sets you could obtain from FRED. As you have already seen, there are various functions within numpy and scipy for loading data from text or binary files into numpy arrays. Two of the functions you may or may not have met up to this point are:

  • numpy.genfromtxt: Flexible input of plain text files with allowing the user to specify many things like the datatype, file delimiter, rows/columns to skip, whether the data should be unpacked into separate arrays - one column per array, and much more. This is a more powerful version numpy.loadtxt
  • scipy.io.loadmat: Functional import of MATLAB .mat data files. The result is put in a python dictionary with keys being the object names from the .mat file. Various options can be given that control the behavior of the input.

While these functions are capable, they are somewhat low-level compared to the functions in pandas. The exercises below will demonstrate this idea.

Todo

Disclaimer: Spend only 10-15 minutes trying to complete this exercise. I don’t actually care if you finish it - I DO however care that you spend at least a little bit of time on it.

Collect the following data sets from the FRED website and read them in to python using tools from numpy/scipy (Note the items listed below are the actual data set names on FRED). Make sure that you collect the data starting January 1 1985 and going out as far as you can:

  • DGS3MO
  • DGS10
  • M2
  • CPIAUCSL
  • GDPC1

Use the date information given in the data files to manually align the data. Notice that there are 4 different frequencies for the data sets listed above. To get all of them on the same frequency, compute a period-average. In other words, if you have daily data and you want weekly data, compute the mean of every 7 items and call that the weekly value. (Note: This probably won’t be easy because you will end up needing quarterly data and trying to align data across months of different lengths is difficult.)

Finally, use your resultant data set and rudimentary econometric techniques to solve this equation:

\[GDPC1 = \beta_0 + \beta_1 (DGS10 - DSG3MO) + \beta_2 M2 + \beta_3 CPIAUCSL\]

You should get the following as answers:

  • \(\beta_0 = 85.665327\)
  • \(\beta_1 = -116.884105\)
  • \(\beta_2 = -0.102965\)
  • \(\beta_3 = 65.954251\)

Todo

Disclaimer: I DO care that you finish this one.

Repeat the above analysis using pandas. Read the data into python in two ways:

  1. Using pandas.read_table or pandas.read_csv to read in the files you used above.
  2. Using the pandas.io.data.DataReader function (read the docstring)

Hint: When using read_xxx you will want to make sure to parse the dates and set which column should be the index (the one with the dates is probably a good idea)

Combining Data Sets

The last main topic I will cover here deals with combining data sets from different sources. We have seen many examples of how the intrinsic link between data and labels has aided computation and aggregation. It is also crucial to data alignment and the merging and joining of data sets.

concat

The most general combining function in pandas is the pandas.concat function. It does most of the heavy lifting for the more specialized data merging routines in pandas. The concat function will feel quite familiar to those who are used to using the numpy.concatenate function (or any of its siblings like numpy.row_stack, ect.). I will demonstrate some of this functions capabilities with a few examples (Note this function applies equally well to DataFrame and Series objects):

In [95]: s1 = Series([0, 1], index=['a', 'b'])

In [96]: s2 = Series([2, 3, 4], index=['c', 'd', 'e'])

In [97]: s3 = Series([5, 6], index=['f', 'g'])

In [98]: s4 = pd.concat([s1 + 10, s3])

# Concatenate along rows (default: axis=0)
In [99]: pd.concat([s1, s2, s3])
Out[99]: 
a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64

# concat along columns
In [100]: pd.concat([s1, s2, s3], axis=1)
Out[100]: 
    0   1   2
a   0 NaN NaN
b   1 NaN NaN
c NaN   2 NaN
d NaN   3 NaN
e NaN   4 NaN
f NaN NaN   5
g NaN NaN   6

# Use intersection of index to avoid NaN's
In [101]: pd.concat([s1, s4], axis=1, join='inner')
Out[101]: 
   0   1
a  0  10
b  1  11

# specify index labels to be used
In [102]: pd.concat([s1, s4], axis=1, join_axes=[['a', 'c', 'b', 'e']])
Out[102]: 
    0   1
a   0  10
c NaN NaN
b   1  11
e NaN NaN

# Give names to concatenated objects (become MultiIndex instead of cols)
In [103]: pd.concat([s1, s1, s3], keys=['one', 'two', 'three'])
Out[103]: 
one    a    0
       b    1
two    a    0
       b    1
three  f    5
       g    6
dtype: int64

Now for some more complicated examples using DataFrame objects

In [104]: df1 = DataFrame(np.arange(6).reshape(3, 2), index=['a', 'b', 'c'],
   .....:                 columns=['one', 'two'])
   .....:

In [105]: df2 = DataFrame(5 + np.arange(4).reshape(2, 2), index=['a', 'c'],
   .....:                 columns=['three', 'four'])
   .....:

# Notice that the keys become top-level column index
In [106]: pd.concat([df1, df2], axis=1, keys=['l1', 'l2'])
Out[106]: 
    l1          l2      
   one  two  three  four
a    0    1      5     6
b    2    3    NaN   NaN
c    4    5      7     8

# Could have passed a dict
In [107]: pd.concat({'l1': df1, 'l2': df2}, axis=1)
Out[107]: 
    l1          l2      
   one  two  three  four
a    0    1      5     6
b    2    3    NaN   NaN
c    4    5      7     8

# Create heirarchial index with keys and names args
In [108]: pd.concat([df1, df2], axis=1, keys=['level1', 'level2'],
   .....:           names=['upper', 'lower'])
   .....:
Out[108]: 
upper  level1       level2      
lower     one  two   three  four
a           0    1       5     6
b           2    3     NaN   NaN
c           4    5       7     8

# We can tell it to ignore the index of the original objects
In [109]: pd.concat([df1, df2], ignore_index=1)
Out[109]: 
   four  one  three  two
0   NaN    0    NaN    1
1   NaN    2    NaN    3
2   NaN    4    NaN    5
3     6  NaN      5  NaN
4     8  NaN      7  NaN

merge and join

The merge and join functions will feel very familiar to those who have used relational databases like SQL. They combine datasets by linking rows using one or more keys. The best way to understand these operations is by example, so here we go!

Merge examples:

In [110]: df1 = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
   .....:                 'data1': range(7)})
   .....:

In [111]: df2 = DataFrame({'key': ['a', 'b', 'd'],
   .....:                  'data2': range(3)})
   .....:

In [112]: df3 = DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
   .....:                  'data1': range(7)})
   .....:

In [113]: df4 = DataFrame({'rkey': ['a', 'b', 'd'],
   .....:                  'data2': range(3)})
   .....:

In [114]: left1 = DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'],
   .....:                    'value': range(6)})
   .....:

In [115]: right1 = DataFrame({'group_val': [3.5, 7]},
   .....:                    index=['a', 'b'])
   .....:

In [116]: lefth = DataFrame({'key1': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
   .....:                    'key2': [2000, 2001, 2002, 2001, 2002],
   .....:                    'data': np.arange(5.)})
   .....:

In [117]: righth = DataFrame(np.arange(12).reshape((6, 2)),
   .....:                    index=[['Nevada', 'Nevada', 'Ohio', 'Ohio', 'Ohio', 'Ohio'],
   .....:                           [2001, 2000, 2000, 2000, 2001, 2002]],
   .....:                     columns=['event1', 'event2'])
   .....:

In [118]: left2 = DataFrame([[1., 2.], [3., 4.], [5., 6.]], index=['a', 'c', 'e'],
   .....:                    columns=['Ohio', 'Nevada'])
   .....:

In [119]: right2 = DataFrame([[7., 8.], [9., 10.], [11., 12.], [13, 14]],index=['b', 'c', 'd', 'e'], columns=['Missouri', 'Alabama'])

# merge defaults to combine on over-lapping column names
In [120]: pd.merge(df1, df2)
Out[120]: 
   data1 key  data2
0      0   b      1
1      1   b      1
2      6   b      1
3      2   a      0
4      4   a      0
5      5   a      0

# But, it is a good idea to specify column
In [121]: pd.merge(df1, df2, on='key')
Out[121]: 
   data1 key  data2
0      0   b      1
1      1   b      1
2      6   b      1
3      2   a      0
4      4   a      0
5      5   a      0

# If column names are different, we can specify separate ones for each frame
In [122]: pd.merge(df3, df4, left_on='lkey', right_on='rkey')
Out[122]: 
   data1 lkey  data2 rkey
0      0    b      1    b
1      1    b      1    b
2      6    b      1    b
3      2    a      0    a
4      4    a      0    a
5      5    a      0    a

# Keep all (union) of the row labels by using 'how=' param
In [123]: pd.merge(df1, df2, how='outer')
Out[123]: 
   data1 key  data2
0      0   b      1
1      1   b      1
2      6   b      1
3      2   a      0
4      4   a      0
5      5   a      0
6      3   c    NaN
7    NaN   d      2

# Use the index of the right frame for the merge
In [124]: pd.merge(left1, right1, left_on='key', right_index=True)
Out[124]: 
  key  value  group_val
0   a      0        3.5
2   a      2        3.5
3   a      3        3.5
1   b      1        7.0
4   b      4        7.0

# Hierarchial indexes work too, but are more complicated
In [125]: lefth
Out[125]: 
   data    key1  key2
0     0    Ohio  2000
1     1    Ohio  2001
2     2    Ohio  2002
3     3  Nevada  2001
4     4  Nevada  2002

In [126]: righth
Out[126]: 
             event1  event2
Nevada 2001       0       1
       2000       2       3
Ohio   2000       4       5
       2000       6       7
       2001       8       9
       2002      10      11

In [127]: pd.merge(lefth, righth, left_on=['key1', 'key2'], right_index=True)
Out[127]: 
   data    key1  key2  event1  event2
0     0    Ohio  2000       4       5
0     0    Ohio  2000       6       7
1     1    Ohio  2001       8       9
2     2    Ohio  2002      10      11
3     3  Nevada  2001       0       1

In [128]: pd.merge(lefth, righth, left_on=['key1', 'key2'], right_index=True,
   .....:          how='outer')
   .....:
Out[128]: 
   data    key1  key2  event1  event2
0     0    Ohio  2000       4       5
0     0    Ohio  2000       6       7
1     1    Ohio  2001       8       9
2     2    Ohio  2002      10      11
3     3  Nevada  2001       0       1
4     4  Nevada  2002     NaN     NaN
4   NaN  Nevada  2000       2       3

# Merging on index
In [129]: pd.merge(left2, right2, how='outer', left_index=True, right_index=True)
Out[129]: 
   Ohio  Nevada  Missouri  Alabama
a     1       2       NaN      NaN
b   NaN     NaN         7        8
c     3       4         9       10
d   NaN     NaN        11       12
e     5       6        13       14

The DataFrame also has a convenient join method for merging on the index. This is used when you have objects with similar row labels, but different columns.

# Should be the same as the last example
In [130]: left2.join(right2, how='outer')
Out[130]: 
   Ohio  Nevada  Missouri  Alabama
a     1       2       NaN      NaN
b   NaN     NaN         7        8
c     3       4         9       10
d   NaN     NaN        11       12
e     5       6        13       14

Exercises

Before you jump into the exercises you will need to download a number of data sets. Below I will give you the urls for the data:

Todo

Using the movie-lens 1M data and pandas.read_table read in all three files (users, ratings, movies) into pandas DataFrames. I recommend giving columns names directly to read_table for each case.

HINT: you will need to open the files in a text editor to see what the separator character. Also, the columns for each file correspond to the following:

  • users.dat: user id, gender, age, occupation code, zip
  • ratings.dat: user id, movie id, rating, timestamp
  • movies.dat: movie id, title, genre

Use the data combining tools discussed above to combine these three objects into a single object named movie_data

Todo

Use the movie_data object from the previous exercise and compute the following things:

  • The 5 movies with the most number of ratings

  • A new object called active_titles that is made up of movies each having at least 250 ratings

  • For the subset of movies in the active_titles list compute the following:
    • The 3 movies with the highest average rating for females. Do the same for males.
    • The 10 movies men liked much more than women and the 10 movies women liked more than men (use the difference in average ratings and sort ascending and descending).
    • The 5 movies that had the highest standard deviation in rating.

HINT: For some of these operations it might be helpful to compute a pivot table using the pivot_table method of a DataFrame.

NOTE: You might also want to do some analysis on the movie genre. To do this you would have to be comfortable with using pandas vectorized string methods and probably some regular expressions. Those topics are a bit bigger than what we can cover here, but if you are looking for more practice, they are a good place to start.

Todo

Using the baby names data set, define a pandas object called names_data that includes all the years.

HINT: You should probably do this in a for loop. Also, to get a list of all the text files in a directory (so you can loop over them), use the following code:

from glob import glob

files = glob('*.TXT')

Todo

Using the names_data object you just created, generate the following items:

  • The most popular male and female name in your birth year
  • The number of people born with your name in the same year (note, if you have a somewhat uncommon name, this may not appear. For example, my wife’s name is Shannen and due to its irregular spelling, wasn’t reported for her birth year)
  • A plot of the number of instances of your name over time.
  • A plot of the number of the total boy names and the number of girls names each year.
  • A plot showing the fraction of male and female babies given a name similar to Lesley. By similar I mean the name starts with ‘lesl’ (make sure you make the name lowercase).

Todo

Use the built in python json module to load the food data into a python list. Your code should look like this

import json

db = json.load(open('foods-2011-10-03.json'))

db will be a list of 6636 python dictionaries, each containing nutritional information for a different food item. Each dictionary will have the following keys:

  • portions
  • description
  • tags
  • nutrients
  • group
  • id
  • manufacture

Create a DataFrame of meta_data using the description, group, id, and manufacturer items in each dictionary.

Todo

Loop over db and construct a list of DataFrames containing the nutritional information for each record in db. Make sure to add a column to each of these DataFrames that contains the unique food id (id key in the dictionary). HINT: This food id will be a constant

Finally, use the pandas combining techniques to create a nutrients DataFrame. After you have done this drop duplicate entries in this DataFrame. For example, if you had named the objects nuts you would do

nuts = nuts.drop_duplicates()

Todo

Use the rename method to make sure that the description and group columns are un-ambiguous for both the meta_data DataFrame and the nutrients DataFrame (These column names are duplicated because every food has a description and group and each nutrient also has those identifiers). HINT: you will need to use the columns and copy arguments of the rename method.

Finally, use the data combining routines to come up with a foods DataFrame containing all the meta_data and nutritional information. Make sure to do an outer style merge on the correct columns.

Todo

Using the foods DataFrame you have been building, compute the following things:

  • The food item with the highest content of each nutrient.
  • A function that accepts a nutrient name and a quantile value and generates a horizontal bar plot of the amount of that nutrient in each food group. Provide a plot title. HINT: You will need to use the quantile and sort (or order ) methods in order for this to work. Also, you can generate the horizontal bar plot using df.plot(kind='barh').