There are three main pandas data structures: Series, DataFrame, and Index. Each of them is a subclass of numpy.ndarray. This means that the functionality you have learned about numpy arrays still applies to these data structures, but these pandas objects have additional, data-specific features. We will provide some high-level principles that apply to each of them and then dig into their specific features. (Note that many of the materials in this section come from the Intro to Data Structures page in the pandas documentation.)
One very important thing to understand about pandas data structures is that the they are like numpy arrays, but they have axis labels. Think of them more like a working version of an Excel spreadsheet where you have labels in the first row and column and then the data corresponding to those labels on the inside. An important characteristics about pandas objects is that the link between the data and the labels is intrinsic, meaning that the labels are actually a part of the data and that link will not be broken unless you explicitly break it.
Throughout the rest of the lab, the code examples assume that you have made the following imports.
In [183]: import numpy as np
In [184]: from numpy.random import randn
In [185]: import pandas as pd
In [186]: from pandas import Series, DataFrame, Index
The Series can be thought of as a labeled one dimensional array. It consists if a single dimension of data and an Index containing the labels. The data in a Series can be of any data type: float, int, string, matplotlib figures, and any other python object.
There are many different ways you can create a Series. Two of the most common are to create a Series using a numpy array or a python list.
# Using a numpy array
In [187]: s1 = Series(randn(5))
In [188]: s1
Out[188]:
0 0.496714
1 -0.138264
2 0.647689
3 1.523030
4 -0.234153
dtype: float64
In [189]: s1.index
Out[189]: Int64Index([0, 1, 2, 3, 4], dtype=int64)
In [190]: s1.values
Out[190]: array([ 0.49671415, -0.1382643 , 0.64768854, 1.52302986, -0.23415337])
# Using a python list
In [191]: s2 = Series([i ** 2 for i in xrange(1, 6)])
In [192]: s2
Out[192]:
0 1
1 4
2 9
3 16
4 25
dtype: int64
In [193]: s2.index
Out[193]: Int64Index([0, 1, 2, 3, 4], dtype=int64)
In [194]: s2.values
Out[194]: array([ 1, 4, 9, 16, 25])
Notice that each of these Series objects has both index and values attributes. The index is an Index type and values is the underlying numpy array holding the data. Below we will show a few more ways we could have created a Series.
# We can explicitly specify the index
In [195]: s3 = Series(randn(5), index=['a', 'b', 'c', 'd', 'e'])
In [196]: s3
Out[196]:
a -0.234137
b 1.579213
c 0.767435
d -0.469474
e 0.542560
dtype: float64
# Create using a dict (note the data are actual python lists)
# Notice the dict keys become the index
In [197]: s4 = Series({'first': [1, 2], 'second':[2, 4]})
In [198]: s4
Out[198]:
first [1, 2]
second [2, 4]
dtype: object
In [199]: s4.values
Out[199]: array([[1, 2], [2, 4]], dtype=object)
In [200]: s4.index
Out[200]: Index([first, second], dtype=object)
# Passing a dict and index uses the index we pass, but fills values from the dict.
# notice NaN that appears because we 'd' in the index, but not in the dict
In [201]: s5 = Series({'a' : 0., 'b' : 1., 'c' : 2.}, index=['b', 'c', 'd', 'a'])
In [202]: s5
Out[202]:
b 1
c 2
d NaN
a 0
dtype: float64
Once we have created a Series object, there is a lot that we can do with it. Below we will demonstrate that they act a lot like their parent class numpy.ndarray.
In [203]: s3
Out[203]:
a -0.234137
b 1.579213
c 0.767435
d -0.469474
e 0.542560
dtype: float64
In [204]: s3[0]
Out[204]: -0.23413695694918055
In [205]: s3[:3]
Out[205]:
a -0.234137
b 1.579213
c 0.767435
dtype: float64
In [206]: s3[s3 > s3.mean()]
Out[206]:
b 1.579213
c 0.767435
e 0.542560
dtype: float64
In [207]: s3[[2, 3, 1, -1, 1]]
Out[207]:
c 0.767435
d -0.469474
b 1.579213
e 0.542560
b 1.579213
dtype: float64
In [208]: np.exp(s3)
Out[208]:
a 0.791253
b 4.851136
c 2.154233
d 0.625331
e 1.720406
dtype: float64
Because Series objects have a labeled index, they also behave a lot like python dictionaries:
In [209]: s3['a']
Out[209]: -0.23413695694918055
In [210]: s3['b':'d']
Out[210]:
b 1.579213
c 0.767435
d -0.469474
dtype: float64
In [211]: 'e' in s3
Out[211]: True
The same vectorized operations that you are used to doing on numpy.ndarray s can be done on Series. The BIG difference between arrays and series objects when it comes to arithmetic is that because axis labels are an intrinsic property, data is aligned every operation. The example below will demonstrate.
In [212]: s3[1:] + s3[:-1]
Out[212]:
a NaN
b 3.158426
c 1.534869
d -0.938949
e NaN
dtype: float64
Notice that the ‘a’ and ‘e’ (or 1, -1 if doing integer indexing) values are NaN. This happened because the slice s3[1:] doesn’t contain the ‘a’ element and the other slice doesn’t contain ‘e’. Therefore, when doing the sum of the two slices we get NaN values where one of the Series doesn’t have data. This illustrates a more general principle. The combination of Series objects, via arithmetic operations or otherwise, creates a new Series where the index is the union of the index of the former Series. The adoption of this new index will lead to NaN values if any of the objects being combined don’t have data for a particular index. This might seem like unwanted behavior, but being able to write code without doing any explicit data alignment provides great freedom and flexibility and is one feature of pandas that really distinguishes it from the other domain specific platforms.
The last main thing to talk about with regards to Series is that they have a name attribute. This will show up below the printout when you ask python to print or otherwise display a series and is used in various other pandas tasks like combining Series objects into a DataFrame (more to come on DataFrame s next!).
In [213]: s3
Out[213]:
a -0.234137
b 1.579213
c 0.767435
d -0.469474
e 0.542560
dtype: float64
In [214]: print(s3.name)
None
In [215]: s3.name = 'letter_ind'
In [216]: s3
Out[216]:
a -0.234137
b 1.579213
c 0.767435
d -0.469474
e 0.542560
Name: letter_ind, dtype: float64
In [217]: print(s3.name)
letter_ind
A DataFrame is a 2-dimensional labeled data structure. It is very much like a Series, but it can have multiple columns with associated labels. For this reason, you can think of a DataFrame as being like a spreadsheet or SQL table. It is by far the most commonly used pandas object, as a Series is simply a DataFrame with only one column.
Like a Series, there are many ways to create a DataFrame. We will illustrate some of them below.
# Create DataFrame from 2d-numpy array
In [218]: df1 = DataFrame(randn(4, 4))
In [219]: df1
Out[219]:
0 1 2 3
0 -0.463418 -0.465730 0.241962 -1.913280
1 -1.724918 -0.562288 -1.012831 0.314247
2 -0.908024 -1.412304 1.465649 -0.225776
3 0.067528 -1.424748 -0.544383 0.110923
# Give explicit index for columns using columns= keyword argument
In [220]: df2 = DataFrame(np.arange(16).reshape(4, 4), columns=list('abcd'))
In [221]: df2
Out[221]:
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
In [222]: s1
Out[222]:
0 0.496714
1 -0.138264
2 0.647689
3 1.523030
4 -0.234153
dtype: float64
In [223]: s2
Out[223]:
0 1
1 4
2 9
3 16
4 25
dtype: int64
# Create DataFrame from dict of Series
In [224]: df3 = DataFrame({'one': s1, 'two':s2})
In [225]: df3
Out[225]:
one two
0 0.496714 1
1 -0.138264 4
2 0.647689 9
3 1.523030 16
4 -0.234153 25
# Note: the union of the two indecies becomes the DataFrame Index and
# missing values at an index result in NaN
In [226]: df4 = DataFrame({'s3': s3, 's5':s5})
In [227]: df4
Out[227]:
s3 s5
a -0.234137 0
b 1.579213 1
c 0.767435 2
d -0.469474 NaN
e 0.542560 NaN
# Use list of numpy arrays. Note each array becomes a row
In [228]: ar1 = np.arange(1, 6)
In [229]: ar2 = ar1 * .1
In [230]: df5 = DataFrame([ar1, ar2], index=['ar1', 'ar2'])
There are other ways to create both DataFrame and Series objects, and I encourage you to test out different ways as well as browse docstrings and documentation.
Indexing (getting slices or chunks of data) a DataFrame is very similar to indexing a Series. An important thing to keep in mind is that the [] syntax indexes along the columns.
In [231]: data = {'state': ['Utah', 'Arizona', 'Idaho', 'Nevada', 'Colorado'],
.....: 'year': [2000, 2001, 2002, 2001, 2002],
.....: 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
.....:
In [232]: df = DataFrame(data, index=['one', 'two', 'three', 'four', 'five'])
In [233]: df
Out[233]:
pop state year
one 1.5 Utah 2000
two 1.7 Arizona 2001
three 3.6 Idaho 2002
four 2.4 Nevada 2001
five 2.9 Colorado 2002
# Access using dict notation
In [234]: df['state']
Out[234]:
one Utah
two Arizona
three Idaho
four Nevada
five Colorado
Name: state, dtype: object
# Access using "dot" notation
In [235]: df.year
Out[235]:
one 2000
two 2001
three 2002
four 2001
five 2002
Name: year, dtype: int64
# Get slice using boolean array (actually a boolean Series)
In [236]: year2001 = df.year == 2001
In [237]: year2001
Out[237]:
one False
two True
three False
four True
five False
Name: year, dtype: bool
In [238]: df[year2001]
Out[238]:
pop state year
two 1.7 Arizona 2001
four 2.4 Nevada 2001
If instead you want to slice along the rows you use the special .ix field
# Access using row label
In [239]: df.ix['one']
Out[239]:
pop 1.5
state Utah
year 2000
Name: one, dtype: object
# Slice along Row labels
In [240]: df.ix['two':'five']
Out[240]:
pop state year
two 1.7 Arizona 2001
three 3.6 Idaho 2002
four 2.4 Nevada 2001
five 2.9 Colorado 2002
# list of row labels
In [241]: df.ix[['one', 'three', 'four']]
Out[241]:
pop state year
one 1.5 Utah 2000
three 3.6 Idaho 2002
four 2.4 Nevada 2001
# Get slices with integers instead of row labels
In [242]: df.ix[1]
Out[242]:
pop 1.7
state Arizona
year 2001
Name: two, dtype: object
# An integer range slice
In [243]: df.ix[1:3]
Out[243]:
pop state year
two 1.7 Arizona 2001
three 3.6 Idaho 2002
# list of integers
In [244]: df.ix[[1, 3, 4]]
Out[244]:
pop state year
two 1.7 Arizona 2001
four 2.4 Nevada 2001
five 2.9 Colorado 2002
It should be noted that when you have integer row labels, indexing by label or indexing by position can be ambiguous. To overcome this, it is recommended you learn to use loc for row label and iloc for row position. Note that the default behavior is to use label indexing first and fall back on positional indexing if row lookup fails. Note that the loc and iloc methods are new in pandas 0.11. If you don’t have them you should first check your pandas version using pd.__version__ and if you don’t have \(\ge 0.11\) you should update (ask Spencer if you don’t know how to do this).
In [245]: ambiguous = DataFrame(randn(4, 4), index=[1, 1, 3, 4])
In [246]: ambiguous
Out[246]:
0 1 2 3
1 -1.150994 0.375698 -0.600639 -0.291694
1 -0.601707 1.852278 -0.013497 -1.057711
3 0.822545 -1.220844 0.208864 -1.959670
4 -1.328186 0.196861 0.738467 0.171368
In [247]: ambiguous.ix[1]
Out[247]:
0 1 2 3
1 -1.150994 0.375698 -0.600639 -0.291694
1 -0.601707 1.852278 -0.013497 -1.057711
In [248]: ambiguous.loc[1]
Out[248]:
0 1 2 3
1 -1.150994 0.375698 -0.600639 -0.291694
1 -0.601707 1.852278 -0.013497 -1.057711
In [249]: ambiguous.iloc[1]
Out[249]:
0 -0.601707
1 1.852278
2 -0.013497
3 -1.057711
Name: 1, dtype: float64
You can create new columns using dictionary syntax
# Notice single value is broadcasted, like in numpy arrays
In [250]: df['budget'] = '?'
# Could also give list where length is equal to index
In [251]: df['GDP'] = [20, 34., 11, 26, 13]
# Can create new columns using existing ones
In [252]: df['perCapita'] = df.GDP / df['pop']
In [253]: df
Out[253]:
pop state year budget GDP perCapita
one 1.5 Utah 2000 ? 20 13.333333
two 1.7 Arizona 2001 ? 34 20.000000
three 3.6 Idaho 2002 ? 11 3.055556
four 2.4 Nevada 2001 ? 26 10.833333
five 2.9 Colorado 2002 ? 13 4.482759
The same note about data alignment and NaN values when combining Series is also true when combining DataFrame objects. In this case, however, the data alignment happens for both the row and column labels.
In [254]: df6 = DataFrame(randn(10, 4), columns=list('ABCD'))
In [255]: df7 = DataFrame(randn(7, 3), columns=list('ABC'))
In [256]: df6 + df7
Out[256]:
A B C D
0 1.362246 -0.819374 -2.287016 NaN
1 -0.962396 1.972524 0.672369 NaN
2 -0.205676 0.128185 -0.579844 NaN
3 1.999645 0.229227 -1.166880 NaN
4 -0.060845 -0.487970 -0.183054 NaN
5 -0.845280 -1.191093 0.577939 NaN
6 -1.487381 0.582888 0.018922 NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 NaN NaN NaN NaN
The transpose attribute of a DataFrame simply swaps rows and columns.
In [257]: df
Out[257]:
pop state year budget GDP perCapita
one 1.5 Utah 2000 ? 20 13.333333
two 1.7 Arizona 2001 ? 34 20.000000
three 3.6 Idaho 2002 ? 11 3.055556
four 2.4 Nevada 2001 ? 26 10.833333
five 2.9 Colorado 2002 ? 13 4.482759
In [258]: df.T
Out[258]:
one two three four five
pop 1.5 1.7 3.6 2.4 2.9
state Utah Arizona Idaho Nevada Colorado
year 2000 2001 2002 2001 2002
budget ? ? ? ? ?
GDP 20 34 11 26 13
perCapita 13.33333 20 3.055556 10.83333 4.482759
You have already seen and dealt with Index objects, but we have barely scratched the surface of what Index objects can be or do. The following list gives a few important properties regarding Index objects
One very useful thing about Index objects is that they can be hierarchal, or have multiple levels. This is best understood through example. Below we will create a hierarchal index and apply it to a Series.
In [259]: arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
.....: ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
.....:
In [260]: tuples = zip(*arrays)
In [261]: h_ind = pd.MultiIndex.from_tuples(tuples, names=['level1', 'level2'])
In [262]: hi_s = Series(randn(8), index=h_ind)
In [263]: hi_f = DataFrame(randn(8, 4), index=h_ind)
Listing out all the labels and their multiplicities before passing to zip and from_tuples is definitely not the only way to create a MultiIndex. I will show a few more of my favorites below.
# Use cartesian product of a few lists. Note this index has 3 levels
In [264]: from itertools import product
In [265]: words = ['hi', 'hello', 'hola']
In [266]: nums = [10, 20]
In [267]: letters = ['a', 'b']
In [268]: ind = pd.MultiIndex.from_tuples(list(product(words, nums, letters)),
.....: names=['word', 'num', 'let'])
.....:
In [269]: hi_f2 = DataFrame(randn(12, 3), columns=['A', 'B', 'C'], index=ind)
# have column names of DataFrame become outer level of series index using unstack()
In [270]: df3
Out[270]:
one two
0 0.496714 1
1 -0.138264 4
2 0.647689 9
3 1.523030 16
4 -0.234153 25
In [271]: hi_f3 = df3.unstack()
In [272]: hi_f3
Out[272]:
one 0 0.496714
1 -0.138264
2 0.647689
3 1.523030
4 -0.234153
two 0 1.000000
1 4.000000
2 9.000000
3 16.000000
4 25.000000
dtype: float64
# Use index from other DataFrame
In [273]: hi_f4 = DataFrame(np.arange(30).reshape(10, 3), index=hi_f3.index)
In [274]: hi_f4
Out[274]:
0 1 2
one 0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
two 0 15 16 17
1 18 19 20
2 21 22 23
3 24 25 26
4 27 28 29
To index into a hierarchal indexed object, just use the .ix field with multiple slice categories, one for each level, starting at the outermost level. Note if you don’t give inner slices, the whole level is taken. (If some of the examples don’t work it is because I use the development version of pandas and there may be new features. Don’t worry about it for now).
In [275]: hi_f2
Out[275]:
A B C
word num let
hi 10 a 0.184634 0.259883 0.781823
b -1.236951 -1.320457 0.521942
20 a 0.296985 0.250493 0.346448
b -0.680025 0.232254 0.293072
hello 10 a -0.714351 1.865775 0.473833
b -1.191303 0.656554 -0.974682
20 a 0.787085 1.158596 -0.820682
b 0.963376 0.412781 0.822060
hola 10 a 1.896793 -0.245388 -0.753736
b -0.889514 -0.815810 -0.077102
20 a 0.341152 0.276691 0.827183
b 0.013002 1.453534 -0.264657
In [276]: hi_f2.ix['hi']
Out[276]:
A B C
num let
10 a 0.184634 0.259883 0.781823
b -1.236951 -1.320457 0.521942
20 a 0.296985 0.250493 0.346448
b -0.680025 0.232254 0.293072
In [277]: hi_f2.ix['hi', :, :]
Out[277]:
A B C
num let
10 a 0.184634 0.259883 0.781823
b -1.236951 -1.320457 0.521942
20 a 0.296985 0.250493 0.346448
b -0.680025 0.232254 0.293072
In [278]: hi_f2.ix['hi', 10, 'b']
Out[278]:
A -1.236951
B -1.320457
C 0.521942
Name: (hi, 10, b), dtype: float64
# Slicing works too, but......
In [279]: hi_f2.ix['hi':'hello']
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-279-e0483a8ad44c> in <module>()
----> 1 hi_f2.ix['hi':'hello']
/Users/spencerlyon2/anaconda/lib/python2.7/site-packages/pandas/core/indexing.pyc in __getitem__(self, key)
46 return self._getitem_tuple(key)
47 else:
---> 48 return self._getitem_axis(key, axis=0)
49
50 def _get_label(self, label, axis=0):
/Users/spencerlyon2/anaconda/lib/python2.7/site-packages/pandas/core/indexing.pyc in _getitem_axis(self, key, axis)
389 labels = self.obj._get_axis(axis)
390 if isinstance(key, slice):
--> 391 return self._get_slice_axis(key, axis=axis)
392 elif _is_list_like(key) and not (isinstance(key, tuple) and
393 isinstance(labels, MultiIndex)):
/Users/spencerlyon2/anaconda/lib/python2.7/site-packages/pandas/core/indexing.pyc in _get_slice_axis(self, slice_obj, axis)
636 else:
637 try:
--> 638 indexer = labels.slice_indexer(start, stop, slice_obj.step)
639 except Exception:
640 if _is_index_slice(slice_obj):
/Users/spencerlyon2/anaconda/lib/python2.7/site-packages/pandas/core/index.pyc in slice_indexer(self, start, end, step)
1178 This function assumes that the data is sorted, so use at your own peril
1179 """
-> 1180 start_slice, end_slice = self.slice_locs(start, end)
1181 return slice(start_slice, end_slice, step)
1182
/Users/spencerlyon2/anaconda/lib/python2.7/site-packages/pandas/core/index.pyc in slice_locs(self, start, end, strict)
2224 if not isinstance(start, tuple):
2225 start = start,
-> 2226 start_slice = self._partial_tup_index(start, side='left')
2227
2228 if end is None:
/Users/spencerlyon2/anaconda/lib/python2.7/site-packages/pandas/core/index.pyc in _partial_tup_index(self, tup, side)
2238 if len(tup) > self.lexsort_depth:
2239 raise KeyError('MultiIndex lexsort depth %d, key was length %d' %
-> 2240 (self.lexsort_depth, len(tup)))
2241
2242 n = len(tup)
KeyError: 'MultiIndex lexsort depth 0, key was length 1'
What happened in the last example? The problem with taking slices of indexes with string labels is that they labels need to be sorted lexicographically in order for pandas to be able to slice them up. Luckily for us, pandas will sort the index for us.
In [280]: hi_f2 = hi_f2.sort_index()
In [281]: hi_f2
Out[281]:
A B C
word num let
hello 10 a -0.714351 1.865775 0.473833
b -1.191303 0.656554 -0.974682
20 a 0.787085 1.158596 -0.820682
b 0.963376 0.412781 0.822060
hi 10 a 0.184634 0.259883 0.781823
b -1.236951 -1.320457 0.521942
20 a 0.296985 0.250493 0.346448
b -0.680025 0.232254 0.293072
hola 10 a 1.896793 -0.245388 -0.753736
b -0.889514 -0.815810 -0.077102
20 a 0.341152 0.276691 0.827183
b 0.013002 1.453534 -0.264657
In [282]: hi_f2.ix['hello':'hi']
Out[282]:
A B C
word num let
hello 10 a -0.714351 1.865775 0.473833
b -1.191303 0.656554 -0.974682
20 a 0.787085 1.158596 -0.820682
b 0.963376 0.412781 0.822060
hi 10 a 0.184634 0.259883 0.781823
b -1.236951 -1.320457 0.521942
20 a 0.296985 0.250493 0.346448
b -0.680025 0.232254 0.293072
In [283]: hi_f2.ix[('hello', 10, 'a'):('hi', 20, 'b')]
Out[283]:
A B C
word num let
hello 10 a -0.714351 1.865775 0.473833
b -1.191303 0.656554 -0.974682
20 a 0.787085 1.158596 -0.820682
b 0.963376 0.412781 0.822060
hi 10 a 0.184634 0.259883 0.781823
b -1.236951 -1.320457 0.521942
20 a 0.296985 0.250493 0.346448
b -0.680025 0.232254 0.293072
One very common operation to perform on a hierarchal indexed data set is to take cross sections. This is done using the xs method and is given two arguments: 1.) The name of the row label(s) you would like, 2.) the level at at which they appear in the index.
# Specify level using integer
In [284]: hi_f2.xs(10, level=1)
Out[284]:
A B C
word let
hello a -0.714351 1.865775 0.473833
b -1.191303 0.656554 -0.974682
hi a 0.184634 0.259883 0.781823
b -1.236951 -1.320457 0.521942
hola a 1.896793 -0.245388 -0.753736
b -0.889514 -0.815810 -0.077102
# Or specify level using the level name for same result.
In [285]: hi_f2.xs(10, level='num')
Out[285]:
A B C
word let
hello a -0.714351 1.865775 0.473833
b -1.191303 0.656554 -0.974682
hi a 0.184634 0.259883 0.781823
b -1.236951 -1.320457 0.521942
hola a 1.896793 -0.245388 -0.753736
b -0.889514 -0.815810 -0.077102
# pass multiple row labels and levels to get all 'hello', 'b' combinations
In [286]: hi_f2.xs(['hello', 'b'], level=[0, 'let'])
Out[286]:
A B C
num
10 -1.191303 0.656554 -0.974682
20 0.963376 0.412781 0.822060
This has been a very brief overview of the main data structures in pandas. We will be using them more later so if you do not feel like you have a good understanding of them, consult the pandas documentation
Return to the main pandas lab page.