.. _pandas_types: Core ``pandas`` Data Structures =============================== There are three main ``pandas`` data structures: ``Series``, ``DataFrame``, and ``Index``. Each of them is a subclass of ``numpy.ndarray``. This means that the functionality you have learned about ``numpy`` arrays still applies to these data structures, but these ``pandas`` objects have additional, data-specific features. We will provide some high-level principles that apply to each of them and then dig into their specific features. (Note that many of the materials in this section come from the `Intro to Data Structures `_ page in the ``pandas`` documentation.) One very important thing to understand about ``pandas`` data structures is that the they are like ``numpy`` arrays, but they have axis labels. Think of them more like a working version of an Excel spreadsheet where you have labels in the first row and column and then the data corresponding to those labels on the inside. An important characteristics about ``pandas`` objects is that the **link between the data and the labels is intrinsic**, meaning that the labels are actually a part of the data and that link will not be broken unless you explicitly break it. Throughout the rest of the lab, the code examples assume that you have made the following imports. .. ipython:: python import numpy as np from numpy.random import randn import pandas as pd from pandas import Series, DataFrame, Index .. ipython:: python :suppress: np.random.seed(42) # consistent random numbers ``Series`` ---------- The ``Series`` can be thought of as a labeled one dimensional array. It consists if a single dimension of data and an ``Index`` containing the labels. The data in a ``Series`` can be of *any* data type: float, int, string, matplotlib figures, and any other python object. There are many different ways you can create a ``Series``. Two of the most common are to create a ``Series`` using a numpy ``array`` or a python ``list``. .. ipython:: python # Using a numpy array s1 = Series(randn(5)) s1 s1.index s1.values # Using a python list s2 = Series([i ** 2 for i in xrange(1, 6)]) s2 s2.index s2.values Notice that each of these ``Series`` objects has both ``index`` and ``values`` attributes. The ``index`` is an ``Index`` type and ``values`` is the underlying numpy ``array`` holding the data. Below we will show a few more ways we could have created a ``Series``. .. ipython:: python # We can explicitly specify the index s3 = Series(randn(5), index=['a', 'b', 'c', 'd', 'e']) s3 # Create using a dict (note the data are actual python lists) # Notice the dict keys become the index s4 = Series({'first': [1, 2], 'second':[2, 4]}) s4 s4.values s4.index # Passing a dict and index uses the index we pass, but fills values from the dict. # notice NaN that appears because we 'd' in the index, but not in the dict s5 = Series({'a' : 0., 'b' : 1., 'c' : 2.}, index=['b', 'c', 'd', 'a']) s5 Once we have created a ``Series`` object, there is a lot that we can do with it. Below we will demonstrate that they act a lot like their parent class ``numpy.ndarray``. .. ipython:: python s3 s3[0] s3[:3] s3[s3 > s3.mean()] s3[[2, 3, 1, -1, 1]] np.exp(s3) Because ``Series`` objects have a labeled index, they also behave a lot like python dictionaries: .. ipython:: python s3['a'] s3['b':'d'] 'e' in s3 The same vectorized operations that you are used to doing on ``numpy.ndarray`` s can be done on ``Series``. The **BIG** difference between arrays and series objects when it comes to arithmetic is that because axis labels are an intrinsic property, data is aligned every operation. The example below will demonstrate. .. ipython:: python s3[1:] + s3[:-1] Notice that the 'a' and 'e' (or 1, -1 if doing integer indexing) values are ``NaN``. This happened because the slice ``s3[1:]`` doesn't contain the 'a' element and the other slice doesn't contain 'e'. Therefore, when doing the sum of the two slices we get `NaN` values where one of the ``Series`` doesn't have data. This illustrates a more general principle. The combination of ``Series`` objects, via arithmetic operations or otherwise, creates a new ``Series`` where the index is the union of the index of the former ``Series``. The adoption of this new index will lead to `NaN` values if any of the objects being combined don't have data for a particular index. This might seem like unwanted behavior, but being able to write code without doing any explicit data alignment provides great freedom and flexibility and is one feature of ``pandas`` that really distinguishes it from the other domain specific platforms. The last main thing to talk about with regards to ``Series`` is that they have a ``name`` attribute. This will show up below the printout when you ask python to print or otherwise display a series and is used in various other ``pandas`` tasks like combining ``Series`` objects into a ``DataFrame`` (more to come on ``DataFrame`` s next!). .. ipython:: python s3 print(s3.name) s3.name = 'letter_ind' s3 print(s3.name) .. TODO: I need to come up with some questions about building Series ``DataFrame`` ------------- A ``DataFrame`` is a 2-dimensional labeled data structure. It is very much like a ``Series``, but it can have multiple columns with associated labels. For this reason, you can think of a ``DataFrame`` as being like a spreadsheet or SQL table. It is by far the most commonly used ``pandas`` object, as a ``Series`` is simply a ``DataFrame`` with only one column. Like a ``Series``, there are many ways to create a ``DataFrame``. We will illustrate some of them below. .. ipython:: python # Create DataFrame from 2d-numpy array df1 = DataFrame(randn(4, 4)) df1 # Give explicit index for columns using columns= keyword argument df2 = DataFrame(np.arange(16).reshape(4, 4), columns=list('abcd')) df2 s1 s2 # Create DataFrame from dict of Series df3 = DataFrame({'one': s1, 'two':s2}) df3 # Note: the union of the two indecies becomes the DataFrame Index and # missing values at an index result in NaN df4 = DataFrame({'s3': s3, 's5':s5}) df4 # Use list of numpy arrays. Note each array becomes a row ar1 = np.arange(1, 6) ar2 = ar1 * .1 df5 = DataFrame([ar1, ar2], index=['ar1', 'ar2']) There are other ways to create both ``DataFrame`` and ``Series`` objects, and I encourage you to test out different ways as well as browse docstrings and documentation. Indexing (getting slices or chunks of data) a ``DataFrame`` is very similar to indexing a ``Series``. An important thing to keep in mind is that the ``[]`` syntax indexes along the columns. .. ipython:: python data = {'state': ['Utah', 'Arizona', 'Idaho', 'Nevada', 'Colorado'], 'year': [2000, 2001, 2002, 2001, 2002], 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]} df = DataFrame(data, index=['one', 'two', 'three', 'four', 'five']) df # Access using dict notation df['state'] # Access using "dot" notation df.year # Get slice using boolean array (actually a boolean Series) year2001 = df.year == 2001 year2001 df[year2001] If instead you want to slice along the rows you use the special ``.ix`` field .. ipython:: python # Access using row label df.ix['one'] # Slice along Row labels df.ix['two':'five'] # list of row labels df.ix[['one', 'three', 'four']] # Get slices with integers instead of row labels df.ix[1] # An integer range slice df.ix[1:3] # list of integers df.ix[[1, 3, 4]] It should be noted that when you have integer row labels, indexing by label or indexing by position can be ambiguous. To overcome this, it is recommended you learn to use ``loc`` for row label and ``iloc`` for row position. Note that the default behavior is to use label indexing first and fall back on positional indexing if row lookup fails. Note that the ``loc`` and ``iloc`` methods are new in pandas ``0.11``. If you don't have them you should first check your pandas version using ``pd.__version__`` and if you don't have :math:`\ge 0.11` you should update (ask Spencer if you don't know how to do this). .. ipython:: python ambiguous = DataFrame(randn(4, 4), index=[1, 1, 3, 4]) ambiguous ambiguous.ix[1] ambiguous.loc[1] ambiguous.iloc[1] You can create new columns using dictionary syntax .. ipython:: python # Notice single value is broadcasted, like in numpy arrays df['budget'] = '?' # Could also give list where length is equal to index df['GDP'] = [20, 34., 11, 26, 13] # Can create new columns using existing ones df['perCapita'] = df.GDP / df['pop'] df The same note about data alignment and ``NaN`` values when combining ``Series`` is also true when combining ``DataFrame`` objects. In this case, however, the data alignment happens for both the row and column labels. .. ipython:: python df6 = DataFrame(randn(10, 4), columns=list('ABCD')) df7 = DataFrame(randn(7, 3), columns=list('ABC')) df6 + df7 The transpose attribute of a ``DataFrame`` simply swaps rows and columns. .. ipython:: python df df.T .. TODO: I need to come up with some questions about building DataFrames ``Index`` --------- You have already seen and dealt with ``Index`` objects, but we have barely scratched the surface of what ``Index`` objects can be or do. The following list gives a few important properties regarding ``Index`` objects - They are immutable, meaning once they are created, the user cannot change them. This is very desirable because immutable indexes preserve the intrinsic link between data and labels and allow you to safely share ``Index`` objects across data. - They behave a lot like the python ``Set`` type and respond to logic like ``x in index``. One very useful thing about ``Index`` objects is that they can be hierarchal, or have multiple levels. This is best understood through example. Below we will create a hierarchal index and apply it to a ``Series``. .. ipython:: python arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']] tuples = zip(*arrays) h_ind = pd.MultiIndex.from_tuples(tuples, names=['level1', 'level2']) hi_s = Series(randn(8), index=h_ind) hi_f = DataFrame(randn(8, 4), index=h_ind) Listing out all the labels and their multiplicities before passing to ``zip`` and ``from_tuples`` is definitely not the only way to create a ``MultiIndex``. I will show a few more of my favorites below. .. ipython:: python # Use cartesian product of a few lists. Note this index has 3 levels from itertools import product words = ['hi', 'hello', 'hola'] nums = [10, 20] letters = ['a', 'b'] ind = pd.MultiIndex.from_tuples(list(product(words, nums, letters)), names=['word', 'num', 'let']) hi_f2 = DataFrame(randn(12, 3), columns=['A', 'B', 'C'], index=ind) # have column names of DataFrame become outer level of series index using unstack() df3 hi_f3 = df3.unstack() hi_f3 # Use index from other DataFrame hi_f4 = DataFrame(np.arange(30).reshape(10, 3), index=hi_f3.index) hi_f4 To index into a hierarchal indexed object, just use the ``.ix`` field with multiple slice categories, one for each level, starting at the outermost level. Note if you don't give inner slices, the whole level is taken. (If some of the examples don't work it is because I use the development version of pandas and there may be new features. Don't worry about it for now). .. ipython:: python hi_f2 hi_f2.ix['hi'] hi_f2.ix['hi', :, :] hi_f2.ix['hi', 10, 'b'] # Slicing works too, but...... hi_f2.ix['hi':'hello'] What happened in the last example? The problem with taking slices of indexes with string labels is that they labels need to be sorted lexicographically in order for ``pandas`` to be able to slice them up. Luckily for us, ``pandas`` will sort the index for us. .. ipython:: python hi_f2 = hi_f2.sort_index() hi_f2 hi_f2.ix['hello':'hi'] hi_f2.ix[('hello', 10, 'a'):('hi', 20, 'b')] One very common operation to perform on a hierarchal indexed data set is to take cross sections. This is done using the ``xs`` method and is given two arguments: 1.) The name of the row label(s) you would like, 2.) the level at at which they appear in the index. .. ipython:: python # Specify level using integer hi_f2.xs(10, level=1) # Or specify level using the level name for same result. hi_f2.xs(10, level='num') # pass multiple row labels and levels to get all 'hello', 'b' combinations hi_f2.xs(['hello', 'b'], level=[0, 'let']) This has been a very brief overview of the main data structures in ``pandas``. We will be using them more later so if you do not feel like you have a good understanding of them, consult the ``pandas`` `documentation `_ Return to the :ref:`main ` ``pandas`` lab page. .. TODO: Write some questions for creating and using indexes