.. _pandas_types:

Core ``pandas`` Data Structures
===============================

There are three main ``pandas`` data structures: ``Series``, ``DataFrame``, and ``Index``. Each of them is a subclass of ``numpy.ndarray``. This means that the functionality you have learned about ``numpy`` arrays still applies to these data structures, but these ``pandas`` objects have additional, data-specific features. We will provide some high-level principles that apply to each of them and then dig into their specific features. (Note that many of the materials in this section come from the `Intro to Data Structures <http://pandas.pydata.org/pandas-docs/dev/dsintro.html>`_  page in the ``pandas`` documentation.)

One very important thing to understand about ``pandas`` data structures is that the they are like ``numpy`` arrays, but they have axis labels. Think of them more like a working version of an Excel spreadsheet where you have labels in the first row and column and then the data corresponding to those labels on the inside. An important characteristics about ``pandas`` objects is that the **link between the data and the labels is intrinsic**, meaning that the labels are actually a part of the data and that link will not be broken unless you explicitly break it.

Throughout the rest of the lab, the code examples assume that you have made the following imports.

.. ipython:: python

    import numpy as np
    from numpy.random import randn
    import pandas as pd
    from pandas import Series, DataFrame, Index


.. ipython:: python
    :suppress:

    np.random.seed(42)  # consistent random numbers


``Series``
----------

The ``Series`` can be thought of as a labeled one dimensional array. It consists if a single dimension of data and an ``Index`` containing the labels. The data in a ``Series`` can be of *any* data type: float, int, string, matplotlib figures, and any other python object.

There are many different ways you can create a ``Series``. Two of the most common are to create a ``Series`` using a numpy ``array`` or a python ``list``.

.. ipython:: python

    # Using a numpy array
    s1 = Series(randn(5))
    s1
    s1.index
    s1.values

    # Using a python list
    s2 = Series([i ** 2 for i in xrange(1, 6)])
    s2
    s2.index
    s2.values

Notice that each of these ``Series`` objects has both ``index`` and ``values`` attributes. The ``index`` is an ``Index`` type and ``values`` is the underlying numpy ``array`` holding the data. Below we will show a few more ways we could have created a ``Series``.

.. ipython:: python

    # We can explicitly specify the index
    s3 = Series(randn(5), index=['a', 'b', 'c', 'd', 'e'])
    s3

    # Create using a dict (note the data are actual python lists)
    # Notice the dict keys become the index
    s4 = Series({'first': [1, 2], 'second':[2, 4]})
    s4
    s4.values
    s4.index

    # Passing a dict and index uses the index we pass, but fills values from the dict.
    # notice NaN that appears because we 'd' in the index, but not in the dict
    s5 = Series({'a' : 0., 'b' : 1., 'c' : 2.}, index=['b', 'c', 'd', 'a'])
    s5

Once we have created a ``Series`` object, there is a lot that we can do with it. Below we will demonstrate that they act a lot like their parent class ``numpy.ndarray``.

.. ipython:: python

    s3
    s3[0]
    s3[:3]
    s3[s3 > s3.mean()]
    s3[[2, 3, 1, -1, 1]]
    np.exp(s3)

Because ``Series`` objects have a labeled index, they also behave a lot like python dictionaries:

.. ipython:: python

    s3['a']
    s3['b':'d']
    'e' in s3

The same vectorized operations that you are used to doing on ``numpy.ndarray`` s can be done on ``Series``. The **BIG** difference between arrays and series objects when it comes to arithmetic is that because axis labels are an intrinsic property, data is aligned every operation. The example below will demonstrate.

.. ipython:: python

    s3[1:] + s3[:-1]

Notice that the 'a' and 'e' (or 1, -1 if doing integer indexing) values are ``NaN``. This happened because the slice ``s3[1:]`` doesn't contain the 'a' element and the other slice doesn't contain 'e'. Therefore, when doing the sum of the two slices we get `NaN` values where one of the ``Series`` doesn't have data. This illustrates a more general principle. The combination of ``Series`` objects, via arithmetic operations or otherwise, creates a new ``Series`` where the index is the union of the index of the former ``Series``. The adoption of this new index will lead to `NaN` values if any of the objects being combined don't have data for a particular index. This might seem like unwanted behavior, but being able to write code without doing any explicit data alignment provides great freedom and flexibility and is one feature of ``pandas`` that really distinguishes it from the other domain specific platforms.

The last main thing to talk about with regards to ``Series`` is that they have a ``name`` attribute. This will show up below the printout when you ask python to print or otherwise display a series and is used in various other ``pandas`` tasks like combining ``Series`` objects into a ``DataFrame`` (more to come on ``DataFrame`` s next!).

.. ipython:: python

    s3
    print(s3.name)
    s3.name = 'letter_ind'
    s3
    print(s3.name)

.. TODO: I need to come up with some questions about building Series

``DataFrame``
-------------

A ``DataFrame`` is a 2-dimensional labeled data structure. It is very much like a ``Series``, but it can have multiple columns with associated labels. For this reason, you can think of a ``DataFrame`` as being like a spreadsheet or SQL table. It is by far the most commonly used ``pandas`` object, as a ``Series`` is simply a ``DataFrame`` with only one column.

Like a ``Series``, there are many ways to create a ``DataFrame``. We will illustrate some of them below.

.. ipython:: python

    # Create DataFrame from 2d-numpy array
    df1 = DataFrame(randn(4, 4))
    df1

    # Give explicit index for columns using columns= keyword argument
    df2 = DataFrame(np.arange(16).reshape(4, 4), columns=list('abcd'))
    df2

    s1
    s2

    # Create DataFrame from dict of Series
    df3 = DataFrame({'one': s1, 'two':s2})
    df3

    # Note: the union of the two indecies becomes the DataFrame Index and
    #       missing values at an index result in NaN
    df4 = DataFrame({'s3': s3, 's5':s5})
    df4

    # Use list of numpy arrays. Note each array becomes a row
    ar1 = np.arange(1, 6)
    ar2 = ar1 * .1
    df5 = DataFrame([ar1, ar2], index=['ar1', 'ar2'])

There are other ways to create both ``DataFrame`` and ``Series`` objects, and I encourage you to test out different ways as well as browse docstrings and documentation.

Indexing (getting slices or chunks of data) a ``DataFrame`` is very similar to indexing a ``Series``. An important thing to keep in mind is that the ``[]`` syntax indexes along the columns.

.. ipython:: python

    data = {'state': ['Utah', 'Arizona', 'Idaho', 'Nevada', 'Colorado'],
            'year': [2000, 2001, 2002, 2001, 2002],
            'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
    df = DataFrame(data, index=['one', 'two', 'three', 'four', 'five'])
    df

    # Access using dict notation
    df['state']

    # Access using "dot" notation
    df.year

    # Get slice using boolean array (actually a boolean Series)
    year2001 = df.year == 2001
    year2001
    df[year2001]

If instead you want to slice along the rows you use the special ``.ix`` field

.. ipython:: python

    # Access using row label
    df.ix['one']

    # Slice along Row labels
    df.ix['two':'five']

    # list of row labels
    df.ix[['one', 'three', 'four']]

    # Get slices with integers instead of row labels
    df.ix[1]

    # An integer range slice
    df.ix[1:3]

    # list of integers
    df.ix[[1, 3, 4]]

It should be noted that when you have integer row labels, indexing by label or indexing by position can be ambiguous. To overcome this, it is recommended you learn to use ``loc`` for row label and ``iloc`` for row position. Note that the default behavior is to use label indexing first and fall back on positional indexing if row lookup fails. Note that the ``loc`` and ``iloc`` methods are new in pandas ``0.11``. If you don't have them you should first check your pandas version using ``pd.__version__`` and if you don't have :math:`\ge 0.11` you should update (ask Spencer if you don't know how to do this).

.. ipython:: python

    ambiguous = DataFrame(randn(4, 4), index=[1, 1, 3, 4])
    ambiguous

    ambiguous.ix[1]

    ambiguous.loc[1]

    ambiguous.iloc[1]

You can create new columns using dictionary syntax

.. ipython:: python

    # Notice single value is broadcasted, like in numpy arrays
    df['budget'] = '?'

    # Could also give list where length is equal to index
    df['GDP'] = [20, 34., 11, 26, 13]

    # Can create new columns using existing ones
    df['perCapita'] = df.GDP / df['pop']
    df

The same note about data alignment and ``NaN`` values when combining ``Series`` is also true when combining ``DataFrame`` objects. In this case, however, the data alignment happens for both the row and column labels.

.. ipython:: python

    df6 = DataFrame(randn(10, 4), columns=list('ABCD'))
    df7 = DataFrame(randn(7, 3), columns=list('ABC'))
    df6 + df7

The transpose attribute of a ``DataFrame`` simply swaps rows and columns.

.. ipython:: python

    df
    df.T

.. TODO: I need to come up with some questions about building DataFrames


``Index``
---------

You have already seen and dealt with ``Index`` objects, but we have barely scratched the surface of what ``Index`` objects can be or do. The following list gives a few important properties regarding ``Index`` objects

- They are immutable, meaning once they are created, the user cannot change them. This is very desirable because immutable indexes preserve the intrinsic link between data and labels and allow you to safely share ``Index`` objects across data.
- They behave a lot like the python ``Set`` type and respond to logic like ``x in index``.

One very useful thing about ``Index`` objects is that they can be hierarchal, or have multiple levels. This is best understood through example. Below we will create a hierarchal index and apply it to a ``Series``.

.. ipython:: python

    arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
              ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]

    tuples = zip(*arrays)
    h_ind = pd.MultiIndex.from_tuples(tuples, names=['level1', 'level2'])

    hi_s = Series(randn(8), index=h_ind)

    hi_f = DataFrame(randn(8, 4), index=h_ind)

Listing out all the labels and their multiplicities before passing to ``zip`` and ``from_tuples`` is definitely not the only way to create a ``MultiIndex``. I will show a few more of my favorites below.

.. ipython:: python

    # Use cartesian product of a few lists. Note this index has 3 levels
    from itertools import product
    words = ['hi', 'hello', 'hola']
    nums = [10, 20]
    letters = ['a', 'b']
    ind = pd.MultiIndex.from_tuples(list(product(words, nums, letters)),
                                    names=['word', 'num', 'let'])
    hi_f2 = DataFrame(randn(12, 3), columns=['A', 'B', 'C'], index=ind)

    # have column names of DataFrame become outer level of series index using unstack()
    df3
    hi_f3 = df3.unstack()
    hi_f3

    # Use index from other DataFrame
    hi_f4 = DataFrame(np.arange(30).reshape(10, 3), index=hi_f3.index)
    hi_f4

To index into a hierarchal indexed object, just use the ``.ix`` field with multiple slice categories, one for each level, starting at the outermost level. Note if you don't give inner slices, the whole level is taken. (If some of the examples don't work it is because I use the development version of pandas and there may be new features. Don't worry about it for now).

.. ipython:: python

    hi_f2
    hi_f2.ix['hi']
    hi_f2.ix['hi', :, :]
    hi_f2.ix['hi', 10, 'b']

    # Slicing works too, but......
    hi_f2.ix['hi':'hello']

What happened in the last example? The problem with taking slices of indexes with string labels is that they labels need to be sorted lexicographically in order for ``pandas`` to be able to slice them up. Luckily for us, ``pandas`` will sort the index for us.

.. ipython:: python

    hi_f2 = hi_f2.sort_index()
    hi_f2

    hi_f2.ix['hello':'hi']

    hi_f2.ix[('hello', 10, 'a'):('hi', 20, 'b')]

One very common operation to perform on a hierarchal indexed data set is to take cross sections. This is done using the ``xs`` method and is given two arguments: 1.) The name of the row label(s) you would like, 2.) the level at at which they appear in the index.

.. ipython:: python

    # Specify level using integer
    hi_f2.xs(10, level=1)

    # Or specify level using the level name for same result.
    hi_f2.xs(10, level='num')

    # pass multiple row labels and levels to get all 'hello', 'b' combinations
    hi_f2.xs(['hello', 'b'], level=[0, 'let'])

This has been a very brief overview of the main data structures in ``pandas``. We will be using them more later so if you do not feel like you have a good understanding of them, consult the ``pandas`` `documentation <http://pandas.pydata.org/pandas-docs/dev/index.html>`_

Return to the :ref:`main <pandas_lab>` ``pandas`` lab page.

.. TODO: Write some questions for creating and using indexes