.. _pandas_lab: ************************************************ pandas: Data Handling and Analysis in Python ************************************************ Author: Spencer Lyon .. ipython:: python :suppress: # Setting up namespace import numpy as np from numpy.random import randn import pandas as pd from pandas import Series, DataFrame, Index np.random.seed(42) # consistent random numbers tips = pd.read_csv('https://raw.github.com/pydata/pydata-book/master/ch08/tips.csv') tips['tip_pct'] = tips['tip'] / tips['total_bill'] Many research tasks require you to use and analyze data. Most of the time before you can actually get to the analysis you have to collect, organize, clean, munge, and otherwise prepare the data. As most data scientists will tell you, the preparation phase takes the majority of the users time. As with most things, there are difficult ways and easier ways to work with data on a computer. Domain specific platforms like Stata, SQL, SASS, and R are very capable data analysis and handling packages, but they are all lacking in one way or another. Python is a very flexible and powerful language that overcomes most if not all of the major pitfalls inherent in the domain specific languages, but historically has not been as good at data handling. Enter pandas. pandas is an extremely efficient, powerful, and flexible data package for python. The following is a quote taken from the pandas website: [pandas] has the broader goal of becoming the most powerful and flexible open source data analysis manipulation tool available in any language. It is already well on its way toward this goal. The purpose of this lab is to introduce you to the basic data structures in pandas, teach you some helpful pandas tricks, and get you some experience handling and analyzing real data. The text of this lab is somewhat detailed and should be thought of as more of a reference to look back on when you need to look something up, rather than a text you need to fully understand on the first read. The remainder of this lab will be structured a bit differently than the other labs you have seen so far. I will present a lot of information at the beginning and save most assignments until the end of the lab. It is structured this way because there are a number of foundational things you need to know about pandas before being able to do meaningful analysis with actual data. As such, I recommend that you spend the first hour or so reading through the material and the last 3 hours (probably a bit more) working through the exercises. Don't worry about mastering all of the material presented on your initial reading. It is meant to introduce you to the topics and serve as a reference for you to look back on while completing the assignments. .. note:: A lot of the content of this lab was drawn from two main sources: the `pandas documentation `_, and the excellent book by Wes McKinney (main author of pandas) `Python for Data Analysis `_ Core pandas Data Structures =============================== Complete the introduction found on the :ref:`pandas objects ` page. To use in examples below, I define the following objects: .. ipython:: python ser = Series(np.arange(10, 16)) df = DataFrame(np.arange(30).reshape(6, 5), columns=list('abcde')) mul_df = DataFrame(np.arange(12).reshape((4, 3)), index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]], columns=[['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']]) mul_df.index.names = ['let', 'num'] mul_df.columns.names = ['state', 'color'] ts_df = pd.util.testing.makeTimeDataFrame(nper=500) big_df = DataFrame(np.random.randint(500, size=(500, 10)), columns=list('abcdefghij')) Essential Functionality ======================= Viewing Data ------------ Now that we understand how to construct pandas objects, I will go over some basic functionality. pandas has an intelligent printing system that won't take over your screen by trying to print something that is too large. .. ipython:: python big_df There are many ways to force pandas to show you at least parts of your data. .. ipython:: python # Slicing big_df.ix[:5] big_df.ix[:5, ::5] # Using the head() method -- shows first n rows big_df.head() big_df.head(5) # default is 5 big_df.head(7) # Using the tail method -- shows last n rows big_df.tail() big_df.tail(5) # default is 5 big_df.tail(7) Reindexing ---------- A very important method for pandas objects is ``reindex``, which means to create a new object with the data *conformed* to a new index. .. ipython:: python ser ser.reindex([1, 2, 4, 6]) Notice that the call to ``reindex`` used the data contained in ``ser`` where possible and put placeholders for missing values where the data didn't exist in ``ser``. The same holds for DataFrame objects: .. ipython:: python df df.reindex([1, 3, 7]) If you didn't want missing data to appear in the reindexed object, you could use some interpolation logic by passing an interpolation rule to the ``method`` argument. The two main arguments that can be passed are ``ffill`` and ``bfill``. .. ipython:: python # Read docstrings for more information df.reindex([1, 3, 5, 7], method='ffill') With a DataFrame you can reindex the rows like we have been doing, but you can also reindex the rows and take only a few columns or just reindex the columns themselves. .. ipython:: python df.reindex([1, 3], columns=['a', 'b', 'e']) df.reindex(columns=['a', 'b', 'e']) .. note:: As will be seen later, pandas has very flexible data alignment and handling routines. A lot of the work done behind the scenes to align data sets uses the reindex method. Also recall the use if the ``ix`` indexing field that was introduced with the data structures on the other page. It is a very robust and powerful way to accomplish many re-indexing tasks. Arithmetic and Data Alignment ----------------------------- We have already seen that when we do basic arithmetic pandas objects, the intrinsic link between data and labels is not broken. This results in a new pandas object whose row/column index is the union of indexes of the original objects, with ``NaN`` propogated where indexes didn't align. This is very useful for doing computation with data sets that need to be aligned, but sometimes isn't what we want. All pandas objects have ``add``, ``sub``, ``mul``, and ``div`` methods that give you more flexibility over this behavior. .. ipython:: python df1 = DataFrame(np.arange(9).reshape((3, 3)), columns=list('bcd'), index=['AZ', 'TX', 'NY']) df2 = DataFrame(np.arange(12).reshape((4, 3)), columns=list('bcd'), index=['UT', 'AZ', 'TX', 'NY']) # Produces NaN values df1 + df2 # Same as above df1.add(df2) # No NaN's df1.add(df2, fill_value=0) Arithmetic between DataFrame and Series is well defined and is analogous to operations between 1 and 2 dimensional numpy arrays, with the added benefit of data alignment. .. ipython:: python series = df2.ix[0] df2 - series ser2 = Series(range(4), index=list('bcde')) # Note NaNs for columns 'e' that doesn't appear in df2 df2 + ser2 # Can also align on rows ser3 = df2['c'] ser3 + df2 # But we need to use .### methods df2.sub(ser3, axis=0) Function Application and Mapping -------------------------------- As we saw on the other page, pandas objects are a subclass of ``numpy.ndarray`` and as such can be passed as arguments to all numpy ufuncs. .. ipython:: python np.sin(df) np.abs(np.sin(df)) Another very common operation is to apply a function on 1d arrays made up of the columns or rows of a pandas object. For example, if we wanted to find the total distance spanned by values in a row (column) of a data frame we could do something like this: .. ipython:: python f = lambda x: x.max() - x.min() rand_df = DataFrame(randn(5, 5), columns=list('abcde')) rand_df rand_df.apply(f) rand_df.apply(f, axis=1) The function sent to ``apply`` doesn't need to just return scalars. In fact, it can any python object you want. .. ipython:: python f2 = lambda x: 'Max: %.3f, Min: %.3f' % (x.max(), x.min()) f3 = lambda x: (x.max(), x.min()) f4 = lambda x: Series([x.max(), x.min()], index=['max', 'min']) rand_df.apply(f2) rand_df.apply(f3) rand_df.apply(f4, axis=1) You can also apply functions element-wise to every item in a dataframe using the ``applymap`` method. .. ipython:: python formater = lambda x: '%.2f' % x rand_df.applymap(formater) As I am sure you are beginning to see, you can be very creative in your use of ``apply`` and ``applymap`` to accomplish some very sophisticated and powerful tasks. Sorting and Ranking ------------------- Another common task is to sort the data based values or keys. This is very easy and efficient in pandas (I make the note about efficiency because all of the operations below are easy enough to compute on your own, but pandas has highly optimized code for doing them). .. ipython:: python frame = DataFrame([[4, 0], [7, 1], [-3, 0], [2, 1]], columns=['three', 'one'], index=[20, 40, 10, 30]) frame # Sort rows numerically frame.sort_index() # Sort column labels alphabetically frame.sort_index(axis=1) # Sort sort by column 'one' frame.sort_index(by='one') # Sort by 'one', then 'three' frame.sort_index(by=['one', 'three']) # Sort by 'three', then 'one' frame.sort_index(by=['three', 'one']) Other useful methods -------------------- There is so much more that pandas can do that we do not have time to look at individually right now. I will simply list useful DataFrame or Series methods and group them into logical categories. Summarizing Data - ``count`` - ``describe`` - ``min``, ``max`` - ``idxmin``, ``idxmax`` - ``argmin``, ``argmax`` - ``quantile`` - ``sum``, ``mean``, ``median``, ``mad``, - ``var``, ``std``, ``corr``, ``cov``, ``corrwith`` - ``skew``, ``kurt`` - ``cumsum``, ``cummin``, ``cummax``, ``cumprod`` - ``diff`` - ``pct_change`` Membership, unique values, and value counts - ``value_counts`` - ``value_counts`` - ``isin`` Missing Data: - ``dropna`` - ``dropna`` - ``dropna`` - ``notnull`` Manipulating Data - ``swaplevel`` - ``sortlevel`` - ``'level'`` argument to most of the above methods Plotting - ``plot``: This one has a ton of functionality so you will want to refer to the `plotting docs `_ for more info. For an overview of other pandas features, see the documentation. Time Series Analysis ==================== Time series analysis is a huge topic that could fill books, so I will only briefly highlight some of the key features. For more information see the `time series docs `_. Generating date ranges ---------------------- One common task in economic or financial data analysis is to generate a range of equally spaced dates. An example might be that you know you have quarterly data starting in Q1 2000 and ending in Q4 2012 and you would like to tell python (pandas) about those dates. To do this you will use the ``pandas.date_range`` function. I will illustrate with some examples below: .. ipython:: python # Example from above. Note robust string parsing pd.date_range(start='1-1-2000', end='12/31/2012', freq='Q') # Maybe we wanted start of period instead of ending period data pd.date_range(start='1-1-2000', end='12/31/2012', freq='QS') # We might have frequency, starting period, and number of periods pd.date_range(start='January 1 2000', freq='QS', periods=52) In the above examples I only tried quarterly data, but there are many different frequencies I could have used. Below I list many of them. .. hlist:: :columns: 2 * ``B``: business day frequency * ``D``: calendar day frequency * ``W``: weekly frequency * ``M``: month end frequency * ``BM``: business month end frequency * ``MS``: month start frequency * ``BMS``: business month start frequency * ``Q``: quarter end frequency * ``BQ``: business quarter endfrequency * ``QS``: quarter start frequency * ``BQS``: business quarter start frequency * ``A``: year end frequency * ``BA``: business year end frequency * ``AS``: year start frequency * ``BAS``: business year start frequency * ``H``: hourly frequency * ``T``: minutely frequency * ``S``: secondly frequency * ``L``: milliseonds * ``U``: microseconds .. note:: You can combine any of the above frequencies with a logical anchor point. For example, you can combine weekly with a three letter abbreviation for which day you want repated each week (e.g. ``W-TUE`` is every Tuesday). With annual and quarterly frequencies you can give a three letter abbreviation for the month (``BQS-JUN`` is the start of business quarters, starting at June 1st). I could use any of these frequencies, or any logical combination of them. .. ipython:: python # Business annual data pd.date_range(end='1-1-1990', freq='BA', periods=20) # Every 5 hours 10 minutes pd.date_range(start='April 25th 1989 3:40 PM', periods=100, freq='5h10min') ``DatetimeIndex`` ----------------- Notice that the return value of the ``date_range`` function is an object of type ``DatetimeIndex``. This means that the result could be used as the index for a Series or DataFrame. When you do this, you open up a whole set of time-series specific functionality. One of the most common and most useful tasks is to resample data at one frequency to another frequency. .. ipython:: python # Notice the frequency is B, or business daily ts_df # Get weekly data ts_df.resample('W') # Use how argument ts_df.resample('2M', how='first') # Different how ts_df.resample('8M', how='max') # ohlc: open-high-low-close. only works on series objects ts_df['A'].resample('8M', how='ohlc') # use lambda function ts_df.resample('8M', how=lambda x: np.sqrt(np.abs(x.min() - x.max()))) There is a lot more available to the ``resample``, method, but I refer the reader to the `time series docs `_ for more information. Other Notable Topics -------------------- Other topics that you will want to look at when you have time are given below: - Time series plotting (just use the ``plot`` method on an object with a DatetimeIndex) - Indexing DatetimeIndex objects. (To get all data from December 2012 through the end of May 2013 data you could do ``df.ix['December 2012':May 2013']``) - Moving Window functions - Up-sampling and down-sampling - Time zone handling - Periods vs. time stamps (we have been using time stamps) Group-By ======== One of the most flexible and powerful constructs in pandas is accessed via the ``groupby`` method for a pandas data structure. The process of ``groupby`` can be summarized in three stages: 1. Split 2. Apply 3. Combine As the steps indicate, when using ``groupby`` we will be deciding how to split up our data, what to do with the categorized chunks, and then how to combine it again (really pandas usually takes care of this last step for us). This topic can feel a bit esoteric, but once understood should be appealing to anyone who needs to aggregate and act on data (pretty much everyone who ever works with data!). I will not have time to cover all the possible use-cases of ``groupby`` in this material, but the `groupby docs `_ are a good place to look for more examples and inspiration. .. note:: The ``tips`` data below originally came from Bryant % Smith (1995) and has the total bill and tip amount for the tips given at a restaurant based on the time of day, day of the week, the size of the party, sex of the payer, and whether or not the payer was a smoker. We will need some objects to play with: .. ipython:: python df = DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'], 'key2' : ['one', 'two', 'one', 'two', 'one'], 'data1' : np.random.randn(5), 'data2' : np.random.randn(5)}) colors = ['red', 'blue', 'red', 'blue', 'red'] ranks = [1, 2, 1, 1, 2] # Some data I loaded in before tips We can use ``groubpy`` on our DataFrame and pass it python iterable(s) of the correct size or we could use the DataFrame's columns themselves as the keys to aggregate on. .. ipython:: python # Group the series of just 'data1' column df['data1'].groupby(colors) # Apply it to the whole frame df.groupby(ranks) # Group on more than one thing (also works for series) df.groupby([colors, ranks]) # Group by the columns of df by_keys = df.groupby(['key1', 'key2']) by_keys As you have seen, the return value of the ``groupby`` object isn't very useful on its own, but we can do some very cool things with it. .. ipython:: python # Compute mean of each group by_keys.mean() # Compute max of just the 'data2' column by_keys['data2'].max() # The above was a Series. We can unstack it into a DataFrame by_keys['data2'].prod().unstack() There are a number of highly optimized ``groupby`` methods. I will list some of the most common below .. hlist:: :columns: 4 - ``count`` - ``sum`` - ``mean`` - ``median`` - ``std`` - ``var`` - ``min`` - ``max`` - ``prod`` - ``first`` - ``last`` You can also use the ``agg`` method of a GroupBy object to give more complicated functions as well as pass a list of functions to apply at once. .. ipython:: python ppl_grp = tips.groupby(['sex', 'smoker']) ppl_pct = ppl_grp['tip_pct'] ppl_pct.agg('mean') # Notice function names become column labels ppl_pct.agg(['mean', 'sum', 'std', lambda x: x.max() - x.min()]) # Aggregate multiple columns with a list of functions funcs = ['count', 'mean', 'max'] ppl_grp['tip_pct', 'total_bill'].agg(funcs) # Compute different aggregations for different columns mapping = {'tip_pct' : ['min', 'max', 'mean', 'std'], 'size' : ['sum', 'mean']} ppl_grp.agg(mapping) As with other topics, there is so much more that you can do with ``groupby`` than what I can cover here. See the documentation for more information. Data Collection and Input/Output ================================ pandas is also an excellent tool for getting data in and out of python. This section will illustrate the differences pandas i/o functions and those in other python packages. Using pandas ------------ Depending on your version of pandas, there are between 4-7 utility functions that can be used get data in and out of pandas. Spend about 10 minutes reading through the `data IO `_ documentation and familiarize yourself with the ``read_table`` and ``read_csv`` functions. There is a lot of information there, so don't worry about capturing all of it. The function arguments you should focus on are the following: - ``filepath_or_buffer`` - ``sep`` - ``dtype`` - ``header`` - ``index_col`` - ``parse_dates`` - ``thousands`` - ``usecols`` To make sure that you have understood the material on the pandas website, you should be able to follow the examples below: .. ipython:: python # Show contents of file print open('./source/resources/Data/flat.csv').read() pd.read_csv('./source/resources/Data/flat.csv') # We really didn't want first row to be header pd.read_csv('./source/resources/Data/flat.csv', header=None) # Can give column names names = ['a', 'b', 'c', 'd', 'message'] pd.read_csv('./source/resources/Data/flat.csv', header=None, names=names) # Can set named column as index pd.read_csv('./source/resources/Data/flat.csv', header=None, names=names, index_col='message') Other Tools ----------- Imagine you wanted to some basic macroeconomic analysis with common data sets you could obtain from FRED. As you have already seen, there are various functions within ``numpy`` and ``scipy`` for loading data from text or binary files into numpy arrays. Two of the functions you may or may not have met up to this point are: - ``numpy.genfromtxt``: Flexible input of plain text files with allowing the user to specify many things like the datatype, file delimiter, rows/columns to skip, whether the data should be unpacked into separate arrays - one column per array, and much more. This is a more powerful version ``numpy.loadtxt`` - ``scipy.io.loadmat``: Functional import of MATLAB ``.mat`` data files. The result is put in a python dictionary with keys being the object names from the ``.mat`` file. Various options can be given that control the behavior of the input. While these functions are capable, they are somewhat low-level compared to the functions in pandas. The exercises below will demonstrate this idea. .. todo:: Disclaimer: Spend only 10-15 minutes trying to complete this exercise. I don't actually care if you finish it - I DO however care that you spend at least a little bit of time on it. Collect the following data sets from the `FRED website `_ and read them in to python using tools from numpy/scipy (Note the items listed below are the actual data set names on FRED). Make sure that you collect the data starting January 1 1985 and going out as far as you can: - DGS3MO - DGS10 - M2 - CPIAUCSL - GDPC1 Use the date information given in the data files to manually align the data. Notice that there are 4 different frequencies for the data sets listed above. To get all of them on the same frequency, compute a period-average. In other words, if you have daily data and you want weekly data, compute the mean of every 7 items and call that the weekly value. (Note: This probably won't be easy because you will end up needing quarterly data and trying to align data across months of different lengths is difficult.) Finally, use your resultant data set and rudimentary econometric techniques to solve this equation: .. math:: GDPC1 = \beta_0 + \beta_1 (DGS10 - DSG3MO) + \beta_2 M2 + \beta_3 CPIAUCSL You should get the following as answers: - :math:`\beta_0 = 85.665327` - :math:`\beta_1 = -116.884105` - :math:`\beta_2 = -0.102965` - :math:`\beta_3 = 65.954251` .. todo:: Disclaimer: I DO care that you finish this one. Repeat the above analysis using pandas. Read the data into python in two ways: 1. Using ``pandas.read_table`` or ``pandas.read_csv`` to read in the files you used above. 2. Using the ``pandas.io.data.DataReader`` function (read the docstring) Hint: When using ``read_xxx`` you will want to make sure to parse the dates and set which column should be the index (the one with the dates is probably a good idea) Combining Data Sets =================== The last main topic I will cover here deals with combining data sets from different sources. We have seen many examples of how the intrinsic link between data and labels has aided computation and aggregation. It is also crucial to data alignment and the merging and joining of data sets. ``concat`` ---------- The most general combining function in pandas is the ``pandas.concat`` function. It does most of the heavy lifting for the more specialized data merging routines in pandas. The ``concat`` function will feel quite familiar to those who are used to using the ``numpy.concatenate`` function (or any of its siblings like ``numpy.row_stack``, ect.). I will demonstrate some of this functions capabilities with a few examples (Note this function applies equally well to DataFrame and Series objects): .. ipython:: python s1 = Series([0, 1], index=['a', 'b']) s2 = Series([2, 3, 4], index=['c', 'd', 'e']) s3 = Series([5, 6], index=['f', 'g']) s4 = pd.concat([s1 + 10, s3]) # Concatenate along rows (default: axis=0) pd.concat([s1, s2, s3]) # concat along columns pd.concat([s1, s2, s3], axis=1) # Use intersection of index to avoid NaN's pd.concat([s1, s4], axis=1, join='inner') # specify index labels to be used pd.concat([s1, s4], axis=1, join_axes=[['a', 'c', 'b', 'e']]) # Give names to concatenated objects (become MultiIndex instead of cols) pd.concat([s1, s1, s3], keys=['one', 'two', 'three']) Now for some more complicated examples using DataFrame objects .. ipython:: python df1 = DataFrame(np.arange(6).reshape(3, 2), index=['a', 'b', 'c'], columns=['one', 'two']) df2 = DataFrame(5 + np.arange(4).reshape(2, 2), index=['a', 'c'], columns=['three', 'four']) # Notice that the keys become top-level column index pd.concat([df1, df2], axis=1, keys=['l1', 'l2']) # Could have passed a dict pd.concat({'l1': df1, 'l2': df2}, axis=1) # Create heirarchial index with keys and names args pd.concat([df1, df2], axis=1, keys=['level1', 'level2'], names=['upper', 'lower']) # We can tell it to ignore the index of the original objects pd.concat([df1, df2], ignore_index=1) ``merge`` and ``join`` ---------------------- The ``merge`` and ``join`` functions will feel very familiar to those who have used relational databases like SQL. They combine datasets by linking rows using one or more keys. The best way to understand these operations is by example, so here we go! Merge examples: .. ipython:: python df1 = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)}) df2 = DataFrame({'key': ['a', 'b', 'd'], 'data2': range(3)}) df3 = DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)}) df4 = DataFrame({'rkey': ['a', 'b', 'd'], 'data2': range(3)}) left1 = DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'], 'value': range(6)}) right1 = DataFrame({'group_val': [3.5, 7]}, index=['a', 'b']) lefth = DataFrame({'key1': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'key2': [2000, 2001, 2002, 2001, 2002], 'data': np.arange(5.)}) righth = DataFrame(np.arange(12).reshape((6, 2)), index=[['Nevada', 'Nevada', 'Ohio', 'Ohio', 'Ohio', 'Ohio'], [2001, 2000, 2000, 2000, 2001, 2002]], columns=['event1', 'event2']) left2 = DataFrame([[1., 2.], [3., 4.], [5., 6.]], index=['a', 'c', 'e'], columns=['Ohio', 'Nevada']) right2 = DataFrame([[7., 8.], [9., 10.], [11., 12.], [13, 14]],index=['b', 'c', 'd', 'e'], columns=['Missouri', 'Alabama']) # merge defaults to combine on over-lapping column names pd.merge(df1, df2) # But, it is a good idea to specify column pd.merge(df1, df2, on='key') # If column names are different, we can specify separate ones for each frame pd.merge(df3, df4, left_on='lkey', right_on='rkey') # Keep all (union) of the row labels by using 'how=' param pd.merge(df1, df2, how='outer') # Use the index of the right frame for the merge pd.merge(left1, right1, left_on='key', right_index=True) # Hierarchial indexes work too, but are more complicated lefth righth pd.merge(lefth, righth, left_on=['key1', 'key2'], right_index=True) pd.merge(lefth, righth, left_on=['key1', 'key2'], right_index=True, how='outer') # Merging on index pd.merge(left2, right2, how='outer', left_index=True, right_index=True) The DataFrame also has a convenient ``join`` method for merging on the index. This is used when you have objects with similar row labels, but different columns. .. ipython:: python # Should be the same as the last example left2.join(right2, how='outer') Exercises ========= Before you jump into the exercises you will need to download a number of data sets. Below I will give you the urls for the data: - http://www.grouplens.org/system/files/ml-1m.zip: This data set has over 1,000,000 movie ratings for almost 4,000 movies from over 6,000 people. - http://www.ssa.gov/oact/babynames/names.zip: This data set has a single file for each year from 1880. Each file is then broken up into gender, name, and births per name for all babies that were born in that year and issued a social security number. - http://ashleyw.co.uk/files/foods-2011-10-03.json.zip: This data set is a compilation of data provided by the US department of Agriculture. The data set contains data for 6,636 unique foods and 94 unique nutrients (with an average of 56.5 nutrients per food). .. todo:: Using the movie-lens 1M data and ``pandas.read_table`` read in all three files (users, ratings, movies) into pandas DataFrames. I recommend giving columns names directly to ``read_table`` for each case. HINT: you will need to open the files in a text editor to see what the separator character. Also, the columns for each file correspond to the following: - users.dat: user id, gender, age, occupation code, zip - ratings.dat: user id, movie id, rating, timestamp - movies.dat: movie id, title, genre Use the data combining tools discussed above to combine these three objects into a single object named movie_data .. todo:: Use the movie_data object from the previous exercise and compute the following things: - The 5 movies with the most number of ratings - A new object called ``active_titles`` that is made up of movies each having at least 250 ratings - For the subset of movies in the ``active_titles`` list compute the following: - The 3 movies with the highest average rating for females. Do the same for males. - The 10 movies men liked much more than women and the 10 movies women liked more than men (use the difference in average ratings and sort ascending and descending). - The 5 movies that had the highest standard deviation in rating. HINT: For some of these operations it might be helpful to compute a pivot table using the ``pivot_table`` method of a DataFrame. NOTE: You might also want to do some analysis on the movie genre. To do this you would have to be comfortable with using pandas vectorized string methods and probably some regular expressions. Those topics are a bit bigger than what we can cover here, but if you are looking for more practice, they are a good place to start. .. todo:: Using the baby names data set, define a pandas object called ``names_data`` that includes all the years. HINT: You should probably do this in a ``for`` loop. Also, to get a list of all the text files in a directory (so you can loop over them), use the following code: ``from glob import glob`` ``files = glob('*.TXT')`` .. todo:: Using the ``names_data`` object you just created, generate the following items: - The most popular male and female name in your birth year - The number of people born with your name in the same year (note, if you have a somewhat uncommon name, this may not appear. For example, my wife's name is Shannen and due to its irregular spelling, wasn't reported for her birth year) - A plot of the number of instances of your name over time. - A plot of the number of the total boy names and the number of girls names each year. - A plot showing the fraction of male and female babies given a name similar to Lesley. By similar I mean the name starts with 'lesl' (make sure you make the name lowercase). .. todo:: Use the built in python ``json`` module to load the food data into a python list. Your code should look like this ``import json`` ``db = json.load(open('foods-2011-10-03.json'))`` ``db`` will be a list of 6636 python dictionaries, each containing nutritional information for a different food item. Each dictionary will have the following keys: - portions - description - tags - nutrients - group - id - manufacture Create a DataFrame of meta_data using the description, group, id, and manufacturer items in each dictionary. .. todo:: Loop over ``db`` and construct a list of DataFrames containing the nutritional information for each record in ``db``. Make sure to add a column to each of these DataFrames that contains the unique food id (``id`` key in the dictionary). HINT: This food id will be a constant Finally, use the pandas combining techniques to create a nutrients DataFrame. After you have done this drop duplicate entries in this DataFrame. For example, if you had named the objects nuts you would do .. code-block:: python nuts = nuts.drop_duplicates() .. todo:: Use the ``rename`` method to make sure that the ``description`` and ``group`` columns are un-ambiguous for both the meta_data DataFrame and the nutrients DataFrame (These column names are duplicated because every food has a description and group and each nutrient also has those identifiers). HINT: you will need to use the ``columns`` and ``copy`` arguments of the ``rename`` method. Finally, use the data combining routines to come up with a ``foods`` DataFrame containing all the meta_data and nutritional information. Make sure to do an ``outer`` style merge on the correct columns. .. todo:: Using the ``foods`` DataFrame you have been building, compute the following things: - The food item with the highest content of each nutrient. - A function that accepts a nutrient name and a quantile value and generates a horizontal bar plot of the amount of that nutrient in each food group. Provide a plot title. HINT: You will need to use the ``quantile`` and ``sort`` (or ``order`` ) methods in order for this to work. Also, you can generate the horizontal bar plot using ``df.plot(kind='barh')``.