pandas question

This is the place for queries that don't fit in any of the other categories.

pandas question

Postby djs051085 » Thu Jul 18, 2013 3:14 pm

Hi All - I'm trying to understand the pandas module. So in the sample code in the documentation df is declared a dataframe and filled with random values 10 rows by 4 columns. Then they show breaking this data frame into pieces. Here is where I'm confused. Is df[:3] the third column of the df? Is df[7:3] the element in index x=7,y=3? Is
df[7:] the 7th row? Assuming all that is correct, why is the concatentation of those three pieces of df the EXACT SAME as df?!?! :?


Code: Select all
In [1255]: df = DataFrame(np.random.randn(10, 4))

In [1256]: df
Out[1256]:
          0         1         2         3
0  0.469112 -0.282863 -1.509059 -1.135632
1  1.212112 -0.173215  0.119209 -1.044236
2 -0.861849 -2.104569 -0.494929  1.071804
3  0.721555 -0.706771 -1.039575  0.271860
4 -0.424972  0.567020  0.276232 -1.087401
5 -0.673690  0.113648 -1.478427  0.524988
6  0.404705  0.577046 -1.715002 -1.039268
7 -0.370647 -1.157892 -1.344312  0.844885
8  1.075770 -0.109050  1.643563 -1.469388
9  0.357021 -0.674600 -1.776904 -0.968914

# break it into pieces
In [1257]: pieces = [df[:3], df[3:7], df[7:]]

In [1258]: concatenated = concat(pieces)

In [1259]: concatenated
Out[1259]:
          0         1         2         3
0  0.469112 -0.282863 -1.509059 -1.135632
1  1.212112 -0.173215  0.119209 -1.044236
2 -0.861849 -2.104569 -0.494929  1.071804
3  0.721555 -0.706771 -1.039575  0.271860
4 -0.424972  0.567020  0.276232 -1.087401
5 -0.673690  0.113648 -1.478427  0.524988
6  0.404705  0.577046 -1.715002 -1.039268
7 -0.370647 -1.157892 -1.344312  0.844885
8  1.075770 -0.109050  1.643563 -1.469388
9  0.357021 -0.674600 -1.776904 -0.968914
Last edited by Yoriz on Thu Jul 18, 2013 4:24 pm, edited 1 time in total.
Reason: Added code tags
djs051085
 
Posts: 1
Joined: Thu Jul 18, 2013 3:05 pm

Re: pandas question

Postby tnknepp » Thu Jul 18, 2013 9:10 pm

Pandas is completely different from anything you have done in numpy/scipy. It has the advantage of replacing the normal indexing mechanism, which can be quite beneficial. e.g.

In numpy you have an array (neglecting all the brackets)

Code: Select all
a = 0 1 2 3 4
      5 6 7 8 9

# You reference everything in <a> according to its matrix position
>>>a[0,0]
     0
>>>a[1,3]
    8


Pandas re-indexes everything according to the "index" value and the column name. So, if you have a data frame such as:

Code: Select all
# Per the user manual:
dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))

>>>df
                   A         B        C         D
2013-01-01 -0.927670 -1.589070 -1.862371  1.593280
2013-01-02 -1.343014  0.564932  0.196953 -0.856496
2013-01-03  0.365228  0.515918  0.221643  0.612525
2013-01-04 -0.969555  0.018029  0.144415  0.566328
2013-01-05  1.026814 -0.401009  0.038801  0.528235
2013-01-06 -1.536789  0.618596 -0.462335  0.495144

>>>df.A
2013-01-01   -0.927670
2013-01-02   -1.343014
2013-01-03    0.365228
2013-01-04   -0.969555
2013-01-05    1.026814
2013-01-06   -1.536789

>>> df.B[2:5]
2013-01-03    0.515918
2013-01-04    0.018029
2013-01-05   -0.401009

>>> df.index[0]
<Timestamp: 2013-01-01 00:00:00>


When you type "df" into your console the returned table is not a matrix or array. As a matter of fact, the first column and first row of the table aren't really even part of the data, they only provide identification of the data sets within the data frame (indexes, if you will). This becomes nice when, e.g., you begin merging data sets.

If we redefine <dates> from above:
Code: Select all
dates = pd.date_range('20130101 12:00',periods=24,freq='H')
df2 = pd.DataFrame(np.random.randn(24,4),index=dates,columns=list('ABCD'))

df2
                            A         B         C         D
2013-01-01 12:00:00  0.841761  1.119553 -0.023832  0.340919
2013-01-01 13:00:00  0.114046  1.528263  1.245665 -1.006409
2013-01-01 14:00:00  0.676895 -0.086183  0.288721  0.830452
2013-01-01 15:00:00  2.102636  0.689814  0.651200 -1.230939
2013-01-01 16:00:00 -0.918325  0.099830  0.565656  0.223494
... Repeating hourly, 24 times

#Now merge the data
df3 = df + df2
>>> df3
                           A         B         C         D
...
2013-01-01 23:00:00      NaN       NaN       NaN       NaN
2013-01-02 00:00:00 -1.01193  1.107888  0.351392 -1.729168
2013-01-02 01:00:00      NaN       NaN       NaN       NaN
...


In df3 we see that df and df2 only have one coincident time stamp (i.e. index value), which occurs at 2013-01-02 00:00:00. This time merge occurred very quickly, and without any fancy upfront work by the user to "time match" the data (something I have to do a lot of).

The data frame really isn't that difficult to wrap your head around. Once you do, you will see definite benefits to making the change from traditional matrix analysis (though you likely will not be able to totally break from arrays, etc.). Check out the pandas tutorial and video on the homepage.
Python: 2.7 via Anaconda
Numpy: 1.7
Pandas: 0.11
OS: Windows 7
IDE: Spyder/IPython
User avatar
tnknepp
 
Posts: 118
Joined: Mon Mar 11, 2013 7:41 pm


Return to General Coding Help

Who is online

Users browsing this forum: Majestic-12 [Bot] and 1 guest