Chapter 11 pandas

11.1 Modules Import

import pandas as pd

## Other Libraries
import numpy as np
import datetime as dt
from datetime import datetime
from datetime import date

11.2 Pandas Objects

11.2.1 Types of Objects

  • pandas.Timestamp
  • pandas.Timedelta
  • pandas.Period
  • pandas.Interval
  • pandas.DateTimeIndex
  • pandas.DataFrame
  • pandas.Series

11.2.2 Series and DataFrame

Type Dimension Size Value Constructor
Series 1 Immutable Mutable pandas.DataFrame( data, index, dtype, copy)
DataFrame 2 Mutable Mutable pandas.DataFrame( data, index, columns, dtype, copy)
Panel 3 Mutable Mutable

data can be ndarray, list, constants
index must be unique and same length as data. Can be integer or string dtype if none, it will be inferred
copy copy data. Default false

11.3 Class Method

11.3.1 Convertion

11.3.1.1 Convert To Timestamp

Use to_datetime() to create various Timestamp related objects:

  • Convert list of dates to DateTimeIndex
  • Convert list of dates to Series of Timestamps
  • Convert single date into Timestamp
  • Source can be string, date, datetime object

Create Single Timestamp object

pd.to_datetime('2011-01-03')
pd.to_datetime(date(2018,4,13))
pd.to_datetime(datetime(2018,3,1,7,30))
## Timestamp('2011-01-03 00:00:00')
## Timestamp('2018-04-13 00:00:00')
## Timestamp('2018-03-01 07:30:00')

11.3.1.2 Convert To DateTimeIndex

data = ['2011-01-03',             # from string
         date(2018,4,13),         # from date
         datetime(2018,3,1,7,30)] # from datetime
dti = pd.to_datetime(data)  ## Convert into DateTimeIndex

type(dti)      ## DateTimeIndex
dti.dtype      ## <M8[ns]
type(dti[1])   ## Elements are Timestamp objects
## <class 'pandas.core.indexes.datetimes.DatetimeIndex'>
## dtype('<M8[ns]')
## <class 'pandas._libs.tslibs.timestamps.Timestamp'>

11.3.2 Creating DateTimeIndex

User date_range() to automatically generate range of Timestamps based on start/end/interval. The result is DateTimeIndex object which contains the range of Timestamps.

11.3.2.1 Hourly

If start time not specified, default to 00:00:00.
If start time specified, it will be honored on all subsequent Timestamp elements.
Specify start and end, sequence will automatically distribute Timestamp according to frequency.

print(
  pd.date_range('2018-01-01', periods=3, freq='H'),
  pd.date_range(datetime(2018,1,1,12,30), periods=3, freq='H'),
  pd.date_range(start='2018-01-03-1230', end='2018-01-03-18:30', freq='H'))
## DatetimeIndex(['2018-01-01 00:00:00', '2018-01-01 01:00:00', '2018-01-01 02:00:00'], dtype='datetime64[ns]', freq='H') DatetimeIndex(['2018-01-01 12:30:00', '2018-01-01 13:30:00', '2018-01-01 14:30:00'], dtype='datetime64[ns]', freq='H') DatetimeIndex(['2018-01-03 12:30:00', '2018-01-03 13:30:00', '2018-01-03 14:30:00',
##                '2018-01-03 15:30:00', '2018-01-03 16:30:00', '2018-01-03 17:30:00',
##                '2018-01-03 18:30:00'],
##               dtype='datetime64[ns]', freq='H')

11.3.2.2 Daily

When the frequency is Day and time is not specified, output is date distributed.
When time is specified, output will honor the time.

pd.date_range(date(2018,1,2), periods=3, freq='D'),
pd.date_range('2018-01-01-1230', periods=4, freq='D')
## (DatetimeIndex(['2018-01-02', '2018-01-03', '2018-01-04'], dtype='datetime64[ns]', freq='D'),)
## DatetimeIndex(['2018-01-01 12:30:00', '2018-01-02 12:30:00', '2018-01-03 12:30:00',
##                '2018-01-04 12:30:00'],
##               dtype='datetime64[ns]', freq='D')

11.3.2.3 First Day Of Month

Use freq=MS, M stands for montly, S stand for Start. If the day specified, the sequence start from first day of following month.

pd.date_range('2018-01', periods=4, freq='MS'),
pd.date_range('2018-01-09', periods=4, freq='MS'),
pd.date_range('2018-01-09 12:30:00', periods=4, freq='MS')
## (DatetimeIndex(['2018-01-01', '2018-02-01', '2018-03-01', '2018-04-01'], dtype='datetime64[ns]', freq='MS'),)
## (DatetimeIndex(['2018-02-01', '2018-03-01', '2018-04-01', '2018-05-01'], dtype='datetime64[ns]', freq='MS'),)
## DatetimeIndex(['2018-02-01 12:30:00', '2018-03-01 12:30:00', '2018-04-01 12:30:00',
##                '2018-05-01 12:30:00'],
##               dtype='datetime64[ns]', freq='MS')

11.3.2.4 Last Day of Month

Sequence always starts from the end of the specified month.

pd.date_range('2018-01', periods=4, freq='M')
pd.date_range('2018-01-09', periods=4, freq='M')
pd.date_range('2018-01-09 12:30:00', periods=4, freq='M')
## DatetimeIndex(['2018-01-31', '2018-02-28', '2018-03-31', '2018-04-30'], dtype='datetime64[ns]', freq='M')
## DatetimeIndex(['2018-01-31', '2018-02-28', '2018-03-31', '2018-04-30'], dtype='datetime64[ns]', freq='M')
## DatetimeIndex(['2018-01-31 12:30:00', '2018-02-28 12:30:00', '2018-03-31 12:30:00',
##                '2018-04-30 12:30:00'],
##               dtype='datetime64[ns]', freq='M')

11.3.3 Frequency Table (crosstab)

crosstab returns Dataframe Object

crosstab( index = <SeriesObj>, columns = <new_colName> )                # one dimension table
crosstab( index = <SeriesObj>, columns = <SeriesObj> )                  # two dimension table
crosstab( index = <SeriesObj>, columns = [<SeriesObj1>, <SeriesObj2>] ) # multi dimension table   
crosstab( index = <SeriesObj>, columns = <SeriesObj>, margines=True )   # add column and row margins

Sample Data

n = 200
comp = ['C' + i for i in np.random.randint( 1,4, size  = n).astype(str)] # 3x Company
dept = ['D' + i for i in np.random.randint( 1,6, size  = n).astype(str)] # 5x Department
grp =  ['G' + i for i in np.random.randint( 1,3, size  = n).astype(str)] # 2x Groups
value1 = np.random.normal( loc=50 , scale=5 , size = n)
value2 = np.random.normal( loc=20 , scale=3 , size = n)
value3 = np.random.normal( loc=5 , scale=30 , size = n)

mydf = pd.DataFrame({
    'comp':comp, 
    'dept':dept, 
    'grp': grp,
    'value1':value1, 
    'value2':value2,
    'value3':value3 })
mydf.head()
##   comp dept grp     value1     value2     value3
## 0   C1   D4  G2  45.422995  22.467314 -31.945606
## 1   C3   D5  G1  47.139210  20.328063  55.326412
## 2   C2   D3  G2  43.643161  25.207357  12.574508
## 3   C3   D3  G2  50.145878  20.179638  -0.896219
## 4   C3   D4  G2  58.134735  21.226297 -22.531270

One Dimension Frequency Table

pd.crosstab(index=mydf.comp, columns='counter')
## col_0  counter
## comp          
## C1          82
## C2          61
## C3          57

Two Dimension Frequency Table

Row dimension uses comp variable, Column dimension uses dept variable.

pd.crosstab(index=mydf.comp, columns=mydf.dept)
## dept  D1  D2  D3  D4  D5
## comp                    
## C1    16  20  16  18  12
## C2    16  11  13  12   9
## C3    15  10   7  17   8

Higher Dimension Table

Crosstab header is multi-levels index when more than one column specified. The multilevel index consists of all permuation of of grp and dept variables.

tb = pd.crosstab(index=mydf.comp, columns=[mydf.dept, mydf.grp])
tb.columns   ## observe two levels of headers generated
tb           ## two levels of headers
## MultiIndex([('D1', 'G1'),
##             ('D1', 'G2'),
##             ('D2', 'G1'),
##             ('D2', 'G2'),
##             ('D3', 'G1'),
##             ('D3', 'G2'),
##             ('D4', 'G1'),
##             ('D4', 'G2'),
##             ('D5', 'G1'),
##             ('D5', 'G2')],
##            names=['dept', 'grp'])
## dept  D1    D2     D3    D4     D5   
## grp   G1 G2 G1  G2 G1 G2 G1  G2 G1 G2
## comp                                 
## C1     7  9  9  11  9  7  9   9  7  5
## C2    12  4  4   7  8  5  6   6  5  4
## C3     8  7  3   7  4  3  7  10  6  2

Select sub-dataframe using multi-level referencing.

print( 'Under D2:\n', tb['D2'], '\n\n',
       'Under D2-G2:\n',tb['D2','G1'])
## Under D2:
##  grp   G1  G2
## comp        
## C1     9  11
## C2     4   7
## C3     3   7 
## 
##  Under D2-G2:
##  comp
## C1    9
## C2    4
## C3    3
## Name: (D2, G1), dtype: int64

Getting Margin

Extend the crosstab with ‘margin=True’ to have sum of rows/columns, presented in new column/row named ‘All’.

tb = pd.crosstab(index=mydf.dept, columns=mydf.grp, margins=True)
tb
## grp    G1  G2  All
## dept              
## D1     27  20   47
## D2     16  25   41
## D3     21  15   36
## D4     22  25   47
## D5     18  11   29
## All   104  96  200
print(
  'Row Sums:     \n', tb.loc[:,'All'],
  '\n\nColumn Sums:\n', tb.loc['All'])
## Row Sums:     
##  dept
## D1      47
## D2      41
## D3      36
## D4      47
## D5      29
## All    200
## Name: All, dtype: int64 
## 
## Column Sums:
##  grp
## G1     104
## G2      96
## All    200
## Name: All, dtype: int64

Getting Proportion

Use matrix operation divide each row with its respective column sum.

tb/tb.loc['All']
## grp         G1        G2    All
## dept                           
## D1    0.259615  0.208333  0.235
## D2    0.153846  0.260417  0.205
## D3    0.201923  0.156250  0.180
## D4    0.211538  0.260417  0.235
## D5    0.173077  0.114583  0.145
## All   1.000000  1.000000  1.000

11.3.4 Concatenation

Sample Data

s1 = pd.Series(['A1','A2','A3','A4'])
s2 = pd.Series(['B1','B2','B3','B4'], name='B')
s3 = pd.Series(['C1','C2','C3','C4'], name='C')

Column-Wise

Combining Multiple Series Into A New DataFrame

  • Series name will become column name in DataFrame.
  • If Series has no name, default column names in DataFrame will be 0,1,2,3
  • axis=1 means column-wise
pd.concat([s1,s2,s3, None], axis=1)  ## observed that None is ignored
##     0   B   C
## 0  A1  B1  C1
## 1  A2  B2  C2
## 2  A3  B3  C3
## 3  A4  B4  C4

Add Multiple Series Into An Existing DataFrame

  • No change to original DataFrame column name
  • Added columns from series will have 0,1,2,3,.. column name
df = pd.DataFrame({ 'A': s1, 'B': s2})
df
##     A   B
## 0  A1  B1
## 1  A2  B2
## 2  A3  B3
## 3  A4  B4
pd.concat([df,s3,s2,s1, None],axis=1)
##     A   B   C   B   0
## 0  A1  B1  C1  B1  A1
## 1  A2  B2  C2  B2  A2
## 2  A3  B3  C3  B3  A3
## 3  A4  B4  C4  B4  A4

Row-Wise

11.3.5 External Data

html_table Parser

This method require html5lib library.
- Read the web page, create a list: which contain one or more dataframes that maps to each html table found
- Scrap all detectable html tables
- Auto detect column header
- Auto create index using number starting from 0

read_html(url)  # return list of dataframe(s) that maps to web table(s) structure
df_list = pd.read_html('https://www.malaysiastock.biz/Listed-Companies.aspx?type=S&s1=18')  ## read all tables
df = df_list[6]  ## get the specific table

print ('Total Table(s) Found : ', len(df_list), '\n',
       'First Table Found:      ',df)
## Total Table(s) Found :  14 
##  First Table Found:                                                          0  \
## 0  0 - 9  A  B  C  D  E  F  G  H  I  J  K  L  M  ...   
## 
##                                                    1  
## 0  Bursa Malaysia Market Watch  Top Articles  1. ...

CSV Writing

Syntax

DataFrame.to_csv(
  path_or_buf=None,   ## if not provided, result is returned as string
  sep=', ', 
  na_rep='', 
  float_format=None, 
  columns=None,       ## list of columns name to write, if not provided, all columns are written
  header=True,        ## write out column names
  index=True,         ## write row label
  index_label=None, 
  mode='w', 
  encoding=None,      ## if not provided, default to 'utf-8'
  quoting=None, quotechar='"', 
  line_terminator=None, 
  chunksize=None, 
  date_format=None, 
  doublequote=True, 
  escapechar=None, 
  decimal='.')

Example below shows column value containing different special character. Note that pandas handles these very well by default.

mydf = pd.DataFrame({'Id':[10,20,30,40], 
                     'Name':  ['Aaa','Bbb','Ccc','Ddd'],
                     'Funny': ["world's most \clever", 
                     "Bloody, damn, good", 
                     "many\nmany\nline", 
                     'Quoting "is" tough']})
mydf.set_index('Id', inplace=True)
mydf.to_csv('data/csv_test.csv', index=True)
mydf
##    Name                 Funny
## Id                           
## 10  Aaa  world's most \clever
## 20  Bbb    Bloody, damn, good
## 30  Ccc      many\nmany\nline
## 40  Ddd    Quoting "is" tough

This is the file saved

system('more data\\csv_test.csv')
## [1] 0

All content retained when reading back by Pandas

pd.read_csv('data/csv_test.csv', index_col='Id')
##    Name                 Funny
## Id                           
## 10  Aaa  world's most \clever
## 20  Bbb    Bloody, damn, good
## 30  Ccc      many\nmany\nline
## 40  Ddd    Quoting "is" tough

11.3.5.1 CSV Reading

Syntax

pandas.read_csv( 
    'url or filePath',                     # path to file or url 
    encoding    = 'utf_8',                 # optional: default is 'utf_8'
    index_col   = ['colName1', ...],       # optional: specify one or more index column
    parse_dates = ['dateCol1', ...],       # optional: specify multiple string column to convert to date
    na_values   = ['.','na','NA','N/A'],   # optional: values that is considered NA
    names       = ['newColName1', ... ],   # optional: overwrite column names
    thousands   = '.',                     # optional: thousand seperator symbol
    nrows       = n,                       # optional: load only first n rows
    skiprows    = 0,                       # optional: don't load first n rows
    parse_dates = False,                   # List of date column names
    infer_datetime_format = False          # automatically parse dates
)

Refer to full codec Python Codec.

Default Import

  • index is sequence of integer 0,1,2…
  • only two data types detection; number (float64/int64) and string (object)
  • date is not parsed, hence stayed as string
goo = pd.read_csv('data/goog.csv', encoding='utf_8')
print(goo.head(), '\n\n',
      goo.info())
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 61 entries, 0 to 60
## Data columns (total 6 columns):
##  #   Column  Non-Null Count  Dtype  
## ---  ------  --------------  -----  
##  0   Date    61 non-null     object 
##  1   Open    61 non-null     float64
##  2   High    61 non-null     float64
##  3   Low     61 non-null     float64
##  4   Close   61 non-null     float64
##  5   Volume  61 non-null     int64  
## dtypes: float64(4), int64(1), object(1)
## memory usage: 3.0+ KB
##          Date        Open        High         Low       Close   Volume
## 0  12/19/2016  790.219971  797.659973  786.270020  794.200012  1225900
## 1  12/20/2016  796.760010  798.650024  793.270020  796.419983   925100
## 2  12/21/2016  795.840027  796.676025  787.099976  794.559998  1208700
## 3  12/22/2016  792.359985  793.320007  788.580017  791.260010   969100
## 4  12/23/2016  790.900024  792.739990  787.280029  789.909973   623400 
## 
##  None

Specify Data Types

  • To customize the data type, use dtype parameter with a dict of definition.
d_types = {'Volume': str}
pd.read_csv('data/goog.csv', dtype=d_types).info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 61 entries, 0 to 60
## Data columns (total 6 columns):
##  #   Column  Non-Null Count  Dtype  
## ---  ------  --------------  -----  
##  0   Date    61 non-null     object 
##  1   Open    61 non-null     float64
##  2   High    61 non-null     float64
##  3   Low     61 non-null     float64
##  4   Close   61 non-null     float64
##  5   Volume  61 non-null     object 
## dtypes: float64(4), object(2)
## memory usage: 3.0+ KB

Parse Datetime

You can specify multiple date-alike column for parsing

pd.read_csv('data/goog.csv', parse_dates=['Date']).info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 61 entries, 0 to 60
## Data columns (total 6 columns):
##  #   Column  Non-Null Count  Dtype         
## ---  ------  --------------  -----         
##  0   Date    61 non-null     datetime64[ns]
##  1   Open    61 non-null     float64       
##  2   High    61 non-null     float64       
##  3   Low     61 non-null     float64       
##  4   Close   61 non-null     float64       
##  5   Volume  61 non-null     int64         
## dtypes: datetime64[ns](1), float64(4), int64(1)
## memory usage: 3.0 KB

Parse Datetime, Then Set as Index
- Specify names of date column in parse_dates=
- When date is set as index, the type is DateTimeIndex

goo3 = pd.read_csv('data/goog.csv',index_col='Date', parse_dates=['Date'])
goo3.info()
## <class 'pandas.core.frame.DataFrame'>
## DatetimeIndex: 61 entries, 2016-12-19 to 2017-03-17
## Data columns (total 5 columns):
##  #   Column  Non-Null Count  Dtype  
## ---  ------  --------------  -----  
##  0   Open    61 non-null     float64
##  1   High    61 non-null     float64
##  2   Low     61 non-null     float64
##  3   Close   61 non-null     float64
##  4   Volume  61 non-null     int64  
## dtypes: float64(4), int64(1)
## memory usage: 2.9 KB

11.3.6 Inspection

Structure info

info() is a function that print information to screen. It doesn’t return any object

dataframe.info()  # display columns and number of rows (that has no missing data)
goo.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 61 entries, 0 to 60
## Data columns (total 6 columns):
##  #   Column  Non-Null Count  Dtype  
## ---  ------  --------------  -----  
##  0   Date    61 non-null     object 
##  1   Open    61 non-null     float64
##  2   High    61 non-null     float64
##  3   Low     61 non-null     float64
##  4   Close   61 non-null     float64
##  5   Volume  61 non-null     int64  
## dtypes: float64(4), int64(1), object(1)
## memory usage: 3.0+ KB

11.4 Class: Timestamp

This is an enhanced version to datetime standard library.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.html#pandas.Timestamp

11.4.1 Constructor

From Number

type(pd.Timestamp(2017,12,11))
## <class 'pandas._libs.tslibs.timestamps.Timestamp'>
pd.Timestamp(year=2017, month=1, day=1)  # date-like numbers
pd.Timestamp(2017,1,1)                   # date-like numbers
pd.Timestamp(2017,12,11,5,45)            # datetime-like numbers
pd.Timestamp(2017,12,11,5,45,55,999)     # + microseconds
pd.Timestamp(2017,12,11,5,45,55,999,8)   # + nanoseconds
## Timestamp('2017-01-01 00:00:00')
## Timestamp('2017-01-01 00:00:00')
## Timestamp('2017-12-11 05:45:00')
## Timestamp('2017-12-11 05:45:55.000999')
## Timestamp('2017-12-11 05:45:55.000999008')

From String

Observe that pandas support many string input format

Year Month Day

pd.Timestamp('2017-12-11')   # date-like string: year-month-day
pd.Timestamp('2017 12 11')   # date-like string: year-month-day
pd.Timestamp('2017 Dec 11')  # date-like string: year-month-day
pd.Timestamp('Dec 11, 2017') # date-like string: year-month-day
## Timestamp('2017-12-11 00:00:00')
## Timestamp('2017-12-11 00:00:00')
## Timestamp('2017-12-11 00:00:00')
## Timestamp('2017-12-11 00:00:00')

YMD Hour Minute Second Ms

pd.Timestamp('2017-12-11 0545')     ## hour minute
pd.Timestamp('2017-12-11-05:45')
pd.Timestamp('2017-12-11T0545')
pd.Timestamp('2017-12-11 054533')   ## hour minute seconds
pd.Timestamp('2017-12-11 05:45:33')
## Timestamp('2017-12-11 05:45:00')
## Timestamp('2017-12-11 05:45:00')
## Timestamp('2017-12-11 05:45:00')
## Timestamp('2017-12-11 05:45:33')
## Timestamp('2017-12-11 05:45:33')

With Timezone can be included in various ways.

pd.Timestamp('2017-01-01T0545Z'),    # GMT 
pd.Timestamp('2017-01-01T0545+9')    # GMT+9
pd.Timestamp('2017-01-01T0545+0800') # GMT+0800
pd.Timestamp('2017-01-01 0545', tz='Asia/Singapore')
## (Timestamp('2017-01-01 05:45:00+0000', tz='UTC'),)
## Timestamp('2017-01-01 05:45:00+0900', tz='pytz.FixedOffset(540)')
## Timestamp('2017-01-01 05:45:00+0800', tz='pytz.FixedOffset(480)')
## Timestamp('2017-01-01 05:45:00+0800', tz='Asia/Singapore')

From datetime / date

pd.Timestamp(date(2017,3,5))          # from date
pd.Timestamp(datetime(2017,3,5,4,30)) # from datetime
pd.Timestamp(datetime(2017,3,5,4,30), tz='Asia/Kuala_Lumpur') # from datetime, + tz
## Timestamp('2017-03-05 00:00:00')
## Timestamp('2017-03-05 04:30:00')
## Timestamp('2017-03-05 04:30:00+0800', tz='Asia/Kuala_Lumpur')

11.4.2 Attributes

We can tell many things about a Timestamp object. Note that timezone (tz) is a pytz object.

ts = pd.Timestamp('2017-01-01T054533+0800') # GMT+0800
ts.month
ts.day
ts.year
ts.hour
ts.minute
ts.second
ts.microsecond
ts.nanosecond
ts.dayofweek
ts.dayofyear
ts.daysinmonth
ts.is_leap_year
ts.is_month_end
ts.is_month_start
ts.dayofweek
ts.tz                 ## pytz object
type(ts.tz)
## 1
## 1
## 2017
## 5
## 45
## 33
## 0
## 0
## 6
## 1
## 31
## False
## False
## True
## 6
## pytz.FixedOffset(480)
## <class 'pytz._FixedOffset'>

11.4.3 Timezones

Adding Timezones and Clock Shifting

  • Use tz_localize to make a timestamp zone aware. It will not shift the clock.
  • Once gotten a timezone, you can easily shift the clock to another timezone using tz_convert().
ts  = pd.Timestamp(2017,1,10,10,34)  ## No timezone
ts
ts1 = ts.tz_localize('Asia/Kuala_Lumpur')  ## Add timezone, retain time
ts1
ts1.tz_convert('UTC')                ## Convert to diff timezone
## Timestamp('2017-01-10 10:34:00')
## Timestamp('2017-01-10 10:34:00+0800', tz='Asia/Kuala_Lumpur')
## Timestamp('2017-01-10 02:34:00+0000', tz='UTC')

Removing Timezone

Just apply None with tz_localize to remove TZ info.

ts = pd.Timestamp(2017,1,10,10,34).tz_localize('Asia/Kuala_Lumpur')
ts
ts.tz_localize(None)   ## no longer TZ aware
## Timestamp('2017-01-10 10:34:00+0800', tz='Asia/Kuala_Lumpur')
## Timestamp('2017-01-10 10:34:00')

11.4.4 Formatting

strftime

Use strftime() to customize string format. For complete directive, see here.

ts = pd.Timestamp(2017,1,10,15,34,10).tz_localize('Asia/Kuala_Lumpur')
ts
ts.strftime("%Y-%m-%d %b %I:%M:%S %p")
## Timestamp('2017-01-10 15:34:10+0800', tz='Asia/Kuala_Lumpur')
## '2017-01-10 Jan 03:34:10 PM'

isoformat

Use isoformat() to format ISO string (without timezone)

ts = pd.Timestamp(2017,1,10,15,34,10)
ts1 = ts.tz_localize('Asia/Kuala_Lumpur') 

ts.isoformat()   ## without TZ
ts1.isoformat()  ## with Tz
## '2017-01-10T15:34:10'
## '2017-01-10T15:34:10+08:00'

11.4.5 Conversion

Convert To datetime

Use to_pydatetime() to convert into standard library datetime.datetime. From the ‘datetime’ object, apply date() to get datetime.date

ts = pd.Timestamp(2017,1,10,7,30,52).tz_localize('Asia/Kuala_Lumpur') 
ts                          ## Timestamp, wtih TZ
ts.to_pydatetime()          ## datetime, with TZ
ts.to_pydatetime().date()   ## datetime.date (no TZ)
ts.date()                   ## same as above (no TZ)
## Timestamp('2017-01-10 07:30:52+0800', tz='Asia/Kuala_Lumpur')
## datetime.datetime(2017, 1, 10, 7, 30, 52, tzinfo=<DstTzInfo 'Asia/Kuala_Lumpur' +08+8:00:00 STD>)
## datetime.date(2017, 1, 10)
## datetime.date(2017, 1, 10)

Convert To numpy.datetime64

Use to_datetime64() to convert into numpy.datetime64

ts = pd.Timestamp(2017,1,10,7,30,52)
ts.to_datetime64()
## numpy.datetime64('2017-01-10T07:30:52.000000000')

11.4.6 Updating

replace()

ts.replace(year=2000, month=1,day=1)
## Timestamp('2000-01-01 07:30:52')

11.4.7 Other Utilities

Attribute Like Methods

ts = pd.Timestamp('2017-01-01T054533+0800') # GMT+0800
ts.day_name()
ts.month_name()
ts.date()   ## convert to datetime.date object
ts.time()   ## convert to datetime.time object
## 'Sunday'
## 'January'
## datetime.date(2017, 1, 1)
## datetime.time(5, 45, 33)

celing - ceil

ts
ts.ceil(freq='D')  # ceiling to day
## Timestamp('2017-01-01 05:45:33+0800', tz='pytz.FixedOffset(480)')
## Timestamp('2017-01-02 00:00:00+0800', tz='pytz.FixedOffset(480)')

11.5 Class: DateTimeIndex

11.5.1 Creating

Refer to Pandas class method above.

11.5.2 Co Conversion

Convert To datetime.datetime
Use to_pydatetime to convert into python standard datetime.datetime object

print('Converted to List:', dti.to_pydatetime(), '\n\n',
      'Converted Type:',    type(dti.to_pydatetime()))
## Converted to List: [datetime.datetime(2011, 1, 3, 0, 0) datetime.datetime(2018, 4, 13, 0, 0)
##  datetime.datetime(2018, 3, 1, 7, 30)] 
## 
##  Converted Type: <class 'numpy.ndarray'>

11.5.3 Structure Conversion

Convert To Series: to_series
This creates a Series where index and data with the same value

#dti = pd.date_range('2018-02', periods=4, freq='M')
dti.to_series()
## 2011-01-03 00:00:00   2011-01-03 00:00:00
## 2018-04-13 00:00:00   2018-04-13 00:00:00
## 2018-03-01 07:30:00   2018-03-01 07:30:00
## dtype: datetime64[ns]

Convert To DataFrame: to_frame()
This convert to single column DataFrame with index as the same value

dti.to_frame()
##                                       0
## 2011-01-03 00:00:00 2011-01-03 00:00:00
## 2018-04-13 00:00:00 2018-04-13 00:00:00
## 2018-03-01 07:30:00 2018-03-01 07:30:00

11.5.4 Attributes

All Timestamp Attributes can be used upon DateTimeIndex.

print( dti.weekday, '\n',
       dti.month, '\n',
       dti.daysinmonth)
## Int64Index([0, 4, 3], dtype='int64') 
##  Int64Index([1, 4, 3], dtype='int64') 
##  Int64Index([31, 30, 31], dtype='int64')

11.6 Class: Series

Series allows different data types (object class) as its element

pandas.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)
- data array-like, iterable, dict or scalar
- If dtype not specified, it will infer from data.

11.6.1 Constructor

Empty Series

Passing no data to constructor will result in empty series. By default, empty series dtype is float.

s = pd.Series(dtype='object')
print (s, '\n',
       type(s))
## Series([], dtype: object) 
##  <class 'pandas.core.series.Series'>

From Scalar

If data is a scalar value, an index must be provided. The value will be repeated to match the length of index

pd.Series( 99, index = ['a','b','c','d'])
## a    99
## b    99
## c    99
## d    99
## dtype: int64

From array-like

From list

pd.Series(['a','b','c','d','e'])           # from Python list
## 0    a
## 1    b
## 2    c
## 3    d
## 4    e
## dtype: object

From numpy.array
If index is not specified, default to 0 and continue incrementally

pd.Series(np.array(['a','b','c','d','e']))
## 0    a
## 1    b
## 2    c
## 3    d
## 4    e
## dtype: object

From DateTimeIndex

pd.Series(pd.date_range('2011-1-1','2011-1-3'))
## 0   2011-01-01
## 1   2011-01-02
## 2   2011-01-03
## dtype: datetime64[ns]

11.6.1.1 From Dictionary

The dictionary key will be the index. Order is not sorted.

pd.Series({'a' : 0., 'c' : 5., 'b' : 2.})
## a    0.0
## c    5.0
## b    2.0
## dtype: float64

If index sequence is specifeid, then Series will forllow the index order
Objerve that missing data (index without value) will be marked as NaN

pd.Series({'a' : 0., 'c' : 1., 'b' : 2.},index = ['a','b','c','d'])
## a    0.0
## b    2.0
## c    1.0
## d    NaN
## dtype: float64

Specify Index

pd.Series(['a','b','c','d','e'], index=[10,20,30,40,50])
## 10    a
## 20    b
## 30    c
## 40    d
## 50    e
## dtype: object

11.6.1.2 Mix Element Types

dType will be ‘object’ when there were mixture of classes

ser = pd.Series(['a',1,2,3])
print('Object Type :  ', type(ser),'\n',
      'Object dType:  ', ser.dtype,'\n',
      'Element 1 Type: ',type(ser[0]),'\n',
      'Elmeent 2 Type: ',type(ser[1]))
## Object Type :   <class 'pandas.core.series.Series'> 
##  Object dType:   object 
##  Element 1 Type:  <class 'str'> 
##  Elmeent 2 Type:  <class 'int'>

11.6.1.3 Specify Data Types

By default, dtype is inferred from data.

ser1 = pd.Series([1,2,3])
ser2 = pd.Series([1,2,3], dtype="int8")
ser3 = pd.Series([1,2,3], dtype="object")

print(' Inferred:        ',ser1.dtype, '\n',
      'Specified int8:  ',ser2.dtype, '\n',
      'Specified object:',ser3.dtype)
##  Inferred:         int64 
##  Specified int8:   int8 
##  Specified object: object

11.6.2 Accessing Series

series     ( single/list/range_of_row_label/number ) # can cause confusion
series.loc ( single/list/range_of_row_label )
series.iloc( single/list/range_of_row_number )

Sample Data

s = pd.Series([1,2,3,4,5],index=['a','b','c','d','e']) 
s
## a    1
## b    2
## c    3
## d    4
## e    5
## dtype: int64

by Row Number(s)

Single Item. Notice that inputing a number and list of number give different result.

print( 'Referencing by number:',s.iloc[1],'\n\n',
       '\nReferencing by list of number:\n',s.iloc[[1]])
## Referencing by number: 2 
## 
##  
## Referencing by list of number:
##  b    2
## dtype: int64

Multiple Items

s.iloc[[1,3]] 
## b    2
## d    4
## dtype: int64

Range (First 3)

s.iloc[:3]
## a    1
## b    2
## c    3
## dtype: int64

Range (Last 3)

s.iloc[-3:]
## c    3
## d    4
## e    5
## dtype: int64

Range (in between)

s.iloc[2:3]
## c    3
## dtype: int64

by Index(es)

Single Label. Notice the difference referencing input: single index and list of index.
Warning: if index is invalid, this will result in error.

print( s.loc['c'], '\n',
       s[['c']])
## 3 
##  c    3
## dtype: int64

Multiple Labels

If index is not found, it will return NaN

s.loc[['k','c']]
## Error in py_call_impl(callable, dots$args, dots$keywords): KeyError: "['k'] not in index"

** Range of Labels **

s.loc['b':'d']
## b    2
## c    3
## d    4
## dtype: int64

Filtering

Use logical array to filter

s = pd.Series(range(1,8))
s[s<5]
## 0    1
## 1    2
## 2    3
## 3    4
## dtype: int64

Use where
The where method is an application of the if-then idiom. For each element in the calling Series, if cond is True the element is used; otherwise other is used.

.where(cond, other=nan, inplace=False)
print(s.where(s<4),'\n\n',
      s.where(s<4,other=None) )
## 0    1.0
## 1    2.0
## 2    3.0
## 3    NaN
## 4    NaN
## 5    NaN
## 6    NaN
## dtype: float64 
## 
##  0    1.0
## 1    2.0
## 2    3.0
## 3    NaN
## 4    NaN
## 5    NaN
## 6    NaN
## dtype: float64

11.6.3 Updating Series

by Row Number(s)

s = pd.Series(range(1,7), index=['a','b','c','d','e','f'])
s[2] = 999
s[[3,4]] = 888,777
s
## a      1
## b      2
## c    999
## d    888
## e    777
## f      6
## dtype: int64

by Index(es)

s = pd.Series(range(1,7), index=['a','b','c','d','e','f'])
s['e'] = 888
s[['c','d']] = 777,888
s
## a      1
## b      2
## c    777
## d    888
## e    888
## f      6
## dtype: int64

11.6.4 Series Attributes

The Data

s = pd.Series([1,2,3,4,5],index=['a','b','c','d','e'],name='SuperHero') 
s
## a    1
## b    2
## c    3
## d    4
## e    5
## Name: SuperHero, dtype: int64

11.6.4.1 The Attributes

print( ' Series Index:    ',s.index, '\n',
       'Series dType:    ', s.dtype, '\n',
       'Series Size:     ', s.size, '\n',
       'Series Shape:    ', s.shape, '\n',
       'Series Dimension:', s.ndim)
##  Series Index:     Index(['a', 'b', 'c', 'd', 'e'], dtype='object') 
##  Series dType:     int64 
##  Series Size:      5 
##  Series Shape:     (5,) 
##  Series Dimension: 1

11.6.5 Instance Methods

Index Manipulation

.rename_axis()

s.rename_axis('haribulan')
## haribulan
## a    1
## b    2
## c    3
## d    4
## e    5
## Name: SuperHero, dtype: int64

.reset_index()

Resetting index will:
- Convert index to a normal column, with column named as ‘index’
- Index renumbered to 1,2,3
- Return DataFrame (became two columns)

s.reset_index()
##   index  SuperHero
## 0     a          1
## 1     b          2
## 2     c          3
## 3     d          4
## 4     e          5

Structure Conversion

  • A series structure contain value (in numpy array), its dtype (data type of the numpy array).
  • Use values to retrieve into `numpy.ndarray. Use dtype to understand the data type.
s = pd.Series([1,2,3,4,5])
print(' Series value:      ', s.values, '\n', 
      'Series value type: ',  type(s.values), '\n',
      'Series dtype:      ',  s.dtype)
##  Series value:       [1 2 3 4 5] 
##  Series value type:  <class 'numpy.ndarray'> 
##  Series dtype:       int64

Convert To List using .tolist()

pd.Series.tolist(s)
## [1, 2, 3, 4, 5]

DataType Conversion

Use astype() to convert to another numpy supproted datatypes, results in a new Series.
Warning: casting to incompatible type will result in error

s.astype('int8')
## 0    1
## 1    2
## 2    3
## 3    4
## 4    5
## dtype: int8

11.6.6 Operators

The result of applying operator (arithmetic or logic) to Series object returns a new Series object

Arithmetic Operator

s1 = pd.Series( [100,200,300,400,500] )
s2 = pd.Series( [10, 20, 30, 40, 50] )

Apply To One Series Object

s1 - 100
## 0      0
## 1    100
## 2    200
## 3    300
## 4    400
## dtype: int64

Apply To Two Series Objects

s1 - s2
## 0     90
## 1    180
## 2    270
## 3    360
## 4    450
## dtype: int64

Logic Operator

  • Apply logic operator to a Series return a new Series of boolean result
  • This can be used for Series or DataFrame filtering
bs = pd.Series(range(0,10))
bs>3
## 0    False
## 1    False
## 2    False
## 3    False
## 4     True
## 5     True
## 6     True
## 7     True
## 8     True
## 9     True
## dtype: bool
~((bs>3) & (bs<8) | (bs>7))
## 0     True
## 1     True
## 2     True
## 3     True
## 4    False
## 5    False
## 6    False
## 7    False
## 8    False
## 9    False
## dtype: bool

11.6.7 String Accesor .str

If the underlying data is str type, then pandas exposed various properties and methos through str accessor.

SeriesObj.str.operatorFunction()

Available Functions

Nearly all Python’s built-in string methods are mirrored by a Pandas vectorized string method. Here is a list of Pandas str methods that mirror Python string methods:

len() lower() translate() islower() ljust() upper() startswith() isupper() rjust() find() endswith() isnumeric() center() rfind() isalnum() isdecimal() zfill() index() isalpha() split() strip() rindex() isdigit() rsplit() rstrip() capitalize() isspace() partition() lstrip() swapcase() istitle() rpartition()

11.6.7.1 Regex Extractor

Extract capture groups in the regex pattern, by default in DataFrame (expand=True).

Series.str.extract(self, pat, flags=0, expand=True)
- expand=True: if result is single column, make it a Series instead of Dataframe.
s = pd.Series(['a1', 'b2', 'c3'])
print( 
  ' Extracted Dataframe:\n', s.str.extract(r'([ab])(\d)'),'\n\n',
  'Extracted Dataframe witn Names:\n', s.str.extract(r'(?P<Letter>[ab])(\d)'))
##  Extracted Dataframe:
##       0    1
## 0    a    1
## 1    b    2
## 2  NaN  NaN 
## 
##  Extracted Dataframe witn Names:
##    Letter    1
## 0      a    1
## 1      b    2
## 2    NaN  NaN

Below ouptut single columne, use expand=False to make the result a Series, instead of DataFrame.

r = s.str.extract(r'[ab](\d)', expand=False)
print( r, '\n\n', type(r) )
## 0      1
## 1      2
## 2    NaN
## dtype: object 
## 
##  <class 'pandas.core.series.Series'>

11.6.7.2 Character Extractor

monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
                   'Eric Idle', 'Terry Jones', 'Michael Palin'])
monte
## 0    Graham Chapman
## 1       John Cleese
## 2     Terry Gilliam
## 3         Eric Idle
## 4       Terry Jones
## 5     Michael Palin
## dtype: object

startwith

monte.str.startswith('T')
## 0    False
## 1    False
## 2     True
## 3    False
## 4     True
## 5    False
## dtype: bool

Slicing

monte.str[0:3]
## 0    Gra
## 1    Joh
## 2    Ter
## 3    Eri
## 4    Ter
## 5    Mic
## dtype: object

Splitting

Split strings around given separator/delimiter in either string or regex.

Series.str.split(self, pat=None, n=-1, expand=False)
- pat: can be string or regex
s = pd.Series(['a_b_c', 'c_d_e', np.nan, 'f_g_h_i_j'])
s
## 0        a_b_c
## 1        c_d_e
## 2          NaN
## 3    f_g_h_i_j
## dtype: object

str.split() by default, split will split each item into array

s.str.split('_')
## 0          [a, b, c]
## 1          [c, d, e]
## 2                NaN
## 3    [f, g, h, i, j]
## dtype: object

expand=True will return a dataframe instead of series. By default, expand split into all possible columns.

print( s.str.split('_', expand=True) )
##      0    1    2     3     4
## 0    a    b    c  None  None
## 1    c    d    e  None  None
## 2  NaN  NaN  NaN   NaN   NaN
## 3    f    g    h     i     j

It is possible to limit the number of columns splitted

print( s.str.split('_', expand=True, n=1) )
##      0        1
## 0    a      b_c
## 1    c      d_e
## 2  NaN      NaN
## 3    f  g_h_i_j

str.rsplit()

rsplit stands for reverse split, it works the same way, except it is reversed

print( s.str.rsplit('_', expand=True, n=1) )
##          0    1
## 0      a_b    c
## 1      c_d    e
## 2      NaN  NaN
## 3  f_g_h_i    j

Case Conversion

SeriesObj.str.upper()
SeriesObj.str.lower()
SeriesObj.str.capitalize()
s = pd.Series(['A', 'B', 'C', 'aAba', 'bBaca', np.nan, 'cCABA', 'dog', 'cat'])
print( s.str.upper(), '\n',
       s.str.capitalize())
## 0        A
## 1        B
## 2        C
## 3     AABA
## 4    BBACA
## 5      NaN
## 6    CCABA
## 7      DOG
## 8      CAT
## dtype: object 
##  0        A
## 1        B
## 2        C
## 3     Aaba
## 4    Bbaca
## 5      NaN
## 6    Ccaba
## 7      Dog
## 8      Cat
## dtype: object

Number of Characters

s.str.len()
## 0    1.0
## 1    1.0
## 2    1.0
## 3    4.0
## 4    5.0
## 5    NaN
## 6    5.0
## 7    3.0
## 8    3.0
## dtype: float64

String Indexing

This return specified character from each item.

s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan,'CABA', 'dog', 'cat'])
s.str[0].values    # first char
## array(['A', 'B', 'C', 'A', 'B', nan, 'C', 'd', 'c'], dtype=object)
s.str[0:2].values  # first and second char
## array(['A', 'B', 'C', 'Aa', 'Ba', nan, 'CA', 'do', 'ca'], dtype=object)

Substring Extraction

Sample Data

s = pd.Series(['a1', 'b2', 'c3'])
s
## 0    a1
## 1    b2
## 2    c3
## dtype: object

Extract absed on regex matching
… to improve …

type(s.str.extract('([ab])(\d)', expand=False))
## <class 'pandas.core.frame.DataFrame'>

11.6.8 Date Accessor .dt

If the underlying data is datetime64 type, then pandas exposed various properties and methos through dt accessor.

Sample Data

s = pd.Series([
    datetime(2000,1,1,0,0,0),
    datetime(1999,12,15,12,34,55),
    datetime(2020,3,8,5,7,12),
    datetime(2018,1,1,0,0,0),
    datetime(2003,3,4,5,6,7)
])
s
## 0   2000-01-01 00:00:00
## 1   1999-12-15 12:34:55
## 2   2020-03-08 05:07:12
## 3   2018-01-01 00:00:00
## 4   2003-03-04 05:06:07
## dtype: datetime64[ns]

Convert To

datetime.datetime
Use to_pydatetime() to convert into numpy.array of standard library datetime.datetime

pdt  = s.dt.to_pydatetime()
print( type(pdt) )
## <class 'numpy.ndarray'>
pdt
## array([datetime.datetime(2000, 1, 1, 0, 0),
##        datetime.datetime(1999, 12, 15, 12, 34, 55),
##        datetime.datetime(2020, 3, 8, 5, 7, 12),
##        datetime.datetime(2018, 1, 1, 0, 0),
##        datetime.datetime(2003, 3, 4, 5, 6, 7)], dtype=object)

datetime.date
Use dt.date to convert into pandas.Series of standard library datetime.date
Is it possible to have a pandas.Series of datetime.datetime ? No, because Pandas want it as its own Timestamp.

sdt = s.dt.date
print( type(sdt[1] ))
## <class 'datetime.date'>
print( type(sdt))
## <class 'pandas.core.series.Series'>
sdt
## 0    2000-01-01
## 1    1999-12-15
## 2    2020-03-08
## 3    2018-01-01
## 4    2003-03-04
## dtype: object

11.6.8.1 Timestamp Attributes

A Series::DateTime object support below properties:
- date
- month
- day
- year
- dayofweek
- dayofyear
- weekday
- weekday_name
- quarter
- daysinmonth
- hour - minute

Full list below:
https://pandas.pydata.org/pandas-docs/stable/reference/series.html#datetimelike-properties

s.dt.date
## 0    2000-01-01
## 1    1999-12-15
## 2    2020-03-08
## 3    2018-01-01
## 4    2003-03-04
## dtype: object
s.dt.month
## 0     1
## 1    12
## 2     3
## 3     1
## 4     3
## dtype: int64
s.dt.dayofweek
## 0    5
## 1    2
## 2    6
## 3    0
## 4    1
## dtype: int64
s.dt.weekday
## 0    5
## 1    2
## 2    6
## 3    0
## 4    1
## dtype: int64
s.dt.quarter
## 0    1
## 1    4
## 2    1
## 3    1
## 4    1
## dtype: int64
s.dt.daysinmonth
## 0    31
## 1    31
## 2    31
## 3    31
## 4    31
## dtype: int64
s.dt.time   # extract time as time Object
## 0    00:00:00
## 1    12:34:55
## 2    05:07:12
## 3    00:00:00
## 4    05:06:07
## dtype: object
s.dt.hour  # extract hour as integer
## 0     0
## 1    12
## 2     5
## 3     0
## 4     5
## dtype: int64
s.dt.minute # extract minute as integer
## 0     0
## 1    34
## 2     7
## 3     0
## 4     6
## dtype: int64

11.7 Class: DataFrame

11.7.1 Constructor

  • pd.DataFrame() constructor creates DataFrame object.
  • When index (row label) and column names are not specified (or cannot be induced from data source), it will default to 0,1,2,3…
  • Specify Row Label and Column Header with columns and index parameters in constructor.
  • When columns is specified AND data source has keys that can be used as column names, pandas will select those column names and order specified in columns parameter.
  • index and row label are used interchangeably in this book

Empty DataFrame

By default, An empty dataframe contain no columns and index.

empty_df1 = pd.DataFrame()
empty_df2 = pd.DataFrame()

empty_df1
id(empty_df1)
id(empty_df2)
## Empty DataFrame
## Columns: []
## Index: []
## 1637942427808
## 1637942431216

Even though it is empty, you can still initialize it with Index Columns.

## empty dataframe with columns
empty_df = pd.DataFrame(columns=['A','B','C'])
empty_df  ## index is empty

## empty dataframe with columns and index
empty_df = pd.DataFrame(columns=['A','B','C'], index=[1,2,3])
empty_df
## Empty DataFrame
## Columns: [A, B, C]
## Index: []
##      A    B    C
## 1  NaN  NaN  NaN
## 2  NaN  NaN  NaN
## 3  NaN  NaN  NaN

When initializing multiple empty DataFrame at once (in one line), all the empty DataFrame refers to same memory location. Meaning they contain similar data.

empty_df1 = empty_df2 = pd.DataFrame()
id(empty_df1)
id(empty_df2)
## 1637942429008
## 1637942429008

From Row Oriented Data

DataFrame( [row_list1, row_list2, row_list3] )
DataFrame( [row_list1, row_list2, row_list3], column = columnName_list )
DataFrame( [row_list1, row_list2, row_list3], index  = row_label_list )
  • Create DataFrame object from List of Lists
  • Use `columns’ to select the columns and preferred order.
data = [  [101, 'Alice',  40000, 2017],
          [102, 'Bob',    24000, 2017], 
          [103, 'Charles',31000, 2017]]

## Initialize with default row label and column names  
pd.DataFrame(data)
##      0        1      2     3
## 0  101    Alice  40000  2017
## 1  102      Bob  24000  2017
## 2  103  Charles  31000  2017
## Initialize with column names and row index
col_names  =  ['empID','name','salary','year']
row_labels =  ['r1','r2','r3']
pd.DataFrame(data, columns=col_names, index=row_labels)
##     empID     name  salary  year
## r1    101    Alice   40000  2017
## r2    102      Bob   24000  2017
## r3    103  Charles   31000  2017

From Record Oriented Data

DataFrame( [dict1, dict2, dict3] )
DataFrame( [dict1, dict2, dict3], column = column_list )
DataFrame( [dict1, dict2, dict3], index  = row_label_list )
  • Create DataFrame object from List of Dicts.
  • By default, Column Name will follow Dictionary Key, unless manually specified.
  • Certain dict object may have missing keys, these are assumed as NaN by pandas.
  • Use `columns’ to select the columns and preferred order.
data = [
  {"name":"Yong", "id":1,"zkey":101},
  {"name":"Gan",  "id":2           }]  ## missing zkey in this record

## Default Index and include all detected columns
pd.DataFrame(data)
##    name  id   zkey
## 0  Yong   1  101.0
## 1   Gan   2    NaN
## specify Index, must be same length of DataFrame
pd.DataFrame(data, index=['row1','row2'])
##       name  id   zkey
## row1  Yong   1  101.0
## row2   Gan   2    NaN
## Filter Which Columns To Include, in the prefered order
pd.DataFrame(data, columns=['zkey','name'])
##     zkey  name
## 0  101.0  Yong
## 1    NaN   Gan

From Column Oriented Data

DataFrame(  data = {'column1': list1,
                    'column2': list2,
                    'column3': list3 } , 
            index    = row_label_list, 
            columns  = column_list)
  • Create DataFrame object from Dict of Lists.
  • Constructor will arrange the columns according to the dict keys order by default, unless columns is specified.
  • Use `columns’ to select the columns in preferred order. NaN column is returned if not exist.
## columns arrangement default to dict's keys
data = {'empID':  [100,      101,    102,      103,     104],
        'year1':   [2017,     2017,   2017,      2018,    2018],
        'year2':   [2027,     2027,   2027,      2028,    2028],
        'salary': [40000,    24000,  31000,     20000,   30000],
        'name':   ['Alice', 'Bob',  'Charles', 'David', 'Eric']}

## Default Index and Include All Columns
pd.DataFrame(data)
##    empID  year1  year2  salary     name
## 0    100   2017   2027   40000    Alice
## 1    101   2017   2027   24000      Bob
## 2    102   2017   2027   31000  Charles
## 3    103   2018   2028   20000    David
## 4    104   2018   2028   30000     Eric
## Select What Columns To Show (and its order) Specify row label.
pd.DataFrame(data, columns=['name','salary','year1','year2','not_exist'], index=data.get('empID'))
##         name  salary  year1  year2 not_exist
## 100    Alice   40000   2017   2027       NaN
## 101      Bob   24000   2017   2027       NaN
## 102  Charles   31000   2017   2027       NaN
## 103    David   20000   2018   2028       NaN
## 104     Eric   30000   2018   2028       NaN

11.7.2 Operators

Sample Data

  • Two dataframes are created, each with 3 columns and 3 rows.
  • However, there are only two matching column and row names.
  • We shall notice that the operator will perform cell-wise, honoring the row/column name.
df1 = pd.DataFrame(data=
  {'idx': ['row1','row2','row3'],
   'x': [10, 20, 30],
   'y': [1,2,3],
   'z': [0.1, 0.2, 0.3]}).set_index('idx')
   
df2 = pd.DataFrame(data=
  {'idx': ['row1','row2','row4'],
   'x': [13, 23, 33],
   'z': [0.1, 0.2, 0.3],
   'k': [11,21,31]}).set_index('idx')

Addition

Adding Two DataFrame

  • When Using + operator, non-matching row/column names will result in NA.
  • .add(), has the same result as + by default.
  • However, add supports fill_value= parameter. When specified, ONE-SIDED none matching cells is assumed to have value specified in fill_value. Only BOTH-SIDED non matching cells will results in NaN.
df1 + df2
df1.add(df2)
df1.add(df2, fill_value=1)
##        k     x   y    z
## idx                    
## row1 NaN  23.0 NaN  0.2
## row2 NaN  43.0 NaN  0.4
## row3 NaN   NaN NaN  NaN
## row4 NaN   NaN NaN  NaN
##        k     x   y    z
## idx                    
## row1 NaN  23.0 NaN  0.2
## row2 NaN  43.0 NaN  0.4
## row3 NaN   NaN NaN  NaN
## row4 NaN   NaN NaN  NaN
##          k     x    y    z
## idx                       
## row1  12.0  23.0  2.0  0.2
## row2  22.0  43.0  3.0  0.4
## row3   NaN  31.0  4.0  1.3
## row4  32.0  34.0  NaN  1.3

Adding Series To DataFrame

  • Specify the appropriate axis depending on the orientation of the series data.
  • Column and Row names are respected in this operation.
  • However, fill_value is not applicable is not available on Series.
  • Columns or rows in Series that are non-matching in DataFrame, they will be created as the result. This behavior is similar to adding two DataFrames.
## Series to add into Rows
s3 = pd.Series([1,1,1], index=['row1','row2','row4'])

## Series to add into Columns
s4 = pd.Series([3,3,3], index=['x','y','s'])

print('Original DataFrame:\n',df1,'\n\n',
      'Add Series as Rows: \n', df1.add(s3, axis=0), '\n\n',
      'Add Series as Columns: \n', df1.add(s4, axis=1))
## Original DataFrame:
##         x  y    z
## idx             
## row1  10  1  0.1
## row2  20  2  0.2
## row3  30  3  0.3 
## 
##  Add Series as Rows: 
##           x    y    z
## row1  11.0  2.0  1.1
## row2  21.0  3.0  1.2
## row3   NaN  NaN  NaN
## row4   NaN  NaN  NaN 
## 
##  Add Series as Columns: 
##         s     x    y   z
## idx                    
## row1 NaN  13.0  4.0 NaN
## row2 NaN  23.0  5.0 NaN
## row3 NaN  33.0  6.0 NaN

Substraction

Refer to .add() above for fill_value explanation.

df2 - df1
df2.sub(df1,fill_value=1000)
##        k    x   y    z
## idx                   
## row1 NaN  3.0 NaN  0.0
## row2 NaN  3.0 NaN  0.0
## row3 NaN  NaN NaN  NaN
## row4 NaN  NaN NaN  NaN
##           k      x      y      z
## idx                             
## row1 -989.0    3.0  999.0    0.0
## row2 -979.0    3.0  998.0    0.0
## row3    NaN  970.0  997.0  999.7
## row4 -969.0 -967.0    NaN -999.7

Operator &

  • Non matching columns returns NaN.
  • Matching columns with both True value returns True
  • Matching columns with both True and False returns False
df1>5
df2<20
(df1>5) & (df2<20)
##          x      y      z
## idx                     
## row1  True  False  False
## row2  True  False  False
## row3  True  False  False
##           x     z      k
## idx                     
## row1   True  True   True
## row2  False  True  False
## row4  False  True  False
##        k      x   y      z
## idx                       
## row1 NaN   True NaN  False
## row2 NaN  False NaN  False
## row3 NaN  False NaN  False
## row4 NaN  False NaN  False

11.7.3 Attributes

df = pd.DataFrame(data)
df
##    empID  year1  year2  salary     name
## 0    100   2017   2027   40000    Alice
## 1    101   2017   2027   24000      Bob
## 2    102   2017   2027   31000  Charles
## 3    103   2018   2028   20000    David
## 4    104   2018   2028   30000     Eric
df.shape          ## tuple (rows, columns)
df.index          ## default index is RangeIndex 
df.index.values   ## array of integer
df.columns        ## array of string
df.columns.values ## array of string
df.values         ## array of list
## (5, 5)
## RangeIndex(start=0, stop=5, step=1)
## array([0, 1, 2, 3, 4], dtype=int64)
## Index(['empID', 'year1', 'year2', 'salary', 'name'], dtype='object')
## array(['empID', 'year1', 'year2', 'salary', 'name'], dtype=object)
## array([[100, 2017, 2027, 40000, 'Alice'],
##        [101, 2017, 2027, 24000, 'Bob'],
##        [102, 2017, 2027, 31000, 'Charles'],
##        [103, 2018, 2028, 20000, 'David'],
##        [104, 2018, 2028, 30000, 'Eric']], dtype=object)

11.7.4 Index Manipulation

Sample Data

df
##    empID  year1  year2  salary     name
## 0    100   2017   2027   40000    Alice
## 1    101   2017   2027   24000      Bob
## 2    102   2017   2027   31000  Charles
## 3    103   2018   2028   20000    David
## 4    104   2018   2028   30000     Eric

Convert Column To Index

set_index('column_name', inplace=False)
  • Specified column will turn into Index
  • inplace=True means don’t create a new dataframe. Modify existing DataFrame.
  • inplace=False means return a new DataFrame
df.index   ## 0,1,2,3,...
df.set_index('empID',inplace=True)
df.index   ## 100,101,102,etc
df
## RangeIndex(start=0, stop=5, step=1)
## Int64Index([100, 101, 102, 103, 104], dtype='int64', name='empID')
##        year1  year2  salary     name
## empID                               
## 100     2017   2027   40000    Alice
## 101     2017   2027   24000      Bob
## 102     2017   2027   31000  Charles
## 103     2018   2028   20000    David
## 104     2018   2028   30000     Eric

Convert Index To Column

  • Resetting index will re-sequence the index as 0,1,2 etc.
  • Old index column will be converted back as normal column, its name retained as column name.
  • Operation support inplace option.
df.reset_index(inplace=True)
df
##    empID  year1  year2  salary     name
## 0    100   2017   2027   40000    Alice
## 1    101   2017   2027   24000      Bob
## 2    102   2017   2027   31000  Charles
## 3    103   2018   2028   20000    David
## 4    104   2018   2028   30000     Eric

Updating Index ( .index= )

  • This simply replace all index with the new values.
  • Number of elements in the new index must match length of DataFrame, otherwise error.
  • Although same label are allowed to repeat, it will be deprecated in future.
  • This operation not reversible.
df.index = [201, 202, 203, 204, 205]
df
##      empID  year1  year2  salary     name
## 201    100   2017   2027   40000    Alice
## 202    101   2017   2027   24000      Bob
## 203    102   2017   2027   31000  Charles
## 204    103   2018   2028   20000    David
## 205    104   2018   2028   30000     Eric

Reordering Index (.reindex )

  • Returns new DataFrame according to the order specified. NO ‘inplace’ option.
  • Doesn’t work if duplicate labels exists (Error)
  • Only matching rows are returned.
  • Non matching index number will results in NaN, without Error.

Change the order of Index, always return a new dataframe

df.reindex([203,202,300])
##      empID   year1   year2   salary     name
## 203  102.0  2017.0  2027.0  31000.0  Charles
## 202  101.0  2017.0  2027.0  24000.0      Bob
## 300    NaN     NaN     NaN      NaN      NaN

Rename Axis

  • Example below renamed the axis of index.
  • Use axis=0 for row index (default), use axis=1 for column index. = Use inplace option to apply changes to the DataFrame, otherwise it will return the a new DataFrame
df.rename_axis('super_id', axis=0, inplace=True)
df
##           empID  year1  year2  salary     name
## super_id                                      
## 201         100   2017   2027   40000    Alice
## 202         101   2017   2027   24000      Bob
## 203         102   2017   2027   31000  Charles
## 204         103   2018   2028   20000    David
## 205         104   2018   2028   30000     Eric

11.7.5 Subsetting Rows

dataframe.loc[ row_label       ]  # return series, single row
dataframe.loc[ row_label_list  ]  # multiple rows
dataframe.loc[ boolean_list    ]  # multiple rows

dataframe.iloc[ row_number       ]  # return series, single row
dataframe.iloc[ row_number_list  ]  # multiple rows
dataframe.iloc[ number_range     ]  # multiple rows

dataframe.sample(frac=)             # frac = 0.6 means sampling 60% of rows randomly

Sample Data

df  = pd.DataFrame(data).set_index('empID')
df
df2 = pd.DataFrame(data, index = ['row2','row3','row1','row5','row4'])
df2
##        year1  year2  salary     name
## empID                               
## 100     2017   2027   40000    Alice
## 101     2017   2027   24000      Bob
## 102     2017   2027   31000  Charles
## 103     2018   2028   20000    David
## 104     2018   2028   30000     Eric
##       empID  year1  year2  salary     name
## row2    100   2017   2027   40000    Alice
## row3    101   2017   2027   24000      Bob
## row1    102   2017   2027   31000  Charles
## row5    103   2018   2028   20000    David
## row4    104   2018   2028   30000     Eric

Using Row Label (loc)

  • Non matching single row label will result in KeyError.
  • Non matching range of row labels will not trigger error.

Row Label is number

df.loc[ 101]         # by single row label, return Series
## year1      2017
## year2      2027
## salary    24000
## name        Bob
## Name: 101, dtype: object
df.loc[ [100,103] ]  # by multiple row labels, returns DataFrame
##        year1  year2  salary   name
## empID                             
## 100     2017   2027   40000  Alice
## 103     2018   2028   20000  David
df.loc[ 100:103   ]  # by range of row labels, returns DataFrame
##        year1  year2  salary     name
## empID                               
## 100     2017   2027   40000    Alice
## 101     2017   2027   24000      Bob
## 102     2017   2027   31000  Charles
## 103     2018   2028   20000    David
df.loc[ 999:9999  ]  # invalid range, No Error !!
## Empty DataFrame
## Columns: [year1, year2, salary, name]
## Index: []
df.loc[ 999 ]        # invalid key, KeyError
## Error in py_call_impl(callable, dots$args, dots$keywords): KeyError: 999

Row Label is string

df2.loc[ 'row3' ]          # by single row label, return Series
## empID       101
## year1      2017
## year2      2027
## salary    24000
## name        Bob
## Name: row3, dtype: object
df2.loc[ ['row1','row3'] ] # by multiple row labels, returns DataFrame
##       empID  year1  year2  salary     name
## row1    102   2017   2027   31000  Charles
## row3    101   2017   2027   24000      Bob
df2.loc[ 'row1':'row3'  ]  # by range of row labels, return empty DataFrame because there is no row3 after row1
## Empty DataFrame
## Columns: [empID, year1, year2, salary, name]
## Index: []
df2.loc[ 'row1':'row4'  ]  # by range of row labels, returns DataFrame
##       empID  year1  year2  salary     name
## row1    102   2017   2027   31000  Charles
## row5    103   2018   2028   20000    David
## row4    104   2018   2028   30000     Eric
df2.loc[ 'row1':'baba'  ]  # Invalid range, KeyError
## Error in py_call_impl(callable, dots$args, dots$keywords): KeyError: 'baba'

Using Row Number (iloc)

Multiple rows returned as dataframe object

df.iloc[1]          # by single row number
df.iloc[ [0,3] ]    # by row numbers
df.iloc[  0:3  ]    # by row number range
df.iloc[ 0:999 ]    # invalid range, no error
df.iloc[999]        # invalid row, IndexError
## year1      2017
## year2      2027
## salary    24000
## name        Bob
## Name: 101, dtype: object
##        year1  year2  salary   name
## empID                             
## 100     2017   2027   40000  Alice
## 103     2018   2028   20000  David
##        year1  year2  salary     name
## empID                               
## 100     2017   2027   40000    Alice
## 101     2017   2027   24000      Bob
## 102     2017   2027   31000  Charles
##        year1  year2  salary     name
## empID                               
## 100     2017   2027   40000    Alice
## 101     2017   2027   24000      Bob
## 102     2017   2027   31000  Charles
## 103     2018   2028   20000    David
## 104     2018   2028   30000     Eric
## Error in py_call_impl(callable, dots$args, dots$keywords): IndexError: single positional indexer is out-of-bounds

Using Boolean List

criteria = (df.salary > 30000) & (df.year1==2017)
print (criteria)
print (df.loc[criteria])
## empID
## 100     True
## 101    False
## 102     True
## 103    False
## 104    False
## dtype: bool
##        year1  year2  salary     name
## empID                               
## 100     2017   2027   40000    Alice
## 102     2017   2027   31000  Charles

Using Expression (.query)

.query(expr, inplace=False)

num1 = 31000
num2 = 2017
df.query(f'salary<={num1} and year1=={num2}')
##        year1  year2  salary     name
## empID                               
## 101     2017   2027   24000      Bob
## 102     2017   2027   31000  Charles

Sampling (.sample)

Specify percentage of random rows to be returned.

np.random.seed(15)  ## ensure consistentcy of result
df.sample(frac=0.7) ## randomly pick 70% of rows, without replacement
##        year1  year2  salary     name
## empID                               
## 102     2017   2027   31000  Charles
## 103     2018   2028   20000    David
## 104     2018   2028   30000     Eric
## 101     2017   2027   24000      Bob

11.7.6 Row Manipulation

Sample Data

df
##        year1  year2  salary     name
## empID                               
## 100     2017   2027   40000    Alice
## 101     2017   2027   24000      Bob
## 102     2017   2027   31000  Charles
## 103     2018   2028   20000    David
## 104     2018   2028   30000     Eric

Append Rows

.append() has been deprecated. Refer to Concatenating section.

Drop Rows (.drop)

.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')

By Row Label(s)

df.drop(index=100)                         # drop single row
df.drop(index=[100,103], columns='salary') # drop selected rows and columns
##        year1  year2  salary     name
## empID                               
## 101     2017   2027   24000      Bob
## 102     2017   2027   31000  Charles
## 103     2018   2028   20000    David
## 104     2018   2028   30000     Eric
##        year1  year2     name
## empID                       
## 101     2017   2027      Bob
## 102     2017   2027  Charles
## 104     2018   2028     Eric

11.7.7 Column Manipulation

Sample Data

df = pd.DataFrame(data)
df
##    empID  year1  year2  salary     name
## 0    100   2017   2027   40000    Alice
## 1    101   2017   2027   24000      Bob
## 2    102   2017   2027   31000  Charles
## 3    103   2018   2028   20000    David
## 4    104   2018   2028   30000     Eric

Renaming Columns

Method 1 : Rename All Columns (.columns =).

Construct the new column names, ensure there is no missing column names, which will result in error.

new_columns = ['empID', 'year.1','year.2','salary', 'glamour']
df.columns = new_columns
df.head(2)
##    empID  year.1  year.2  salary glamour
## 0    100    2017    2027   40000   Alice
## 1    101    2017    2027   24000     Bob

Method 2 : Renaming Specific Column (.rename (columns=) )

  • Change column name through rename function.
  • Support inplace option.
  • Specify only columns that require name change.
  • Non matching columns will not return error. That is great.
df.rename( columns={'year.1':'year1', 'year.2':'year2'}, inplace=True)
df.head(2)
##    empID  year1  year2  salary glamour
## 0    100   2017   2027   40000   Alice
## 1    101   2017   2027   24000     Bob

Reordering Columns

  • Reordering columns is basically returning a new DataFrame with the speicified column list in order preferred.
  • Refer to Sub-setting Column section.

Create New Column

  • New Column will be created instantly using [] notation
  • DO NOT USE dot Notation because it is view only attribute
df['year3'] = df.year1
df
##    empID  year1  year2  salary  glamour  year3
## 0    100   2017   2027   40000    Alice   2017
## 1    101   2017   2027   24000      Bob   2017
## 2    102   2017   2027   31000  Charles   2017
## 3    103   2018   2028   20000    David   2018
## 4    104   2018   2028   30000     Eric   2018

Dropping Columns (.drop)

dataframe.drop( columns='column_name',    inplace=True/False)   # delete single column
dataframe.drop( columns=list_of_colnames, inplace=True/False)   # delete multiple column

dataframe.drop( index='row_label',         inplace=True/False)   # delete single row
dataframe.drop( index= list_of_row_labels, inplace=True/False)   # delete multiple rows

inplace=True means column will be deleted from original dataframe. Default is False, which return a copy of dataframe

By Column Name(s)

df.drop( columns='year1')           # drop single column
df.drop( columns=['year1','year2'])  # drop multiple columns
##    empID  year2  salary  glamour  year3
## 0    100   2027   40000    Alice   2017
## 1    101   2027   24000      Bob   2017
## 2    102   2027   31000  Charles   2017
## 3    103   2028   20000    David   2018
## 4    104   2028   30000     Eric   2018
##    empID  salary  glamour  year3
## 0    100   40000    Alice   2017
## 1    101   24000      Bob   2017
## 2    102   31000  Charles   2017
## 3    103   20000    David   2018
## 4    104   30000     Eric   2018

By Column Number(s)

Use dataframe.columns to produce interim list of column names

## before dropping columns
df
df.drop( columns=df.columns[[1,2]])  # drop second and third columns
df.drop( columns=df.columns[0:3] )   # drop first, second and third columns
##    empID  year1  year2  salary  glamour  year3
## 0    100   2017   2027   40000    Alice   2017
## 1    101   2017   2027   24000      Bob   2017
## 2    102   2017   2027   31000  Charles   2017
## 3    103   2018   2028   20000    David   2018
## 4    104   2018   2028   30000     Eric   2018
##    empID  salary  glamour  year3
## 0    100   40000    Alice   2017
## 1    101   24000      Bob   2017
## 2    102   31000  Charles   2017
## 3    103   20000    David   2018
## 4    104   30000     Eric   2018
##    salary  glamour  year3
## 0   40000    Alice   2017
## 1   24000      Bob   2017
## 2   31000  Charles   2017
## 3   20000    David   2018
## 4   30000     Eric   2018

11.7.8 Subsetting Columns

Select Single Column Return Series

dataframe.columnName               # single column, name based, return Series object
dataframe[ single_col_name ]       # single column, name based, return Series object

Select Single/Multiple Columns Return DataFrame

dataframe    [  list_of_col_names   ]  # name based, return Dataframe object
dataframe.loc[ : , single_col_name  ]  # single column, series
dataframe.loc[ : , col_name_list    ]  # multiple columns, dataframe
dataframe.loc[ : , col_name_range   ]  # multiple columns, dataframe

dataframe.iloc[ : , col_number      ]  # single column, series
dataframe.iloc[ : , col_number_list ]  # multiple columns, dataframe
dataframe.iloc[ : , number_range    ]  # multiple columns, dataframe

By Number (.iloc) and Label (.loc)

Get Single Column (.loc, iloc)

Selecting single column always return as panda::Series, except using [[ ]] notation.

df = pd.DataFrame(data)
df.name
df['name']
df.loc[:, 'name']
df.iloc[:, 3]
## 0      Alice
## 1        Bob
## 2    Charles
## 3      David
## 4       Eric
## Name: name, dtype: object
## 0      Alice
## 1        Bob
## 2    Charles
## 3      David
## 4       Eric
## Name: name, dtype: object
## 0      Alice
## 1        Bob
## 2    Charles
## 3      David
## 4       Eric
## Name: name, dtype: object
## 0    40000
## 1    24000
## 2    31000
## 3    20000
## 4    30000
## Name: salary, dtype: int64

Get Multiple Columns (.loc, iloc)

Selecting multiple columns return as panda::Dataframe object`

Example below returns DataFrame with Single Column

df[['name']]                # return one column dataframe
df[['name','year1']] 
df.loc[:,['name','year1']]
##       name
## 0    Alice
## 1      Bob
## 2  Charles
## 3    David
## 4     Eric
##       name  year1
## 0    Alice   2017
## 1      Bob   2017
## 2  Charles   2017
## 3    David   2018
## 4     Eric   2018
##       name  year1
## 0    Alice   2017
## 1      Bob   2017
## 2  Charles   2017
## 3    David   2018
## 4     Eric   2018

Select Range of Columns

df.loc [ : , 'year1':'year2']  ## by range of column names
df.loc[ : , ['empID','year2']] ## select two columns only
df.iloc[ : , 1:4]              ## by range of column number
df.iloc[ : , [0,3]]            ## select two columns only
##    year1  year2
## 0   2017   2027
## 1   2017   2027
## 2   2017   2027
## 3   2018   2028
## 4   2018   2028
##    empID  year2
## 0    100   2027
## 1    101   2027
## 2    102   2027
## 3    103   2028
## 4    104   2028
##    year1  year2  salary
## 0   2017   2027   40000
## 1   2017   2027   24000
## 2   2017   2027   31000
## 3   2018   2028   20000
## 4   2018   2028   30000
##    empID  salary
## 0    100   40000
## 1    101   24000
## 2    102   31000
## 3    103   20000
## 4    104   30000

By Column Name (.filter, .reindex)

.filter(items=None, like=None, regex=None, axis=1)

Filter by like - Sub string Matching, always return DataFrame.

df.filter( like='year', axis=1)  ## or axis = 1
##    year1  year2
## 0   2017   2027
## 1   2017   2027
## 2   2017   2027
## 3   2018   2028
## 4   2018   2028
df.filter( like='name')
##       name
## 0    Alice
## 1      Bob
## 2  Charles
## 3    David
## 4     Eric

Filter by items - list of exact column names

df.filter( items=('year1','year2', 'name', 'not_exist'),  axis=1)
##    year1  year2     name
## 0   2017   2027    Alice
## 1   2017   2027      Bob
## 2   2017   2027  Charles
## 3   2018   2028    David
## 4   2018   2028     Eric

Filter by regex = Regular Expression Matching column names

Select column names that contain integer.

df.filter(regex='\d')  ## default axis=1 if DataFrame
##    year1  year2
## 0   2017   2027
## 1   2017   2027
## 2   2017   2027
## 3   2018   2028
## 4   2018   2028

.reindex(columns = .. )

  • It returns a new dataframe. There is no inplace option for reordering columns
  • Non Matching columns names will result in NA values
new_colorder = [ 'salary', 'year1', 'baba']
df.reindex(columns = new_colorder)  ## Non matching column name returns as NaN
##    salary  year1  baba
## 0   40000   2017   NaN
## 1   24000   2017   NaN
## 2   31000   2017   NaN
## 3   20000   2018   NaN
## 4   30000   2018   NaN

By Data Types (.select_dtypes)

Always return panda::DataFrame, even though only single column matches. Allowed types are:

  • number (integer and float)
  • integer / int64
  • float
  • datetime
  • timedelta
  • category
  • bool
  • object (string)

Use .dtypes.value_counts() to insepct available data types.

df.dtypes.value_counts()                       ## inspect available data types
df.select_dtypes( exclude='integer')           ## exclude bool cols
df.select_dtypes( include=['number','object']) ## include only number cols
## int64     4
## object    1
## dtype: int64
##       name
## 0    Alice
## 1      Bob
## 2  Charles
## 3    David
## 4     Eric
##    empID  year1  year2  salary     name
## 0    100   2017   2027   40000    Alice
## 1    101   2017   2027   24000      Bob
## 2    102   2017   2027   31000  Charles
## 3    103   2018   2028   20000    David
## 4    104   2018   2028   30000     Eric

11.7.9 Concatenating

Concatenating Two DataFrames

  • When ignore_index=True, pandas will create new index in the result with 0,1,2,3…
  • It is recommended to ignore index IF the data source index is not unique.
  • New columns or rows will be added in the result if non-matching.
my_df = pd.DataFrame(
          data= {'Id':   [10,20,30],
                 'Name': ['Aaa','Bbb','Ccc']})
#                 .set_index('Id')
                 
my_df_new = pd.DataFrame(
            data= {'Id':   [40,50],
                   'Name': ['Ddd','Eee'],
                   'Age':  [12,13]})  
                   #.set_index('Id')
                   
my_df_append  = pd.concat( [my_df, my_df_new])
my_df_noindex = pd.concat( [my_df, my_df_new], ignore_index=True)

print("Original DataFrame:\n", my_df,
      "\n\nTo Be Appended DataFrame:\n", my_df_new,
      "\n\nAppended DataFrame (index maintained):\n", my_df_append,
      "\n\nAppended DataFrame (index ignored):\n", my_df_noindex)
## Original DataFrame:
##     Id Name
## 0  10  Aaa
## 1  20  Bbb
## 2  30  Ccc 
## 
## To Be Appended DataFrame:
##     Id Name  Age
## 0  40  Ddd   12
## 1  50  Eee   13 
## 
## Appended DataFrame (index maintained):
##     Id Name   Age
## 0  10  Aaa   NaN
## 1  20  Bbb   NaN
## 2  30  Ccc   NaN
## 0  40  Ddd  12.0
## 1  50  Eee  13.0 
## 
## Appended DataFrame (index ignored):
##     Id Name   Age
## 0  10  Aaa   NaN
## 1  20  Bbb   NaN
## 2  30  Ccc   NaN
## 3  40  Ddd  12.0
## 4  50  Eee  13.0

11.7.10 Slicing

Sample Data

df = pd.DataFrame(data).set_index('empID')
df
##        year1  year2  salary     name
## empID                               
## 100     2017   2027   40000    Alice
## 101     2017   2027   24000      Bob
## 102     2017   2027   31000  Charles
## 103     2018   2028   20000    David
## 104     2018   2028   30000     Eric

Getting One Cell

By Row Label and Column Name (loc)

dataframe.loc [ row_label , col_name   ]    # by row label and column names
dataframe.loc [ bool_list , col_name   ]    # by row label and column names
dataframe.iloc[ row_number, col_number ]    # by row and column number
df.loc [102,'year1']  ## row label 102, column 'year1'
df.iloc[2,0]          ## same result as above
## 2017
## 2017

Getting Range of Cells

Specify rows and columns (by individual or range)

dataframe.loc [ list/range_of_row_labels , list/range_col_names   ]    # by row label and column names
dataframe.iloc[ list/range_row_numbers,    list/range_col_numbers ]    # by row number

By Index and Column Name (loc)

df.loc[ [101,103], ['name', 'year1'] ]  # by list of row label and column names
df.loc[  101:104 ,  'year1':'year2'  ]  # by range of row label and column names
##         name  year1
## empID              
## 101      Bob   2017
## 103    David   2018
##        year1  year2
## empID              
## 101     2017   2027
## 102     2017   2027
## 103     2018   2028
## 104     2018   2028

By Boolean Row and Column Names (loc)

b = df.year1==2017
df.loc[ b ]
##        year1  year2  salary     name
## empID                               
## 100     2017   2027   40000    Alice
## 101     2017   2027   24000      Bob
## 102     2017   2027   31000  Charles

By Row and Column Number (iloc)

print (df.iloc[ [1,4], [0,3]],'\n' )   # by individual rows/columns
##        year1  name
## empID             
## 101     2017   Bob
## 104     2018  Eric
print (df.iloc[  1:4 ,  0:3], '\n')    # by range
##        year1  year2  salary
## empID                      
## 101     2017   2027   24000
## 102     2017   2027   31000
## 103     2018   2028   20000

11.7.11 Chained Indexing

Chained Index Method creates a copy of dataframe, any modification of data on original dataframe does not affect the copy

dataframe.loc  [...]  [...]
dataframe.iloc [...]  [...]

Suggesting, never use chain indexing

df = pd.DataFrame(
    { 'empID':  [100,      101,    102,      103,     104],
      'year1':   [2017,     2017,   2017,      2018,    2018],
      'name':   ['Alice',  'Bob',  'Charles','David', 'Eric'],
      'year2':   [2001,     1907,   2003,      1998,    2011],
      'salary': [40000,    24000,  31000,     20000,   30000]},
    columns = ['year1','salary','year2','empID','name']).set_index(['empID'])
df
##        year1  salary  year2     name
## empID                               
## 100     2017   40000   2001    Alice
## 101     2017   24000   1907      Bob
## 102     2017   31000   2003  Charles
## 103     2018   20000   1998    David
## 104     2018   30000   2011     Eric
df.loc[100]['year'] =2000
## <string>:1: SettingWithCopyWarning: 
## A value is trying to be set on a copy of a slice from a DataFrame
## 
## See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
## <string>:1: SettingWithCopyWarning: 
## A value is trying to be set on a copy of a slice from a DataFrame
## 
## See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df  ## notice row label 100 had not been updated, because data was updated on a copy due to chain indexing
##        year1  salary  year2     name
## empID                               
## 100     2017   40000   2001    Alice
## 101     2017   24000   1907      Bob
## 102     2017   31000   2003  Charles
## 103     2018   20000   1998    David
## 104     2018   30000   2011     Eric

11.7.12 Find and Replace

Slicing deals with square cells selection. Use mask or where to select specific cell(s). These function respect column and row names.

mask()

mask() replace cell value with other= when condition is met.

data={'x': [1,4,7],
      'y': [2,5,8],
      'z': [3,6,9]}
df = pd.DataFrame(data)
df.mask(df>4, other=999)
##      x    y    z
## 0    1    2    3
## 1    4  999  999
## 2  999  999  999

where()

This is reverse of mask(), it keep cell value where condition is met. If not met, replace with other value.

df.where(df>4, other=0)
##    x  y  z
## 0  0  0  0
## 1  0  5  6
## 2  7  8  9

11.7.13 Iteration

Iterating Rows (.iterrows)

df = pd.DataFrame(data=
    { 'empID':  [100,      101,    102,      103,     104],
      'Name':   ['Alice',  'Bob',  'Charles','David', 'Eric'],
      'Year':   [1999,     1988,   2001,     2010,     2020]}).set_index(['empID'])

for idx, row in df.iterrows():
  print(idx, row.Name)
## 100 Alice
## 101 Bob
## 102 Charles
## 103 David
## 104 Eric

Iterating Columns (.items)

for label, content in df.items():
  print('Column Label:',              label,   '\n\n', 
        'Column Content (Series):\n', content, '\n\n')
## Column Label: Name 
## 
##  Column Content (Series):
##  empID
## 100      Alice
## 101        Bob
## 102    Charles
## 103      David
## 104       Eric
## Name: Name, dtype: object 
## 
## 
## Column Label: Year 
## 
##  Column Content (Series):
##  empID
## 100    1999
## 101    1988
## 102    2001
## 103    2010
## 104    2020
## Name: Year, dtype: int64

11.7.14 Data Structure

Inspect Structure

Find out the column names, data type in a summary. Output is for display only, not a data object

df.info()  # return text output
## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 5 entries, 100 to 104
## Data columns (total 2 columns):
##  #   Column  Non-Null Count  Dtype 
## ---  ------  --------------  ----- 
##  0   Name    5 non-null      object
##  1   Year    5 non-null      int64 
## dtypes: int64(1), object(1)
## memory usage: 120.0+ bytes
df.dtypes.value_counts() # return Series
## object    1
## int64     1
## dtype: int64

Format Conversion

.to_dict()

df.to_dict('dict')    ## dict of dict by column
## {'Name': {100: 'Alice', 101: 'Bob', 102: 'Charles', 103: 'David', 104: 'Eric'}, 'Year': {100: 1999, 101: 1988, 102: 2001, 103: 2010, 104: 2020}}
df.to_dict('index')   ## dict of dict by row index
## {100: {'Name': 'Alice', 'Year': 1999}, 101: {'Name': 'Bob', 'Year': 1988}, 102: {'Name': 'Charles', 'Year': 2001}, 103: {'Name': 'David', 'Year': 2010}, 104: {'Name': 'Eric', 'Year': 2020}}
df.to_dict('list')    ## dict of list
## {'Name': ['Alice', 'Bob', 'Charles', 'David', 'Eric'], 'Year': [1999, 1988, 2001, 2010, 2020]}
df.to_dict('records') ## list of dict
## [{'Name': 'Alice', 'Year': 1999}, {'Name': 'Bob', 'Year': 1988}, {'Name': 'Charles', 'Year': 2001}, {'Name': 'David', 'Year': 2010}, {'Name': 'Eric', 'Year': 2020}]

11.7.15 Import/Export

CSV

df.to_csv()
## 'empID,Name,Year\r\n100,Alice,1999\r\n101,Bob,1988\r\n102,Charles,2001\r\n103,David,2010\r\n104,Eric,2020\r\n'

11.8 Class: MultiIndex

MultiIndexing are columns with few levels of headers.

11.8.1 The Data

df = pd.DataFrame({
     'myindex': [0, 1, 2],
     'One_X':   [1.1,  1.1,  1.1],
     'One_Y':   [1.2,  1.2,  1.2],
     'Two_X':   [1.11, 1.11, 1.11],
     'Two_Y':   [1.22, 1.22, 1.22]})
df.set_index('myindex',inplace=True)
df
##          One_X  One_Y  Two_X  Two_Y
## myindex                            
## 0          1.1    1.2   1.11   1.22
## 1          1.1    1.2   1.11   1.22
## 2          1.1    1.2   1.11   1.22

11.8.2 Creating MultiIndex Object

Create From Tuples

MultiIndex can easily created from typles:
- Step 1: Create a MultiIndex object by splitting column name into tuples
- Step 2: Assign the MultiIndex Object to dataframe columns property.

my_tuples = [tuple(c.split('_')) for c in df.columns]
df.columns = pd.MultiIndex.from_tuples(my_tuples)

print(' Column Headers :\n\n',           my_tuples,
        '\n\nNew Columns: \n\n',         df.columns,
        '\n\nTwo Layers Header DF:\n\n', df)
##  Column Headers :
## 
##  [('One', 'X'), ('One', 'Y'), ('Two', 'X'), ('Two', 'Y')] 
## 
## New Columns: 
## 
##  MultiIndex([('One', 'X'),
##             ('One', 'Y'),
##             ('Two', 'X'),
##             ('Two', 'Y')],
##            ) 
## 
## Two Layers Header DF:
## 
##           One        Two      
##            X    Y     X     Y
## myindex                      
## 0        1.1  1.2  1.11  1.22
## 1        1.1  1.2  1.11  1.22
## 2        1.1  1.2  1.11  1.22

11.8.3 MultiIndex Object

Levels

  • MultiIndex object contain multiple leveels, each level (header) is an Index object.
  • Use MultiIndex.get_level_values() to the entire header for the desired level. Note that each level is an Index object
print(df.columns.get_level_values(0), '\n',
      df.columns.get_level_values(1))
## Index(['One', 'One', 'Two', 'Two'], dtype='object') 
##  Index(['X', 'Y', 'X', 'Y'], dtype='object')

MultiIndex.levels return the unique values of each level.

print(df.columns.levels[0], '\n',
      df.columns.levels[1])
## Index(['One', 'Two'], dtype='object') 
##  Index(['X', 'Y'], dtype='object')

Convert MultiIndex Back To Tuples

df.columns.to_list()
## [('One', 'X'), ('One', 'Y'), ('Two', 'X'), ('Two', 'Y')]

11.8.4 Selecting Column(s)

Sample Data

import itertools
test_df = pd.DataFrame
max_age = 100

### Create The Columns Tuple
level0_sex = ['Male','Female','Pondan']
level1_age = ['Medium','High','Low']
my_columns = list(itertools.product(level0_sex, level1_age))

test_df = pd.DataFrame([
             [1,2,3,4,5,6,7,8,9],
             [11,12,13,14,15,16,17,18,19],
             [21,22,23,24,25,26,27,28,29]], index=['row1','row2','row3'])

### Create Multiindex From Tuple
test_df.columns = pd.MultiIndex.from_tuples(my_columns)
print( test_df ) 
##        Male          Female          Pondan         
##      Medium High Low Medium High Low Medium High Low
## row1      1    2   3      4    5   6      7    8   9
## row2     11   12  13     14   15  16     17   18  19
## row3     21   22  23     24   25  26     27   28  29

Select Level0 Header(s)

Use [L0] notation, where L0 is list of header names

print( test_df[['Male','Pondan']] ,'\n\n',  ## Include multiple Level0 Header
       test_df['Male'] ,          '\n\n',   ## Include single Level0 Header
       test_df.Male )                       ## Same as above
##        Male          Pondan         
##      Medium High Low Medium High Low
## row1      1    2   3      7    8   9
## row2     11   12  13     17   18  19
## row3     21   22  23     27   28  29 
## 
##        Medium  High  Low
## row1       1     2    3
## row2      11    12   13
## row3      21    22   23 
## 
##        Medium  High  Low
## row1       1     2    3
## row2      11    12   13
## row3      21    22   23

Using .loc[]

Use .loc[ :, L0 ], where L0 is list of headers names

print( test_df.loc[:, ['Male','Pondan']] , '\n\n',  ## Multiple Level0 Header
       test_df.loc[:, 'Male'] )                     ## Single Level0 Header
##        Male          Pondan         
##      Medium High Low Medium High Low
## row1      1    2   3      7    8   9
## row2     11   12  13     17   18  19
## row3     21   22  23     27   28  29 
## 
##        Medium  High  Low
## row1       1     2    3
## row2      11    12   13
## row3      21    22   23

Selecting Level 1 Header(s)

Use .loc[ :, (All, L1)], where L1 are list of headers names

All = slice(None)
print( test_df.loc[ : , (All, 'High')],  '\n\n',  ## Signle L1 header
       test_df.loc[ : , (All, ['High','Low'])] )  ## Multiple L1 headers
##      Male Female Pondan
##      High   High   High
## row1    2      5      8
## row2   12     15     18
## row3   22     25     28 
## 
##       Male Female Pondan Male Female Pondan
##      High   High   High  Low    Low    Low
## row1    2      5      8    3      6      9
## row2   12     15     18   13     16     19
## row3   22     25     28   23     26     29

Select Level 0 and Level1 Headers

Use .loc[ :, (L0, L1)], where L0 and L1 are list of headers names

test_df.loc[ : , (['Male','Pondan'], ['Medium','High'])]
##        Male      Pondan     
##      Medium High Medium High
## row1      1    2      7    8
## row2     11   12     17   18
## row3     21   22     27   28

Select single L0,L1 Header

Use .loc[:, (L0, L1) ], result is a Series
Use .loc[:, (L0 ,[L1])], result is a DataFrame

print( test_df.loc[ : , ('Female', 'High')], '\n\n',
       test_df.loc[ : , ('Female', ['High'])])
## row1     5
## row2    15
## row3    25
## Name: (Female, High), dtype: int64 
## 
##       Female
##        High
## row1      5
## row2     15
## row3     25

11.8.5 Headers Ordering

Note that columns order specifeid by [ ] selection were not respected. This can be remediated either by Sorting and rearranging.

Sort Headers

Use .sort_index() on DataFrame to sort the headers. Note that when level1 is sorted, it jumble up level0 headers.

test_df_sorted_l0 = test_df.sort_index(axis=1, level=0)
test_df_sorted_l1 = test_df.sort_index(axis=1, level=1, ascending=False)
print(test_df, '\n\n',test_df_sorted_l0, '\n\n', test_df_sorted_l1)
##        Male          Female          Pondan         
##      Medium High Low Medium High Low Medium High Low
## row1      1    2   3      4    5   6      7    8   9
## row2     11   12  13     14   15  16     17   18  19
## row3     21   22  23     24   25  26     27   28  29 
## 
##       Female            Male            Pondan           
##        High Low Medium High Low Medium   High Low Medium
## row1      5   6      4    2   3      1      8   9      7
## row2     15  16     14   12  13     11     18  19     17
## row3     25  26     24   22  23     21     28  29     27 
## 
##       Pondan   Male Female Pondan Male Female Pondan Male Female
##      Medium Medium Medium    Low  Low    Low   High High   High
## row1      7      1      4      9    3      6      8    2      5
## row2     17     11     14     19   13     16     18   12     15
## row3     27     21     24     29   23     26     28   22     25

Rearranging Headers

Use **.reindex()** on arrange columns in specific order. Example below shows how to control the specific order for level1 headers.

cats = ['Low','Medium','High']
test_df.reindex(cats, level=1, axis=1)
##      Male             Female             Pondan            
##       Low Medium High    Low Medium High    Low Medium High
## row1    3      1    2      6      4    5      9      7    8
## row2   13     11   12     16     14   15     19     17   18
## row3   23     21   22     26     24   25     29     27   28

11.8.6 Stacking and Unstacking

df.stack()
##            One   Two
## myindex             
## 0       X  1.1  1.11
##         Y  1.2  1.22
## 1       X  1.1  1.11
##         Y  1.2  1.22
## 2       X  1.1  1.11
##         Y  1.2  1.22

Stacking Columns to Rows

Stacking with DataFrame.stack(level_no) is moving wide columns into row.

print('Stacking Header Level 0: \n\n', df.stack(0),
      '\n\nStacking Header Level 1: \n\n', df.stack(1))
## Stacking Header Level 0: 
## 
##                  X     Y
## myindex                
## 0       One  1.10  1.20
##         Two  1.11  1.22
## 1       One  1.10  1.20
##         Two  1.11  1.22
## 2       One  1.10  1.20
##         Two  1.11  1.22 
## 
## Stacking Header Level 1: 
## 
##             One   Two
## myindex             
## 0       X  1.1  1.11
##         Y  1.2  1.22
## 1       X  1.1  1.11
##         Y  1.2  1.22
## 2       X  1.1  1.11
##         Y  1.2  1.22

11.8.7 Exploratory Analysis

Sample Data

df
##          One        Two      
##            X    Y     X     Y
## myindex                      
## 0        1.1  1.2  1.11  1.22
## 1        1.1  1.2  1.11  1.22
## 2        1.1  1.2  1.11  1.22

All Stats in One - .describe()

df.describe(include='number') # default
df.describe(include='object') # display for non-numeric columns
df.describe(include='all')    # display both numeric and non-numeric

When applied to DataFrame object, describe shows all basic statistic for all numeric columns: - Count (non-NA)
- Unique (for string)
- Top (for string)
- Frequency (for string)
- Percentile
- Mean
- Min / Max
- Standard Deviation

For Numeric Columns only

You can customize the percentiles requred. Notice 0.5 percentile is always there although not specified

df.describe()
##        One        Two      
##          X    Y     X     Y
## count  3.0  3.0  3.00  3.00
## mean   1.1  1.2  1.11  1.22
## std    0.0  0.0  0.00  0.00
## min    1.1  1.2  1.11  1.22
## 25%    1.1  1.2  1.11  1.22
## 50%    1.1  1.2  1.11  1.22
## 75%    1.1  1.2  1.11  1.22
## max    1.1  1.2  1.11  1.22
df.describe(percentiles=[0.9,0.3,0.2,0.1])
##        One        Two      
##          X    Y     X     Y
## count  3.0  3.0  3.00  3.00
## mean   1.1  1.2  1.11  1.22
## std    0.0  0.0  0.00  0.00
## min    1.1  1.2  1.11  1.22
## 10%    1.1  1.2  1.11  1.22
## 20%    1.1  1.2  1.11  1.22
## 30%    1.1  1.2  1.11  1.22
## 50%    1.1  1.2  1.11  1.22
## 90%    1.1  1.2  1.11  1.22
## max    1.1  1.2  1.11  1.22

For both Numeric and Object

df.describe(include='all')
##        One        Two      
##          X    Y     X     Y
## count  3.0  3.0  3.00  3.00
## mean   1.1  1.2  1.11  1.22
## std    0.0  0.0  0.00  0.00
## min    1.1  1.2  1.11  1.22
## 25%    1.1  1.2  1.11  1.22
## 50%    1.1  1.2  1.11  1.22
## 75%    1.1  1.2  1.11  1.22
## max    1.1  1.2  1.11  1.22

min/max/mean/median

df.min()  # default axis=0, column-wise
## One  X    1.10
##      Y    1.20
## Two  X    1.11
##      Y    1.22
## dtype: float64
df.min(axis=1) # axis=1, row-wise
## myindex
## 0    1.1
## 1    1.1
## 2    1.1
## dtype: float64

Observe, sum on string will concatenate column-wise, whereas row-wise only sum up numeric fields

df.sum(0)
## One  X    3.30
##      Y    3.60
## Two  X    3.33
##      Y    3.66
## dtype: float64
df.sum(1)
## myindex
## 0    4.63
## 1    4.63
## 2    4.63
## dtype: float64

11.8.8 Plotting

11.9 Class: Categories

11.9.1 Creating

From List

Basic (Auto Category Mapping)

Basic syntax return categorical index with sequence with code 0,1,2,3… mapping to first found category
In this case, low(0), high(1), medium(2)

temp = ['low','high','medium','high','high','low','medium','medium','high']
temp_cat = pd.Categorical(temp)
temp_cat
## ['low', 'high', 'medium', 'high', 'high', 'low', 'medium', 'medium', 'high']
## Categories (3, object): ['high', 'low', 'medium']
type( temp_cat )
## <class 'pandas.core.arrays.categorical.Categorical'>

Manual Category Mapping
During creation, we can specify mapping of codes to category: low(0), medium(1), high(2)

temp_cat = pd.Categorical(temp, categories=['low','medium','high'])
temp_cat
## ['low', 'high', 'medium', 'high', 'high', 'low', 'medium', 'medium', 'high']
## Categories (3, object): ['low', 'medium', 'high']

From Series

  • We can ‘add’ categorical structure into a Series. With these methods, additional property (.cat) is added as a categorical accessor
  • Through this accessor, you gain access to various properties of the category such as .codes, .categories. But not .get_values() as the information is in the Series itself
  • Can we manual map category ?????
temp = ['low','high','medium','high','high','low','medium','medium','high']
temp_cat = pd.Series(temp, dtype='category')
print (type(temp_cat))       # Series object
## <class 'pandas.core.series.Series'>
print (type(temp_cat.cat))   # Categorical Accessor
## <class 'pandas.core.arrays.categorical.CategoricalAccessor'>
  • Method below has the same result as above by using .astype(‘category’)
  • It is useful adding category structure into existing series.
temp_ser = pd.Series(temp)
temp_cat = pd.Series(temp).astype('category')
print (type(temp_cat))       # Series object
## <class 'pandas.core.series.Series'>
print (type(temp_cat.cat))   # Categorical Accessor
## <class 'pandas.core.arrays.categorical.CategoricalAccessor'>
temp_cat.cat.categories
## Index(['high', 'low', 'medium'], dtype='object')

11.9.1.1 Ordering Category

temp = ['low','high','medium','high','high','low','medium','medium','high']
temp_cat = pd.Categorical(temp, categories=['low','medium','high'], ordered=True)
temp_cat
## ['low', 'high', 'medium', 'high', 'high', 'low', 'medium', 'medium', 'high']
## Categories (3, object): ['low' < 'medium' < 'high']
temp_cat.codes
## array([0, 2, 1, 2, 2, 0, 1, 1, 2], dtype=int8)
temp_cat[0] < temp_cat[3]
## False

11.9.2 Properties

.categories

first element’s code = 0
second element’s code = 1
third element’s code = 2

temp_cat.categories
## Index(['low', 'medium', 'high'], dtype='object')

.codes

Codes are actual integer value stored as array. 1 represent ‘high’,

temp_cat.codes
## array([0, 2, 1, 2, 2, 0, 1, 1, 2], dtype=int8)

11.9.3 Rename Category

Renamce To New Category Object

.rename_categories() method return a new category object with new changed categories

temp = ['low','high','medium','high','high','low','medium','medium','high']
new_temp_cat = temp_cat.rename_categories(['sejuk','sederhana','panas'])
new_temp_cat 
## ['sejuk', 'panas', 'sederhana', 'panas', 'panas', 'sejuk', 'sederhana', 'sederhana', 'panas']
## Categories (3, object): ['sejuk' < 'sederhana' < 'panas']
temp_cat   # original category object categories not changed
## ['low', 'high', 'medium', 'high', 'high', 'low', 'medium', 'medium', 'high']
## Categories (3, object): ['low' < 'medium' < 'high']

11.9.3.1 Rename Inplace

Observe the original categories had been changed using .rename()

temp_cat.categories = ['sejuk','sederhana','panas']
temp_cat   # original category object categories is changed
## ['sejuk', 'panas', 'sederhana', 'panas', 'panas', 'sejuk', 'sederhana', 'sederhana', 'panas']
## Categories (3, object): ['sejuk' < 'sederhana' < 'panas']

11.9.4 Adding New Category

This return a new category object with added categories

temp_cat_more = temp_cat.add_categories(['susah','senang'])
temp_cat_more
## ['sejuk', 'panas', 'sederhana', 'panas', 'panas', 'sejuk', 'sederhana', 'sederhana', 'panas']
## Categories (5, object): ['sejuk' < 'sederhana' < 'panas' < 'susah' < 'senang']

Removing Category

This is not in place, hence return a new categorical object

11.9.4.1 Remove Specific Categor(ies)

Elements with its category removed will become NaN

temp = ['low','high','medium','high','high','low','medium','medium','high']
temp_cat = pd.Categorical(temp)
temp_cat_removed = temp_cat.remove_categories('low')
temp_cat_removed
## [NaN, 'high', 'medium', 'high', 'high', NaN, 'medium', 'medium', 'high']
## Categories (2, object): ['high', 'medium']

Remove Unused Category

Since categories removed are not used, there is no impact to the element

print (temp_cat_more)
## ['sejuk', 'panas', 'sederhana', 'panas', 'panas', 'sejuk', 'sederhana', 'sederhana', 'panas']
## Categories (5, object): ['sejuk' < 'sederhana' < 'panas' < 'susah' < 'senang']
temp_cat_more.remove_unused_categories()
## ['sejuk', 'panas', 'sederhana', 'panas', 'panas', 'sejuk', 'sederhana', 'sederhana', 'panas']
## Categories (3, object): ['sejuk' < 'sederhana' < 'panas']

11.9.5 Add and Remove Categories In One Step - Set()

temp = ['low','high','medium','high','high','low','medium','medium','high']
temp_cat = pd.Categorical(temp, ordered=True)
temp_cat
## ['low', 'high', 'medium', 'high', 'high', 'low', 'medium', 'medium', 'high']
## Categories (3, object): ['high' < 'low' < 'medium']
temp_cat.set_categories(['low','medium','sederhana','susah','senang'])
## ['low', NaN, 'medium', NaN, NaN, 'low', 'medium', 'medium', NaN]
## Categories (5, object): ['low' < 'medium' < 'sederhana' < 'susah' < 'senang']

11.9.6 Categorical Descriptive Analysis

At One Glance

temp_cat.describe()
##             counts     freqs
## categories                  
## high             4  0.444444
## low              2  0.222222
## medium           3  0.333333

Frequency Count

temp_cat.value_counts()
## high      4
## low       2
## medium    3
## dtype: int64

Least Frequent Category, Most Frequent Category, and Most Frequent Category

( temp_cat.min(), temp_cat.max(), temp_cat.mode() )
## ('high', 'medium', ['high']
## Categories (3, object): ['high' < 'low' < 'medium'])
## 
## <string>:1: FutureWarning: Categorical.mode is deprecated and will be removed in a future version. Use Series.mode instead.

11.9.7 Other Methods

.get_values()

Since actual value stored by categorical object are integer codes, get_values() function return values translated from *.codes** property

temp_cat#array
## ['low', 'high', 'medium', 'high', 'high', 'low', 'medium', 'medium', 'high']
## Categories (3, object): ['high' < 'low' < 'medium']

11.10 Dummies

  • get_dummies creates columns for each categories
  • The underlying data can be string or pd.Categorical
  • It produces a new pd.DataFrame

11.10.1 Sample Data

df = pd.DataFrame (
    {'A': ['A1','A2','A3','A1','A3','A1'], 
     'B': ['B1','B2','B3','B1','B1','B3'],
     'C': ['C1','C2','C3','C1',np.nan,np.nan]})
df
##     A   B    C
## 0  A1  B1   C1
## 1  A2  B2   C2
## 2  A3  B3   C3
## 3  A1  B1   C1
## 4  A3  B1  NaN
## 5  A1  B3  NaN

11.10.2 Dummies on Array-Like Data

pd.get_dummies(df.A)
##    A1  A2  A3
## 0   1   0   0
## 1   0   1   0
## 2   0   0   1
## 3   1   0   0
## 4   0   0   1
## 5   1   0   0

11.10.3 Dummies on DataFrame (multiple columns)

All Columns

pd.get_dummies(df)
##    A_A1  A_A2  A_A3  B_B1  B_B2  B_B3  C_C1  C_C2  C_C3
## 0     1     0     0     1     0     0     1     0     0
## 1     0     1     0     0     1     0     0     1     0
## 2     0     0     1     0     0     1     0     0     1
## 3     1     0     0     1     0     0     1     0     0
## 4     0     0     1     1     0     0     0     0     0
## 5     1     0     0     0     0     1     0     0     0

Selected Columns

cols = ['A','B']
pd.get_dummies(df[cols])
##    A_A1  A_A2  A_A3  B_B1  B_B2  B_B3
## 0     1     0     0     1     0     0
## 1     0     1     0     0     1     0
## 2     0     0     1     0     0     1
## 3     1     0     0     1     0     0
## 4     0     0     1     1     0     0
## 5     1     0     0     0     0     1

11.10.4 Dummies with na

By default, nan values are ignored

pd.get_dummies(df.C)
##    C1  C2  C3
## 0   1   0   0
## 1   0   1   0
## 2   0   0   1
## 3   1   0   0
## 4   0   0   0
## 5   0   0   0

Make NaN as a dummy variable

pd.get_dummies(df.C,dummy_na=True)
##    C1  C2  C3  NaN
## 0   1   0   0    0
## 1   0   1   0    0
## 2   0   0   1    0
## 3   1   0   0    0
## 4   0   0   0    1
## 5   0   0   0    1

11.10.5 Specify Prefixes

pd.get_dummies(df.A, prefix='col')
##    col_A1  col_A2  col_A3
## 0       1       0       0
## 1       0       1       0
## 2       0       0       1
## 3       1       0       0
## 4       0       0       1
## 5       1       0       0
pd.get_dummies(df[cols], prefix=['colA','colB'])
##    colA_A1  colA_A2  colA_A3  colB_B1  colB_B2  colB_B3
## 0        1        0        0        1        0        0
## 1        0        1        0        0        1        0
## 2        0        0        1        0        0        1
## 3        1        0        0        1        0        0
## 4        0        0        1        1        0        0
## 5        1        0        0        0        0        1

11.10.6 Dropping First Column

  • Dummies cause colinearity issue for regression as it has redundant column.
  • Dropping a column does not loose any information technically
pd.get_dummies(df[cols],drop_first=True)
##    A_A2  A_A3  B_B2  B_B3
## 0     0     0     0     0
## 1     1     0     1     0
## 2     0     1     0     1
## 3     0     0     0     0
## 4     0     1     0     0
## 5     0     0     0     1

11.11 DataFrameGroupBy

  • groupby() is a DataFrame method, it returns DataFrameGroupBy object
  • DataFrameGroupBy object open doors for dataframe aggregation and summarization
  • DataFrameGroupBy object is a very flexible abstraction. In many ways, you can simply treat DataFrameGroup as if it’s a collection of DataFrames, and it does the difficult things under the hood

11.11.1 Sample Data

company = pd.read_csv('data/company.csv')
company
##    Company Department      Name  Age  Salary  Birthdate
## 0       C1         D1      Yong   45   15000   1/1/1970
## 1       C1         D1      Chew   35   12000   2/1/1980
## 2       C1         D2       Lim   34    8000  2/19/1977
## 3       C1         D3     Jessy   23    2500  3/15/1990
## 4       C1         D3  Hoi Ming   55   25000  4/15/1987
## ..     ...        ...       ...  ...     ...        ...
## 13      C3         D3     Chang   32    7900  7/26/1973
## 14      C3         D1       Ong   44   17500  8/21/1980
## 15      C3         D2      Lily   41   15300  7/17/1990
## 16      C3         D3     Sally   54   21000  7/19/1968
## 17      C3         D3    Esther   37   13500  3/16/1969
## 
## [18 rows x 6 columns]

11.11.2 Creating Groups

Group can be created for single or multiple columns

com_grp = company.groupby('Company') ## Single Column
com_dep_grp = company.groupby(['Company','Department'])  ## Multiple Column
type(com_dep_grp)
## <class 'pandas.core.groupby.generic.DataFrameGroupBy'>

11.11.3 Properties

Number of Groups

com_dep_grp.ngroups
## 9

Row Numbers Association

.groups property is a dictionary containing group key (identifying the group) and its values (underlying row indexes for the group)

gdict = com_dep_grp.groups       # return Dictionary
print( gdict.keys()   , '\n\n',  # group identifier
       gdict.values()   )        # group row indexes
## dict_keys([('C1', 'D1'), ('C1', 'D2'), ('C1', 'D3'), ('C2', 'D1'), ('C2', 'D2'), ('C2', 'D3'), ('C3', 'D1'), ('C3', 'D2'), ('C3', 'D3')]) 
## 
##  dict_values([Int64Index([0, 1], dtype='int64'), Int64Index([2], dtype='int64'), Int64Index([3, 4, 5], dtype='int64'), Int64Index([6], dtype='int64'), Int64Index([7, 8, 9], dtype='int64'), Int64Index([10, 11, 12], dtype='int64'), Int64Index([14], dtype='int64'), Int64Index([15], dtype='int64'), Int64Index([13, 16, 17], dtype='int64')])

11.11.4 Methods

Number of Rows In Each Group

com_dep_grp.size()  # return panda Series object
## Company  Department
## C1       D1            2
##          D2            1
##          D3            3
## C2       D1            1
##          D2            3
##          D3            3
## C3       D1            1
##          D2            1
##          D3            3
## dtype: int64

11.11.5 Retrieve Rows

Retrieve n-th Row Of Each Group

  • Row number is 0-based
  • For First row, use .first() or nth(0)
print( com_dep_grp.nth(0)  , '\n',
       com_dep_grp.first())
##                        Name  Age  Salary   Birthdate
## Company Department                                  
## C1      D1             Yong   45   15000    1/1/1970
##         D2              Lim   34    8000   2/19/1977
##         D3            Jessy   23    2500   3/15/1990
## C2      D1             Anne   18     400   7/15/1997
##         D2          Deborah   30    8600   8/15/1984
##         D3          Michael   38   17000  11/30/1997
## C3      D1              Ong   44   17500   8/21/1980
##         D2             Lily   41   15300   7/17/1990
##         D3            Chang   32    7900   7/26/1973 
##                         Name  Age  Salary   Birthdate
## Company Department                                  
## C1      D1             Yong   45   15000    1/1/1970
##         D2              Lim   34    8000   2/19/1977
##         D3            Jessy   23    2500   3/15/1990
## C2      D1             Anne   18     400   7/15/1997
##         D2          Deborah   30    8600   8/15/1984
##         D3          Michael   38   17000  11/30/1997
## C3      D1              Ong   44   17500   8/21/1980
##         D2             Lily   41   15300   7/17/1990
##         D3            Chang   32    7900   7/26/1973
  • For Last row, use .last() or nth(-1)`
print( com_dep_grp.nth(-1)  , '\n',
       com_dep_grp.last())
##                        Name  Age  Salary   Birthdate
## Company Department                                  
## C1      D1             Chew   35   12000    2/1/1980
##         D2              Lim   34    8000   2/19/1977
##         D3          Sui Wei   56    3000   6/15/1990
## C2      D1             Anne   18     400   7/15/1997
##         D2            Jimmy   46   14000  10/31/1988
##         D3          Bernard   29    9800   12/1/1963
## C3      D1              Ong   44   17500   8/21/1980
##         D2             Lily   41   15300   7/17/1990
##         D3           Esther   37   13500   3/16/1969 
##                         Name  Age  Salary   Birthdate
## Company Department                                  
## C1      D1             Chew   35   12000    2/1/1980
##         D2              Lim   34    8000   2/19/1977
##         D3          Sui Wei   56    3000   6/15/1990
## C2      D1             Anne   18     400   7/15/1997
##         D2            Jimmy   46   14000  10/31/1988
##         D3          Bernard   29    9800   12/1/1963
## C3      D1              Ong   44   17500   8/21/1980
##         D2             Lily   41   15300   7/17/1990
##         D3           Esther   37   13500   3/16/1969

Retrieve N Rows Of Each Groups

Example below retrieve 2 rows from each group

com_dep_grp.head(2)
##    Company Department      Name  Age  Salary   Birthdate
## 0       C1         D1      Yong   45   15000    1/1/1970
## 1       C1         D1      Chew   35   12000    2/1/1980
## 2       C1         D2       Lim   34    8000   2/19/1977
## 3       C1         D3     Jessy   23    2500   3/15/1990
## 4       C1         D3  Hoi Ming   55   25000   4/15/1987
## ..     ...        ...       ...  ...     ...         ...
## 11      C2         D3   Jeannie   30   12500  12/31/1980
## 13      C3         D3     Chang   32    7900   7/26/1973
## 14      C3         D1       Ong   44   17500   8/21/1980
## 15      C3         D2      Lily   41   15300   7/17/1990
## 16      C3         D3     Sally   54   21000   7/19/1968
## 
## [14 rows x 6 columns]

Retrieve All Rows Of Specific Group

get_group() retrieves all rows within the specified group.

com_dep_grp.get_group(('C1','D3'))
##   Company Department      Name  Age  Salary  Birthdate
## 3      C1         D3     Jessy   23    2500  3/15/1990
## 4      C1         D3  Hoi Ming   55   25000  4/15/1987
## 5      C1         D3   Sui Wei   56    3000  6/15/1990

11.11.6 Single Statistic Per Group

count()

count() for valid data (not null) for each fields within the group

com_dep_grp.count()  # return panda DataFrame object
##                     Name  Age  Salary  Birthdate
## Company Department                              
## C1      D1             2    2       2          2
##         D2             1    1       1          1
##         D3             3    3       3          3
## C2      D1             1    1       1          1
##         D2             3    3       3          3
##         D3             3    3       3          3
## C3      D1             1    1       1          1
##         D2             1    1       1          1
##         D3             3    3       3          3

sum()

This sums up all numeric columns for each group

com_dep_grp.sum()
##                     Age  Salary
## Company Department             
## C1      D1           80   27000
##         D2           34    8000
##         D3          134   30500
## C2      D1           18     400
##         D2          127   34600
##         D3           97   39300
## C3      D1           44   17500
##         D2           41   15300
##         D3          123   42400

To sum specific columns of each group, use ['columnName'] to select the column.
When single column is selected, output is a Series

com_dep_grp['Age'].sum()
## Company  Department
## C1       D1             80
##          D2             34
##          D3            134
## C2       D1             18
##          D2            127
##          D3             97
## C3       D1             44
##          D2             41
##          D3            123
## Name: Age, dtype: int64

mean()

This average up all numeric columns for each group

com_dep_grp.mean()
##                           Age        Salary
## Company Department                         
## C1      D1          40.000000  13500.000000
##         D2          34.000000   8000.000000
##         D3          44.666667  10166.666667
## C2      D1          18.000000    400.000000
##         D2          42.333333  11533.333333
##         D3          32.333333  13100.000000
## C3      D1          44.000000  17500.000000
##         D2          41.000000  15300.000000
##         D3          41.000000  14133.333333

To average specific columns of each group, use ['columnName'] to select the column.
When single column is selected, output is a Series

com_dep_grp['Age'].mean()
## Company  Department
## C1       D1            40.000000
##          D2            34.000000
##          D3            44.666667
## C2       D1            18.000000
##          D2            42.333333
##          D3            32.333333
## C3       D1            44.000000
##          D2            41.000000
##          D3            41.000000
## Name: Age, dtype: float64

11.11.7 Multi Statistic Per Group

Single Function To Column(s)

  • Instructions for aggregation are provided in the form of a dictionary. Dictionary keys specifies the column name, and value as the function to run
  • Can use lambda x: to customize the calclulation on entire column (x)
  • Python built-in function names does can be supplied without wrapping in string 'function'
com_dep_grp.agg({
  'Age': sum ,                 ## Total age of the group
  'Salary': lambda x: max(x),  ## Highest salary of the group
  'Birthdate': 'first'         ## First birthday of the group
})
##                     Age  Salary   Birthdate
## Company Department                         
## C1      D1           80   15000    1/1/1970
##         D2           34    8000   2/19/1977
##         D3          134   25000   3/15/1990
## C2      D1           18     400   7/15/1997
##         D2          127   14000   8/15/1984
##         D3           97   17000  11/30/1997
## C3      D1           44   17500   8/21/1980
##         D2           41   15300   7/17/1990
##         D3          123   21000   7/26/1973

Multiple Function to Column(s)

  • Use list of function names to specify functions to be applied on a particular column
  • Notice that output columns are MultiIndex , indicating the name of funcitons appled on level 1
ag = com_dep_grp.agg({
      'Age': ['mean', sum ],       ## Average age of the group
      'Salary': lambda x: max(x),  ## Highest salary of the group
      'Birthdate': 'first'         ## First birthday of the group
    })
    
print (ag, '\n\n', ag.columns)
##                           Age        Salary   Birthdate
##                          mean  sum <lambda>       first
## Company Department                                     
## C1      D1          40.000000   80    15000    1/1/1970
##         D2          34.000000   34     8000   2/19/1977
##         D3          44.666667  134    25000   3/15/1990
## C2      D1          18.000000   18      400   7/15/1997
##         D2          42.333333  127    14000   8/15/1984
##         D3          32.333333   97    17000  11/30/1997
## C3      D1          44.000000   44    17500   8/21/1980
##         D2          41.000000   41    15300   7/17/1990
##         D3          41.000000  123    21000   7/26/1973 
## 
##  MultiIndex([(      'Age',     'mean'),
##             (      'Age',      'sum'),
##             (   'Salary', '<lambda>'),
##             ('Birthdate',    'first')],
##            )

Column Relabling

Introduced in Pandas 0.25.0, groupby aggregation with relabelling is supported using “named aggregation” with simple tuples

com_dep_grp.agg(
  max_age     = ('Age', max),
  salary_m100 = ('Salary',  lambda x: max(x)+100),  
  first_bd    = ('Birthdate', 'first')
)
##                     max_age  salary_m100    first_bd
## Company Department                                  
## C1      D1               45        15100    1/1/1970
##         D2               34         8100   2/19/1977
##         D3               56        25100   3/15/1990
## C2      D1               18          500   7/15/1997
##         D2               51        14100   8/15/1984
##         D3               38        17100  11/30/1997
## C3      D1               44        17600   8/21/1980
##         D2               41        15400   7/17/1990
##         D3               54        21100   7/26/1973

11.11.8 Iteration

DataFrameGroupBy object can be thought as a collection of named groups

def print_groups (g):
    for name,group in g:
        print (name)
        print (group[:2])
        
print_groups (com_grp)
## C1
##   Company Department  Name  Age  Salary Birthdate
## 0      C1         D1  Yong   45   15000  1/1/1970
## 1      C1         D1  Chew   35   12000  2/1/1980
## C2
##   Company Department     Name  Age  Salary  Birthdate
## 6      C2         D1     Anne   18     400  7/15/1997
## 7      C2         D2  Deborah   30    8600  8/15/1984
## C3
##    Company Department   Name  Age  Salary  Birthdate
## 13      C3         D3  Chang   32    7900  7/26/1973
## 14      C3         D1    Ong   44   17500  8/21/1980
com_grp
## <pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000017D5D10C4C0>

11.11.9 Transform

  • Transform is an operation used combined with DataFrameGroupBy object
  • transform() return a new DataFrame object
grp = company.groupby('Company')
grp.size()
## Company
## C1    6
## C2    7
## C3    5
## dtype: int64

transform() perform a function to a group, and expands and replicate it to multiple rows according to original DataFrame

grp[['Age','Salary']].transform('sum')
##     Age  Salary
## 0   248   65500
## 1   248   65500
## 2   248   65500
## 3   248   65500
## 4   248   65500
## ..  ...     ...
## 13  208   75200
## 14  208   75200
## 15  208   75200
## 16  208   75200
## 17  208   75200
## 
## [18 rows x 2 columns]
grp.transform( lambda x:x+10 )
##     Age  Salary
## 0    55   15010
## 1    45   12010
## 2    44    8010
## 3    33    2510
## 4    65   25010
## ..  ...     ...
## 13   42    7910
## 14   54   17510
## 15   51   15310
## 16   64   21010
## 17   47   13510
## 
## [18 rows x 2 columns]
## 
## <string>:1: FutureWarning: Dropping invalid columns in DataFrameGroupBy.transform is deprecated. In a future version, a TypeError will be raised. Before calling .transform, select only columns which should be valid for the function.
## <string>:1: FutureWarning: Dropping invalid columns in DataFrameGroupBy.transform is deprecated. In a future version, a TypeError will be raised. Before calling .transform, select only columns which should be valid for the function.

11.12 Fundamental Analysis

11.13 Missing Data

11.13.1 Sample Data

df = pd.DataFrame( np.random.randn(5, 3), 
                   index   =['a', 'c', 'e', 'f', 'h'],
                   columns =['one', 'two', 'three'])
df['four'] = 'bar'
df['five'] = df['one'] > 0
#df
df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
##         one       two     three four   five
## a -0.155909 -0.501790  0.235569  bar  False
## b       NaN       NaN       NaN  NaN    NaN
## c -1.763605 -1.095862 -1.087766  bar  False
## d       NaN       NaN       NaN  NaN    NaN
## e -0.305170 -0.473748 -0.200595  bar  False
## f  0.355197  0.689518  0.410590  bar   True
## g       NaN       NaN       NaN  NaN    NaN
## h -0.564978  0.599391 -0.162936  bar  False

How Missing Data For Each Column ?

df.count()
## one      5
## two      5
## three    5
## four     5
## five     5
## dtype: int64
len(df.index) - df.count()
## one      0
## two      0
## three    0
## four     0
## five     0
## dtype: int64
df.isnull()
##      one    two  three   four   five
## a  False  False  False  False  False
## c  False  False  False  False  False
## e  False  False  False  False  False
## f  False  False  False  False  False
## h  False  False  False  False  False
df.describe()
##             one       two     three
## count  5.000000  5.000000  5.000000
## mean  -0.486893 -0.156498 -0.161028
## std    0.788635  0.772882  0.579752
## min   -1.763605 -1.095862 -1.087766
## 25%   -0.564978 -0.501790 -0.200595
## 50%   -0.305170 -0.473748 -0.162936
## 75%   -0.155909  0.599391  0.235569
## max    0.355197  0.689518  0.410590