Chapter 11 pandas
11.1 Modules Import
import pandas as pd
## Other Libraries
import numpy as np
import datetime as dt
from datetime import datetime
from datetime import date
11.2 Pandas Objects
11.2.1 Types of Objects
- pandas.Timestamp
- pandas.Timedelta
- pandas.Period
- pandas.Interval
- pandas.DateTimeIndex
- pandas.DataFrame
- pandas.Series
11.2.2 Series and DataFrame
Type | Dimension | Size | Value | Constructor |
---|---|---|---|---|
Series | 1 | Immutable | Mutable | pandas.DataFrame( data, index, dtype, copy) |
DataFrame | 2 | Mutable | Mutable | pandas.DataFrame( data, index, columns, dtype, copy) |
Panel | 3 | Mutable | Mutable |
data can be ndarray, list, constants
index must be unique and same length as data. Can be integer or string
dtype if none, it will be inferred
copy copy data. Default false
11.3 Class Method
11.3.1 Convertion
11.3.1.1 Convert To Timestamp
Use to_datetime()
to create various Timestamp related objects:
- Convert list of dates to DateTimeIndex
- Convert list of dates to Series of Timestamps
- Convert single date into Timestamp
- Source can be string, date, datetime object
Create Single Timestamp object
'2011-01-03')
pd.to_datetime(2018,4,13))
pd.to_datetime(date(2018,3,1,7,30)) pd.to_datetime(datetime(
## Timestamp('2011-01-03 00:00:00')
## Timestamp('2018-04-13 00:00:00')
## Timestamp('2018-03-01 07:30:00')
11.3.1.2 Convert To DateTimeIndex
= ['2011-01-03', # from string
data 2018,4,13), # from date
date(2018,3,1,7,30)] # from datetime
datetime(= pd.to_datetime(data) ## Convert into DateTimeIndex
dti
type(dti) ## DateTimeIndex
## <M8[ns]
dti.dtype type(dti[1]) ## Elements are Timestamp objects
## <class 'pandas.core.indexes.datetimes.DatetimeIndex'>
## dtype('<M8[ns]')
## <class 'pandas._libs.tslibs.timestamps.Timestamp'>
11.3.2 Creating DateTimeIndex
User date_range()
to automatically generate range of Timestamps based on start/end/interval. The result is DateTimeIndex
object which contains the range of Timestamps.
11.3.2.1 Hourly
If start time not specified, default to 00:00:00.
If start time specified, it will be honored on all subsequent Timestamp elements.
Specify start and end, sequence will automatically distribute Timestamp according to frequency.
print(
'2018-01-01', periods=3, freq='H'),
pd.date_range(2018,1,1,12,30), periods=3, freq='H'),
pd.date_range(datetime(='2018-01-03-1230', end='2018-01-03-18:30', freq='H')) pd.date_range(start
## DatetimeIndex(['2018-01-01 00:00:00', '2018-01-01 01:00:00', '2018-01-01 02:00:00'], dtype='datetime64[ns]', freq='H') DatetimeIndex(['2018-01-01 12:30:00', '2018-01-01 13:30:00', '2018-01-01 14:30:00'], dtype='datetime64[ns]', freq='H') DatetimeIndex(['2018-01-03 12:30:00', '2018-01-03 13:30:00', '2018-01-03 14:30:00',
## '2018-01-03 15:30:00', '2018-01-03 16:30:00', '2018-01-03 17:30:00',
## '2018-01-03 18:30:00'],
## dtype='datetime64[ns]', freq='H')
11.3.2.2 Daily
When the frequency is Day and time is not specified, output is date distributed.
When time is specified, output will honor the time.
2018,1,2), periods=3, freq='D'),
pd.date_range(date('2018-01-01-1230', periods=4, freq='D') pd.date_range(
## (DatetimeIndex(['2018-01-02', '2018-01-03', '2018-01-04'], dtype='datetime64[ns]', freq='D'),)
## DatetimeIndex(['2018-01-01 12:30:00', '2018-01-02 12:30:00', '2018-01-03 12:30:00',
## '2018-01-04 12:30:00'],
## dtype='datetime64[ns]', freq='D')
11.3.2.3 First Day Of Month
Use freq=MS
, M stands for montly, S stand for Start. If the day specified, the sequence start from first day of following month.
'2018-01', periods=4, freq='MS'),
pd.date_range('2018-01-09', periods=4, freq='MS'),
pd.date_range('2018-01-09 12:30:00', periods=4, freq='MS') pd.date_range(
## (DatetimeIndex(['2018-01-01', '2018-02-01', '2018-03-01', '2018-04-01'], dtype='datetime64[ns]', freq='MS'),)
## (DatetimeIndex(['2018-02-01', '2018-03-01', '2018-04-01', '2018-05-01'], dtype='datetime64[ns]', freq='MS'),)
## DatetimeIndex(['2018-02-01 12:30:00', '2018-03-01 12:30:00', '2018-04-01 12:30:00',
## '2018-05-01 12:30:00'],
## dtype='datetime64[ns]', freq='MS')
11.3.2.4 Last Day of Month
Sequence always starts from the end of the specified month.
'2018-01', periods=4, freq='M')
pd.date_range('2018-01-09', periods=4, freq='M')
pd.date_range('2018-01-09 12:30:00', periods=4, freq='M') pd.date_range(
## DatetimeIndex(['2018-01-31', '2018-02-28', '2018-03-31', '2018-04-30'], dtype='datetime64[ns]', freq='M')
## DatetimeIndex(['2018-01-31', '2018-02-28', '2018-03-31', '2018-04-30'], dtype='datetime64[ns]', freq='M')
## DatetimeIndex(['2018-01-31 12:30:00', '2018-02-28 12:30:00', '2018-03-31 12:30:00',
## '2018-04-30 12:30:00'],
## dtype='datetime64[ns]', freq='M')
11.3.3 Frequency Table (crosstab)
crosstab returns Dataframe Object
crosstab( index = <SeriesObj>, columns = <new_colName> ) # one dimension table
crosstab( index = <SeriesObj>, columns = <SeriesObj> ) # two dimension table
crosstab( index = <SeriesObj>, columns = [<SeriesObj1>, <SeriesObj2>] ) # multi dimension table
crosstab( index = <SeriesObj>, columns = <SeriesObj>, margines=True ) # add column and row margins
Sample Data
= 200
n = ['C' + i for i in np.random.randint( 1,4, size = n).astype(str)] # 3x Company
comp = ['D' + i for i in np.random.randint( 1,6, size = n).astype(str)] # 5x Department
dept = ['G' + i for i in np.random.randint( 1,3, size = n).astype(str)] # 2x Groups
grp = np.random.normal( loc=50 , scale=5 , size = n)
value1 = np.random.normal( loc=20 , scale=3 , size = n)
value2 = np.random.normal( loc=5 , scale=30 , size = n)
value3
= pd.DataFrame({
mydf 'comp':comp,
'dept':dept,
'grp': grp,
'value1':value1,
'value2':value2,
'value3':value3 })
mydf.head()
## comp dept grp value1 value2 value3
## 0 C1 D4 G2 45.422995 22.467314 -31.945606
## 1 C3 D5 G1 47.139210 20.328063 55.326412
## 2 C2 D3 G2 43.643161 25.207357 12.574508
## 3 C3 D3 G2 50.145878 20.179638 -0.896219
## 4 C3 D4 G2 58.134735 21.226297 -22.531270
One Dimension Frequency Table
=mydf.comp, columns='counter') pd.crosstab(index
## col_0 counter
## comp
## C1 82
## C2 61
## C3 57
Two Dimension Frequency Table
Row dimension uses comp variable, Column dimension uses dept variable.
=mydf.comp, columns=mydf.dept) pd.crosstab(index
## dept D1 D2 D3 D4 D5
## comp
## C1 16 20 16 18 12
## C2 16 11 13 12 9
## C3 15 10 7 17 8
Higher Dimension Table
Crosstab header is multi-levels index when more than one column specified. The multilevel index consists of all permuation of of grp and dept variables.
= pd.crosstab(index=mydf.comp, columns=[mydf.dept, mydf.grp])
tb ## observe two levels of headers generated
tb.columns ## two levels of headers tb
## MultiIndex([('D1', 'G1'),
## ('D1', 'G2'),
## ('D2', 'G1'),
## ('D2', 'G2'),
## ('D3', 'G1'),
## ('D3', 'G2'),
## ('D4', 'G1'),
## ('D4', 'G2'),
## ('D5', 'G1'),
## ('D5', 'G2')],
## names=['dept', 'grp'])
## dept D1 D2 D3 D4 D5
## grp G1 G2 G1 G2 G1 G2 G1 G2 G1 G2
## comp
## C1 7 9 9 11 9 7 9 9 7 5
## C2 12 4 4 7 8 5 6 6 5 4
## C3 8 7 3 7 4 3 7 10 6 2
Select sub-dataframe using multi-level referencing.
print( 'Under D2:\n', tb['D2'], '\n\n',
'Under D2-G2:\n',tb['D2','G1'])
## Under D2:
## grp G1 G2
## comp
## C1 9 11
## C2 4 7
## C3 3 7
##
## Under D2-G2:
## comp
## C1 9
## C2 4
## C3 3
## Name: (D2, G1), dtype: int64
Getting Margin
Extend the crosstab with ‘margin=True’ to have sum of rows/columns, presented in new column/row named ‘All’.
= pd.crosstab(index=mydf.dept, columns=mydf.grp, margins=True)
tb tb
## grp G1 G2 All
## dept
## D1 27 20 47
## D2 16 25 41
## D3 21 15 36
## D4 22 25 47
## D5 18 11 29
## All 104 96 200
print(
'Row Sums: \n', tb.loc[:,'All'],
'\n\nColumn Sums:\n', tb.loc['All'])
## Row Sums:
## dept
## D1 47
## D2 41
## D3 36
## D4 47
## D5 29
## All 200
## Name: All, dtype: int64
##
## Column Sums:
## grp
## G1 104
## G2 96
## All 200
## Name: All, dtype: int64
Getting Proportion
Use matrix operation divide each row with its respective column sum.
/tb.loc['All'] tb
## grp G1 G2 All
## dept
## D1 0.259615 0.208333 0.235
## D2 0.153846 0.260417 0.205
## D3 0.201923 0.156250 0.180
## D4 0.211538 0.260417 0.235
## D5 0.173077 0.114583 0.145
## All 1.000000 1.000000 1.000
11.3.4 Concatenation
Sample Data
= pd.Series(['A1','A2','A3','A4'])
s1 = pd.Series(['B1','B2','B3','B4'], name='B')
s2 = pd.Series(['C1','C2','C3','C4'], name='C') s3
Column-Wise
Combining Multiple Series Into A New DataFrame
- Series name will become column name in DataFrame.
- If Series has no name, default column names in DataFrame will be 0,1,2,3
axis=1
means column-wise
None], axis=1) ## observed that None is ignored pd.concat([s1,s2,s3,
## 0 B C
## 0 A1 B1 C1
## 1 A2 B2 C2
## 2 A3 B3 C3
## 3 A4 B4 C4
Add Multiple Series Into An Existing DataFrame
- No change to original DataFrame column name
- Added columns from series will have 0,1,2,3,.. column name
= pd.DataFrame({ 'A': s1, 'B': s2})
df df
## A B
## 0 A1 B1
## 1 A2 B2
## 2 A3 B3
## 3 A4 B4
None],axis=1) pd.concat([df,s3,s2,s1,
## A B C B 0
## 0 A1 B1 C1 B1 A1
## 1 A2 B2 C2 B2 A2
## 2 A3 B3 C3 B3 A3
## 3 A4 B4 C4 B4 A4
11.3.5 External Data
html_table
Parser
This method require html5lib library.
- Read the web page, create a list: which contain one or more dataframes that maps to each html table found
- Scrap all detectable html tables
- Auto detect column header
- Auto create index using number starting from 0
read_html(url) # return list of dataframe(s) that maps to web table(s) structure
= pd.read_html('https://www.malaysiastock.biz/Listed-Companies.aspx?type=S&s1=18') ## read all tables
df_list = df_list[6] ## get the specific table
df
print ('Total Table(s) Found : ', len(df_list), '\n',
'First Table Found: ',df)
## Total Table(s) Found : 14
## First Table Found: 0 \
## 0 0 - 9 A B C D E F G H I J K L M ...
##
## 1
## 0 Bursa Malaysia Market Watch Top Articles 1. ...
CSV Writing
Syntax
DataFrame.to_csv(
path_or_buf=None, ## if not provided, result is returned as string
sep=', ',
na_rep='',
float_format=None,
columns=None, ## list of columns name to write, if not provided, all columns are written
header=True, ## write out column names
index=True, ## write row label
index_label=None,
mode='w',
encoding=None, ## if not provided, default to 'utf-8'
quoting=None, quotechar='"',
line_terminator=None,
chunksize=None,
date_format=None,
doublequote=True,
escapechar=None,
decimal='.')
Example below shows column value containing different special character. Note that pandas handles these very well by default.
= pd.DataFrame({'Id':[10,20,30,40],
mydf 'Name': ['Aaa','Bbb','Ccc','Ddd'],
'Funny': ["world's most \clever",
"Bloody, damn, good",
"many\nmany\nline",
'Quoting "is" tough']})
'Id', inplace=True)
mydf.set_index('data/csv_test.csv', index=True)
mydf.to_csv( mydf
## Name Funny
## Id
## 10 Aaa world's most \clever
## 20 Bbb Bloody, damn, good
## 30 Ccc many\nmany\nline
## 40 Ddd Quoting "is" tough
This is the file saved
system('more data\\csv_test.csv')
## [1] 0
All content retained when reading back by Pandas
'data/csv_test.csv', index_col='Id') pd.read_csv(
## Name Funny
## Id
## 10 Aaa world's most \clever
## 20 Bbb Bloody, damn, good
## 30 Ccc many\nmany\nline
## 40 Ddd Quoting "is" tough
11.3.5.1 CSV Reading
Syntax
pandas.read_csv(
'url or filePath', # path to file or url
encoding = 'utf_8', # optional: default is 'utf_8'
index_col = ['colName1', ...], # optional: specify one or more index column
parse_dates = ['dateCol1', ...], # optional: specify multiple string column to convert to date
na_values = ['.','na','NA','N/A'], # optional: values that is considered NA
names = ['newColName1', ... ], # optional: overwrite column names
thousands = '.', # optional: thousand seperator symbol
nrows = n, # optional: load only first n rows
skiprows = 0, # optional: don't load first n rows
parse_dates = False, # List of date column names
infer_datetime_format = False # automatically parse dates
)
Refer to full codec Python Codec.
Default Import
- index is sequence of integer 0,1,2…
- only two data types detection; number (float64/int64) and string (object)
- date is not parsed, hence stayed as string
= pd.read_csv('data/goog.csv', encoding='utf_8')
goo print(goo.head(), '\n\n',
goo.info())
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 61 entries, 0 to 60
## Data columns (total 6 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 Date 61 non-null object
## 1 Open 61 non-null float64
## 2 High 61 non-null float64
## 3 Low 61 non-null float64
## 4 Close 61 non-null float64
## 5 Volume 61 non-null int64
## dtypes: float64(4), int64(1), object(1)
## memory usage: 3.0+ KB
## Date Open High Low Close Volume
## 0 12/19/2016 790.219971 797.659973 786.270020 794.200012 1225900
## 1 12/20/2016 796.760010 798.650024 793.270020 796.419983 925100
## 2 12/21/2016 795.840027 796.676025 787.099976 794.559998 1208700
## 3 12/22/2016 792.359985 793.320007 788.580017 791.260010 969100
## 4 12/23/2016 790.900024 792.739990 787.280029 789.909973 623400
##
## None
Specify Data Types
- To customize the data type, use
dtype
parameter with a dict of definition.
= {'Volume': str}
d_types 'data/goog.csv', dtype=d_types).info() pd.read_csv(
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 61 entries, 0 to 60
## Data columns (total 6 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 Date 61 non-null object
## 1 Open 61 non-null float64
## 2 High 61 non-null float64
## 3 Low 61 non-null float64
## 4 Close 61 non-null float64
## 5 Volume 61 non-null object
## dtypes: float64(4), object(2)
## memory usage: 3.0+ KB
Parse Datetime
You can specify multiple date-alike column for parsing
'data/goog.csv', parse_dates=['Date']).info() pd.read_csv(
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 61 entries, 0 to 60
## Data columns (total 6 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 Date 61 non-null datetime64[ns]
## 1 Open 61 non-null float64
## 2 High 61 non-null float64
## 3 Low 61 non-null float64
## 4 Close 61 non-null float64
## 5 Volume 61 non-null int64
## dtypes: datetime64[ns](1), float64(4), int64(1)
## memory usage: 3.0 KB
Parse Datetime, Then Set as Index
- Specify names of date column in parse_dates=
- When date is set as index, the type is DateTimeIndex
= pd.read_csv('data/goog.csv',index_col='Date', parse_dates=['Date'])
goo3 goo3.info()
## <class 'pandas.core.frame.DataFrame'>
## DatetimeIndex: 61 entries, 2016-12-19 to 2017-03-17
## Data columns (total 5 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 Open 61 non-null float64
## 1 High 61 non-null float64
## 2 Low 61 non-null float64
## 3 Close 61 non-null float64
## 4 Volume 61 non-null int64
## dtypes: float64(4), int64(1)
## memory usage: 2.9 KB
11.3.6 Inspection
Structure info
info() is a function that print information to screen. It doesn’t return any object
dataframe.info() # display columns and number of rows (that has no missing data)
goo.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 61 entries, 0 to 60
## Data columns (total 6 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 Date 61 non-null object
## 1 Open 61 non-null float64
## 2 High 61 non-null float64
## 3 Low 61 non-null float64
## 4 Close 61 non-null float64
## 5 Volume 61 non-null int64
## dtypes: float64(4), int64(1), object(1)
## memory usage: 3.0+ KB
head
goo.head()
## Date Open High Low Close Volume
## 0 12/19/2016 790.219971 797.659973 786.270020 794.200012 1225900
## 1 12/20/2016 796.760010 798.650024 793.270020 796.419983 925100
## 2 12/21/2016 795.840027 796.676025 787.099976 794.559998 1208700
## 3 12/22/2016 792.359985 793.320007 788.580017 791.260010 969100
## 4 12/23/2016 790.900024 792.739990 787.280029 789.909973 623400
11.4 Class: Timestamp
This is an enhanced version to datetime standard library.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.html#pandas.Timestamp
11.4.1 Constructor
From Number
type(pd.Timestamp(2017,12,11))
## <class 'pandas._libs.tslibs.timestamps.Timestamp'>
=2017, month=1, day=1) # date-like numbers
pd.Timestamp(year2017,1,1) # date-like numbers
pd.Timestamp(2017,12,11,5,45) # datetime-like numbers
pd.Timestamp(2017,12,11,5,45,55,999) # + microseconds
pd.Timestamp(2017,12,11,5,45,55,999,8) # + nanoseconds pd.Timestamp(
## Timestamp('2017-01-01 00:00:00')
## Timestamp('2017-01-01 00:00:00')
## Timestamp('2017-12-11 05:45:00')
## Timestamp('2017-12-11 05:45:55.000999')
## Timestamp('2017-12-11 05:45:55.000999008')
From String
Observe that pandas support many string input format
Year Month Day
'2017-12-11') # date-like string: year-month-day
pd.Timestamp('2017 12 11') # date-like string: year-month-day
pd.Timestamp('2017 Dec 11') # date-like string: year-month-day
pd.Timestamp('Dec 11, 2017') # date-like string: year-month-day pd.Timestamp(
## Timestamp('2017-12-11 00:00:00')
## Timestamp('2017-12-11 00:00:00')
## Timestamp('2017-12-11 00:00:00')
## Timestamp('2017-12-11 00:00:00')
YMD Hour Minute Second Ms
'2017-12-11 0545') ## hour minute
pd.Timestamp('2017-12-11-05:45')
pd.Timestamp('2017-12-11T0545')
pd.Timestamp('2017-12-11 054533') ## hour minute seconds
pd.Timestamp('2017-12-11 05:45:33') pd.Timestamp(
## Timestamp('2017-12-11 05:45:00')
## Timestamp('2017-12-11 05:45:00')
## Timestamp('2017-12-11 05:45:00')
## Timestamp('2017-12-11 05:45:33')
## Timestamp('2017-12-11 05:45:33')
With Timezone can be included in various ways.
'2017-01-01T0545Z'), # GMT
pd.Timestamp('2017-01-01T0545+9') # GMT+9
pd.Timestamp('2017-01-01T0545+0800') # GMT+0800
pd.Timestamp('2017-01-01 0545', tz='Asia/Singapore') pd.Timestamp(
## (Timestamp('2017-01-01 05:45:00+0000', tz='UTC'),)
## Timestamp('2017-01-01 05:45:00+0900', tz='pytz.FixedOffset(540)')
## Timestamp('2017-01-01 05:45:00+0800', tz='pytz.FixedOffset(480)')
## Timestamp('2017-01-01 05:45:00+0800', tz='Asia/Singapore')
From datetime / date
2017,3,5)) # from date
pd.Timestamp(date(2017,3,5,4,30)) # from datetime
pd.Timestamp(datetime(2017,3,5,4,30), tz='Asia/Kuala_Lumpur') # from datetime, + tz pd.Timestamp(datetime(
## Timestamp('2017-03-05 00:00:00')
## Timestamp('2017-03-05 04:30:00')
## Timestamp('2017-03-05 04:30:00+0800', tz='Asia/Kuala_Lumpur')
11.4.2 Attributes
We can tell many things about a Timestamp object. Note that timezone (tz) is a pytz object.
= pd.Timestamp('2017-01-01T054533+0800') # GMT+0800
ts
ts.month
ts.day
ts.year
ts.hour
ts.minute
ts.second
ts.microsecond
ts.nanosecond
ts.dayofweek
ts.dayofyear
ts.daysinmonth
ts.is_leap_year
ts.is_month_end
ts.is_month_start
ts.dayofweek## pytz object
ts.tz type(ts.tz)
## 1
## 1
## 2017
## 5
## 45
## 33
## 0
## 0
## 6
## 1
## 31
## False
## False
## True
## 6
## pytz.FixedOffset(480)
## <class 'pytz._FixedOffset'>
11.4.3 Timezones
Adding Timezones and Clock Shifting
- Use
tz_localize
to make a timestamp zone aware. It will not shift the clock. - Once gotten a timezone, you can easily shift the clock to another timezone using
tz_convert()
.
= pd.Timestamp(2017,1,10,10,34) ## No timezone
ts
ts= ts.tz_localize('Asia/Kuala_Lumpur') ## Add timezone, retain time
ts1
ts1'UTC') ## Convert to diff timezone ts1.tz_convert(
## Timestamp('2017-01-10 10:34:00')
## Timestamp('2017-01-10 10:34:00+0800', tz='Asia/Kuala_Lumpur')
## Timestamp('2017-01-10 02:34:00+0000', tz='UTC')
Removing Timezone
Just apply None with tz_localize
to remove TZ info.
= pd.Timestamp(2017,1,10,10,34).tz_localize('Asia/Kuala_Lumpur')
ts
tsNone) ## no longer TZ aware ts.tz_localize(
## Timestamp('2017-01-10 10:34:00+0800', tz='Asia/Kuala_Lumpur')
## Timestamp('2017-01-10 10:34:00')
11.4.4 Formatting
strftime
Use strftime()
to customize string format. For complete directive, see here.
= pd.Timestamp(2017,1,10,15,34,10).tz_localize('Asia/Kuala_Lumpur')
ts
ts"%Y-%m-%d %b %I:%M:%S %p") ts.strftime(
## Timestamp('2017-01-10 15:34:10+0800', tz='Asia/Kuala_Lumpur')
## '2017-01-10 Jan 03:34:10 PM'
isoformat
Use isoformat()
to format ISO string (without timezone)
= pd.Timestamp(2017,1,10,15,34,10)
ts = ts.tz_localize('Asia/Kuala_Lumpur')
ts1
## without TZ
ts.isoformat() ## with Tz ts1.isoformat()
## '2017-01-10T15:34:10'
## '2017-01-10T15:34:10+08:00'
11.4.5 Conversion
Convert To datetime
Use to_pydatetime()
to convert into standard library datetime.datetime
. From the ‘datetime’ object, apply date()
to get datetime.date
= pd.Timestamp(2017,1,10,7,30,52).tz_localize('Asia/Kuala_Lumpur')
ts ## Timestamp, wtih TZ
ts ## datetime, with TZ
ts.to_pydatetime() ## datetime.date (no TZ)
ts.to_pydatetime().date() ## same as above (no TZ) ts.date()
## Timestamp('2017-01-10 07:30:52+0800', tz='Asia/Kuala_Lumpur')
## datetime.datetime(2017, 1, 10, 7, 30, 52, tzinfo=<DstTzInfo 'Asia/Kuala_Lumpur' +08+8:00:00 STD>)
## datetime.date(2017, 1, 10)
## datetime.date(2017, 1, 10)
11.4.7 Other Utilities
11.5 Class: DateTimeIndex
11.5.2 Co Conversion
Convert To datetime.datetime
Use to_pydatetime
to convert into python standard datetime.datetime object
print('Converted to List:', dti.to_pydatetime(), '\n\n',
'Converted Type:', type(dti.to_pydatetime()))
## Converted to List: [datetime.datetime(2011, 1, 3, 0, 0) datetime.datetime(2018, 4, 13, 0, 0)
## datetime.datetime(2018, 3, 1, 7, 30)]
##
## Converted Type: <class 'numpy.ndarray'>
11.5.3 Structure Conversion
Convert To Series: to_series
This creates a Series where index and data with the same value
#dti = pd.date_range('2018-02', periods=4, freq='M')
dti.to_series()
## 2011-01-03 00:00:00 2011-01-03 00:00:00
## 2018-04-13 00:00:00 2018-04-13 00:00:00
## 2018-03-01 07:30:00 2018-03-01 07:30:00
## dtype: datetime64[ns]
Convert To DataFrame: to_frame()
This convert to single column DataFrame with index as the same value
dti.to_frame()
## 0
## 2011-01-03 00:00:00 2011-01-03 00:00:00
## 2018-04-13 00:00:00 2018-04-13 00:00:00
## 2018-03-01 07:30:00 2018-03-01 07:30:00
11.6 Class: Series
Series allows different data types (object class) as its element
pandas.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)
- data array-like, iterable, dict or scalar
- If dtype not specified, it will infer from data.
11.6.1 Constructor
Empty Series
Passing no data to constructor will result in empty series. By default, empty series dtype is float.
= pd.Series(dtype='object')
s print (s, '\n',
type(s))
## Series([], dtype: object)
## <class 'pandas.core.series.Series'>
From Scalar
If data is a scalar value, an index must be provided. The value will be repeated to match the length of index
99, index = ['a','b','c','d']) pd.Series(
## a 99
## b 99
## c 99
## d 99
## dtype: int64
From array-like
From list
'a','b','c','d','e']) # from Python list pd.Series([
## 0 a
## 1 b
## 2 c
## 3 d
## 4 e
## dtype: object
From numpy.array
If index is not specified, default to 0 and continue incrementally
'a','b','c','d','e'])) pd.Series(np.array([
## 0 a
## 1 b
## 2 c
## 3 d
## 4 e
## dtype: object
From DateTimeIndex
'2011-1-1','2011-1-3')) pd.Series(pd.date_range(
## 0 2011-01-01
## 1 2011-01-02
## 2 2011-01-03
## dtype: datetime64[ns]
11.6.1.1 From Dictionary
The dictionary key will be the index. Order is not sorted.
'a' : 0., 'c' : 5., 'b' : 2.}) pd.Series({
## a 0.0
## c 5.0
## b 2.0
## dtype: float64
If index sequence is specifeid, then Series will forllow the index order
Objerve that missing data (index without value) will be marked as NaN
'a' : 0., 'c' : 1., 'b' : 2.},index = ['a','b','c','d']) pd.Series({
## a 0.0
## b 2.0
## c 1.0
## d NaN
## dtype: float64
Specify Index
'a','b','c','d','e'], index=[10,20,30,40,50]) pd.Series([
## 10 a
## 20 b
## 30 c
## 40 d
## 50 e
## dtype: object
11.6.1.2 Mix Element Types
dType will be ‘object’ when there were mixture of classes
= pd.Series(['a',1,2,3])
ser print('Object Type : ', type(ser),'\n',
'Object dType: ', ser.dtype,'\n',
'Element 1 Type: ',type(ser[0]),'\n',
'Elmeent 2 Type: ',type(ser[1]))
## Object Type : <class 'pandas.core.series.Series'>
## Object dType: object
## Element 1 Type: <class 'str'>
## Elmeent 2 Type: <class 'int'>
11.6.1.3 Specify Data Types
By default, dtype is inferred from data.
= pd.Series([1,2,3])
ser1 = pd.Series([1,2,3], dtype="int8")
ser2 = pd.Series([1,2,3], dtype="object")
ser3
print(' Inferred: ',ser1.dtype, '\n',
'Specified int8: ',ser2.dtype, '\n',
'Specified object:',ser3.dtype)
## Inferred: int64
## Specified int8: int8
## Specified object: object
11.6.2 Accessing Series
series ( single/list/range_of_row_label/number ) # can cause confusion
series.loc ( single/list/range_of_row_label )
series.iloc( single/list/range_of_row_number )
Sample Data
= pd.Series([1,2,3,4,5],index=['a','b','c','d','e'])
s s
## a 1
## b 2
## c 3
## d 4
## e 5
## dtype: int64
by Row Number(s)
Single Item. Notice that inputing a number and list of number give different result.
print( 'Referencing by number:',s.iloc[1],'\n\n',
'\nReferencing by list of number:\n',s.iloc[[1]])
## Referencing by number: 2
##
##
## Referencing by list of number:
## b 2
## dtype: int64
Multiple Items
1,3]] s.iloc[[
## b 2
## d 4
## dtype: int64
Range (First 3)
3] s.iloc[:
## a 1
## b 2
## c 3
## dtype: int64
Range (Last 3)
-3:] s.iloc[
## c 3
## d 4
## e 5
## dtype: int64
Range (in between)
2:3] s.iloc[
## c 3
## dtype: int64
by Index(es)
Single Label. Notice the difference referencing input: single index and list of index.
Warning: if index is invalid, this will result in error.
print( s.loc['c'], '\n',
'c']]) s[[
## 3
## c 3
## dtype: int64
Multiple Labels
If index is not found, it will return NaN
'k','c']] s.loc[[
## Error in py_call_impl(callable, dots$args, dots$keywords): KeyError: "['k'] not in index"
** Range of Labels **
'b':'d'] s.loc[
## b 2
## c 3
## d 4
## dtype: int64
Filtering
Use logical array to filter
= pd.Series(range(1,8))
s <5] s[s
## 0 1
## 1 2
## 2 3
## 3 4
## dtype: int64
Use where
The where method is an application of the if-then idiom. For each element in the calling Series, if cond
is True the element is used; otherwise other
is used.
.where(cond, other=nan, inplace=False)
print(s.where(s<4),'\n\n',
<4,other=None) ) s.where(s
## 0 1.0
## 1 2.0
## 2 3.0
## 3 NaN
## 4 NaN
## 5 NaN
## 6 NaN
## dtype: float64
##
## 0 1.0
## 1 2.0
## 2 3.0
## 3 NaN
## 4 NaN
## 5 NaN
## 6 NaN
## dtype: float64
11.6.3 Updating Series
11.6.4 Series Attributes
The Data
= pd.Series([1,2,3,4,5],index=['a','b','c','d','e'],name='SuperHero')
s s
## a 1
## b 2
## c 3
## d 4
## e 5
## Name: SuperHero, dtype: int64
11.6.4.1 The Attributes
print( ' Series Index: ',s.index, '\n',
'Series dType: ', s.dtype, '\n',
'Series Size: ', s.size, '\n',
'Series Shape: ', s.shape, '\n',
'Series Dimension:', s.ndim)
## Series Index: Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
## Series dType: int64
## Series Size: 5
## Series Shape: (5,)
## Series Dimension: 1
11.6.5 Instance Methods
Index Manipulation
.rename_axis()
'haribulan') s.rename_axis(
## haribulan
## a 1
## b 2
## c 3
## d 4
## e 5
## Name: SuperHero, dtype: int64
.reset_index()
Resetting index will:
- Convert index to a normal column, with column named as ‘index’
- Index renumbered to 1,2,3
- Return DataFrame (became two columns)
s.reset_index()
## index SuperHero
## 0 a 1
## 1 b 2
## 2 c 3
## 3 d 4
## 4 e 5
Structure Conversion
- A series structure contain
value
(in numpy array), itsdtype
(data type of the numpy array).
- Use
values
to retrieve into `numpy.ndarray
. Usedtype
to understand the data type.
= pd.Series([1,2,3,4,5])
s print(' Series value: ', s.values, '\n',
'Series value type: ', type(s.values), '\n',
'Series dtype: ', s.dtype)
## Series value: [1 2 3 4 5]
## Series value type: <class 'numpy.ndarray'>
## Series dtype: int64
Convert To List using .tolist()
pd.Series.tolist(s)
## [1, 2, 3, 4, 5]
11.6.6 Operators
The result of applying operator (arithmetic or logic) to Series object returns a new Series object
Arithmetic Operator
= pd.Series( [100,200,300,400,500] )
s1 = pd.Series( [10, 20, 30, 40, 50] ) s2
Apply To One Series Object
- 100 s1
## 0 0
## 1 100
## 2 200
## 3 300
## 4 400
## dtype: int64
Apply To Two Series Objects
- s2 s1
## 0 90
## 1 180
## 2 270
## 3 360
## 4 450
## dtype: int64
Logic Operator
- Apply logic operator to a Series return a new Series of boolean result
- This can be used for Series or DataFrame filtering
= pd.Series(range(0,10))
bs >3 bs
## 0 False
## 1 False
## 2 False
## 3 False
## 4 True
## 5 True
## 6 True
## 7 True
## 8 True
## 9 True
## dtype: bool
~((bs>3) & (bs<8) | (bs>7))
## 0 True
## 1 True
## 2 True
## 3 True
## 4 False
## 5 False
## 6 False
## 7 False
## 8 False
## 9 False
## dtype: bool
11.6.7 String Accesor .str
If the underlying data is str type, then pandas exposed various properties and methos through str
accessor.
SeriesObj.str.operatorFunction()
Available Functions
Nearly all Python’s built-in string methods are mirrored by a Pandas vectorized string method. Here is a list of Pandas str methods that mirror Python string methods:
len() lower() translate() islower() ljust() upper() startswith() isupper() rjust() find() endswith() isnumeric() center() rfind() isalnum() isdecimal() zfill() index() isalpha() split() strip() rindex() isdigit() rsplit() rstrip() capitalize() isspace() partition() lstrip() swapcase() istitle() rpartition()
11.6.7.1 Regex Extractor
Extract capture groups in the regex pattern, by default in DataFrame (expand=True
).
Series.str.extract(self, pat, flags=0, expand=True)
- expand=True: if result is single column, make it a Series instead of Dataframe.
= pd.Series(['a1', 'b2', 'c3'])
s print(
' Extracted Dataframe:\n', s.str.extract(r'([ab])(\d)'),'\n\n',
'Extracted Dataframe witn Names:\n', s.str.extract(r'(?P<Letter>[ab])(\d)'))
## Extracted Dataframe:
## 0 1
## 0 a 1
## 1 b 2
## 2 NaN NaN
##
## Extracted Dataframe witn Names:
## Letter 1
## 0 a 1
## 1 b 2
## 2 NaN NaN
Below ouptut single columne, use expand=False
to make the result a Series, instead of DataFrame.
= s.str.extract(r'[ab](\d)', expand=False)
r print( r, '\n\n', type(r) )
## 0 1
## 1 2
## 2 NaN
## dtype: object
##
## <class 'pandas.core.series.Series'>
11.6.7.2 Character Extractor
= pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
monte 'Eric Idle', 'Terry Jones', 'Michael Palin'])
monte
## 0 Graham Chapman
## 1 John Cleese
## 2 Terry Gilliam
## 3 Eric Idle
## 4 Terry Jones
## 5 Michael Palin
## dtype: object
startwith
str.startswith('T') monte.
## 0 False
## 1 False
## 2 True
## 3 False
## 4 True
## 5 False
## dtype: bool
Slicing
str[0:3] monte.
## 0 Gra
## 1 Joh
## 2 Ter
## 3 Eri
## 4 Ter
## 5 Mic
## dtype: object
Splitting
Split strings around given separator/delimiter in either string or regex.
Series.str.split(self, pat=None, n=-1, expand=False)
- pat: can be string or regex
= pd.Series(['a_b_c', 'c_d_e', np.nan, 'f_g_h_i_j'])
s s
## 0 a_b_c
## 1 c_d_e
## 2 NaN
## 3 f_g_h_i_j
## dtype: object
str.split()
by default, split will split each item into array
str.split('_') s.
## 0 [a, b, c]
## 1 [c, d, e]
## 2 NaN
## 3 [f, g, h, i, j]
## dtype: object
expand=True
will return a dataframe instead of series. By default, expand split into all possible columns.
print( s.str.split('_', expand=True) )
## 0 1 2 3 4
## 0 a b c None None
## 1 c d e None None
## 2 NaN NaN NaN NaN NaN
## 3 f g h i j
It is possible to limit the number of columns splitted
print( s.str.split('_', expand=True, n=1) )
## 0 1
## 0 a b_c
## 1 c d_e
## 2 NaN NaN
## 3 f g_h_i_j
str.rsplit()
rsplit
stands for reverse split, it works the same way, except it is reversed
print( s.str.rsplit('_', expand=True, n=1) )
## 0 1
## 0 a_b c
## 1 c_d e
## 2 NaN NaN
## 3 f_g_h_i j
Case Conversion
SeriesObj.str.upper()
SeriesObj.str.lower()
SeriesObj.str.capitalize()
= pd.Series(['A', 'B', 'C', 'aAba', 'bBaca', np.nan, 'cCABA', 'dog', 'cat'])
s print( s.str.upper(), '\n',
str.capitalize()) s.
## 0 A
## 1 B
## 2 C
## 3 AABA
## 4 BBACA
## 5 NaN
## 6 CCABA
## 7 DOG
## 8 CAT
## dtype: object
## 0 A
## 1 B
## 2 C
## 3 Aaba
## 4 Bbaca
## 5 NaN
## 6 Ccaba
## 7 Dog
## 8 Cat
## dtype: object
Number of Characters
str.len() s.
## 0 1.0
## 1 1.0
## 2 1.0
## 3 4.0
## 4 5.0
## 5 NaN
## 6 5.0
## 7 3.0
## 8 3.0
## dtype: float64
String Indexing
This return specified character from each item.
= pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan,'CABA', 'dog', 'cat'])
s str[0].values # first char s.
## array(['A', 'B', 'C', 'A', 'B', nan, 'C', 'd', 'c'], dtype=object)
str[0:2].values # first and second char s.
## array(['A', 'B', 'C', 'Aa', 'Ba', nan, 'CA', 'do', 'ca'], dtype=object)
11.6.8 Date Accessor .dt
If the underlying data is datetime64 type, then pandas exposed various properties and methos through dt
accessor.
Sample Data
= pd.Series([
s 2000,1,1,0,0,0),
datetime(1999,12,15,12,34,55),
datetime(2020,3,8,5,7,12),
datetime(2018,1,1,0,0,0),
datetime(2003,3,4,5,6,7)
datetime(
]) s
## 0 2000-01-01 00:00:00
## 1 1999-12-15 12:34:55
## 2 2020-03-08 05:07:12
## 3 2018-01-01 00:00:00
## 4 2003-03-04 05:06:07
## dtype: datetime64[ns]
Convert To
datetime.datetime
Use to_pydatetime()
to convert into numpy.array
of standard library datetime.datetime
= s.dt.to_pydatetime()
pdt print( type(pdt) )
## <class 'numpy.ndarray'>
pdt
## array([datetime.datetime(2000, 1, 1, 0, 0),
## datetime.datetime(1999, 12, 15, 12, 34, 55),
## datetime.datetime(2020, 3, 8, 5, 7, 12),
## datetime.datetime(2018, 1, 1, 0, 0),
## datetime.datetime(2003, 3, 4, 5, 6, 7)], dtype=object)
datetime.date
Use dt.date
to convert into pandas.Series
of standard library datetime.date
Is it possible to have a pandas.Series of datetime.datetime ? No, because Pandas want it as its own Timestamp.
= s.dt.date
sdt print( type(sdt[1] ))
## <class 'datetime.date'>
print( type(sdt))
## <class 'pandas.core.series.Series'>
sdt
## 0 2000-01-01
## 1 1999-12-15
## 2 2020-03-08
## 3 2018-01-01
## 4 2003-03-04
## dtype: object
11.6.8.1 Timestamp Attributes
A Series::DateTime object support below properties:
- date
- month
- day
- year
- dayofweek
- dayofyear
- weekday
- weekday_name
- quarter
- daysinmonth
- hour
- minute
Full list below:
https://pandas.pydata.org/pandas-docs/stable/reference/series.html#datetimelike-properties
s.dt.date
## 0 2000-01-01
## 1 1999-12-15
## 2 2020-03-08
## 3 2018-01-01
## 4 2003-03-04
## dtype: object
s.dt.month
## 0 1
## 1 12
## 2 3
## 3 1
## 4 3
## dtype: int64
s.dt.dayofweek
## 0 5
## 1 2
## 2 6
## 3 0
## 4 1
## dtype: int64
s.dt.weekday
## 0 5
## 1 2
## 2 6
## 3 0
## 4 1
## dtype: int64
s.dt.quarter
## 0 1
## 1 4
## 2 1
## 3 1
## 4 1
## dtype: int64
s.dt.daysinmonth
## 0 31
## 1 31
## 2 31
## 3 31
## 4 31
## dtype: int64
# extract time as time Object s.dt.time
## 0 00:00:00
## 1 12:34:55
## 2 05:07:12
## 3 00:00:00
## 4 05:06:07
## dtype: object
# extract hour as integer s.dt.hour
## 0 0
## 1 12
## 2 5
## 3 0
## 4 5
## dtype: int64
# extract minute as integer s.dt.minute
## 0 0
## 1 34
## 2 7
## 3 0
## 4 6
## dtype: int64
11.7 Class: DataFrame
11.7.1 Constructor
- pd.DataFrame() constructor creates DataFrame object.
- When index (row label) and column names are not specified (or cannot be induced from data source), it will default to 0,1,2,3…
- Specify Row Label and Column Header with
columns
andindex
parameters in constructor. - When
columns
is specified AND data source has keys that can be used as column names, pandas will select those column names and order specified incolumns
parameter. - index and row label are used interchangeably in this book
Empty DataFrame
By default, An empty dataframe contain no columns and index.
= pd.DataFrame()
empty_df1 = pd.DataFrame()
empty_df2
empty_df1id(empty_df1)
id(empty_df2)
## Empty DataFrame
## Columns: []
## Index: []
## 1637942427808
## 1637942431216
Even though it is empty, you can still initialize it with Index Columns.
## empty dataframe with columns
= pd.DataFrame(columns=['A','B','C'])
empty_df ## index is empty
empty_df
## empty dataframe with columns and index
= pd.DataFrame(columns=['A','B','C'], index=[1,2,3])
empty_df empty_df
## Empty DataFrame
## Columns: [A, B, C]
## Index: []
## A B C
## 1 NaN NaN NaN
## 2 NaN NaN NaN
## 3 NaN NaN NaN
When initializing multiple empty DataFrame at once (in one line), all the empty DataFrame refers to same memory location. Meaning they contain similar data.
= empty_df2 = pd.DataFrame()
empty_df1 id(empty_df1)
id(empty_df2)
## 1637942429008
## 1637942429008
From Row Oriented Data
DataFrame( [row_list1, row_list2, row_list3] )
DataFrame( [row_list1, row_list2, row_list3], column = columnName_list )
DataFrame( [row_list1, row_list2, row_list3], index = row_label_list )
- Create DataFrame object from List of Lists
- Use `columns’ to select the columns and preferred order.
= [ [101, 'Alice', 40000, 2017],
data 102, 'Bob', 24000, 2017],
[103, 'Charles',31000, 2017]]
[
## Initialize with default row label and column names
pd.DataFrame(data)
## 0 1 2 3
## 0 101 Alice 40000 2017
## 1 102 Bob 24000 2017
## 2 103 Charles 31000 2017
## Initialize with column names and row index
= ['empID','name','salary','year']
col_names = ['r1','r2','r3']
row_labels =col_names, index=row_labels) pd.DataFrame(data, columns
## empID name salary year
## r1 101 Alice 40000 2017
## r2 102 Bob 24000 2017
## r3 103 Charles 31000 2017
From Record Oriented Data
DataFrame( [dict1, dict2, dict3] )
DataFrame( [dict1, dict2, dict3], column = column_list )
DataFrame( [dict1, dict2, dict3], index = row_label_list )
- Create DataFrame object from List of Dicts.
- By default, Column Name will follow Dictionary Key, unless manually specified.
- Certain dict object may have missing keys, these are assumed as NaN by pandas.
- Use `columns’ to select the columns and preferred order.
= [
data "name":"Yong", "id":1,"zkey":101},
{"name":"Gan", "id":2 }] ## missing zkey in this record
{
## Default Index and include all detected columns
pd.DataFrame(data)
## name id zkey
## 0 Yong 1 101.0
## 1 Gan 2 NaN
## specify Index, must be same length of DataFrame
=['row1','row2']) pd.DataFrame(data, index
## name id zkey
## row1 Yong 1 101.0
## row2 Gan 2 NaN
## Filter Which Columns To Include, in the prefered order
=['zkey','name']) pd.DataFrame(data, columns
## zkey name
## 0 101.0 Yong
## 1 NaN Gan
From Column Oriented Data
DataFrame( data = {'column1': list1,
'column2': list2,
'column3': list3 } ,
index = row_label_list,
columns = column_list)
- Create DataFrame object from Dict of Lists.
- Constructor will arrange the columns according to the dict keys order by default, unless
columns
is specified. - Use `columns’ to select the columns in preferred order. NaN column is returned if not exist.
## columns arrangement default to dict's keys
= {'empID': [100, 101, 102, 103, 104],
data 'year1': [2017, 2017, 2017, 2018, 2018],
'year2': [2027, 2027, 2027, 2028, 2028],
'salary': [40000, 24000, 31000, 20000, 30000],
'name': ['Alice', 'Bob', 'Charles', 'David', 'Eric']}
## Default Index and Include All Columns
pd.DataFrame(data)
## empID year1 year2 salary name
## 0 100 2017 2027 40000 Alice
## 1 101 2017 2027 24000 Bob
## 2 102 2017 2027 31000 Charles
## 3 103 2018 2028 20000 David
## 4 104 2018 2028 30000 Eric
## Select What Columns To Show (and its order) Specify row label.
=['name','salary','year1','year2','not_exist'], index=data.get('empID')) pd.DataFrame(data, columns
## name salary year1 year2 not_exist
## 100 Alice 40000 2017 2027 NaN
## 101 Bob 24000 2017 2027 NaN
## 102 Charles 31000 2017 2027 NaN
## 103 David 20000 2018 2028 NaN
## 104 Eric 30000 2018 2028 NaN
11.7.2 Operators
Sample Data
- Two dataframes are created, each with 3 columns and 3 rows.
- However, there are only two matching column and row names.
- We shall notice that the operator will perform cell-wise, honoring the row/column name.
= pd.DataFrame(data=
df1 'idx': ['row1','row2','row3'],
{'x': [10, 20, 30],
'y': [1,2,3],
'z': [0.1, 0.2, 0.3]}).set_index('idx')
= pd.DataFrame(data=
df2 'idx': ['row1','row2','row4'],
{'x': [13, 23, 33],
'z': [0.1, 0.2, 0.3],
'k': [11,21,31]}).set_index('idx')
Addition
Adding Two DataFrame
- When Using
+
operator, non-matching row/column names will result in NA. .add()
, has the same result as+
by default.- However,
add
supportsfill_value=
parameter. When specified, ONE-SIDED none matching cells is assumed to have value specified infill_value
. Only BOTH-SIDED non matching cells will results in NaN.
+ df2
df1
df1.add(df2)=1) df1.add(df2, fill_value
## k x y z
## idx
## row1 NaN 23.0 NaN 0.2
## row2 NaN 43.0 NaN 0.4
## row3 NaN NaN NaN NaN
## row4 NaN NaN NaN NaN
## k x y z
## idx
## row1 NaN 23.0 NaN 0.2
## row2 NaN 43.0 NaN 0.4
## row3 NaN NaN NaN NaN
## row4 NaN NaN NaN NaN
## k x y z
## idx
## row1 12.0 23.0 2.0 0.2
## row2 22.0 43.0 3.0 0.4
## row3 NaN 31.0 4.0 1.3
## row4 32.0 34.0 NaN 1.3
Adding Series To DataFrame
- Specify the appropriate
axis
depending on the orientation of the series data. - Column and Row names are respected in this operation.
- However,
fill_value
is not applicable is not available on Series.
- Columns or rows in Series that are non-matching in DataFrame, they will be created as the result. This behavior is similar to adding two DataFrames.
## Series to add into Rows
= pd.Series([1,1,1], index=['row1','row2','row4'])
s3
## Series to add into Columns
= pd.Series([3,3,3], index=['x','y','s'])
s4
print('Original DataFrame:\n',df1,'\n\n',
'Add Series as Rows: \n', df1.add(s3, axis=0), '\n\n',
'Add Series as Columns: \n', df1.add(s4, axis=1))
## Original DataFrame:
## x y z
## idx
## row1 10 1 0.1
## row2 20 2 0.2
## row3 30 3 0.3
##
## Add Series as Rows:
## x y z
## row1 11.0 2.0 1.1
## row2 21.0 3.0 1.2
## row3 NaN NaN NaN
## row4 NaN NaN NaN
##
## Add Series as Columns:
## s x y z
## idx
## row1 NaN 13.0 4.0 NaN
## row2 NaN 23.0 5.0 NaN
## row3 NaN 33.0 6.0 NaN
Substraction
Refer to .add()
above for fill_value
explanation.
- df1
df2 =1000) df2.sub(df1,fill_value
## k x y z
## idx
## row1 NaN 3.0 NaN 0.0
## row2 NaN 3.0 NaN 0.0
## row3 NaN NaN NaN NaN
## row4 NaN NaN NaN NaN
## k x y z
## idx
## row1 -989.0 3.0 999.0 0.0
## row2 -979.0 3.0 998.0 0.0
## row3 NaN 970.0 997.0 999.7
## row4 -969.0 -967.0 NaN -999.7
Operator &
- Non matching columns returns NaN.
- Matching columns with both True value returns True
- Matching columns with both True and False returns False
>5
df1<20
df2>5) & (df2<20) (df1
## x y z
## idx
## row1 True False False
## row2 True False False
## row3 True False False
## x z k
## idx
## row1 True True True
## row2 False True False
## row4 False True False
## k x y z
## idx
## row1 NaN True NaN False
## row2 NaN False NaN False
## row3 NaN False NaN False
## row4 NaN False NaN False
11.7.3 Attributes
= pd.DataFrame(data)
df df
## empID year1 year2 salary name
## 0 100 2017 2027 40000 Alice
## 1 101 2017 2027 24000 Bob
## 2 102 2017 2027 31000 Charles
## 3 103 2018 2028 20000 David
## 4 104 2018 2028 30000 Eric
## tuple (rows, columns)
df.shape ## default index is RangeIndex
df.index ## array of integer
df.index.values ## array of string
df.columns ## array of string
df.columns.values ## array of list df.values
## (5, 5)
## RangeIndex(start=0, stop=5, step=1)
## array([0, 1, 2, 3, 4], dtype=int64)
## Index(['empID', 'year1', 'year2', 'salary', 'name'], dtype='object')
## array(['empID', 'year1', 'year2', 'salary', 'name'], dtype=object)
## array([[100, 2017, 2027, 40000, 'Alice'],
## [101, 2017, 2027, 24000, 'Bob'],
## [102, 2017, 2027, 31000, 'Charles'],
## [103, 2018, 2028, 20000, 'David'],
## [104, 2018, 2028, 30000, 'Eric']], dtype=object)
11.7.4 Index Manipulation
Sample Data
df
## empID year1 year2 salary name
## 0 100 2017 2027 40000 Alice
## 1 101 2017 2027 24000 Bob
## 2 102 2017 2027 31000 Charles
## 3 103 2018 2028 20000 David
## 4 104 2018 2028 30000 Eric
Convert Column To Index
set_index('column_name', inplace=False)
- Specified column will turn into Index
- inplace=True means don’t create a new dataframe. Modify existing DataFrame.
- inplace=False means return a new DataFrame
## 0,1,2,3,...
df.index 'empID',inplace=True)
df.set_index(## 100,101,102,etc
df.index df
## RangeIndex(start=0, stop=5, step=1)
## Int64Index([100, 101, 102, 103, 104], dtype='int64', name='empID')
## year1 year2 salary name
## empID
## 100 2017 2027 40000 Alice
## 101 2017 2027 24000 Bob
## 102 2017 2027 31000 Charles
## 103 2018 2028 20000 David
## 104 2018 2028 30000 Eric
Convert Index To Column
- Resetting index will re-sequence the index as 0,1,2 etc.
- Old index column will be converted back as normal column, its name retained as column name.
- Operation support inplace option.
=True)
df.reset_index(inplace df
## empID year1 year2 salary name
## 0 100 2017 2027 40000 Alice
## 1 101 2017 2027 24000 Bob
## 2 102 2017 2027 31000 Charles
## 3 103 2018 2028 20000 David
## 4 104 2018 2028 30000 Eric
Updating Index ( .index= )
- This simply replace all index with the new values.
- Number of elements in the new index must match length of DataFrame, otherwise error.
- Although same label are allowed to repeat, it will be deprecated in future.
- This operation not reversible.
= [201, 202, 203, 204, 205]
df.index df
## empID year1 year2 salary name
## 201 100 2017 2027 40000 Alice
## 202 101 2017 2027 24000 Bob
## 203 102 2017 2027 31000 Charles
## 204 103 2018 2028 20000 David
## 205 104 2018 2028 30000 Eric
Reordering Index (.reindex )
- Returns new DataFrame according to the order specified. NO ‘inplace’ option.
- Doesn’t work if duplicate labels exists (Error)
- Only matching rows are returned.
- Non matching index number will results in NaN, without Error.
Change the order of Index, always return a new dataframe
203,202,300]) df.reindex([
## empID year1 year2 salary name
## 203 102.0 2017.0 2027.0 31000.0 Charles
## 202 101.0 2017.0 2027.0 24000.0 Bob
## 300 NaN NaN NaN NaN NaN
Rename Axis
- Example below renamed the axis of index.
- Use
axis=0
for row index (default), useaxis=1
for column index. = Useinplace
option to apply changes to the DataFrame, otherwise it will return the a new DataFrame
'super_id', axis=0, inplace=True)
df.rename_axis( df
## empID year1 year2 salary name
## super_id
## 201 100 2017 2027 40000 Alice
## 202 101 2017 2027 24000 Bob
## 203 102 2017 2027 31000 Charles
## 204 103 2018 2028 20000 David
## 205 104 2018 2028 30000 Eric
11.7.5 Subsetting Rows
dataframe.loc[ row_label ] # return series, single row
dataframe.loc[ row_label_list ] # multiple rows
dataframe.loc[ boolean_list ] # multiple rows
dataframe.iloc[ row_number ] # return series, single row
dataframe.iloc[ row_number_list ] # multiple rows
dataframe.iloc[ number_range ] # multiple rows
dataframe.sample(frac=) # frac = 0.6 means sampling 60% of rows randomly
Sample Data
= pd.DataFrame(data).set_index('empID')
df
df= pd.DataFrame(data, index = ['row2','row3','row1','row5','row4'])
df2 df2
## year1 year2 salary name
## empID
## 100 2017 2027 40000 Alice
## 101 2017 2027 24000 Bob
## 102 2017 2027 31000 Charles
## 103 2018 2028 20000 David
## 104 2018 2028 30000 Eric
## empID year1 year2 salary name
## row2 100 2017 2027 40000 Alice
## row3 101 2017 2027 24000 Bob
## row1 102 2017 2027 31000 Charles
## row5 103 2018 2028 20000 David
## row4 104 2018 2028 30000 Eric
Using Row Label (loc)
- Non matching single row label will result in KeyError.
- Non matching range of row labels will not trigger error.
Row Label is number
101] # by single row label, return Series df.loc[
## year1 2017
## year2 2027
## salary 24000
## name Bob
## Name: 101, dtype: object
100,103] ] # by multiple row labels, returns DataFrame df.loc[ [
## year1 year2 salary name
## empID
## 100 2017 2027 40000 Alice
## 103 2018 2028 20000 David
100:103 ] # by range of row labels, returns DataFrame df.loc[
## year1 year2 salary name
## empID
## 100 2017 2027 40000 Alice
## 101 2017 2027 24000 Bob
## 102 2017 2027 31000 Charles
## 103 2018 2028 20000 David
999:9999 ] # invalid range, No Error !! df.loc[
## Empty DataFrame
## Columns: [year1, year2, salary, name]
## Index: []
999 ] # invalid key, KeyError df.loc[
## Error in py_call_impl(callable, dots$args, dots$keywords): KeyError: 999
Row Label is string
'row3' ] # by single row label, return Series df2.loc[
## empID 101
## year1 2017
## year2 2027
## salary 24000
## name Bob
## Name: row3, dtype: object
'row1','row3'] ] # by multiple row labels, returns DataFrame df2.loc[ [
## empID year1 year2 salary name
## row1 102 2017 2027 31000 Charles
## row3 101 2017 2027 24000 Bob
'row1':'row3' ] # by range of row labels, return empty DataFrame because there is no row3 after row1 df2.loc[
## Empty DataFrame
## Columns: [empID, year1, year2, salary, name]
## Index: []
'row1':'row4' ] # by range of row labels, returns DataFrame df2.loc[
## empID year1 year2 salary name
## row1 102 2017 2027 31000 Charles
## row5 103 2018 2028 20000 David
## row4 104 2018 2028 30000 Eric
'row1':'baba' ] # Invalid range, KeyError df2.loc[
## Error in py_call_impl(callable, dots$args, dots$keywords): KeyError: 'baba'
Using Row Number (iloc)
Multiple rows returned as dataframe object
1] # by single row number
df.iloc[0,3] ] # by row numbers
df.iloc[ [0:3 ] # by row number range
df.iloc[ 0:999 ] # invalid range, no error
df.iloc[ 999] # invalid row, IndexError df.iloc[
## year1 2017
## year2 2027
## salary 24000
## name Bob
## Name: 101, dtype: object
## year1 year2 salary name
## empID
## 100 2017 2027 40000 Alice
## 103 2018 2028 20000 David
## year1 year2 salary name
## empID
## 100 2017 2027 40000 Alice
## 101 2017 2027 24000 Bob
## 102 2017 2027 31000 Charles
## year1 year2 salary name
## empID
## 100 2017 2027 40000 Alice
## 101 2017 2027 24000 Bob
## 102 2017 2027 31000 Charles
## 103 2018 2028 20000 David
## 104 2018 2028 30000 Eric
## Error in py_call_impl(callable, dots$args, dots$keywords): IndexError: single positional indexer is out-of-bounds
Using Boolean List
= (df.salary > 30000) & (df.year1==2017)
criteria print (criteria)
print (df.loc[criteria])
## empID
## 100 True
## 101 False
## 102 True
## 103 False
## 104 False
## dtype: bool
## year1 year2 salary name
## empID
## 100 2017 2027 40000 Alice
## 102 2017 2027 31000 Charles
Using Expression (.query)
.query(expr, inplace=False)
= 31000
num1 = 2017
num2 f'salary<={num1} and year1=={num2}') df.query(
## year1 year2 salary name
## empID
## 101 2017 2027 24000 Bob
## 102 2017 2027 31000 Charles
Sampling (.sample)
Specify percentage of random rows to be returned.
15) ## ensure consistentcy of result
np.random.seed(=0.7) ## randomly pick 70% of rows, without replacement df.sample(frac
## year1 year2 salary name
## empID
## 102 2017 2027 31000 Charles
## 103 2018 2028 20000 David
## 104 2018 2028 30000 Eric
## 101 2017 2027 24000 Bob
11.7.6 Row Manipulation
Sample Data
df
## year1 year2 salary name
## empID
## 100 2017 2027 40000 Alice
## 101 2017 2027 24000 Bob
## 102 2017 2027 31000 Charles
## 103 2018 2028 20000 David
## 104 2018 2028 30000 Eric
Drop Rows (.drop)
.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')
By Row Label(s)
=100) # drop single row
df.drop(index=[100,103], columns='salary') # drop selected rows and columns df.drop(index
## year1 year2 salary name
## empID
## 101 2017 2027 24000 Bob
## 102 2017 2027 31000 Charles
## 103 2018 2028 20000 David
## 104 2018 2028 30000 Eric
## year1 year2 name
## empID
## 101 2017 2027 Bob
## 102 2017 2027 Charles
## 104 2018 2028 Eric
11.7.7 Column Manipulation
Sample Data
= pd.DataFrame(data)
df df
## empID year1 year2 salary name
## 0 100 2017 2027 40000 Alice
## 1 101 2017 2027 24000 Bob
## 2 102 2017 2027 31000 Charles
## 3 103 2018 2028 20000 David
## 4 104 2018 2028 30000 Eric
Renaming Columns
Method 1 : Rename All Columns (.columns =).
Construct the new column names, ensure there is no missing column names, which will result in error.
= ['empID', 'year.1','year.2','salary', 'glamour']
new_columns = new_columns
df.columns 2) df.head(
## empID year.1 year.2 salary glamour
## 0 100 2017 2027 40000 Alice
## 1 101 2017 2027 24000 Bob
Method 2 : Renaming Specific Column (.rename (columns=) )
- Change column name through rename function.
- Support inplace option.
- Specify only columns that require name change.
- Non matching columns will not return error. That is great.
={'year.1':'year1', 'year.2':'year2'}, inplace=True)
df.rename( columns2) df.head(
## empID year1 year2 salary glamour
## 0 100 2017 2027 40000 Alice
## 1 101 2017 2027 24000 Bob
Reordering Columns
- Reordering columns is basically returning a new DataFrame with the speicified column list in order preferred.
- Refer to Sub-setting Column section.
Create New Column
- New Column will be created instantly using [] notation
- DO NOT USE dot Notation because it is view only attribute
'year3'] = df.year1
df[ df
## empID year1 year2 salary glamour year3
## 0 100 2017 2027 40000 Alice 2017
## 1 101 2017 2027 24000 Bob 2017
## 2 102 2017 2027 31000 Charles 2017
## 3 103 2018 2028 20000 David 2018
## 4 104 2018 2028 30000 Eric 2018
Dropping Columns (.drop)
dataframe.drop( columns='column_name', inplace=True/False) # delete single column
dataframe.drop( columns=list_of_colnames, inplace=True/False) # delete multiple column
dataframe.drop( index='row_label', inplace=True/False) # delete single row
dataframe.drop( index= list_of_row_labels, inplace=True/False) # delete multiple rows
inplace=True means column will be deleted from original dataframe. Default is False, which return a copy of dataframe
By Column Name(s)
='year1') # drop single column
df.drop( columns=['year1','year2']) # drop multiple columns df.drop( columns
## empID year2 salary glamour year3
## 0 100 2027 40000 Alice 2017
## 1 101 2027 24000 Bob 2017
## 2 102 2027 31000 Charles 2017
## 3 103 2028 20000 David 2018
## 4 104 2028 30000 Eric 2018
## empID salary glamour year3
## 0 100 40000 Alice 2017
## 1 101 24000 Bob 2017
## 2 102 31000 Charles 2017
## 3 103 20000 David 2018
## 4 104 30000 Eric 2018
By Column Number(s)
Use dataframe.columns to produce interim list of column names
## before dropping columns
df=df.columns[[1,2]]) # drop second and third columns
df.drop( columns=df.columns[0:3] ) # drop first, second and third columns df.drop( columns
## empID year1 year2 salary glamour year3
## 0 100 2017 2027 40000 Alice 2017
## 1 101 2017 2027 24000 Bob 2017
## 2 102 2017 2027 31000 Charles 2017
## 3 103 2018 2028 20000 David 2018
## 4 104 2018 2028 30000 Eric 2018
## empID salary glamour year3
## 0 100 40000 Alice 2017
## 1 101 24000 Bob 2017
## 2 102 31000 Charles 2017
## 3 103 20000 David 2018
## 4 104 30000 Eric 2018
## salary glamour year3
## 0 40000 Alice 2017
## 1 24000 Bob 2017
## 2 31000 Charles 2017
## 3 20000 David 2018
## 4 30000 Eric 2018
11.7.8 Subsetting Columns
Select Single Column Return Series
dataframe.columnName # single column, name based, return Series object
dataframe[ single_col_name ] # single column, name based, return Series object
Select Single/Multiple Columns Return DataFrame
dataframe [ list_of_col_names ] # name based, return Dataframe object
dataframe.loc[ : , single_col_name ] # single column, series
dataframe.loc[ : , col_name_list ] # multiple columns, dataframe
dataframe.loc[ : , col_name_range ] # multiple columns, dataframe
dataframe.iloc[ : , col_number ] # single column, series
dataframe.iloc[ : , col_number_list ] # multiple columns, dataframe
dataframe.iloc[ : , number_range ] # multiple columns, dataframe
By Number (.iloc) and Label (.loc)
Get Single Column (.loc, iloc)
Selecting single column always return as panda::Series
, except using [[ ]]
notation.
= pd.DataFrame(data)
df
df.name'name']
df['name']
df.loc[:, 3] df.iloc[:,
## 0 Alice
## 1 Bob
## 2 Charles
## 3 David
## 4 Eric
## Name: name, dtype: object
## 0 Alice
## 1 Bob
## 2 Charles
## 3 David
## 4 Eric
## Name: name, dtype: object
## 0 Alice
## 1 Bob
## 2 Charles
## 3 David
## 4 Eric
## Name: name, dtype: object
## 0 40000
## 1 24000
## 2 31000
## 3 20000
## 4 30000
## Name: salary, dtype: int64
Get Multiple Columns (.loc, iloc)
Selecting multiple columns return as panda::Dataframe object`
Example below returns DataFrame with Single Column
'name']] # return one column dataframe
df[['name','year1']]
df[['name','year1']] df.loc[:,[
## name
## 0 Alice
## 1 Bob
## 2 Charles
## 3 David
## 4 Eric
## name year1
## 0 Alice 2017
## 1 Bob 2017
## 2 Charles 2017
## 3 David 2018
## 4 Eric 2018
## name year1
## 0 Alice 2017
## 1 Bob 2017
## 2 Charles 2017
## 3 David 2018
## 4 Eric 2018
Select Range of Columns
'year1':'year2'] ## by range of column names
df.loc [ : , 'empID','year2']] ## select two columns only
df.loc[ : , [1:4] ## by range of column number
df.iloc[ : , 0,3]] ## select two columns only df.iloc[ : , [
## year1 year2
## 0 2017 2027
## 1 2017 2027
## 2 2017 2027
## 3 2018 2028
## 4 2018 2028
## empID year2
## 0 100 2027
## 1 101 2027
## 2 102 2027
## 3 103 2028
## 4 104 2028
## year1 year2 salary
## 0 2017 2027 40000
## 1 2017 2027 24000
## 2 2017 2027 31000
## 3 2018 2028 20000
## 4 2018 2028 30000
## empID salary
## 0 100 40000
## 1 101 24000
## 2 102 31000
## 3 103 20000
## 4 104 30000
By Column Name (.filter, .reindex)
.filter(items=None, like=None, regex=None, axis=1)
Filter by like
- Sub string Matching, always return DataFrame.
filter( like='year', axis=1) ## or axis = 1 df.
## year1 year2
## 0 2017 2027
## 1 2017 2027
## 2 2017 2027
## 3 2018 2028
## 4 2018 2028
filter( like='name') df.
## name
## 0 Alice
## 1 Bob
## 2 Charles
## 3 David
## 4 Eric
Filter by items
- list of exact column names
filter( items=('year1','year2', 'name', 'not_exist'), axis=1) df.
## year1 year2 name
## 0 2017 2027 Alice
## 1 2017 2027 Bob
## 2 2017 2027 Charles
## 3 2018 2028 David
## 4 2018 2028 Eric
Filter by regex = Regular Expression Matching column names
Select column names that contain integer.
filter(regex='\d') ## default axis=1 if DataFrame df.
## year1 year2
## 0 2017 2027
## 1 2017 2027
## 2 2017 2027
## 3 2018 2028
## 4 2018 2028
.reindex(columns = .. )
- It returns a new dataframe. There is no inplace option for reordering columns
- Non Matching columns names will result in NA values
= [ 'salary', 'year1', 'baba']
new_colorder = new_colorder) ## Non matching column name returns as NaN df.reindex(columns
## salary year1 baba
## 0 40000 2017 NaN
## 1 24000 2017 NaN
## 2 31000 2017 NaN
## 3 20000 2018 NaN
## 4 30000 2018 NaN
By Data Types (.select_dtypes)
Always return panda::DataFrame, even though only single column matches. Allowed types are:
- number (integer and float)
- integer / int64
- float
- datetime
- timedelta
- category
- bool
- object (string)
Use .dtypes.value_counts()
to insepct available data types.
## inspect available data types
df.dtypes.value_counts() ='integer') ## exclude bool cols
df.select_dtypes( exclude=['number','object']) ## include only number cols df.select_dtypes( include
## int64 4
## object 1
## dtype: int64
## name
## 0 Alice
## 1 Bob
## 2 Charles
## 3 David
## 4 Eric
## empID year1 year2 salary name
## 0 100 2017 2027 40000 Alice
## 1 101 2017 2027 24000 Bob
## 2 102 2017 2027 31000 Charles
## 3 103 2018 2028 20000 David
## 4 104 2018 2028 30000 Eric
11.7.9 Concatenating
Concatenating Two DataFrames
- When
ignore_index=True
, pandas will create new index in the result with 0,1,2,3…
- It is recommended to ignore index IF the data source index is not unique.
- New columns or rows will be added in the result if non-matching.
= pd.DataFrame(
my_df = {'Id': [10,20,30],
data'Name': ['Aaa','Bbb','Ccc']})
# .set_index('Id')
= pd.DataFrame(
my_df_new = {'Id': [40,50],
data'Name': ['Ddd','Eee'],
'Age': [12,13]})
#.set_index('Id')
= pd.concat( [my_df, my_df_new])
my_df_append = pd.concat( [my_df, my_df_new], ignore_index=True)
my_df_noindex
print("Original DataFrame:\n", my_df,
"\n\nTo Be Appended DataFrame:\n", my_df_new,
"\n\nAppended DataFrame (index maintained):\n", my_df_append,
"\n\nAppended DataFrame (index ignored):\n", my_df_noindex)
## Original DataFrame:
## Id Name
## 0 10 Aaa
## 1 20 Bbb
## 2 30 Ccc
##
## To Be Appended DataFrame:
## Id Name Age
## 0 40 Ddd 12
## 1 50 Eee 13
##
## Appended DataFrame (index maintained):
## Id Name Age
## 0 10 Aaa NaN
## 1 20 Bbb NaN
## 2 30 Ccc NaN
## 0 40 Ddd 12.0
## 1 50 Eee 13.0
##
## Appended DataFrame (index ignored):
## Id Name Age
## 0 10 Aaa NaN
## 1 20 Bbb NaN
## 2 30 Ccc NaN
## 3 40 Ddd 12.0
## 4 50 Eee 13.0
11.7.10 Slicing
Sample Data
= pd.DataFrame(data).set_index('empID')
df df
## year1 year2 salary name
## empID
## 100 2017 2027 40000 Alice
## 101 2017 2027 24000 Bob
## 102 2017 2027 31000 Charles
## 103 2018 2028 20000 David
## 104 2018 2028 30000 Eric
Getting One Cell
By Row Label and Column Name (loc)
dataframe.loc [ row_label , col_name ] # by row label and column names
dataframe.loc [ bool_list , col_name ] # by row label and column names
dataframe.iloc[ row_number, col_number ] # by row and column number
102,'year1'] ## row label 102, column 'year1'
df.loc [2,0] ## same result as above df.iloc[
## 2017
## 2017
Getting Range of Cells
Specify rows and columns (by individual or range)
dataframe.loc [ list/range_of_row_labels , list/range_col_names ] # by row label and column names
dataframe.iloc[ list/range_row_numbers, list/range_col_numbers ] # by row number
By Index and Column Name (loc)
101,103], ['name', 'year1'] ] # by list of row label and column names
df.loc[ [101:104 , 'year1':'year2' ] # by range of row label and column names df.loc[
## name year1
## empID
## 101 Bob 2017
## 103 David 2018
## year1 year2
## empID
## 101 2017 2027
## 102 2017 2027
## 103 2018 2028
## 104 2018 2028
By Boolean Row and Column Names (loc)
= df.year1==2017
b df.loc[ b ]
## year1 year2 salary name
## empID
## 100 2017 2027 40000 Alice
## 101 2017 2027 24000 Bob
## 102 2017 2027 31000 Charles
By Row and Column Number (iloc)
print (df.iloc[ [1,4], [0,3]],'\n' ) # by individual rows/columns
## year1 name
## empID
## 101 2017 Bob
## 104 2018 Eric
print (df.iloc[ 1:4 , 0:3], '\n') # by range
## year1 year2 salary
## empID
## 101 2017 2027 24000
## 102 2017 2027 31000
## 103 2018 2028 20000
11.7.11 Chained Indexing
Chained Index Method creates a copy of dataframe, any modification of data on original dataframe does not affect the copy
dataframe.loc [...] [...]
dataframe.iloc [...] [...]
Suggesting, never use chain indexing
= pd.DataFrame(
df 'empID': [100, 101, 102, 103, 104],
{ 'year1': [2017, 2017, 2017, 2018, 2018],
'name': ['Alice', 'Bob', 'Charles','David', 'Eric'],
'year2': [2001, 1907, 2003, 1998, 2011],
'salary': [40000, 24000, 31000, 20000, 30000]},
= ['year1','salary','year2','empID','name']).set_index(['empID'])
columns df
## year1 salary year2 name
## empID
## 100 2017 40000 2001 Alice
## 101 2017 24000 1907 Bob
## 102 2017 31000 2003 Charles
## 103 2018 20000 1998 David
## 104 2018 30000 2011 Eric
100]['year'] =2000 df.loc[
## <string>:1: SettingWithCopyWarning:
## A value is trying to be set on a copy of a slice from a DataFrame
##
## See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
## <string>:1: SettingWithCopyWarning:
## A value is trying to be set on a copy of a slice from a DataFrame
##
## See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
## notice row label 100 had not been updated, because data was updated on a copy due to chain indexing df
## year1 salary year2 name
## empID
## 100 2017 40000 2001 Alice
## 101 2017 24000 1907 Bob
## 102 2017 31000 2003 Charles
## 103 2018 20000 1998 David
## 104 2018 30000 2011 Eric
11.7.12 Find and Replace
Slicing deals with square cells selection. Use mask
or where
to select specific cell(s). These function respect column and row names.
11.7.13 Iteration
Iterating Rows (.iterrows)
= pd.DataFrame(data=
df 'empID': [100, 101, 102, 103, 104],
{ 'Name': ['Alice', 'Bob', 'Charles','David', 'Eric'],
'Year': [1999, 1988, 2001, 2010, 2020]}).set_index(['empID'])
for idx, row in df.iterrows():
print(idx, row.Name)
## 100 Alice
## 101 Bob
## 102 Charles
## 103 David
## 104 Eric
Iterating Columns (.items)
for label, content in df.items():
print('Column Label:', label, '\n\n',
'Column Content (Series):\n', content, '\n\n')
## Column Label: Name
##
## Column Content (Series):
## empID
## 100 Alice
## 101 Bob
## 102 Charles
## 103 David
## 104 Eric
## Name: Name, dtype: object
##
##
## Column Label: Year
##
## Column Content (Series):
## empID
## 100 1999
## 101 1988
## 102 2001
## 103 2010
## 104 2020
## Name: Year, dtype: int64
11.7.14 Data Structure
Inspect Structure
Find out the column names, data type in a summary. Output is for display only, not a data object
# return text output df.info()
## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 5 entries, 100 to 104
## Data columns (total 2 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 Name 5 non-null object
## 1 Year 5 non-null int64
## dtypes: int64(1), object(1)
## memory usage: 120.0+ bytes
# return Series df.dtypes.value_counts()
## object 1
## int64 1
## dtype: int64
Format Conversion
.to_dict()
'dict') ## dict of dict by column df.to_dict(
## {'Name': {100: 'Alice', 101: 'Bob', 102: 'Charles', 103: 'David', 104: 'Eric'}, 'Year': {100: 1999, 101: 1988, 102: 2001, 103: 2010, 104: 2020}}
'index') ## dict of dict by row index df.to_dict(
## {100: {'Name': 'Alice', 'Year': 1999}, 101: {'Name': 'Bob', 'Year': 1988}, 102: {'Name': 'Charles', 'Year': 2001}, 103: {'Name': 'David', 'Year': 2010}, 104: {'Name': 'Eric', 'Year': 2020}}
'list') ## dict of list df.to_dict(
## {'Name': ['Alice', 'Bob', 'Charles', 'David', 'Eric'], 'Year': [1999, 1988, 2001, 2010, 2020]}
'records') ## list of dict df.to_dict(
## [{'Name': 'Alice', 'Year': 1999}, {'Name': 'Bob', 'Year': 1988}, {'Name': 'Charles', 'Year': 2001}, {'Name': 'David', 'Year': 2010}, {'Name': 'Eric', 'Year': 2020}]
11.8 Class: MultiIndex
MultiIndexing are columns with few levels of headers.
11.8.1 The Data
= pd.DataFrame({
df 'myindex': [0, 1, 2],
'One_X': [1.1, 1.1, 1.1],
'One_Y': [1.2, 1.2, 1.2],
'Two_X': [1.11, 1.11, 1.11],
'Two_Y': [1.22, 1.22, 1.22]})
'myindex',inplace=True)
df.set_index( df
## One_X One_Y Two_X Two_Y
## myindex
## 0 1.1 1.2 1.11 1.22
## 1 1.1 1.2 1.11 1.22
## 2 1.1 1.2 1.11 1.22
11.8.2 Creating MultiIndex Object
Create From Tuples
MultiIndex can easily created from typles:
- Step 1: Create a MultiIndex object by splitting column name into tuples
- Step 2: Assign the MultiIndex Object to dataframe columns
property.
= [tuple(c.split('_')) for c in df.columns]
my_tuples = pd.MultiIndex.from_tuples(my_tuples)
df.columns
print(' Column Headers :\n\n', my_tuples,
'\n\nNew Columns: \n\n', df.columns,
'\n\nTwo Layers Header DF:\n\n', df)
## Column Headers :
##
## [('One', 'X'), ('One', 'Y'), ('Two', 'X'), ('Two', 'Y')]
##
## New Columns:
##
## MultiIndex([('One', 'X'),
## ('One', 'Y'),
## ('Two', 'X'),
## ('Two', 'Y')],
## )
##
## Two Layers Header DF:
##
## One Two
## X Y X Y
## myindex
## 0 1.1 1.2 1.11 1.22
## 1 1.1 1.2 1.11 1.22
## 2 1.1 1.2 1.11 1.22
11.8.3 MultiIndex Object
Levels
- MultiIndex object contain multiple leveels, each level (header) is an Index object.
- Use
MultiIndex.get_level_values()
to the entire header for the desired level. Note that each level is an Index object
print(df.columns.get_level_values(0), '\n',
1)) df.columns.get_level_values(
## Index(['One', 'One', 'Two', 'Two'], dtype='object')
## Index(['X', 'Y', 'X', 'Y'], dtype='object')
MultiIndex.levels
return the unique values of each level.
print(df.columns.levels[0], '\n',
1]) df.columns.levels[
## Index(['One', 'Two'], dtype='object')
## Index(['X', 'Y'], dtype='object')
11.8.4 Selecting Column(s)
Sample Data
import itertools
= pd.DataFrame
test_df = 100
max_age
### Create The Columns Tuple
= ['Male','Female','Pondan']
level0_sex = ['Medium','High','Low']
level1_age = list(itertools.product(level0_sex, level1_age))
my_columns
= pd.DataFrame([
test_df 1,2,3,4,5,6,7,8,9],
[11,12,13,14,15,16,17,18,19],
[21,22,23,24,25,26,27,28,29]], index=['row1','row2','row3'])
[
### Create Multiindex From Tuple
= pd.MultiIndex.from_tuples(my_columns)
test_df.columns print( test_df )
## Male Female Pondan
## Medium High Low Medium High Low Medium High Low
## row1 1 2 3 4 5 6 7 8 9
## row2 11 12 13 14 15 16 17 18 19
## row3 21 22 23 24 25 26 27 28 29
Select Level0 Header(s)
Use [L0]
notation, where L0
is list of header names
print( test_df[['Male','Pondan']] ,'\n\n', ## Include multiple Level0 Header
'Male'] , '\n\n', ## Include single Level0 Header
test_df[## Same as above test_df.Male )
## Male Pondan
## Medium High Low Medium High Low
## row1 1 2 3 7 8 9
## row2 11 12 13 17 18 19
## row3 21 22 23 27 28 29
##
## Medium High Low
## row1 1 2 3
## row2 11 12 13
## row3 21 22 23
##
## Medium High Low
## row1 1 2 3
## row2 11 12 13
## row3 21 22 23
Using .loc[]
Use .loc[ :, L0 ]
, where L0
is list of headers names
print( test_df.loc[:, ['Male','Pondan']] , '\n\n', ## Multiple Level0 Header
'Male'] ) ## Single Level0 Header test_df.loc[:,
## Male Pondan
## Medium High Low Medium High Low
## row1 1 2 3 7 8 9
## row2 11 12 13 17 18 19
## row3 21 22 23 27 28 29
##
## Medium High Low
## row1 1 2 3
## row2 11 12 13
## row3 21 22 23
Selecting Level 1 Header(s)
Use .loc[ :, (All, L1)]
, where L1
are list of headers names
= slice(None)
All print( test_df.loc[ : , (All, 'High')], '\n\n', ## Signle L1 header
'High','Low'])] ) ## Multiple L1 headers test_df.loc[ : , (All, [
## Male Female Pondan
## High High High
## row1 2 5 8
## row2 12 15 18
## row3 22 25 28
##
## Male Female Pondan Male Female Pondan
## High High High Low Low Low
## row1 2 5 8 3 6 9
## row2 12 15 18 13 16 19
## row3 22 25 28 23 26 29
Select Level 0 and Level1 Headers
Use .loc[ :, (L0, L1)]
, where L0
and L1
are list of headers names
'Male','Pondan'], ['Medium','High'])] test_df.loc[ : , ([
## Male Pondan
## Medium High Medium High
## row1 1 2 7 8
## row2 11 12 17 18
## row3 21 22 27 28
Select single L0,L1 Header
Use .loc[:, (L0, L1) ]
, result is a Series
Use .loc[:, (L0 ,[L1])]
, result is a DataFrame
print( test_df.loc[ : , ('Female', 'High')], '\n\n',
'Female', ['High'])]) test_df.loc[ : , (
## row1 5
## row2 15
## row3 25
## Name: (Female, High), dtype: int64
##
## Female
## High
## row1 5
## row2 15
## row3 25
11.8.5 Headers Ordering
Note that columns order specifeid by [ ]
selection were not respected. This can be remediated either by Sorting and rearranging.
Sort Headers
Use .sort_index()
on DataFrame to sort the headers. Note that when level1 is sorted, it jumble up level0 headers.
= test_df.sort_index(axis=1, level=0)
test_df_sorted_l0 = test_df.sort_index(axis=1, level=1, ascending=False)
test_df_sorted_l1 print(test_df, '\n\n',test_df_sorted_l0, '\n\n', test_df_sorted_l1)
## Male Female Pondan
## Medium High Low Medium High Low Medium High Low
## row1 1 2 3 4 5 6 7 8 9
## row2 11 12 13 14 15 16 17 18 19
## row3 21 22 23 24 25 26 27 28 29
##
## Female Male Pondan
## High Low Medium High Low Medium High Low Medium
## row1 5 6 4 2 3 1 8 9 7
## row2 15 16 14 12 13 11 18 19 17
## row3 25 26 24 22 23 21 28 29 27
##
## Pondan Male Female Pondan Male Female Pondan Male Female
## Medium Medium Medium Low Low Low High High High
## row1 7 1 4 9 3 6 8 2 5
## row2 17 11 14 19 13 16 18 12 15
## row3 27 21 24 29 23 26 28 22 25
Rearranging Headers
Use **.reindex()**
on arrange columns in specific order. Example below shows how to control the specific order for level1 headers.
= ['Low','Medium','High']
cats =1, axis=1) test_df.reindex(cats, level
## Male Female Pondan
## Low Medium High Low Medium High Low Medium High
## row1 3 1 2 6 4 5 9 7 8
## row2 13 11 12 16 14 15 19 17 18
## row3 23 21 22 26 24 25 29 27 28
11.8.6 Stacking and Unstacking
df.stack()
## One Two
## myindex
## 0 X 1.1 1.11
## Y 1.2 1.22
## 1 X 1.1 1.11
## Y 1.2 1.22
## 2 X 1.1 1.11
## Y 1.2 1.22
Stacking Columns to Rows
Stacking with DataFrame.stack(level_no)
is moving wide columns into row.
print('Stacking Header Level 0: \n\n', df.stack(0),
'\n\nStacking Header Level 1: \n\n', df.stack(1))
## Stacking Header Level 0:
##
## X Y
## myindex
## 0 One 1.10 1.20
## Two 1.11 1.22
## 1 One 1.10 1.20
## Two 1.11 1.22
## 2 One 1.10 1.20
## Two 1.11 1.22
##
## Stacking Header Level 1:
##
## One Two
## myindex
## 0 X 1.1 1.11
## Y 1.2 1.22
## 1 X 1.1 1.11
## Y 1.2 1.22
## 2 X 1.1 1.11
## Y 1.2 1.22
11.8.7 Exploratory Analysis
Sample Data
df
## One Two
## X Y X Y
## myindex
## 0 1.1 1.2 1.11 1.22
## 1 1.1 1.2 1.11 1.22
## 2 1.1 1.2 1.11 1.22
All Stats in One - .describe()
df.describe(include='number') # default
df.describe(include='object') # display for non-numeric columns
df.describe(include='all') # display both numeric and non-numeric
When applied to DataFrame object, describe shows all basic statistic for all numeric columns:
- Count (non-NA)
- Unique (for string)
- Top (for string)
- Frequency (for string)
- Percentile
- Mean
- Min / Max
- Standard Deviation
For Numeric Columns only
You can customize the percentiles requred. Notice 0.5 percentile is always there although not specified
df.describe()
## One Two
## X Y X Y
## count 3.0 3.0 3.00 3.00
## mean 1.1 1.2 1.11 1.22
## std 0.0 0.0 0.00 0.00
## min 1.1 1.2 1.11 1.22
## 25% 1.1 1.2 1.11 1.22
## 50% 1.1 1.2 1.11 1.22
## 75% 1.1 1.2 1.11 1.22
## max 1.1 1.2 1.11 1.22
=[0.9,0.3,0.2,0.1]) df.describe(percentiles
## One Two
## X Y X Y
## count 3.0 3.0 3.00 3.00
## mean 1.1 1.2 1.11 1.22
## std 0.0 0.0 0.00 0.00
## min 1.1 1.2 1.11 1.22
## 10% 1.1 1.2 1.11 1.22
## 20% 1.1 1.2 1.11 1.22
## 30% 1.1 1.2 1.11 1.22
## 50% 1.1 1.2 1.11 1.22
## 90% 1.1 1.2 1.11 1.22
## max 1.1 1.2 1.11 1.22
For both Numeric and Object
='all') df.describe(include
## One Two
## X Y X Y
## count 3.0 3.0 3.00 3.00
## mean 1.1 1.2 1.11 1.22
## std 0.0 0.0 0.00 0.00
## min 1.1 1.2 1.11 1.22
## 25% 1.1 1.2 1.11 1.22
## 50% 1.1 1.2 1.11 1.22
## 75% 1.1 1.2 1.11 1.22
## max 1.1 1.2 1.11 1.22
min/max/mean/median
min() # default axis=0, column-wise df.
## One X 1.10
## Y 1.20
## Two X 1.11
## Y 1.22
## dtype: float64
min(axis=1) # axis=1, row-wise df.
## myindex
## 0 1.1
## 1 1.1
## 2 1.1
## dtype: float64
Observe, sum on string will concatenate column-wise, whereas row-wise only sum up numeric fields
sum(0) df.
## One X 3.30
## Y 3.60
## Two X 3.33
## Y 3.66
## dtype: float64
sum(1) df.
## myindex
## 0 4.63
## 1 4.63
## 2 4.63
## dtype: float64
11.9 Class: Categories
11.9.1 Creating
From List
Basic (Auto Category Mapping)
Basic syntax return categorical index with sequence with code 0,1,2,3… mapping to first found category
In this case, low(0), high(1), medium(2)
= ['low','high','medium','high','high','low','medium','medium','high']
temp = pd.Categorical(temp)
temp_cat temp_cat
## ['low', 'high', 'medium', 'high', 'high', 'low', 'medium', 'medium', 'high']
## Categories (3, object): ['high', 'low', 'medium']
type( temp_cat )
## <class 'pandas.core.arrays.categorical.Categorical'>
Manual Category Mapping
During creation, we can specify mapping of codes to category: low(0), medium(1), high(2)
= pd.Categorical(temp, categories=['low','medium','high'])
temp_cat temp_cat
## ['low', 'high', 'medium', 'high', 'high', 'low', 'medium', 'medium', 'high']
## Categories (3, object): ['low', 'medium', 'high']
From Series
- We can ‘add’ categorical structure into a Series. With these methods, additional property (.cat) is added as a categorical accessor
- Through this accessor, you gain access to various properties of the category such as .codes, .categories. But not .get_values() as the information is in the Series itself
- Can we manual map category ?????
= ['low','high','medium','high','high','low','medium','medium','high']
temp = pd.Series(temp, dtype='category')
temp_cat print (type(temp_cat)) # Series object
## <class 'pandas.core.series.Series'>
print (type(temp_cat.cat)) # Categorical Accessor
## <class 'pandas.core.arrays.categorical.CategoricalAccessor'>
- Method below has the same result as above by using .astype(‘category’)
- It is useful adding category structure into existing series.
= pd.Series(temp)
temp_ser = pd.Series(temp).astype('category')
temp_cat print (type(temp_cat)) # Series object
## <class 'pandas.core.series.Series'>
print (type(temp_cat.cat)) # Categorical Accessor
## <class 'pandas.core.arrays.categorical.CategoricalAccessor'>
temp_cat.cat.categories
## Index(['high', 'low', 'medium'], dtype='object')
11.9.1.1 Ordering Category
= ['low','high','medium','high','high','low','medium','medium','high']
temp = pd.Categorical(temp, categories=['low','medium','high'], ordered=True)
temp_cat temp_cat
## ['low', 'high', 'medium', 'high', 'high', 'low', 'medium', 'medium', 'high']
## Categories (3, object): ['low' < 'medium' < 'high']
temp_cat.codes
## array([0, 2, 1, 2, 2, 0, 1, 1, 2], dtype=int8)
0] < temp_cat[3] temp_cat[
## False
11.9.2 Properties
11.9.3 Rename Category
Renamce To New Category Object
.rename_categories() method return a new category object with new changed categories
= ['low','high','medium','high','high','low','medium','medium','high']
temp = temp_cat.rename_categories(['sejuk','sederhana','panas'])
new_temp_cat new_temp_cat
## ['sejuk', 'panas', 'sederhana', 'panas', 'panas', 'sejuk', 'sederhana', 'sederhana', 'panas']
## Categories (3, object): ['sejuk' < 'sederhana' < 'panas']
# original category object categories not changed temp_cat
## ['low', 'high', 'medium', 'high', 'high', 'low', 'medium', 'medium', 'high']
## Categories (3, object): ['low' < 'medium' < 'high']
11.9.3.1 Rename Inplace
Observe the original categories had been changed using .rename()
= ['sejuk','sederhana','panas']
temp_cat.categories # original category object categories is changed temp_cat
## ['sejuk', 'panas', 'sederhana', 'panas', 'panas', 'sejuk', 'sederhana', 'sederhana', 'panas']
## Categories (3, object): ['sejuk' < 'sederhana' < 'panas']
11.9.4 Adding New Category
This return a new category object with added categories
= temp_cat.add_categories(['susah','senang'])
temp_cat_more temp_cat_more
## ['sejuk', 'panas', 'sederhana', 'panas', 'panas', 'sejuk', 'sederhana', 'sederhana', 'panas']
## Categories (5, object): ['sejuk' < 'sederhana' < 'panas' < 'susah' < 'senang']
Removing Category
This is not in place, hence return a new categorical object
11.9.4.1 Remove Specific Categor(ies)
Elements with its category removed will become NaN
= ['low','high','medium','high','high','low','medium','medium','high']
temp = pd.Categorical(temp)
temp_cat = temp_cat.remove_categories('low')
temp_cat_removed temp_cat_removed
## [NaN, 'high', 'medium', 'high', 'high', NaN, 'medium', 'medium', 'high']
## Categories (2, object): ['high', 'medium']
Remove Unused Category
Since categories removed are not used, there is no impact to the element
print (temp_cat_more)
## ['sejuk', 'panas', 'sederhana', 'panas', 'panas', 'sejuk', 'sederhana', 'sederhana', 'panas']
## Categories (5, object): ['sejuk' < 'sederhana' < 'panas' < 'susah' < 'senang']
temp_cat_more.remove_unused_categories()
## ['sejuk', 'panas', 'sederhana', 'panas', 'panas', 'sejuk', 'sederhana', 'sederhana', 'panas']
## Categories (3, object): ['sejuk' < 'sederhana' < 'panas']
11.9.5 Add and Remove Categories In One Step - Set()
= ['low','high','medium','high','high','low','medium','medium','high']
temp = pd.Categorical(temp, ordered=True)
temp_cat temp_cat
## ['low', 'high', 'medium', 'high', 'high', 'low', 'medium', 'medium', 'high']
## Categories (3, object): ['high' < 'low' < 'medium']
'low','medium','sederhana','susah','senang']) temp_cat.set_categories([
## ['low', NaN, 'medium', NaN, NaN, 'low', 'medium', 'medium', NaN]
## Categories (5, object): ['low' < 'medium' < 'sederhana' < 'susah' < 'senang']
11.9.6 Categorical Descriptive Analysis
At One Glance
temp_cat.describe()
## counts freqs
## categories
## high 4 0.444444
## low 2 0.222222
## medium 3 0.333333
Least Frequent Category, Most Frequent Category, and Most Frequent Category
min(), temp_cat.max(), temp_cat.mode() ) ( temp_cat.
## ('high', 'medium', ['high']
## Categories (3, object): ['high' < 'low' < 'medium'])
##
## <string>:1: FutureWarning: Categorical.mode is deprecated and will be removed in a future version. Use Series.mode instead.
11.9.7 Other Methods
.get_values()
Since actual value stored by categorical object are integer codes, get_values() function return values translated from *.codes** property
#array temp_cat
## ['low', 'high', 'medium', 'high', 'high', 'low', 'medium', 'medium', 'high']
## Categories (3, object): ['high' < 'low' < 'medium']
11.10 Dummies
- get_dummies creates columns for each categories
- The underlying data can be string or pd.Categorical
- It produces a new pd.DataFrame
11.10.1 Sample Data
= pd.DataFrame (
df 'A': ['A1','A2','A3','A1','A3','A1'],
{'B': ['B1','B2','B3','B1','B1','B3'],
'C': ['C1','C2','C3','C1',np.nan,np.nan]})
df
## A B C
## 0 A1 B1 C1
## 1 A2 B2 C2
## 2 A3 B3 C3
## 3 A1 B1 C1
## 4 A3 B1 NaN
## 5 A1 B3 NaN
11.10.2 Dummies on Array-Like Data
pd.get_dummies(df.A)
## A1 A2 A3
## 0 1 0 0
## 1 0 1 0
## 2 0 0 1
## 3 1 0 0
## 4 0 0 1
## 5 1 0 0
11.10.3 Dummies on DataFrame (multiple columns)
11.10.4 Dummies with na
By default, nan values are ignored
pd.get_dummies(df.C)
## C1 C2 C3
## 0 1 0 0
## 1 0 1 0
## 2 0 0 1
## 3 1 0 0
## 4 0 0 0
## 5 0 0 0
Make NaN as a dummy variable
=True) pd.get_dummies(df.C,dummy_na
## C1 C2 C3 NaN
## 0 1 0 0 0
## 1 0 1 0 0
## 2 0 0 1 0
## 3 1 0 0 0
## 4 0 0 0 1
## 5 0 0 0 1
11.10.5 Specify Prefixes
='col') pd.get_dummies(df.A, prefix
## col_A1 col_A2 col_A3
## 0 1 0 0
## 1 0 1 0
## 2 0 0 1
## 3 1 0 0
## 4 0 0 1
## 5 1 0 0
=['colA','colB']) pd.get_dummies(df[cols], prefix
## colA_A1 colA_A2 colA_A3 colB_B1 colB_B2 colB_B3
## 0 1 0 0 1 0 0
## 1 0 1 0 0 1 0
## 2 0 0 1 0 0 1
## 3 1 0 0 1 0 0
## 4 0 0 1 1 0 0
## 5 1 0 0 0 0 1
11.10.6 Dropping First Column
- Dummies cause colinearity issue for regression as it has redundant column.
- Dropping a column does not loose any information technically
=True) pd.get_dummies(df[cols],drop_first
## A_A2 A_A3 B_B2 B_B3
## 0 0 0 0 0
## 1 1 0 1 0
## 2 0 1 0 1
## 3 0 0 0 0
## 4 0 1 0 0
## 5 0 0 0 1
11.11 DataFrameGroupBy
groupby()
is a DataFrame method, it returnsDataFrameGroupBy
object
DataFrameGroupBy
object open doors for dataframe aggregation and summarization
DataFrameGroupBy
object is a very flexible abstraction. In many ways, you can simply treatDataFrameGroup
as if it’s a collection of DataFrames, and it does the difficult things under the hood
11.11.1 Sample Data
= pd.read_csv('data/company.csv')
company company
## Company Department Name Age Salary Birthdate
## 0 C1 D1 Yong 45 15000 1/1/1970
## 1 C1 D1 Chew 35 12000 2/1/1980
## 2 C1 D2 Lim 34 8000 2/19/1977
## 3 C1 D3 Jessy 23 2500 3/15/1990
## 4 C1 D3 Hoi Ming 55 25000 4/15/1987
## .. ... ... ... ... ... ...
## 13 C3 D3 Chang 32 7900 7/26/1973
## 14 C3 D1 Ong 44 17500 8/21/1980
## 15 C3 D2 Lily 41 15300 7/17/1990
## 16 C3 D3 Sally 54 21000 7/19/1968
## 17 C3 D3 Esther 37 13500 3/16/1969
##
## [18 rows x 6 columns]
11.11.2 Creating Groups
Group can be created for single or multiple columns
= company.groupby('Company') ## Single Column
com_grp = company.groupby(['Company','Department']) ## Multiple Column
com_dep_grp type(com_dep_grp)
## <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
11.11.3 Properties
Row Numbers Association
.groups
property is a dictionary containing group key (identifying the group) and its values (underlying row indexes for the group)
= com_dep_grp.groups # return Dictionary
gdict print( gdict.keys() , '\n\n', # group identifier
# group row indexes gdict.values() )
## dict_keys([('C1', 'D1'), ('C1', 'D2'), ('C1', 'D3'), ('C2', 'D1'), ('C2', 'D2'), ('C2', 'D3'), ('C3', 'D1'), ('C3', 'D2'), ('C3', 'D3')])
##
## dict_values([Int64Index([0, 1], dtype='int64'), Int64Index([2], dtype='int64'), Int64Index([3, 4, 5], dtype='int64'), Int64Index([6], dtype='int64'), Int64Index([7, 8, 9], dtype='int64'), Int64Index([10, 11, 12], dtype='int64'), Int64Index([14], dtype='int64'), Int64Index([15], dtype='int64'), Int64Index([13, 16, 17], dtype='int64')])
11.11.5 Retrieve Rows
Retrieve n-th Row Of Each Group
- Row number is 0-based
- For First row, use
.first()
ornth(0)
print( com_dep_grp.nth(0) , '\n',
com_dep_grp.first())
## Name Age Salary Birthdate
## Company Department
## C1 D1 Yong 45 15000 1/1/1970
## D2 Lim 34 8000 2/19/1977
## D3 Jessy 23 2500 3/15/1990
## C2 D1 Anne 18 400 7/15/1997
## D2 Deborah 30 8600 8/15/1984
## D3 Michael 38 17000 11/30/1997
## C3 D1 Ong 44 17500 8/21/1980
## D2 Lily 41 15300 7/17/1990
## D3 Chang 32 7900 7/26/1973
## Name Age Salary Birthdate
## Company Department
## C1 D1 Yong 45 15000 1/1/1970
## D2 Lim 34 8000 2/19/1977
## D3 Jessy 23 2500 3/15/1990
## C2 D1 Anne 18 400 7/15/1997
## D2 Deborah 30 8600 8/15/1984
## D3 Michael 38 17000 11/30/1997
## C3 D1 Ong 44 17500 8/21/1980
## D2 Lily 41 15300 7/17/1990
## D3 Chang 32 7900 7/26/1973
- For Last row, use
.last()
ornth(
-1)`
print( com_dep_grp.nth(-1) , '\n',
com_dep_grp.last())
## Name Age Salary Birthdate
## Company Department
## C1 D1 Chew 35 12000 2/1/1980
## D2 Lim 34 8000 2/19/1977
## D3 Sui Wei 56 3000 6/15/1990
## C2 D1 Anne 18 400 7/15/1997
## D2 Jimmy 46 14000 10/31/1988
## D3 Bernard 29 9800 12/1/1963
## C3 D1 Ong 44 17500 8/21/1980
## D2 Lily 41 15300 7/17/1990
## D3 Esther 37 13500 3/16/1969
## Name Age Salary Birthdate
## Company Department
## C1 D1 Chew 35 12000 2/1/1980
## D2 Lim 34 8000 2/19/1977
## D3 Sui Wei 56 3000 6/15/1990
## C2 D1 Anne 18 400 7/15/1997
## D2 Jimmy 46 14000 10/31/1988
## D3 Bernard 29 9800 12/1/1963
## C3 D1 Ong 44 17500 8/21/1980
## D2 Lily 41 15300 7/17/1990
## D3 Esther 37 13500 3/16/1969
Retrieve N Rows Of Each Groups
Example below retrieve 2 rows from each group
2) com_dep_grp.head(
## Company Department Name Age Salary Birthdate
## 0 C1 D1 Yong 45 15000 1/1/1970
## 1 C1 D1 Chew 35 12000 2/1/1980
## 2 C1 D2 Lim 34 8000 2/19/1977
## 3 C1 D3 Jessy 23 2500 3/15/1990
## 4 C1 D3 Hoi Ming 55 25000 4/15/1987
## .. ... ... ... ... ... ...
## 11 C2 D3 Jeannie 30 12500 12/31/1980
## 13 C3 D3 Chang 32 7900 7/26/1973
## 14 C3 D1 Ong 44 17500 8/21/1980
## 15 C3 D2 Lily 41 15300 7/17/1990
## 16 C3 D3 Sally 54 21000 7/19/1968
##
## [14 rows x 6 columns]
11.11.6 Single Statistic Per Group
count()
count()
for valid data (not null) for each fields within the group
# return panda DataFrame object com_dep_grp.count()
## Name Age Salary Birthdate
## Company Department
## C1 D1 2 2 2 2
## D2 1 1 1 1
## D3 3 3 3 3
## C2 D1 1 1 1 1
## D2 3 3 3 3
## D3 3 3 3 3
## C3 D1 1 1 1 1
## D2 1 1 1 1
## D3 3 3 3 3
sum()
This sums up all numeric columns for each group
sum() com_dep_grp.
## Age Salary
## Company Department
## C1 D1 80 27000
## D2 34 8000
## D3 134 30500
## C2 D1 18 400
## D2 127 34600
## D3 97 39300
## C3 D1 44 17500
## D2 41 15300
## D3 123 42400
To sum specific columns of each group, use ['columnName']
to select the column.
When single column is selected, output is a Series
'Age'].sum() com_dep_grp[
## Company Department
## C1 D1 80
## D2 34
## D3 134
## C2 D1 18
## D2 127
## D3 97
## C3 D1 44
## D2 41
## D3 123
## Name: Age, dtype: int64
mean()
This average up all numeric columns for each group
com_dep_grp.mean()
## Age Salary
## Company Department
## C1 D1 40.000000 13500.000000
## D2 34.000000 8000.000000
## D3 44.666667 10166.666667
## C2 D1 18.000000 400.000000
## D2 42.333333 11533.333333
## D3 32.333333 13100.000000
## C3 D1 44.000000 17500.000000
## D2 41.000000 15300.000000
## D3 41.000000 14133.333333
To average specific columns of each group, use ['columnName']
to select the column.
When single column is selected, output is a Series
'Age'].mean() com_dep_grp[
## Company Department
## C1 D1 40.000000
## D2 34.000000
## D3 44.666667
## C2 D1 18.000000
## D2 42.333333
## D3 32.333333
## C3 D1 44.000000
## D2 41.000000
## D3 41.000000
## Name: Age, dtype: float64
11.11.7 Multi Statistic Per Group
Single Function To Column(s)
- Instructions for aggregation are provided in the form of a dictionary. Dictionary keys specifies the column name, and value as the function to run
- Can use
lambda x:
to customize the calclulation on entire column (x)
- Python built-in function names does can be supplied without wrapping in string
'function'
com_dep_grp.agg({'Age': sum , ## Total age of the group
'Salary': lambda x: max(x), ## Highest salary of the group
'Birthdate': 'first' ## First birthday of the group
})
## Age Salary Birthdate
## Company Department
## C1 D1 80 15000 1/1/1970
## D2 34 8000 2/19/1977
## D3 134 25000 3/15/1990
## C2 D1 18 400 7/15/1997
## D2 127 14000 8/15/1984
## D3 97 17000 11/30/1997
## C3 D1 44 17500 8/21/1980
## D2 41 15300 7/17/1990
## D3 123 21000 7/26/1973
Multiple Function to Column(s)
- Use list of function names to specify functions to be applied on a particular column
- Notice that output columns are MultiIndex , indicating the name of funcitons appled on level 1
= com_dep_grp.agg({
ag 'Age': ['mean', sum ], ## Average age of the group
'Salary': lambda x: max(x), ## Highest salary of the group
'Birthdate': 'first' ## First birthday of the group
})
print (ag, '\n\n', ag.columns)
## Age Salary Birthdate
## mean sum <lambda> first
## Company Department
## C1 D1 40.000000 80 15000 1/1/1970
## D2 34.000000 34 8000 2/19/1977
## D3 44.666667 134 25000 3/15/1990
## C2 D1 18.000000 18 400 7/15/1997
## D2 42.333333 127 14000 8/15/1984
## D3 32.333333 97 17000 11/30/1997
## C3 D1 44.000000 44 17500 8/21/1980
## D2 41.000000 41 15300 7/17/1990
## D3 41.000000 123 21000 7/26/1973
##
## MultiIndex([( 'Age', 'mean'),
## ( 'Age', 'sum'),
## ( 'Salary', '<lambda>'),
## ('Birthdate', 'first')],
## )
Column Relabling
Introduced in Pandas 0.25.0, groupby aggregation with relabelling is supported using “named aggregation” with simple tuples
com_dep_grp.agg(= ('Age', max),
max_age = ('Salary', lambda x: max(x)+100),
salary_m100 = ('Birthdate', 'first')
first_bd )
## max_age salary_m100 first_bd
## Company Department
## C1 D1 45 15100 1/1/1970
## D2 34 8100 2/19/1977
## D3 56 25100 3/15/1990
## C2 D1 18 500 7/15/1997
## D2 51 14100 8/15/1984
## D3 38 17100 11/30/1997
## C3 D1 44 17600 8/21/1980
## D2 41 15400 7/17/1990
## D3 54 21100 7/26/1973
11.11.8 Iteration
DataFrameGroupBy object can be thought as a collection of named groups
def print_groups (g):
for name,group in g:
print (name)
print (group[:2])
print_groups (com_grp)
## C1
## Company Department Name Age Salary Birthdate
## 0 C1 D1 Yong 45 15000 1/1/1970
## 1 C1 D1 Chew 35 12000 2/1/1980
## C2
## Company Department Name Age Salary Birthdate
## 6 C2 D1 Anne 18 400 7/15/1997
## 7 C2 D2 Deborah 30 8600 8/15/1984
## C3
## Company Department Name Age Salary Birthdate
## 13 C3 D3 Chang 32 7900 7/26/1973
## 14 C3 D1 Ong 44 17500 8/21/1980
com_grp
## <pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000017D5D10C4C0>
11.11.9 Transform
- Transform is an operation used combined with DataFrameGroupBy object
- transform() return a new DataFrame object
= company.groupby('Company')
grp grp.size()
## Company
## C1 6
## C2 7
## C3 5
## dtype: int64
transform() perform a function to a group, and expands and replicate it to multiple rows according to original DataFrame
'Age','Salary']].transform('sum') grp[[
## Age Salary
## 0 248 65500
## 1 248 65500
## 2 248 65500
## 3 248 65500
## 4 248 65500
## .. ... ...
## 13 208 75200
## 14 208 75200
## 15 208 75200
## 16 208 75200
## 17 208 75200
##
## [18 rows x 2 columns]
lambda x:x+10 ) grp.transform(
## Age Salary
## 0 55 15010
## 1 45 12010
## 2 44 8010
## 3 33 2510
## 4 65 25010
## .. ... ...
## 13 42 7910
## 14 54 17510
## 15 51 15310
## 16 64 21010
## 17 47 13510
##
## [18 rows x 2 columns]
##
## <string>:1: FutureWarning: Dropping invalid columns in DataFrameGroupBy.transform is deprecated. In a future version, a TypeError will be raised. Before calling .transform, select only columns which should be valid for the function.
## <string>:1: FutureWarning: Dropping invalid columns in DataFrameGroupBy.transform is deprecated. In a future version, a TypeError will be raised. Before calling .transform, select only columns which should be valid for the function.
11.13 Missing Data
11.13.1 Sample Data
= pd.DataFrame( np.random.randn(5, 3),
df =['a', 'c', 'e', 'f', 'h'],
index =['one', 'two', 'three'])
columns 'four'] = 'bar'
df['five'] = df['one'] > 0
df[#df
'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']) df.reindex([
## one two three four five
## a -0.155909 -0.501790 0.235569 bar False
## b NaN NaN NaN NaN NaN
## c -1.763605 -1.095862 -1.087766 bar False
## d NaN NaN NaN NaN NaN
## e -0.305170 -0.473748 -0.200595 bar False
## f 0.355197 0.689518 0.410590 bar True
## g NaN NaN NaN NaN NaN
## h -0.564978 0.599391 -0.162936 bar False
How Missing Data For Each Column ?
df.count()
## one 5
## two 5
## three 5
## four 5
## five 5
## dtype: int64
len(df.index) - df.count()
## one 0
## two 0
## three 0
## four 0
## five 0
## dtype: int64
df.isnull()
## one two three four five
## a False False False False False
## c False False False False False
## e False False False False False
## f False False False False False
## h False False False False False
df.describe()
## one two three
## count 5.000000 5.000000 5.000000
## mean -0.486893 -0.156498 -0.161028
## std 0.788635 0.772882 0.579752
## min -1.763605 -1.095862 -1.087766
## 25% -0.564978 -0.501790 -0.200595
## 50% -0.305170 -0.473748 -0.162936
## 75% -0.155909 0.599391 0.235569
## max 0.355197 0.689518 0.410590