Pace Check II

Further analysis of my longest run yet

Hopefully you have already checked out my first project, Pace Check, where I briefly analyzed of one of my runs tracked with the Strava App. This time, we are going to analyze my longest run: a half marathon across the Golden Gate Bridge!

Completing this run was a long term goal of mine, and a significant personal challenge to overcome. About three and a half years ago I shattered two bones in my left leg necessitating two surgeries. As a result, it is very difficult and painful to train. But that’s enough complaining- let’s get started!

Instead of downloading my running data like last time, I decided to connect to Strava using the developer API instead. For this project I used the ready-to-go Python client, stravaio. Once I created a Strava API application I passed my credentials to the client, which returns my access token (this is secret so I won’t be sharing it here). Now I can use the client to access my Strava account information, like recent runs.

from stravaio import StravaIO
client = StravaIO(access_token=access_token)

recent_runs_list = client.get_logged_in_athlete_activities(after='last month')

Fetched 6, the latests is on 2020-06-11 01:48:10+00:00

# take a look at one of the runs
recent_runs_list[0]

{'achievement_count': 11,
 'athlete': {'id': 53889752},
 'athlete_count': 1,
 'average_speed': 3.022,
 'average_watts': None,
 'comment_count': 0,
 'commute': False,
 'device_watts': None,
 'distance': 10511.1,
 'elapsed_time': 3492,
 'elev_high': 27.0,
 'elev_low': 2.8,
 'end_latlng': [37.81, -122.41],
 'external_id': '8FBE8D55-6D19-4BF3-B487-1FDCBC80C30F',
 'flagged': False,
 'gear_id': None,
 'has_kudoed': False,
 'id': 3482935194,
 'kilojoules': None,
 'kudos_count': 0,
 'manual': False,
 'map': {'id': 'a3482935194',
         'polyline': None,
         'summary_polyline': 'waweF~tbjVGc@OKa@Cw@DKDIFELER@NAPGLWPIAk@Fi@r@INId@Q^IT[rBEd@?RG^?v@In@@f@ETGr@BbAI`@Cz@I`@I|@BL?RMz@BpAMJF\\?VHv@TvCBZEr@Bl@Nt@ATTfBNn@L~AJ|CR~@Fx@NbAAPBv@Hr@NzCJdADrADn@BtBDp@PvAd@hB`@dAd@z@HVx@`BHf@?lBKfAo@pC[bAAr@CHIHWLYRm@V_@HcANw@By@LyACeBM}@Yc@W_A}@[e@O]O{@G}AFk@La@D]G@Oz@?x@H~@DT^z@bAxALLr@f@t@^v@PZDpAA~AMT@Fr@@nAC^[xBC`A{@jB@n@Od@NIrAsBRUH@PJ\\n@^^LPNHh@l@LFv@Lh@AJBZVd@f@FN\\t@b@t@NN\\Zt@\\d@^HDj@JHABRA\\Q|@KJKDORg@vAOv@Gb@Ax@NzBIRS\\KB[Dc@RQDgARu@FULEVLp@Bb@HbAOpAMN_@FQLa@JKFGFKn@HdAHb@RdDJh@@fCBPHtBZlDDbADTDhARjBFrAFv@FVHNLHz@JL?TEFBBHT~CJz@LnC`@vEBhANvAFxAAh@HbBARWFWAWB_A\\WT_@t@g@t@EJ?RFdA?j@DZ^vAh@d@HLBPJvAT|FR~AFn@ANEJEBSOKQSqA[gGEqBIkAEa@i@qBQmA@OB_@HOj@u@`@a@d@m@JKZGXMv@EFI?IQaBk@uLWgCYaEImBMmAKoCK}AG}AGw@EsAYkDWqBa@cF?iAKmACMt@kBj@u@j@eAn@y@fBcDJ[Da@?aAOqBQsA?m@BI`B}@DOV@HVVVd@ZXOTYVOPG~@IFEBIi@mE@QCYMw@QoBC{@Bi@AkAB[Ke@C}@Ks@CgAg@}F?{@ESC{@C[GUKUGE]DWAq@@e@Hm@Rg@Fw@@k@\\_AXM@KIAMK_@Ag@F]Zm@J[Js@[wDSe@Om@m@}AQWu@q@QS]u@Io@GMCQG{@WcAKq@Bk@Ao@Ey@Km@EwBQgBE}BEa@YaAYeDE_AKe@EsASyBE{AGi@EaBMuAIiB?qAVoDVcBHKHAp@Bb@Cj@Yb@MNYZQ'},
 'max_speed': 13.3,
 'max_watts': None,
 'moving_time': 3478,
 'name': 'Afternoon Run',
 'photo_count': 0,
 'private': False,
 'start_date': datetime.datetime(2020, 5, 19, 23, 41, 31, tzinfo=tzutc()),
 'start_date_local': datetime.datetime(2020, 5, 19, 16, 41, 31, tzinfo=tzutc()),
 'start_latlng': [37.81, -122.41],
 'timezone': '(GMT-08:00) America/Los_Angeles',
 'total_elevation_gain': 52.3,
 'total_photo_count': 0,
 'trainer': False,
 'type': 'Run',
 'upload_id': 3719530643,
 'weighted_average_watts': None,
 'workout_type': 0}

#each run has a customized object
type(recent_runs_list[0])

swagger_client.models.summary_activity.SummaryActivity

# we want to make each run a dictionary to make it easier to navigate
recent_runs = [run.to_dict() for run in recent_runs_list]

Ok, so in the past month I have completed a handful of runs- but which is the one we want to analyze? Lets throw these in a dataframe so it is easy to view and see which run was the longest.

import pandas as pd
pd.options.display.max_columns = 50 #show more columns so we can see all of them

#have pandas parse each nested dictionary it into a flat dataframe
recent_runs = pd.json_normalize(recent_runs)
recent_runs

	id	external_id	upload_id	name	distance	moving_time	elapsed_time	total_elevation_gain	elev_high	elev_low	type	start_date	start_date_local	timezone	start_latlng	end_latlng	achievement_count	kudos_count	athlete_count	trainer	commute	manual	private	flagged	average_speed	max_speed	has_kudoed	gear_id	kilojoules	average_watts	device_watts	max_watts	weighted_average_watts	athlete.id	map.id	map.polyline	map.summary_polyline
0	3482935194	8FBE8D55-6D19-4BF3-B487-1FDCBC80C30F	3719530643	Afternoon Run	10511.1	3478	3492	52.3	27.0	2.8	Run	2020-05-19 23:41:31+00:00	2020-05-19 16:41:31+00:00	(GMT-08:00) America/Los_Angeles	[37.81, -122.41]	[37.81, -122.41]	11	0	1	False	False	False	False	False	3.022	13.3	False	None	None	None	None	None	None	53889752	a3482935194	None	waweF~tbjVGc@OKa@Cw@DKDIFELER@NAPGLWPIAk@Fi@r@...
1	3531392033	C1844A33-AC45-4635-A3E4-E255C9EA3BB3	3770681834	Afternoon Run	10371.6	3303	3308	44.3	26.4	2.8	Run	2020-05-28 22:49:49+00:00	2020-05-28 15:49:49+00:00	(GMT-08:00) America/Los_Angeles	[37.81, -122.41]	[37.81, -122.41]	14	0	1	False	False	False	False	False	3.140	5.4	False	None	None	None	None	None	None	53889752	a3531392033	None	aaweFjvbjVEq@I]SKQ?cALUHy@[MJKPQf@?j@EP_@dAUbA...
2	3539038874	E3A4EF8B-4F1B-4B08-A14F-2CA099AD95DB	3778792319	Afternoon Run	10627.6	3473	3477	44.5	26.4	2.8	Run	2020-05-30 21:32:42+00:00	2020-05-30 14:32:42+00:00	(GMT-08:00) America/Los_Angeles	[37.81, -122.41]	[37.81, -122.41]	6	0	1	False	False	False	False	False	3.060	5.9	False	None	None	None	None	None	None	53889752	a3539038874	None	}aweF`sbjVQIUPg@FMDs@rAo@Lc@d@O^On@OVe@jB[rAGn...
3	3569837822	C7E58F91-7061-4501-9156-4BED2951D572	3811441999	Afternoon Run	21600.7	7599	7616	172.7	81.6	2.4	Run	2020-06-05 23:03:35+00:00	2020-06-05 16:03:35+00:00	(GMT-08:00) America/Los_Angeles	[37.81, -122.41]	[37.81, -122.41]	12	3	1	False	False	False	False	False	2.843	14.0	False	None	None	None	None	None	None	53889752	a3569837822	None	gbweF~ubjVMkAmBPkAhCWX]H[n@a@`BUxAm@dGHn@Q`F]x...
4	3588262984	72185A79-B7F0-48DB-92CC-4146885D1F50	3830875861	Afternoon Run	5878.5	1830	1833	32.5	26.9	3.3	Run	2020-06-09 00:25:17+00:00	2020-06-08 17:25:17+00:00	(GMT-08:00) America/Los_Angeles	[37.81, -122.41]	[37.81, -122.41]	7	0	1	False	False	False	False	False	3.212	5.4	False	None	None	None	None	None	None	53889752	a3588262984	None	qaweFtubjVKi@g@WO?s@^UVi@B_@h@Uv@e@dASv@e@nA[\|...
5	3600201601	96F1DBC2-E315-41A4-997A-6E1C264F0596	3843441386	Evening Run	4156.6	1340	1344	44.2	27.1	3.6	Run	2020-06-11 01:48:10+00:00	2020-06-10 18:48:10+00:00	(GMT-08:00) America/Los_Angeles	[37.81, -122.41]	[37.81, -122.42]	0	1	1	False	False	False	False	False	3.102	5.9	False	None	None	None	None	None	None	53889752	a3600201601	None	{mweFnicjVJ~@CVMx@QbCI^]p@M`@DNHBRTPhALtCHb@B`...

Lets also plot a bar chart of the distance for each run.

recent_runs.set_index('id').distance.plot(kind='bar',
                                          figsize=(12,6),
                                          legend=True);

png

Clearly the run with id of 3569837822 had a much larger distance than the other runs, so this must be the one we are looking for.

# mask out recent runs data frame to show only the run with the max distance
longest_run = recent_runs[recent_runs.distance == recent_runs.distance.max()].dropna(axis=1).iloc[0]

longest_run

id                                                             3569837822
external_id                          C7E58F91-7061-4501-9156-4BED2951D572
upload_id                                                      3811441999
name                                                        Afternoon Run
distance                                                          21600.7
moving_time                                                          7599
elapsed_time                                                         7616
total_elevation_gain                                                172.7
elev_high                                                            81.6
elev_low                                                              2.4
type                                                                  Run
start_date                                      2020-06-05 23:03:35+00:00
start_date_local                                2020-06-05 16:03:35+00:00
timezone                                  (GMT-08:00) America/Los_Angeles
start_latlng                                             [37.81, -122.41]
end_latlng                                               [37.81, -122.41]
achievement_count                                                      12
kudos_count                                                             3
comment_count                                                           0
athlete_count                                                           1
photo_count                                                             0
total_photo_count                                                       0
trainer                                                             False
commute                                                             False
manual                                                              False
private                                                             False
flagged                                                             False
workout_type                                                            0
average_speed                                                       2.843
max_speed                                                              14
has_kudoed                                                          False
athlete.id                                                       53889752
map.id                                                        a3569837822
map.summary_polyline    gbweF~ubjVMkAmBPkAhCWX]H[n@a@`BUxAm@dGHn@Q`F]x...
Name: 3, dtype: object

It seems that the API gives us a lot more fields than we had when we downloaded the data directly in Pace Check. The big difference is that when we downloaded data directly, it was already in GPX format and had the geospatial information built in.

The geospatial information is given to us by the API here, but it is not obvious. The Strava developer docs explain that lines are encoded using the Google encoded polyline algorithm.

“Polyline encoding is a lossy compression algorithm that allows you to store a series of coordinates as a single string.”

This algorithm provides an efficient way to represent a series of points by encoding the start point, and then giving the next point by the difference from the first point, and so on. We can see the encoded polyline under map.summary_polyline

polyline_str = longest_run['map.summary_polyline']
polyline_str

'gbweF~ubjVMkAmBPkAhCWX]H[n@a@`BUxAm@dGHn@Q`F]xAMv@Pp@b@rGKvDDj@h@fDT~DTlBD`B`@lBJtAd@tBDxAXz@AlERZp@vCbBlENv@EvDOTUpAs@fBGr@_Ax@uBf@QbANnAe@|AYVK`@BXPf@lA~@DrAX`@t@AZaA^B\\TfAA~Aj@p@bBAd@d@tBjCrBNd@MlAe@j@i@hCGtALrAQrAg@j@yALmCj@I`@RpEwAr@UXClAr@`KRbBr@bOVfCFTTLhBLLPpBp\\?ZKNiCp@{@zA\\zAAjAx@xJVlGHf@h@h@h@lAd@rBp@dAvAlAb@rAj@tFn@~J`@lI`@dBjArBX|AmAzWa@dDOlDUfBFl@`@pAW`BUn@yBtBq@rAs@b@WMoBBOMa@qAWABjBk@zBoBrDi@LuCxCa@xACv@P~AeA|Cc@`Eq@j@Yx@Ft@Vb@HBIq@JQJBh@hA@f@y@`A{CbAqAN_A\\u@O{BPo@MoE@eANw@p@_AR_CRuQr@{Gt@uDJW]WmEMk@OVGIq@Vw@v@{AvDGRD@uAeB{AMy@ZeAbBkAf@ZIy@RqJb@iBZmAJaTz@uD?aATgCPm@zCYJe@Y]@i@UYg@I?c@J_DrBgBNOy@Z}Bf@cAl@Yh@FZ^D`@Gx@w@dCbCmBxBeA|Cg@nKc@bPaAz@UdAOjE[tDMnA_@~AJxCUrECbDc@fJW|Hg@EETKzGYfBW\\P~JaAtMc@fC[`A_@vCuBDS_@e@OeA[BWXXx@e@sBTy@f@UZ_@BaA\\oCXaAf@q@]sAAe@tAcDxAeBnAs@nBoF]IYJiAfAcAbBoAr@aCj@c@SCs@Zo@l@cDf@eAtCwBhBu@r@i@lBgCnBwD^qAHkFF]n@s@GqAHaCCr@HNINB]g@sGGuDAcJu@}Hm@sHBo@YwDBi@NSHk@Ws@OaBL{A_AqGAiB{@oEI}CW}B{@`AZy@DYIcAf@qDTs@v@yA^]r@QRi@a@_EM}Cq@oHKkDi@_G[aHuAqRGaDN_Av@eBpCuEz@uBLaAa@cFBkAvFsArBAJY_@}Ba@kDSuC?wBSsBGqEK{AQc@SiEgD^Uk@WEiC@gAl@WWEiAb@oBJaBKyAMs@aA}B{BgCSm@ImA]eBq@}KMeGu@kDHqAcBmSOs@h@sENiCfA?r@k@~Cm@'

Encoded, this seems like complete nonsense. But we can easily decode this using the polyline python package .

import polyline

route = polyline.decode(polyline_str)
route[:10]

[(37.8066, -122.40752),
 (37.80667, -122.40714),
 (37.80722, -122.40723),
 (37.8076, -122.40792),
 (37.80772, -122.40805),
 (37.80787, -122.4081),
 (37.80801, -122.40834),
 (37.80818, -122.40883),
 (37.80829, -122.40928),
 (37.80852, -122.41059)]

Success! These look like latitude-longitude pairs. Let’s throw them into a geodataframe and plot it.

from shapely.geometry import LineString
import geopandas as gpd

longest_run = gpd.GeoDataFrame(longest_run.to_frame().T, #make run a single record data frame again
                          geometry = [LineString([p[::-1] for p in route])], #make linestring from points, reverse order so it is long-lat, not lat-long
                          crs={'init':'epsg:4326'}) #universal lat-long coords are always in epsg:4326

longest_run.plot();

png

Well that looks about right! Lets throw it on a folium map.

import folium

m = folium.Map(location = [longest_run.centroid.y, longest_run.centroid.x],
              tiles='Carto db Positron',
              zoom_start=13)

folium.GeoJson(data=longest_run.geometry
              ).add_to(m)

m

Fantastic. That is definitely the route I took. You can see that I went across the bridge and back. Clearly there is some sort of measurement error as it looks like I was running over the ocean around the middle of the bridge area.

The down side with this geospatial information is that we dont have information about each ping, like we did in the last project. I would like to know how metrics like pace and elevation varied through out the run, but longest_run only has information about the run as a whole.

To get the information about each individual ping made while on my run, we have to use the Streams endpoint of the Strava API.

run = client.get_activity_streams(id = 3569837822,
                            athlete_id = 53889752)

run = pd.DataFrame(run.to_dict())

run

	time	distance	altitude	velocity_smooth	heartrate	moving	grade_smooth	lat	lng
0	0	0.0	4.7	0.0	172	False	-1.4	37.806600	-122.407517
1	5	13.5	4.4	2.7	172	True	-2.3	37.806618	-122.407366
2	8	21.4	4.4	2.7	172	True	-1.8	37.806646	-122.407257
3	10	30.0	4.0	3.3	172	True	-1.3	37.806672	-122.407184
4	13	39.8	4.0	3.7	172	True	-1.5	37.806677	-122.407131
...	...	...	...	...	...	...	...	...	...
2909	7604	21565.7	4.2	3.7	182	True	0.6	37.807304	-122.410505
2910	7607	21573.1	4.3	3.3	182	True	0.7	37.807224	-122.410500
2911	7609	21580.1	4.3	2.9	182	True	0.7	37.807185	-122.410498
2912	7612	21587.2	4.4	2.8	182	True	0.5	37.807128	-122.410467
2913	7614	21592.9	4.4	2.6	181	True	0.8	37.807069	-122.410445

2914 rows × 9 columns

That was easy! Now we already have a nice clean dataframe with speed already calculated, and latitude and longitude for each point explicitly handed to us.

run.altitude.plot();

png

The altitude plot shows smaller spikes on the side (over fort mason and back) and the large spike in the middle (the golden gate bridge). It is pretty shocking how much the altitude changes across the bridge.

run.grade_smooth.plot();

png

This dataframe has a few new features. One is called grade_smooth, which I am guessing is the rate of change of the altitude, an approximation of the derivative. We can approximate this calculation ourselves using the pandas methods .diff to take the difference between consecutive values and then .rolling() to group up consecutive points and aggregate (in this case I took the running average). Mathematically speaking we are calculating the numerator of the Difference Quotient, neglecting the denominator value by assuming the difference in how far I have ran between each point is roughly the same.

run.altitude.diff(1).rolling(10).mean().plot();

png

The two plots look nearly identitcal, so we are probably correct about what grade_smooth is.

Recently I saw that Strava offered an interesting metric to paying users: Grade Adjusted Pace.

“Grade Adjusted Pace estimates an equivalent pace when running on flat land, allowing the runner to compare hilly and flat runs more easily. Because running uphill requires extra effort, the Grade Adjusted Pace will be faster than the actual pace run. When running downhill, the Grade Adjusted Pace will be slower than the actual pace.”

Interesting… I wonder how much the grade I am running on actually affects my pace. Let’s observe the relationship between velocity_smooth from grade_smooth.

import seaborn as sns
sns.pairplot(run.loc[run.moving, ['grade_smooth', 'velocity_smooth']], height=5);

png

sns.regplot(x="grade_smooth", y="velocity_smooth", data=run[run.moving]);

png

There doesn’t seem to be a particularly strong correlation between the two features. Lets take a deeper dive into the metrics with Statsmodels

import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Fit a simple regression model 
results = smf.ols('velocity_smooth ~ grade_smooth', data=run[run.moving]).fit()

print(results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:        velocity_smooth   R-squared:                       0.015
Model:                            OLS   Adj. R-squared:                  0.014
Method:                 Least Squares   F-statistic:                     42.86
Date:                Mon, 15 Jun 2020   Prob (F-statistic):           6.92e-11
Time:                        19:59:19   Log-Likelihood:                -3528.4
No. Observations:                2899   AIC:                             7061.
Df Residuals:                    2897   BIC:                             7073.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept        2.8809      0.015    189.675      0.000       2.851       2.911
grade_smooth    -0.0317      0.005     -6.547      0.000      -0.041      -0.022
==============================================================================
Omnibus:                     2315.926   Durbin-Watson:                   0.537
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            96762.680
Skew:                           3.453   Prob(JB):                         0.00
Kurtosis:                      30.448   Cond. No.                         3.14
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

As one might expect, the coefficient for grade_smooth is negative, mean the higher the grade, the slower I usually run. However, the resulting $R^2$ is very low meaning that the grade I am running on explains very little of the variance in my pace. But, the low p-value lets us reject the null hypothesis that grade has no effect on my pace. In conclusion, the grade I am running on definitely negatively affects my pace, but not in a consistently predictable way.

We can compute our own grade adjusted pace by compensating the real pace by how much we predict the grade to affect it.

#                               grade      *    OLS ceof (-0.0317)
predicted_change_in_pace = run.grade_smooth*results.params.grade_smooth

run['grade_adjusted_pace'] = run.velocity_smooth + predicted_change_in_pace

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(15,10))

run.grade_smooth.apply(lambda x: (x-run.grade_smooth.mean()) / run.grade_smooth.var(ddof=0)
                      ).plot(ax=ax, alpha=.75, label='Grade (normalized)');

predicted_change_in_pace.plot(ax=ax, alpha=.75, label='Predicted change in pace');

ax.legend(prop={'size': 15});

png

Although, when plotted together, there seems to be little difference between my actual pace and the grade adjusted pace.

fig, ax = plt.subplots(figsize=(15,10))

run.grade_adjusted_pace.plot(ax=ax, alpha=.75);
run.velocity_smooth.plot(ax=ax, alpha=.75);

ax.legend(prop={'size': 15});

png

fig, ax = plt.subplots(figsize=(15,10))

(run.grade_adjusted_pace - run.velocity_smooth).plot(ax=ax, alpha=.75);

ax.set_title('grade_adjusted_pace - velocity_smooth');

png

Having been the one who actually did this run, I think the strong winds affected my pace much more than the grade I was running on (I happened to run on one of the windiest I have seen). It would be interesting to compare pace with wind speed data, but I am not sure where I would be able to find it at the granularity I would require (down to the location with speed and direction).