
Cover, Historizing and Analyzing Open Bikesharing Data by bild.element. All rights reserved. (svg images from www.svgrepo.com were used under CC0 livense)
The DKAN is a global open-source data management platform, enabling the infrastructure for large and public open-data projects. Among other cities, the city of cologne started the initiative „Offene Daten Köln“ in 2012 together with DKAN to provide a vast collection of databases and resources to the public. These databases are contributed by various communal and private services such as public transport, public pools and tourist information, to name a few. Within this data the local public transport service (KVB) and a private bike sharing company (Nextbike) provide a dataset that comprises the georeferrenced list of all bike stations and bikes that are not currently in use at a given time-point. The database is updated every 15 minutes and can be accessed via an API1 which is provided by the bike sharing company. I decided to explore this data as a personal project and to assess the potential this has in an open information context also with perspective to other datasets. At this point I want to add that I am not affiliated to KVB, Nextbike or any other company or Institution that is mentioned in this essay. Links to the database and my research sources will be provided below.
Query (Request) and ETL
Nextbike provides the data to bike positions and stations via their API and in XML format. The database is updated every 15 minutes. To collect the data at equal time points over a specific time-frame I automatically scheduled the execution of a brief python script that calls the API, saves the data, marks it with a date and time stamp and appends it to the existing dataset. A request was performed every 20 minutes for seven days. The script further comprised minimal ETL steps such as flattening the data and removing redundancies to take load of the semantic model further downstream. The collected data was then imported to MS PowerBI and structured into a common model. This whole workflow was chosen to fit the the concise and exemplary nature of the project.
(Reverse) Geocoding with Nominatim
The locations of single bikes are provided with their distinct coordinates or the respective stations coordinates if applicable. To gain structured insight from plain coordinates it is necessary to retrieve geographical information about the location i.e. the names of the corresponding street, quarter or city. This process is called reverse geocoding. To perform high volume requests for a total of <106 location entries in the scope (and with the budget) of a private data analysis project I employed a locally served instance of Nominatim2, a open-source geographic search API, which allowed me to access OpenStreetMap data. Geocoding does not only offer descriptive information about the location but it also allows for a meaningful aggregation of geographic data.
Each recorded timepoint comprised approximately 2.000 new entries. Since I collected 72 timepoints per day, we can extrapolate this to approximately 144.000 new locations added to our dataset daily. This illustrates the high degree of granularity we can expect in the retrieved data. At this point one should always ask: „Which level of information is necessary to convey relevant insight from the data without loosing valuable detail?“ For geographical data this often means: „How far do I need to zoom out, to still get the information I need?“ For this project I wanted to assess the smallest possible units, that I still could assign to different generalized, anonymous groups of users in terms of age, residence and other general demographics. The smallest meaningful unit, which can roughly be ascribed to a certain demographic is a suburb/neighborhood3. This is why I chose to analyze patterns and relations between the 86 different suburbs in cologne.
Reporting and Visualization
For modelling and reporting I chose the industry standard MS PowerBI. I first decided to analyze the data with regard to easily accessible and distinct measures such like the number of free standing bikes, number of bookings et cetera. The historization of the data of course also allowed for a time based assessment of these measures. However, Information about time and location naturally also give insight about the direction of users i.e. in the correlation between origin and destination. This allowed me to further assess relationships between locations, such like common destinations from single suburb. I compiled the resulting analysis in a MS PowerBI report, which you can access below. (Note that I reserve the right to correct and change both content and functionality of this report at any time.)
Closing Remarks
At this stage, the project primarily serves a descriptive purpose. It is mostly exploratory and intended to establish a solid foundation for future analyses. The projection of bike journeys into relationships between different suburbs paves the way for exploring various potential directions.
First, suburbs can be associated with different structural and demographic attributes. Therefore it is feasible to relate bike distributions to user demographics or attributes of the urban structure.
This might be realized by incorporating suburb statistics provided by the city of cologne4.
Second, the integration of the OpenStreetMap API (OSM) Nominatim allows for inquiries of geographic and structural information that was provided by users. In combination with the time dimension of the dataset this might enable the distinction and comparison between residential districts, recreational zones and work related areas. At this Point I want to mention a similar Project by Cologne Intelligence5. Here the authors assessed potential routes for moved bikes by employing OSM information about streets and paths that are documented as accessible by bike.
In conclusion, the amount and accessibility of open data resources offer remarkable potential for complex analyses. The variety of datasets that are provided by „Offene Daten Köln“ and similar initiatives promise numerous relationships to explore. Open datasets might also extend the scope of professional data analysis projects and it goes without saying that they serve as a great training tool for aspiring data analysts.
Links and Sources
- https://offenedaten-koeln.de/dataset/standorte-fahrradverleih-koeln-kvb-rad ↩︎
- https://nominatim.org/ ↩︎
- By suburb I refer to smaller official divisions of city districts. In local dialect commonly termed „Veedel“. ↩︎
- https://www.stadt-koeln.de/artikel/62998/index.html ↩︎
- https://www.cologne-intelligence.de/blog/bikesharing-eine-geodatenanalyse-von-nextbike-in-koeln ↩︎