OpenStreetMap Data Case Study¶

Map Area¶

Bengaluru, Karnataka, India ¶

As a kid, I lived close to Bengaluru in India and have been away for more than 10 years now. I want to take this opportunity to explore how the city has grown and also contribute by cleaning up the data in Open Street Map and provide some insights on the 'Garden City', Bengaluru.

Problems encountered in the map¶

Applying the code from the case study worked on earlier for Data Wrangling, I wrote a subset of data to csv and SQL relational tables. I used SQL queries to explore the data and problem areas.

Below are some of the problems I encountered:

Street names with the entire address
Ex: Survey 5/2, Whitefield Main Rd, SH 35, Sathya Sai Layout, Whitefield, Sathya Sai Layout, Whitefield, Bengaluru, India
Area names in the city field
Ex: K.R Puram
Inconsistent and multiple phone numbers with different formats
Ex: +918022382894/96
Incorrect postal codes
*Ex: 600005

Resolving those problems¶

Street names¶

As the city, state, country or postal code doesn't belong in the street field, all those names have been removed from the street field so that it contains details about the street address alone.

Using the above example, the cleaned street field would contain - Survey 5/2, Whitefield Main Rd, SH 35, Sathya Sai Layout, Whitefield, Sathya Sai Layout, Whitefield

Besides street names, house number field also had details on street address in the data set. However, I noticed that in those cases, those particular details were missing from the street field. To maintain data integrity and continuity, they were left untouched.

All the abbreviations in the street fields have been mapped to their corresponding full forms as well.

City¶

The name of the city has officially changed to Bengaluru. Due to this there is a nice mix of both the old and the new name in the city field. However, looking at the numbers, 'Bangalore' stood out as the clear winner and hence I've continued to use the same in my process.

I've also noticed that an area or a locality's name gets used in that field as the city. In most cases, since this doesn't exist on the street name, I've continued to use the area name and appended the city name to it.

Using the above example, the cleaned city name would be - K.R Puram, Bangalore

Phone numbers¶

Out of all the fields explored, phone numbers was the most inconsistent both in terms of format and listing multiple phone numbers. Different kinds of characters (/.,&) were used to separate each phone number.

I've used the standard below while cleaning up this field

Fixed lines - +91-80-xxxx xxxx
Mobile/Cell - +91-xxxxxxxxxx

Multiple phone numbers will be listed as comma separated values in phone number field.

Using the above example, the cleaned phone number field would contain - +91-80-2238 2894, +91-80-2238 2896

Postal Code¶

While there were only two formats in the postal code field, what I found interesting later on was that there were valid postal codes that do not belong to Bangalore. I cleaned them up as well so that only valid postal codes for Bangalore city would get populated.

Overview of the data¶

This provides some statistics of the data, right from the size of the raw XML data set to the finished tables and dives into the data to explore some basic data distribution and statistics.

File Sizes¶

Bangalore OSM - 614MB
nodes.csv - 231MB
nodes_tags.csv - 3.16MB
ways.csv - 38.7MB
ways_nodes.csv - 84.6MB
ways_tags.csv - 22.8MB

Number of Unique Users¶

There are 1855 unique users in nodes and ways.

select count(distinct [user]) AS Users
from (
select [user]
from [Udacity].[dbo].nodes
union 
select [user]
from [Udacity].[dbo].ways)u

Users
1855

Number of Unique Users who contributed towards tags¶

Out of which, 1700 have contributed towards node and way tags.

select count(distinct [user]) AS Users
from (
select [user]
from [Udacity].[dbo].nodes n inner join
dbo.nodes_tags nt on n.id = nt.id
UNION 
select [user]
from [Udacity].[dbo].ways w inner join
dbo.ways_tags wt on w.id = wt.id)u

Users
1700

Contributor Statistics¶

It looks like the top 10 contributors account for only about 30% of the total contributions. That shows how diverse the contributions are from varios users.

select count(distinct id) AS Contributions, [user]
from (
select nt.id, [user]
from dbo.nodes n inner join
dbo.nodes_tags nt on n.id = nt.id
UNION 
select wt.id, [user]
from dbo.ways w inner join
dbo.ways_tags wt on w.id = wt.id)u
group by [user] with rollup
order by Contributions DESC

Contributions	User
684838	NULL
24351	premkumar
23890	jasvinderkaur
23451	akhilsai
23046	saikumar
20588	vamshikrishna
19569	shekarn
17434	himalay
17334	PlaneMad
17276	hareesh11
17221	himabindhu

User Distribution (Top 10)¶

This shows the top 10 areas in which most number of users have contributed towards in descending order.

select TOP 10 count(distinct [user]) AS Users, Value AS Postalcode
from (
select [user],value
from dbo.nodes n 
inner join dbo.nodes_tags nt on n.id = nt.id
where nt.[key] = 'postcode'
UNION 
select [user],value
from [Udacity].[dbo].ways w 
inner join dbo.ways_tags wt on w.id = wt.id
where wt.[key] = 'postcode')u
group by Value 
order by Users DESC

Users	Postalcode
56	560037
45	560076
42	560066
42	560102
42	560001
40	560068
40	560095
38	560038
38	560043
38	560078

Number of Nodes¶

select count(*) AS Nodes from dbo.nodes

Nodes
2864255

Number of Ways¶

select count(*) AS Ways from dbo.ways

Ways
656294

Most contributed to areas¶

select tags.value, count(*) as count 
from (select * FROM nodes_tags UNION ALL 
      select * FROM ways_tags) tags
where tags.[key] = 'city'
group by tags.value
order by count DESC

Area	Count
Banashankari,Bangalore	29
Marathahalli, Bangalore	29
Whitefield, Bangalore	20
Whitefield,Bangalore	17
Basaveshwara Nagar, Bangalore	17
Gandhi Nagar, Bangalore	15
Hoodi, Mahadevapura, Bangalore	10
Koramangala, Bangalore	10
Mathikere, Bangalore	9
Mahadevapura, Bangalore	9
Indira Nagar, Bangalore	8

Walkability¶

By looking at the number of sidewalks available, I wanted to see how walk friendly the city is. Below are some statistics grouped on the sides on which the walk ways are available on.

select count(distinct  id) AS Sidewalks, value AS Value from dbo.ways_tags 
where  [key] = 'sidewalk' and value != 'none' and value != 'no' 
group by value
order by Sidewalks DESC

Sidewalks	Value
117	both
76	left
26	right
5	separate

Given the size of the city, I believe the number of walk ways could be improved.

Bike lanes¶

I was surprised to find tags for bike lanes and I had to pull some statistics for this as well. Unsurprisingly, there are very few bike lanes in the city and a lot could be improved.

select  count(*) as Bikelanes, value 
from (
select * from dbo.nodes_tags
where [KEY] = 'cycleway'
UNION
select * from dbo.ways_tags
where [KEY] = 'cycleway'
)b
group by value
order by Bikelanes desc

Bikelanes	value
437	no
21	lane
10	shared_lane
1	track

Top 10 Amenities¶

Below are the top 10 amenities in the nodes and ways for the city of Bangalore. It is no surprise that restaurants top the list of amenities in the city.

SELECT  top 10 COUNT(*) as count, value 
from (
select * from dbo.nodes_tags
where [KEY] = 'amenity'
UNION
select * from dbo.ways_tags
where [KEY] = 'amenity'
)q
group by value
order by count desc

Count	Value
1665	restaurant
1040	place_of_worship
773	atm
732	bank
691	school
571	hospital
521	pharmacy
491	fast_food
350	cafe
317	fuel

Health & Security¶

select count(*) as num,value from (
select nt2.value from dbo.nodes_tags nt1
inner join dbo.nodes_tags nt2 on nt1.id=nt2.id
where nt1.value IN ('police', 'hospital') and nt2.[key] = 'amenity'
UNION ALL
select wt2.value from dbo.ways_tags wt1
inner join dbo.ways_tags wt2 on wt1.id = wt2.id
where wt1. value IN ('police', 'hospital') and wt2.[key] = 'amenity'
)q
group by value
order by num DESC

count	value
580	hospital
111	police

It is also a good idea to find out how the police stations and hospitals are distributed around the city.

Leisure¶

Lastly, since Bangalore gets tagged as the 'Garden city', I wanted to look at the number of parks and gardens available in the city.

SELECT  top 10 count(*) as count, value 
from (
select * from dbo.nodes_tags
where [KEY] = 'leisure'
UNION
select * from dbo.ways_tags
where [KEY] = 'leisure'
)q
group by value
order by count desc

count	value
729	park
282	sports_centre
275	pitch
157	playground
131	garden
80	swimming_pool
49	common
23	fitness_centre
18	track
18	stadium

As you can see, there are a lot of leisure activities and parks top the list. I also looked for any references to trees in the dataset and just found two instances in nodes_tages where trees where referred to but I have to conclude that the information at this point is inadequate.

Additional Ideas¶

There is still a lot that could be done statistically by looking at different levels of data in the nodes and ways. Below are some of the ideas that I had but didn't want to elaborate as the answers to these questions have no element of surprise with this data set (although these might be useful for other cities).

Different cuisines of restaurants could be explored. In the case of Bangalore, the most prevalent cuisine is still different Indian cuisines and the rest lag way behind them.
Different religions and their places of worship.
Amenities were not included in the data analysis and I noticed that most of the amenities like restaurants and their cuisines, cafes could use some clean up from what I noticed.

Below is a sample SQL query that could be used to get to the above questions simply by swapping the 'value' and 'key' in the query.

select count(*) as num, Value from (
select nt2.value from dbo.nodes_tags nt1
inner join dbo.nodes_tags nt2 on nt1.id=nt2.id
where nt1.value = 'place_of_worship' and nt2.[key] = 'religion'
UNION ALL
select wt2.value from dbo.ways_tags wt1
inner join dbo.ways_tags wt2 on wt1.id = wt2.id
where wt1. value = 'place_of_worship' and wt2.[key] = 'religion'
)q
group by value
order by num DESC

The distribution of schools, hospitals and police stations in the city could be analyzed to provide some insights on how accessible education, health and security is in the city.

Below is a sample query and resultset that provides distance from the city center.

--POINT(12.9716 77.5946) Bangalore city center taken from Google 
DECLARE @citycenter geography = 'POINT(12.9716 77.5946)'
select distinct n.id, round(@citycenter.STDistance('POINT('+lat+' '+ lon+')')/1000,2) AS 'Distance from the city center(in kms)' from dbo.nodes_tags nt
inner join dbo.nodes n on nt.id = n.id 
where  nt.value like '%police%'
order by [Distance from the city center(in kms)] ASC

id	Distance from the city center(in kms)
263155937	0.25
4489908738	0.31
660666061	0.32
2577231853	0.35
428843493	0.43

I also noticed that the accuracy of some of the data is questionable and an option for users to 'Report a Problem' which would inactivate a particular tag for example would make the data more reliable and accurate.

In addition to benefits, these additional ideas could bring in some problems as well. Below are some benefits and anticipated problems by following these ideas, in my opinion.

Benefits¶

The data would be more cleaner and readable
Better and moer accurate analysis

Anticipated problems¶

Cleaning the data of lesser known or recognized amenities like cafes or restaurants could be a challenge as their data cannot be verified easily
'Report a problem' feature or anything similar takes time to receive feedback and ultimately clean data

Conclusion¶

After looking at data for the city of Bangalore, I believe it could use a lot more contributions from users. For the purposes of this exercise, I believe the data set has been cleaned. Although, I've added a few more ideas for cleaning and analyzing. Building up on these ideas will make for a more complete data cleanser, which will ultimately make the data on OpenStreetMap.org more accurate and readable for the city of Bangalore.

While cleaning up the postal codes and phone numbers of the city, I used a national standard website as a resource to clean and standardize data. Such sources would work out to be more reliable and readable for users and could be used in cleaning up more of OpenStreetMap data.