Tuesday, July 23, 2019

Capstone Project - Business Prospect and Venues Data Analysis in Bali, Indonesia



1. Introduction

Bali is one of Indonesia's main tourism destinations, which has seen a significant rise in tourists since the 1980s. Tourism-related business makes up 80% of its economy. It is renowned for its highly developed arts, including traditional and modern dance, sculpture, painting, leather, metalworking, and music. There are so many business opportunities in this island, with potential customers from both local residents and the tourists.

Business owners will need to perform research and exploration of the neighborhood before opening any business in a specific location to get more insight from some factors, like the nearby residential areas, tourism places, office buildings, similar competitors, public facilities, and also the information of average rental fare.

2. Objective

This project will give insight to the restaurant business owners/investors to compare the neighborhoods in the districts of Bali, then it will also help to choose the best suited locations based on the top ten common venues surrounding it.

Methods:
  • Web scraping the list of districts and population in Bali from Wikipedia.
  • Extracting the top trending venues using Foursquare API.
  • Forming neighborhood clusters based on venue categories using unsupervised k-means clustering algorithms.
  • Understanding the similarities and differences between districts to retrieve more insights and to conclude which neighborhood is best suited for business prospects.

3. Data Reference and Library

3.1. District list in Bali

I will be extracting a list of districts and population in Bali from the Wikipedia page here. Using read_html() function to extract HTML table information from it, then using Panda libraries to load it into dataframes, perform necessary data clean-up, and perform some actions with it.

3.2. Geolocation of each district in Bali

Geolocation contains the longitude and latitude of each district in Bali from a csv file that I upload in GitHub here. Using panda dataframe to load the data and merge it with above district information, then get the top venues using Foursquare API using geolocation information.

3.3. Library 

  • Panda libraries for dataframe and other dataset manipulation.
  • Numpy for any scientific computation.
  • Requests to call Foursquare API.
  • KMeans cluster from sklearn for clustering.
  • Matplotlib for plotting modules.
  • Seaborn for bar graph plotting.
  • Folium for map plotting.

4. Methodology and Process Execution

I use two different datasets, the first dataset is from the web scraping from Wikipedia by extracting information of the district names and the population. I'm using a read_html() function then store the result into panda dataframe. There are nine districts in Bali as per shown below:
Second dataset is the districts and sub-districts geolocation that I consume from a csv file in the GitHub remote repository. I'm using the read_csv() function to read the data then store the information into a panda dataframe. There are 57 sub-districts in Bali, here is the format of the dataframe:
After merging both datasets and removing the duplicate columns, it becomes like this format:

I used a python folium library to visualize geographic details of Bali island and its sub-districts, then created a map using the merge dataset as per shown below:

I used a python seaborn library to create a bar graph to plot the district population, and give different colors for districts that have a total population more than 500,000 people:
I utilized a Foursquare API to explore the venues based on focused sub-districts above, which are 'Kabupaten Buleleng', 'Kabupaten Karangasem', and 'Kota Denpasar'. Parameters that I use to call the API are using a radius of 10 kilometers and limit to 100 venues. Here is the snapshot of the API call results:

There are 20 venues returned by Foursquare, here's the head of merged table of sub-districts and venues:
Then I created a table which shows a list of top five venue categories for each sub-district. Here are the head records:
I used the K-means algorithm to cluster the sub-districts. K-means algorithm is one of the most common cluster methods for unsupervised learning. First I run the K-means to cluster the sub-districts into 3 clusters. Below is my merged table with the cluster labels for each sub-district.

5. Results

Based on the clustered results, I plot it into a Bali map: 

Based on the clustering table, there are 3 sub-districts that have more restaurants and bars, as per shown below:

6. Discussion

As mentioned in the introduction part, Bali has a big market and business opportunity due to it's famous tourism place. There are so many business opportunities on this island, with potential customers from both local residents and tourists. The population densities in the 9 districts (called as Kabupaten/Kota) will give an insight as a candidate of the potential market target. To give more detailed information, we can go deeper to the sub-districts (called as Kecamatan) and observer more on the venue categories.

I got the district list information from Wikipedia, and sub-district geo locations from csv files. After merging both data together into a dataframe, then I plotted it into a barplot and visualized it in different colors based on the population that has more than 500,000 people in the district. The next activity is using Foursquare API to get a list of available venues based on focused districts.

One of the challenges that I experience with the Foursquare API is not many venues are available despite the radius parameter as 10km.

I ended the study by visualizing the data and clustering information on the Bali map. In future studies, rental detail per sub-district can be added to provide more insight to the business owners if they want to consider opening a business in the specific location.

7. Conclusion

As a result of clustering and comparison, people will have more insight if they want to open a business in a specific location. With this insight, business owners can have better decisions by using the reports/platforms where such information are provided.

Not only for business owners/investors, it will be useful for the local government to understand business diversity and the prospects in their local area.


8. Reference

Visit my GitHub for details of the python notebook here or alternatively can use my jupyter notebook here.

Post a Comment

Whatsapp Button works on Mobile Device only

Start typing and press Enter to search