Mom Wants to Buy A House

12 min readFeb 9, 2021

In the second half of 2020, Mom told me she wanted to sell our current house and buy a smaller one. “This house is too big since your dad passed away and your siblings have moved out. I want a smaller house, preferably a single storey for cheaper maintenance. But I don’t want to move far away from here and I’d like to have a similar space as our current house” she said.

My current house specifications: 2 storeys, 110 square meter land, 160 square meter building, 5 bedrooms, 2 bathrooms, and a carport for a single car.

So I started looking on property marketplaces to find a good deal matched with Mom’s requirements. OLX was the first my “go-to” (and was the only one). I was devastated by the number of listings they have. There are about 450 thousands property listings in my city. Even with a strict filter applied, there are still around 15 thousand houses that could potentially be my mom’s new house.
How could I filter more? How do I find the best-valued house? I need to find a way to find the best possible one.

Web Scrapping

First I have to make a web scraper. There is no way I could analyze 15 thousand listings on hundreds of paginated web pages online. I found a good simple video scrapping on Amazon with Python. I copied the methods and adjusted them based on my need.

When scrapping, I found out that not all information I need is presented on the search result screen. I need to dive inside every listing page and scrape more information there. For that, I created a two layers scrapper within my scrapper robot.

The disadvantage of my method is the time consumption. If you know how Selenium works, you will understand that my robot is imitated human behavior by opening a web page through a browser. Then, when the page opened, BeautifulSoup will get the data from raw HTML code. So, to scrape 15 thousand items, the robot needs to open those 15 thousand pages plus few hundred search result pages.

With some simplifications, like not loading images and JavaScript when loading pages, my robot can scrape at speed of 50 pages per minute. To scrap 15,000+ 500 (remember I need to scrape URLs from search page result first), I need around 5 hours 10 minutes.

I am not going to scrape all those at first because my robot has not been tested yet. Imagine running the robot for 4 and a half hours then suddenly there is a crash. Therefore, I tried it only to scrape 600 listings for an initial experiment, also to make data processing easier later.

After some debugs, I finally able to convert them into an Excel file. Why do I need an Excel file? For me, although I know Pandas, it is still easier to evaluate the result and see the data wholly to determine the next step. Also, I can easily pass along the raw scraped data if anyone wanted it. Of course, the first thing I noticed when I see my Excel file I have this very thought: Data is so messy.

2. Data Cleaning

I have only collected 11 columns in my data, they are ID, Title, Price, Floor, Building Size, Land Size, Bedrooms, Bathrooms, Location, Address, Date Posted, and URL. My collected data was as messy as below when first scraped.

Here some information about the data:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 600 entries, 0 to 599
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ID               600 non-null    object 
 1   Title            600 non-null    object 
 2   Harga            600 non-null    object 
 3   Lantai           563 non-null    float64
 4   Luas Bangunan    600 non-null    int64  
 5   Luas Tanah       600 non-null    int64  
 6   Kamar Tidur      600 non-null    object 
 7   Kamar Mandi      600 non-null    object 
 8   Lokasi           600 non-null    object 
 9   Alamat           590 non-null    object 
 10  Tanggal Posting  600 non-null    object 
 11  URL              600 non-null    object 
dtypes: float64(1), int64(2), object(9)
memory usage: 56.4+ KB

Now let me give you the highlight of the problems of each column and how I wanted it to be:

ID: A string with combinations of letters and numbers
What I want: Only numbers with Integer type data
Price(Harga): A string with combinations of letters and numbers. Numbers are denoted with Indonesian style thousands separator
What I want: Only numbers with Integer type data without thousand separators
Floor (Lantai): Float data type with NaNs among them
What I want: Integers data type, with NaN replaced by 0
Bedrooms(Kamar Tidur) and Bathrooms(Kamar Mandi): String data type with ‘>’ marker in some entries
What I want: Integer data type and ‘>’ removed
Location (Lokasi): Districts name followed by city name
What I want: Only district names
Address (Alamat): Some NaNs exist
What I want: Replace NaNs with ‘Tidak ada alamat’ string
Date Posted (Tanggal Posting): A string, inputted with Indonesian months without year value, if the posting date near current day it shows ‘Today’, ‘Yesterday’ or ‘6 days ago’ in Indonesian
What I want: Datetime data type with dd/mm/yyyy format

When those columns are fixed, now let’s see the data’s anomalies.

Looks like Price (Harga) digits are too much, it will be ugly later in the chart if I don’t shorten it. So, I will simplify it into million forms. The other anomaly is on the Floor(Lantai) column. The maximum number is 22. Never in my life, I see a 22 storey house. I assume this might be a typo, the advertiser intended to input 2 but accidentally pressed another 2. Another scenario might be an apartment on the 22nd floor. Let’s look at this special house deeper.

If you know my area, you will be sure that there is no such high-rise building near that address. So, I will be going to change the floor from 22 to 2.

Next step, I need to make sure that the ID is unique. After filtering, I found a couple of IDs are duplicated. One of them must be deleted.

I need to do the same for Title columns. You know the property broker tend to spam their properties into marketplaces to get more views. I need to filter them out from my data.

Finally, my data is ready. It looks clean and ready to be analyzed =)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 590 entries, 0 to 599
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   ID               590 non-null    int64 
 1   Title            590 non-null    object
 2   Harga            590 non-null    int32 
 3   Lantai           590 non-null    int64 
 4   Luas Bangunan    590 non-null    int64 
 5   Luas Tanah       590 non-null    int64 
 6   Kamar Tidur      590 non-null    int64 
 7   Kamar Mandi      590 non-null    int64 
 8   Lokasi           590 non-null    object
 9   Alamat           590 non-null    object
 10  Tanggal Posting  590 non-null    object
 11  URL              590 non-null    object
dtypes: int32(1), int64(6), object(5)
memory usage: 57.6+ KB

3. Wrangling and Analysis

Here we arrived at the fun part. The data is cleaned and now we are ready to get some insight from it.

First, let’s see the mean of house prices in each district. Mom is currently staying in Pondok Aren district, and she prefers the nearer area. Unfortunately, the house prices in that district are averagely higher than in other areas. Rationally thinking if she wants to stick in this district she must prepare more money, or we should work hard to find a bargain within the market.

Now let’s turn our attention to the number of house listings throughout the city. Pondok Aren takes almost half a portion of the listings, almost double compared to the closest rank Serpong. Other districts are less than or shy of 50 listings.

Now let’s find a bargain house. Houses are mostly measured by the area of land or building. Rationally thinking, the cheaper the ratio of Price to Land or Price to Building, the better deal it should be. But in this case, an overly cheap house is suspicious could be considered suspicious. It could be fake advertising, or the house condition is so bad. What I want to find here is a good condition house but at a bargain price. Too bad the website did not score the house conditions in their platform, so I have to do it on my own. Alright, first let’s see the Price to Land for each of the houses we have.

I can see the data mostly have a positive correlation, which makes sense, the larger the land, the more expensive the house is. I know that Ciputat and Ciledug are the two cheapest areas. But the graph shows us Pondok Aren and Serpong have some deals that look like a bargain too. How do I know that? Well, I simply look at the dots that tend to be more in the right side position compared to majority dot clusters, but not too right because it is where suspicious deals are located. It is not so good to classify them in the graph, I will look into this in more detail shortly after. Now let's turn the attention to Price to Building ratio.

The Price to Building ratios looks very similar to the Price to Land ratios. Even you can see some outliers of Price to Land ratios are an outlier of Price to Building ratios. This makes perfect sense since both Price to Land and Price to Building ratios are strongly correlated.

Since Mom is looking for houses with small land area, I would only use Price to Building ratios. I would not happy having land with a smaller house in it, given that the land is already small.

I now can start to identify bargain deals, I need to classify those shiny looking deals from defective ones. This means I need to give labels for every deal in data.

I will do it by simply label out the lowest 10% of the Price to Building ratio’s standard deviation as a suspicious deal. This means that if the ratio is too far away from the overall mean by more than 90% of the standard deviation, that is a suspicious deal. Let’s look at the distribution.

count    590.000000
mean      15.942947
std        5.800557
min        3.100000
25%       12.072557
50%       15.500000
75%       18.856826
max       65.000000

Overall, the price for each square meter is 15.9 million rupiahs. With the standard deviation is 5.8 million rupiahs, suspicious deals will be fall in price below 10.7 million rupiahs per square meter. If you are not happy with the mean as the middle value, the mode of distribution is 15 million rupiahs. So the threshold will be 9.8 million rupiahs. Alternatively, you can use the median as the middle point as well. I will go for mode as I think the price of properties determined by the convention of local area sellers. My decision is very debatable, but for this case, the difference will not much as the mean, median, and mode are in the 15–15.9 million range.

Now let’s define the upper threshold for our bargain deals. The mode can be used as a good threshold as it’s a middle point of the deal, making any price deal below it is a good deal. But I would like to push it down a little. I would go for 20% below mode as my threshold. I think below 20% of the difference will not enough monetary incentive for me to see the deal as a bargain. So my upper threshold will be 13.8 million per square meter of building.

For the fun part, I will apply the same threshold for the upper part of the list. So anything beyond 20% of standard deviation from mode will be labeled as ‘expensive’ and beyond 90% as ‘very expensive’.

Another perspective of labeling, I will also add another label for district specified deals. This will help me identify a good deal in an expensive district like Pondok Aren or Serpong Utara. I can already imagine if I use only the overall data mode and thresholds, my result will be dominated by listings from cheaper districts. Now let’s see how the data will look like.

Serpong is the district that has the most bargain proportions while Pondok Aren has a greater quantity in bargain deals(Mom will be happy to hear this). Ciputat which has the lowest average price shows the biggest proportion of suspicious deals.

For district specified deals or local deals, each district has more expensive-very expensive deals compared to suspicious-bargain, except Serpong Utara. Ciledug surprisingly has no suspicious deals within it.

Now my data looks like below:

The data above is not sorted. Therefore there is no means of those which appear at top 10. To make it has some meanings, I will sort the data by Price per Square Meter Building(Harga Per Meter Bangunan). One thing I can notice that the Deals and Local Deals column can have a different value. It makes sense since the price standard for each district is different.

Top 10 data, sorted by Price per Building meter

Now the table is sorted, as I expected, top listings will be considered suspicious since they belong to >90% standard deviation distance from the middle point. Since I want to avoid suspicious deals, I have to cut them out from the data. So the data will look like below:

Now the suspicious data were out, I can see bargain deals filling in. But those on the Local Deals column still have some Normal deals. This is happening because the middle point of the Ciputat and Ciledug district is lower than the overall middle point. So although the deals are considered bargain deals by overall standard, they are just normal deals compared to the local district.

This list can be considered good in my opinion. Those considered bargain on overall standard but normal on the local standard can be expected to have higher quality than a double bargain deal.

But for the sake of perfection, just a single bargain on the Deals column is not enough! I want a double bargain!

Here is my ultimate list! Let's call it a Double Bargain list. Now you will notice that all Ciledug and Ciputat listings have been kicked out in this list. All those listed are fit in terms of pricing related to building areas. Any of those properties are guaranteed below the normal price.

See below for double bargain stats:

I have 77 potential houses ranging between 9.8–13.8 million rupiah per square meter of building, 76–450 square meter of building on 850 million to 5.8 billion rupiahs in price. Wait a minute! who said Mom can afford a 5.8 billion house?

Well, it’s time for us to apply the last filter: price constraint. Mom’s budget at maximum is only 1.8 billion rupiahs, so I will just throw whatever beyond that.

So this is my finalist! I now have only 18 properties to look and those are the best we have within my Mom’s criteria. I hope Mom will be satisfied with my work =)

Hi! thank you for reading this far. I know my article is so limited in techniques with a lot of assumptions I poured there. With all that I have learned so far, this is the best I can produce by now. Hopefully, my next writings will be far improved. I would really appreciate it if you can leave a comment on how I can make this article better in any way. I am more than thrilled to hear from you. If you want to approach me directly, please just reach my Twitter at @itskido.

If you want to see the technical scratch, find it below:

Bargain Houses

Problems for Tanggal Posting's column: data marked as 'today' and 'yesterday' for today's and yesterday's posting date…

nbviewer.jupyter.org

Mom Wants to Buy A House

Bargain Houses

Problems for Tanggal Posting's column: data marked as 'today' and 'yesterday' for today's and yesterday's posting date…

Written by Ahmad Izzudin