Exploratory Data Analysis on the Apple App Store Dataset

7 min readMar 8, 2021

Overview:

Apple is one of the fastest-growing technology companies in the world. The ever-changing simple design of its products has attracted millions of people. The App Store is a system software Apple designed that allows its users to browse and download applications on Apple’s mobile devices. The App Store is optimized in a way that users can just download their apps of interest with just one click of the button, which in turn encourages the users to add more functionalities to their Apple mobile devices.

With the increasing usage of the iPhone and other Apple’s mobile devices these days, many app developers see the App Store as a growing opportunity. According to Statista, there are nearly 4.4 million apps available in the Apple App Store as of July 2020. Although mobile apps can bring huge profits to the app developer, it is still a very competitive field. For this exploratory analysis, I will be using an Apple App Store dataset to gain some insights on what apps can be turned into profitable opportunities.

Dataset:

I looked into a dataset on Kaggle named Mobile App Statistics (Apple iOS app store). Data source: click here. This dataset contains more than 7000 mobile application details in the Apple App Store, it was extracted from the iTunes Search API in July 2017. There are two CSV files incorporated in this dataset, one is named appleStore.csv which includes the details of each application, the other one is called appleStore_description.csv which includes the description of each application. For this analysis, I combined the two CSV files together using the common App ID in both files. Since there are duplicated columns in both files, I only kept the app description column from the appStore_Description.csv file and performed a merge. After the merge, I changed the index of the DataFrame to the App ID and kept all the attributes. As a result, there are 16 columns and 7197 rows in total in the DataFrame. In addition, there are no missing values nor duplicated rows in the DataFrame.

The contents of appStore.csv are listed as the following:

“id” : App ID
“track_name”: App Name
“size_bytes”: Size (in Bytes)
“currency”: Currency Type
“price”: Price amount
“rating_count_tot”: User Rating counts (for all version)
“rating_count_ver”: User Rating counts (for current version)
“user_rating” : Average User Rating value (for all version)
“user_rating_ver”: Average User Rating value (for current version)
“ver” : Latest version code
“cont_rating”: Content Rating
“prime_genre”: Primary Genre
“sup_devices.num”: Number of supported devices
“ipadSc_urls.num”: Number of screenshots shown for display
“lang.num”: Number of supported languages
“vpp_lic”: Vpp Device Based Licensing Enabled

The contents of appleStore_description.csv are listed as the following:

id : App ID
track_name: Application name
size_bytes: Memory size (in Bytes)
app_desc: Application description

Guiding Questions:

To figure out what apps in the Apple App Store have the potential to be successful, I came up with the following gilding questions.

What are the app statistics for different groups?
Are paid apps better than free apps?
What are some possible factors that contribute to higher user ratings?

App Statistics for Different Groups

I started my analysis by looking at the app statistics for different groups. I first counted how many apps are in each category by plotting a bar chart. From the bar chart, I found that out of all the categories, Games, Entertainment, Education, Photos & Videos, Utilities are the top 5 largest categories. One interesting finding is that the first largest category which is Games surpasses the second largest category which is Entertainment by roughly 6 times.

Keep digging deeper into each category, I also plotted a pie chart that shows the distribution of each category. The purpose of this is to see what percentages did the most popular categories accounted for. I kept the five largest categories and grouped everything else together and named it Others. As we can see from the pie chart below, the largest category which is Games makes up 53.7% of the apps in the App Store. The second and the third largest ones Entertainment and Education make up 7.4% and 6.3% of the App Store respectively. And the Others category I have grouped which are everything besides the first five accounted for 24.3%. From this graph, we know that Games is the most popular category due to the extremely high number; however, it might be a very competitive category as the number of direct competitors is big.

In addition, I also split the categories based on whether the app is free or not. As we can see from these two bar plots, there are three things that worth noting. One is that the number of free games is surprisingly higher than the number of paid games. Another one is that almost all Social & Networking apps are free. The third one is that if we take a closer look at the tail of the paid apps, we see that all the shopping and catalog apps are free as well.

Lastly, I looked at the average user rating of each category. My findings showed that the average rating for Productivity, Music, and Photos & Videos are higher than other categories. Finance and Catalog were the lowest among all other categories. I think a common characteristic among the top 3 categories is that they all provide some degree of joy and satisfaction to the users. And the last two categories may sometimes cause dissatisfaction. This might suggest that those apps which can bring users pleasure and satisfaction may have higher ratings.

Are Paid Apps Better Than Free Apps

In this section, I will be exploring more about the paid apps and free apps. To start with, I plotted out the correlation between user ratings and price to see if a higher price indicates a higher rating. According to the scatter plot below, it seems like there is no significant relationship between price and ratings. By just examining the price and user rating, we cannot conclude that pricier apps are better than free apps.

I then plotted the rating of paid apps and free apps in a bar chart, what I found is that there is a small difference in the rating if we look at the ratings of free apps and paid apps respectively. The result indicates that the rating of paid apps is higher than those that are free.

There is no attribute in this dataset illustrating the number of users, but I think the total rating counts can be used as an indicator to reflect the popularity of each app. If the app has a higher rating count, it probably has a greater number of users. Therefore, I plotted the total rating counts of the paid apps and free apps. From this image, I found that the number of total rating counts for free apps is significantly higher than paid apps. By looking at this graph, we may infer that free apps usually have a larger customer base, and thus more likely to attract users than paid apps.

What are some possible factors that contribute to higher user ratings?

To find the possible factors that could result in a high user rating, I went through the contents of this dataset and selected five attributes. I plotted the number of languages supported, number of devices supported, number of screenshots being displayed, and the size of each app over the user rating. The results are in the subplots shown below.

In this subplot, it is very obvious that as the user rating increases, the number of languages supported, the number of screenshots displayed in the App Store, and the amount of bytes being used by the app also increases. Although there is a slight decrease towards the perfect rating point for these three, the lines have upward tre. Thus, the number of languages supported, the number of screenshots displayed, and the size of the apps may affect the user rating. Nonetheless, it seems like the number of devices supported has a very small effect on the user rating as there are no obvious upward or downward trends in that plot.

Next, I plotted the total rating counts against the user rating. We can see that the left scatter plot has some outliers, in order to make the trend clearer, plotted another graph on the right which excluded the extremes, and added a red trendline. The trendline is almost horizontal indicating the total rating count may havee little effect on the user rating.

In this section, I found that that the possible factors that contribute to a higher app rating are the number of languages supported and the number of screenshots displayed, and the size of the app. To be more specific, it is possible to have a higher rating app if the app was designed for more than three languages or if the app developers show more than 3 screenshots of the app to the users in the App Store.

Source Used from Statista