Motivation: Vast amount of data with a hidden structure (patterns) inside.

Method: Topic Modeling (LDA).

Software: Python backend (Gensim); HTML/JavaScript frontend (Google Compact Language Detector, Google Charts).


Most of the iTunes apps have English-language descriptions. In our analysis we will focus exclusively on the English-language segment of the repository. This decision has been made deliberately with the purpose of simplification and easier interpretability of the results. It also allows us to employ language-specific techniques, e.g. lemmatization.


The predefined iTunes categories are limited to 22 general-purpose areas.

The chart below shows language distribution excluding English (66 languages). German, French and Spanish apps dominate. Thereafter, we do not take into consideration these applications.


Results of LDA model (produced with Gensim) for 66 topics: