Empirical Software Engineering (2022) 27: 113 do Android developers improve non-functionalproperties of software?James Callan1· Oliver Krauss2 · Justyna Petke1 · Federica Sarro1Accepted: 11 February 2022 / Published online: 30 May 2022 The Author(s) 2022AbstractNowadays there is an increased pressure on mobile app developers to take non-functionalproperties into account. An app that is too slow or uses much bandwidth will decrease usersatisfaction, and thus can lead to users simply abandoning the app. Although automatedsoftware improvement techniques exist for traditional software, these are not as prevalentin the mobile domain. Moreover, it is yet unknown if the same software changes would beas effective. With that in mind, we mined overall 100 Android repositories to find out howdevelopers improve execution time, memory consumption, bandwidth usage and frame rateof mobile apps. We categorised non-functional property (NFP) improving commits relatedto performance to see how existing automated software improvement techniques can beimproved. Our results show that although NFP improving commits related to performanceare rare, such improvements appear throughout the development lifecycle. We found altogether 560 NFP commits out of a total of 74,408 commits analysed. Memory consumptionis sacrificed most often when improving execution time or bandwidth usage, although similar types of changes can improve multiple non-functional properties at once. Code deletionis the most frequently utilised strategy except for frame rate, where increase in concurrency is the dominant strategy. We find that automated software improvement techniquesfor mobile domain can benefit from addition of SQL query improvement, caching and assetmanipulation. Moreover, we provide a classifier which can drastically reduce manual effortto analyse NFP improving commits.Keywords Non-Functional property optimisation · Android optimisation · Miningandroid · Execution time · Bandwidth · Framerate · MemoryCommunicated by: Tse-Hsun (Peter) Chen, Cor-Paul Bezemer, André van Hoorn, Catia Trubiani andWeiyi ShangThis article belongs to the Topical Collection: Software Performance James [email protected] author information available on the last page of the article.

113 Page 2 of 42Empir Software Eng (2022) 27: 1131 IntroductionSeveral studies have shown that non-functional performance characteristics have a strongimpact on user satisfaction (Inukollu et al. 2014; Lim et al. 2014; Khalid et al. 2014; Liuet al. 2014; Ferrucci et al. 2015; Kim et al. 2015; Martins et al. 2018; Gao et al. 2018, 2020).Performance issues are especially important in the mobile, resource constrained domain.Banerjee and Roychoudhury (2017) analysed 170,000 user reviews of mobile applications,and classified reasons for user downvotes. Three out of five identified categories related tonon-functional properties. Khalid et al. (2014)’s study of iOS applications also showed thatunresponsive, resource heavy applications, and those with network related issued were amongthe top most frequent sources of complaints. Issues related to non-functional software properties lead not only to downvotes, but also large number of uninstallations (Banerjee andRoychoudhury 2017) rendering them a priority in the mobile software development process.Most of the aforementioned work provides analysis on what the issues are, rather thanhow these can be solved. Only recently Hort et al. (2021) carried out a comprehensive surveyof existing research on performance optimisation for mobile applications. These includedoffloading, antipattern detection, refactoring, prefetching, choice of a different programming language, reordering of I/O calls and changes to hardware components (Hort et al.2021). Their survey shows that tool support for automated improvement of non-functionalproperties has been scarce, mainly targeting energy consumption, and if available, targetingone property of choice, sometimes having adverse effects on the others (Hort et al. 2021).Among automated approaches for improvement of non-functional properties for traditional software, search-based approaches have gained popularity in the last few years. Forexample, Wu et al. (2015) automatically applied changes directly to source code to produce a Pareto front of program variants that improve memory consumption and runtimeefficiency. Basios et al. (2018) and Burles et al. (2015) both modified the data structuresused by software to improve non-functional properties by finding less resource-demandingcombinations. These and other techniques have been successful at optimising traditionalsoftware, though these are yet to be applied more widely in the mobile domain. The advantage of using search-based tools, is that they are well-suited for multi-objective optimisation,something that is missing in the mobile application improvement field (Hort et al. 2021).We pose that software repositories offer researchers a wealth of information about thebehaviour and techniques used by actual developers. These can be used to find patterns thatcan be mimicked by search-based software engineering approaches for optimisation of nonfunctional software properties (Harman and Jones 2001). Although several previous studiesfocus on performance bugs in traditional software, such as the study by Jin et al. (2012a),only a few studies on mining performance improving commits in the Android domain exist(e.g., Das et al. 2016; Moura et al. 2015). Moreover, those do not provide fine-grainedenough information to guide developers of search-based software development tooling. Previous studies were also concerned with finding general patterns across as many projects aspossible, thus employed a sampling strategy that would alleviate the expensive manual analysis cost. This leads to under-approximation of the true number of non-functional propertyimproving changes.To fill this gap we mined the most popular Android repositories, using single-keywordsearch and analyse all returned results, to find patterns that could be utilised in searchbased automated software improvement tooling. We focus on four non-functional propertiesin particular: execution time, memory consumption, bandwidth usage, and frame rate. Wechose these as they are most related to mobile app performance, key issue for users, asprevious studies show Banerjee and Roychoudhury (2016) and Hort et al. (2021).

Empir Software Eng (2022) 27: 113Page 3 of 42 113First, we mined the repositories of the 20 most popularly downloaded mobile applications, according to Fossdroid, and manually examine the resultant 3,132 commits,finding229 were actually NFP improving ones.1 Although this process should give us a goodoverview of non-functional property improving strategies for performance, it only allowsfor analysis of a relatively small number of repositories. However, the detailed analysis provides us with a corpus of data on which we can train a classifier that could help gather andanalyse more data. Therefore, we devised such a classifier and analysed a further randomlyselected set of 80 repositories, manually analysing 495 commits found, which added 331non-functional property improving commits to our dataset. We categorised all the commitsfound, to help us identify emerging patterns. We also report on whether current automated improvement tools already allow for such transformations to be found, and if not, ifsuch tools could be extended to provide new, useful software transformations. Finally, weexamined features of the repositories we analysed. This is to provide recommendations forsoftware developers, for what types of mobile applications non-functional improvementsare likely to be found.Our results show that non-functional property improvements to app performance are rare:from 74,408 commits mined across 100 repositories, only 560 were deemed to improveexecution time, memory consumption, frame rate or bandwidth (229 identified by manualsearch and 331 by using a classifier). However, we can still draw interesting conclusionsabout their nature. In particular:––––In 10.7% of cases, developers were willing to sacrifice one non-functional propertyover another, while in 6.5% of cases developers were able to improve upon multipleproperties at once. This shows the need for tooling that can handle multi-objectiveoptimisation.The strongest indicators for the number of non-functional property improving commitsin a repository was the total number of commits, number of contributors and number ofstars.Current search-based improvement tooling mimics 5 out of 23 non-functional improvement strategies found.Future automated techniques for improvement of non-functional properties could beenhanced by incorporating automated caching, SQL query, and image transformations.We propose detailed transformation patterns to aid researchers and developers in thedesign and adoption of such strategies.Overall our results provide recommendations for software engineers, aiming to providebetter tooling for automated software improvement; and for researchers, providing patternsof how developers improve mobile applications’ non-functional properties related to mobileapp performance, as well as a classifier that can help with future mining studies in this domain.All our data and scripts are freely available to allow for reproduction s), replication and extension of our work.The rest of this paper is organised as follows: Section 2 describes our methodology;Section 3 presents our results; in Section 4 we discuss implications of our study in softwareengineering research and practice; Section 6 presents threats to validity; while Section 5 presents related work; Section 7 concludes the paper. Appendix A contains additional materials.1 Incomparison, Moura et al. (2015) found 371 energy-aware commits from a sample of 2,189 curated commits. It should be noted these span different numbers of repositories, and different keywords, correspondingto relevant non-functional software properties.

113 Page 4 of 42Empir Software Eng (2022) 27: 1132 MethodologyIn order to answer how Android developers improve the performance-related non-functionalproperties of software (performance NFPs), and how we can use this knowledge to potentially devise new software transformations for tools for automated software improvement,we mine open-source Android projects for commits that improve four non-functional software properties (NFPs): execution time, memory consumption, bandwidth usage, and framerate. Along with energy efficiency, previous research shows these are often found inuser reviews (Banerjee and Roychoudhury 2017; Khalid et al. 2014), yet have not beenextensively tackled in the literature (Hort et al. 2021).2We aim to answer the following research questions:RQ1RQ2RQ3With what prevalence do developers improve performance NFPs of Android apps?NFPs of mobile applications impact user satisfaction, however it is not clearto what extent Android developers change their code to improve performanceNFPs. The aim of this question is twofold: understanding if there exist NFP commits in Android open-source repositories to extract general patterns from, andunderstanding their characteristics.How and when do Android developers improve app’s performance NFPs?We want to know at which stage in software development do performance NFPimproving commits occur, whether these are considered as standalone improvements, and whether these improve multiple NFPs or prioritise one whilst possiblysacrificing another. These should give us an overview of the current Androiddevelopment practice with respect to performance NFP improvement.What type of code changes do Android developers make to improve app’s performance NFPs?We want to also investigate what sort of changes developers make to source codeto improve its performance NFPs. Examining these changes will allow us to compare current search-based improvement techniques to real-world commits and makesuggestions for how these techniques can be improved.To answer these research questions we have manually curated a corpus of 560 nonfunctional property improving commits, which were collected by analysing a total of74,408 commits mined from 100 open-source Android repositories. In the followingsection we explain in detail our collection procedure. We have made this corpus publiclyavailable to allow for replication and extension of our work s).2.1 Overview of MethodologyBelow we present the methodology used to create our corpus. It consists of three steps:Keyword mining: In this step we collect a set of performance NFP improving commitsby filtering them first based on keywords and then manual analysis.Classifier mining: In this step we expand this set by using a classifier trained on thecommit messages gathered in the previous step.2 Weomit energy commits, as very similar studies targetting these have already been conducted (e.g., Mouraet al. 2015), with Banerjee and Roychoudhury (2016) already implementing a refactoring tool for energybugs.

Empir Software Eng (2022) 27: 113Page 5 of 42 113Categorisation: In this step we attempt to manually group the commits into categories.These categories allow us to find common patterns used to improve the four nonfunctional properties of interest: runtime, memory consumption, bandwidth and framerate.2.2 CorpusIn the first step, we mined the twenty most popularly downloaded Android applicationsaccording to Fossdroid,3 and extracted a total of of 28,028 commits. As it would have beeninfeasible to manually inspect such a large set to identify NFP improving commits, we haveadopted a semi-automatic approach that examines every commit message based on keywordsearch (as detailed in Section 2.3). This lead us to a total of 3,132 commits, which werethen manually analysed in order to label them as performance NFP improving commits ornot. A final set of 229 NFP improving commits was deemed to improve one of the fournon-functional properties of interest. We note that in previous work Moura et al. (2015)opted for two-word key-phrases rather than keywords to massively narrow down the numberof commits to manually analyse. Das et al. (2016) only mined commits from the mainmodules of applications, missing any changes to back-end modules. We opted not to takethese actions, and avoid missing possible useful software transformations by mining allcommits with generic keywords.In the second step, we leverage this curated set of NFP improving commits, to train aclassifier to be able to automatically identify such commits. This allowed us to automaticallyanalyse a much larger set of commits (46,378), mined from 80 randomly selected F-droidrepositories, and filter out irrelevant (i.e., not NFP improving) commits with a precision of95%, as detailed in Section 2.4. Specifically, we used the classifier to automatically identify331 additional NFP improving commits by randomly sampling F-droid. We initially founda total of 495 commits, which were then manually validated by two of the authors to makesure they improve any of the four non-functional properties of interest. This manual checkled to the identification of 331 performance NFP improving commits.The final size of our manually curated corpus thus consists of 560 NFP improving commits (229 from the first and 331 from the second step). We then manually categorised thesecommits by the type of change which was made to improve the NFP, by analysing theircommit messages and diffs. This resulted in 23 categories of improvement types beingfound.Next, we detail how we mine NFP improving commits by using keyword search(Section 2.3) and the classifier (Section 2.4), as well as how we manually validate the NFPimproving commits and categorise them (Section 2.5).2.3 Step 1: Identifying NFP Improving Commits Based on Keyword SearchWe mined 28,028 commits from the twenty most popularly downloaded applicationsaccording to Fossdroid (as of 18/03/2020), a website which offers an alternative user interface to the standard F-Droid web page. These applications are diverse in nature (e.g., gamingapplications, streaming applications, browsers) and size, having between 13 and 6,157 commits. Details of each application repository can be found in Tables 1 and 5. Whilst therepositories of these applications are hosted on a variety of platforms (GitHub, GitLab, etc.),3

113 Page 6 of 42Empir Software Eng (2022) 27: 113Table 1 Properties of repositories mined based on keyword searchRepositoryType of AppComm.StarsAge (days)Contrib.ForksKLoCAeons EndGame265963153.0AFH DownloaderNetwork.69181407272.4Android CUPS PrintPrinting274142180214454.9ANNO 1404Game1311127225.0Apple 4219042201828632.8Call 01304151537.8EditorText Editor405110103814385.3F-DroidApp 4982585309.2FOSS BrowserBrowser92742712922216515.9Frozen BubbleGame15771382946438.4G-DroidApp 8729840103.9Gloomy DungeonsGame i Mangu NuBooks18272301839236033.3Mighty KnightGame181112502121.0NewPipeVideo Stream.605480001711439120082.9all repositories use the git version control system. The git log command was used togenerate a list of commit messages, which was then parsed and searched for sets of relevantcommits, that suggest improvements to the following four non-functional properties: Time: Decreasing the amount of time needed for computation.Memory Consumption: Decreasing the amount of RAM used.Bandwidth Usage: Reduction of the load on the network.Frame Rate: Decreasing frame rendering and display rate.In order to identify relevant performance NFP improving commits, each repository wasmined by searching every commit message for a series of keywords (or parts of words insome cases, e.g. “effic” to capture all words similar to “efficient”, “inefficient”, etc.) associated with the particular property, by following a three-step process, as described below,and then manually validated.Initial Selection An initial set of keywords was generated by a combination of our knowledge of relevant terminology (which we have gained by writing NFP improving commitsourselves) and the examination of the language used in commit messages written by others.We then augmented this set with 15 keywords4 used in previous work conducting similar analysis (Jin et al. 2012b; Mazuera-Rozo et al. 2020; Das et al. 2016; Linares-Vásquez4 Thekeywords taken from previous work were: ‘wait’, ‘tim’, ‘stuck’, ‘react’, ‘latenc’, ‘throughput’,‘suboptimal’, ‘bloat’, ‘utilization’, ‘ANR’, ‘OOM’, ‘bottleneck’, ‘hot-spot’, ‘length’, ‘consumption’

Empir Software Eng (2022) 27: 113Page 7 of 42 113et al. 2015; Chen et al. 2019b). Any commit containing any of these keywords was selectedfor manual evaluation. Every selected commit message was manually evaluated to see if itactually suggests that an NFP has been improved or not. This approach aims to highlight asmany commits as possible that could improve non-functional properties and therefore resultin many false positives being manually evaluated. This helps to reduce the number of falsenegatives and allows us to detect as many relevant commits as possible.Keyword Expansion Synonyms for all keywords were searched for using the SEThesaurus(Chen et al. 2019a), a natural language processing (NLP) tool for finding synonyms inan SE context. Terminology found during manual evaluation of commits which suggestsimprovement but not present in the initial keyword set was added to a new keyword set.Another search took place with the new keywords in the same way as the original search.The keywords used can be found in Table 2.Keyword Validation To validate the keywords we conducted a text analysis by tokenisingand lemmatising all words over all commit messages. The resulting 12,230 tokens weregrouped according to the commits relevant (229), irrelevant (3132 - 229), and filtered out(27028 - 3132), based on the keywords used. These tokens were then ranked by how oftenthey occur in each group. From these rankings we attempted to identify possible keywordsthat we may have missed. First we removed all tokens that occur less than 10 times inthe commits identified as improving performance NFPs: This resulted in the identificationof 76 tokens, which could potentially be used as keywords. Then we further filtered outtokens by focusing only those that occur in the relevant group more or as often than thoseoccurring in the irrelevant group. This step allowed us to filter out words such as ‘and’,which are common in all commits. Of the 6 remaining tokens, three were already includedas keywords (i.e., memory, faster, and leak). The remaining three were save, reduce andlow. These three terms may be considered as additional keywords to identify additionalNFP commits, yet their use could increase the already high manual effort needed to inspectthe selected commits. In fact, in our study, these three keywords (save, reduce, low) relateto 111 filtered out commits. After manual inspection, we found that of these 111 commitsonly a single one could be identified as relevant; this commit also contains the word ‘mem’instead of ‘memory’, suggesting that using keyword search may miss those commits thatuse abbreviations like this or contain misspellings of keywords. However, as most commitscontain more than one keyword, the keyword set used herein can capture the majority ofthose commits too. As only three more relevant keywords were identified out of the 12,230unique tokens present in the relevant commits, and they led to the identification of only oneadditional relevant commit out of 111, we are confident that the set of keywords used toconduct our study is comprehensive and effective.Table 2 Keywords used to search for commit types, from initial selection and Keyword Expansion stages.Note that extensions of keywords are also captured during search, e.g., speeding, performance, and otherPropertyKeywordsExecution Timespeed, time, perform, slow, fast, optimi, wait, tim, stuck, react,Memorymemory, leak, size, cache, buffer, bloat, consumption, OOM space, storageBandwidthnetwork, bandwidth, size, download, upload, socket latenc, throughput,Frame Rateframe, lag, respons, latenc, hangsuboptimal, utilization, ANR, bottleneck, hot-spot, length, effic

113 Page 8 of 42Empir Software Eng (2022) 27: 113Table 3 Decision tree classification of NFP improving commits allows an accurate classification (0.80 recall)with a tolerable level of irrelevant commits mixed in (0.73 .76Irrelevant0.950.920.93Furthermore, the first author of this paper manually analysed the resultant commit set.Some commit messages were found ambiguous as to whether or not they offer any improvement. Developers sometimes write commit messages about what they have done but notwhy they have done it. Such commits were also independently analysed by another author.If the second author also found the commit to be ambiguous and not explicitly labelled asand improvement, it was discarded. We also discarded those commits which were mergeswith a single child commit as they were considered duplicates. We refer to the final set ofmanually curated commits gathered in this step as the “manual set”.2.4 Step 2: Identifying NFP Improving Commits Based on Automated ClassificationWhile in the previous step we use keyword search to narrow-down the number of commitsfor manual investigation, in this step we explore the use of an automated classifier, whichleverages on the manual set obtained from Step 1.The classifier we propose has been trained with the classified data from Step 1, i.e., allcommits manually excluded after the keyword search are labelled as irrelevant, while allcommits included are labelled as relevant.5 In addition, we have included 368 commits manually identified as relevant towards execution time in previous work (Mazuera-Rozo et al.2020) to the relevant commit set. We train the classifier using only the commit messages ofthe commit.6In order to search for an accurate prediction model, we have investigated a total of 20classification algorithms exploiting 6 different settings for feature selection. The settingswere derived from the featurization of text tokens via TF/IDF, Bag of Words (Yamauchiet al. 2014), and an adapted version of Bag of Words where only words occurring with a discriminative significance in either the irrelevant or relevant groups were used in the featurevector. Next, we present only the best result of these attempts, while more information aboutthe training of the classifier can be found in Appendix A. The best classifier was achievedvia stemming as pre-processing step, TF/IDF for featurization using a Decision Tree classifier. We assessed its effectiveness via cross-validation by using 10 hold-out repetitions(80%/20% train/test split), each time using a different seed. The results show a good level ofclassification with a precision of 73% and recall of 80% in the relevant class (see Table 3).In order to show the reduction in manual effort required when using our classifier werun it on two datasets. Table 4 shows a comparison of commits identified via keyword5 We decided to group all relevant commits into one single group, as preliminary analysis showed that attempt-ing to classify the commits into multiple classes (i.e., execution time, memory, bandwidth and frame rate)produces classes that are too small for building an accurate classification model (the Recall in all groups wasless than 0.1).6 We considered also using issue messages to identity commits. However the analysis of our KM data setshowed that only 13% of commits had associated issues. Most (52%) of those issues were associated with10 or more commits, meaning that only a small fraction of their messages and comments would be related tothe commit that we are interested in.

Empir Software Eng (2022) 27: 113Page 9 of 42 113Table 4 Comparing our keyword search to our classification based approach on two datasets. The 368 number of relevant commits for the Mazuera-Rozo et al. dataset was taken from their work /master/bug-fixing-commits-performance.csv. We note that authors report380 in their paper, but 11 commits don’t exist anymoreOur Keyword SearchClassifierOur datasetMazuera-Rozo et al. (2020)Total tional4403109search or via the classifier. For the dataset from Mazuera-Rozo et al. (2020) we applied thekeywords from Table 2, after compiling the git logs from the repositories used in the datasetby Mazuera-Rozo et al. (2020). The table shows that keyword search requires a much highermanual effort as the search returns several thousand keywords (3,132 our dataset and 32,308from Mazuera-Rozo et al.) containing only a few relevant commits (229 and 368). Theclassifier returns only 669 commits, with 219 from the manual identified ones contained(only 10 missed), and an additional 440 commits that may be relevant, but were filtered bythe keyword search.As the cross-validation confirms the effectiveness of the classifier, we re-train it on theentire available dataset in order to classify performance NFP improving commits on unseendata, thus further validating our classifier in a real usage scenario, and extending our corpusof NFP improving commits with the commits correctly classified as such.To this end, we randomly selected 80 repositories from F-Droid and used the classifier to automatically classify all 46,378 commits extracted from these repositories.7 Detailsof the repositories are provided in Table 5. The classifier identified 475 commits relevantcommits. Two of the authors manually analysed these commits, as they did in Step 1, tocheck whether the commits classified as relevant are actually NFP improving commits, i.e.,true positives. They found that only 164 commits were false positives, giving a manuallyevaluated precision of 66.87% for this classifier, and 331 commits added to our corpus.8To further verify our classifier, we evaluated its performance on 5 randomly selectedrepositories from the set that was mined with the classifier. We perform keyword mining onthese repositories in order to identify the false negatives of the classifier. Of the 5 repositories selected, 3 were found in both CM and KM to contain no performance NFP improvingcommits. In the repositories where commits were found, one was found to have 5 performance NFP improving commits compared to 3 found by the classifier, and in the other thesame 3 commits were found by both approaches. These repositories are all small yet representative of many of the repositories which were mined. In order to evaluate the classifier ona larger repository with many commits, we also run it on the Koreader repository where themost CM commits were found (147 overall, see Table 7). We manually analyse all commits7 Wehad to set a limit on the number of repositories due the manual effort required to analyse the precisionof the classifier.8 We note that a lot of these were small repositories, as the 40 repositories in which NFP improving commitswere found had altogether 39,420 commits, while the other 39 had altogether 6,958 commits.

113 Page 10 of 42Empir Software Eng (2022) 27: 113Table 5 Properties of classifier mined repositoriesNameCommitsStarsAge 5143354143.0Audioanchor24310966212

in the mobile domain. Moreover, it is yet unknown if the same software changes would be as effective. With that in mind, we mined overall 100 Android repositories to find out how developers improve execution time, memory consumption, bandwidth usage and frame rate of mobile apps. We categorised non-functional property (NFP) improving commits .