Search | arXiv e-print repository

SMOOTHIE: A Theory of Hyper-parameter Optimization for Software Analytics

Abstract: Hyper-parameter optimization is the black art of tuning a learner's control parameters. In software analytics, a repeated result is that such tuning can result in dramatic performance improvements. Despite this, hyper-parameter optimization is often applied rarely or poorly in software analytics--perhaps due to the CPU cost of exploring all those parameter options can be prohibitive. We theorize… ▽ More Hyper-parameter optimization is the black art of tuning a learner's control parameters. In software analytics, a repeated result is that such tuning can result in dramatic performance improvements. Despite this, hyper-parameter optimization is often applied rarely or poorly in software analytics--perhaps due to the CPU cost of exploring all those parameter options can be prohibitive. We theorize that learners generalize better when the loss landscape is ``smooth''. This theory is useful since the influence on ``smoothness'' of different hyper-parameter choices can be tested very quickly (e.g. for a deep learner, after just one epoch). To test this theory, this paper implements and tests SMOOTHIE, a novel hyper-parameter optimizer that guides its optimizations via considerations of ``smothness''. The experiments of this paper test SMOOTHIE on numerous SE tasks including (a) GitHub issue lifetime prediction; (b) detecting false alarms in static code warnings; (c) defect prediction, and (d) a set of standard ML datasets. In all these experiments, SMOOTHIE out-performed state-of-the-art optimizers. Better yet, SMOOTHIE ran 300% faster than the prior state-of-the art. We hence conclude that this theory (that hyper-parameter optimization is best viewed as a ``smoothing'' function for the decision landscape), is both theoretically interesting and practically very useful. To support open science and other researchers working in this area, all our scripts and datasets are available on-line at https://github.com/yrahul3910/smoothness-hpo/. △ Less

Submitted 17 January, 2024; originally announced January 2024.

Comments: v1

arXiv:2401.01883 [pdf, other]

Mining Temporal Attack Patterns from Cyberthreat Intelligence Reports

Authors: Md Rayhanur Rahman, Brandon Wroblewski, Quinn Matthews, Brantley Morgan, Tim Menzies, Laurie Williams

Abstract: Defending from cyberattacks requires practitioners to operate on high-level adversary behavior. Cyberthreat intelligence (CTI) reports on past cyberattack incidents describe the chain of malicious actions with respect to time. To avoid repeating cyberattack incidents, practitioners must proactively identify and defend against recurring chain of actions - which we refer to as temporal attack patter… ▽ More Defending from cyberattacks requires practitioners to operate on high-level adversary behavior. Cyberthreat intelligence (CTI) reports on past cyberattack incidents describe the chain of malicious actions with respect to time. To avoid repeating cyberattack incidents, practitioners must proactively identify and defend against recurring chain of actions - which we refer to as temporal attack patterns. Automatically mining the patterns among actions provides structured and actionable information on the adversary behavior of past cyberattacks. The goal of this paper is to aid security practitioners in prioritizing and proactive defense against cyberattacks by mining temporal attack patterns from cyberthreat intelligence reports. To this end, we propose ChronoCTI, an automated pipeline for mining temporal attack patterns from cyberthreat intelligence (CTI) reports of past cyberattacks. To construct ChronoCTI, we build the ground truth dataset of temporal attack patterns and apply state-of-the-art large language models, natural language processing, and machine learning techniques. We apply ChronoCTI on a set of 713 CTI reports, where we identify 124 temporal attack patterns - which we categorize into nine pattern categories. We identify that the most prevalent pattern category is to trick victim users into executing malicious code to initiate the attack, followed by bypassing the anti-malware system in the victim network. Based on the observed patterns, we advocate organizations to train users about cybersecurity best practices, introduce immutable operating systems with limited functionalities, and enforce multi-user authentications. Moreover, we advocate practitioners to leverage the automated mining capability of ChronoCTI and design countermeasures against the recurring attack patterns. △ Less

Submitted 3 January, 2024; originally announced January 2024.

Comments: A modified version of this pre-print is submitted to IEEE Transactions on Software Engineering, and is under review

arXiv:2312.05436 [pdf, other]

Trading Off Scalability, Privacy, and Performance in Data Synthesis

Authors: Xiao Ling, Tim Menzies, Christopher Hazard, Jack Shu, Jacob Beel

Abstract: Synthetic data has been widely applied in the real world recently. One typical example is the creation of synthetic data for privacy concerned datasets. In this scenario, synthetic data substitute the real data which contains the privacy information, and is used to public testing for machine learning models. Another typical example is the unbalance data over-sampling which the synthetic data is ge… ▽ More Synthetic data has been widely applied in the real world recently. One typical example is the creation of synthetic data for privacy concerned datasets. In this scenario, synthetic data substitute the real data which contains the privacy information, and is used to public testing for machine learning models. Another typical example is the unbalance data over-sampling which the synthetic data is generated in the region of minority samples to balance the positive and negative ratio when training the machine learning models. In this study, we concentrate on the first example, and introduce (a) the Howso engine, and (b) our proposed random projection based synthetic data generation framework. We evaluate these two algorithms on the aspects of privacy preservation and accuracy, and compare them to the two state-of-the-art synthetic data generation algorithms DataSynthesizer and Synthetic Data Vault. We show that the synthetic data generated by Howso engine has good privacy and accuracy, which results the best overall score. On the other hand, our proposed random projection based framework can generate synthetic data with highest accuracy score, and has the fastest scalability. △ Less

Submitted 8 December, 2023; originally announced December 2023.

Comments: 13 pages, 2 figures, 6 tables, submitted to IEEEAccess

arXiv:2310.19125 [pdf, other]

Partial Orderings as Heuristic for Multi-Objective Model-Based Reasoning

Authors: Andre Lustosa, Tim Menzies

Abstract: Model-based reasoning is becoming increasingly common in software engineering. The process of building and analyzing models helps stakeholders to understand the ramifications of their software decisions. But complex models can confuse and overwhelm stakeholders when these models have too many candidate solutions. We argue here that a technique based on partial orderings lets humans find acceptable… ▽ More Model-based reasoning is becoming increasingly common in software engineering. The process of building and analyzing models helps stakeholders to understand the ramifications of their software decisions. But complex models can confuse and overwhelm stakeholders when these models have too many candidate solutions. We argue here that a technique based on partial orderings lets humans find acceptable solutions via a binary chop needing $O(log(N))$ queries (or less). This paper checks the value of this approach via the iSNEAK partial ordering tool. Pre-experimentally, we were concerned that (a)~our automated methods might produce models that were unacceptable to humans; and that (b)~our human-in-the-loop methods might actual overlooking significant optimizations. Hence, we checked the acceptability of the solutions found by iSNEAK via a human-in-the-loop double-blind evaluation study of 20 Brazilian programmers. We also checked if iSNEAK misses significant optimizations (in a corpus of 16 SE models of size ranging up to 1000 attributes by comparing it against two rival technologies (the genetic algorithms preferred by the interactive search-based SE community; and the sequential model optimizers developed by the SE configuration community~\citep{flash_vivek}). iSNEAK 's solutions were found to be human acceptable (and those solutions took far less time to generate, with far fewer questions to any stakeholder). Significantly, our methods work well even for multi-objective models with competing goals (in this work we explore models with four to five goals). These results motivate more work on partial ordering for many-goal model-based problems. △ Less

Submitted 29 October, 2023; originally announced October 2023.

arXiv:2310.07109 [pdf, other]

SparseCoder: Advancing Source Code Analysis with Sparse Attention and Learned Token Pruning

Authors: Xueqi Yang, Mariusz Jakubowski, Kelly Kang, Haojie Yu, Tim Menzies

Abstract: As software projects rapidly evolve, software artifacts become more complex and defects behind get harder to identify. The emerging Transformer-based approaches, though achieving remarkable performance, struggle with long code sequences due to their self-attention mechanism, which scales quadratically with the sequence length. This paper introduces SparseCoder, an innovative approach incorporating… ▽ More As software projects rapidly evolve, software artifacts become more complex and defects behind get harder to identify. The emerging Transformer-based approaches, though achieving remarkable performance, struggle with long code sequences due to their self-attention mechanism, which scales quadratically with the sequence length. This paper introduces SparseCoder, an innovative approach incorporating sparse attention and learned token pruning (LTP) method (adapted from natural language processing) to address this limitation. Extensive experiments carried out on a large-scale dataset for vulnerability detection demonstrate the effectiveness and efficiency of SparseCoder, scaling from quadratically to linearly on long code sequence analysis in comparison to CodeBERT and RoBERTa. We further achieve 50% FLOPs reduction with a negligible performance drop of less than 1% comparing to Transformer leveraging sparse attention. Moverover, SparseCoder goes beyond making "black-box" decisions by elucidating the rationale behind those decisions. Code segments that contribute to the final decision can be highlighted with importance scores, offering an interpretable, transparent analysis tool for the software engineering landscape. △ Less

Submitted 10 October, 2023; originally announced October 2023.

Comments: 11 pages, 8 figures, pre-print

arXiv:2309.01314 [pdf, other]

doi 10.1145/3617555.3617876

Model Review: A PROMISEing Opportunity

Authors: Tim Menzies

Abstract: To make models more understandable and correctable, I propose that the PROMISE community pivots to the problem of model review. Over the years, there have been many reports that very simple models can perform exceptionally well. Yet, where are the researchers asking "say, does that mean that we could make software analytics simpler and more comprehensible?" This is an important question, since hum… ▽ More To make models more understandable and correctable, I propose that the PROMISE community pivots to the problem of model review. Over the years, there have been many reports that very simple models can perform exceptionally well. Yet, where are the researchers asking "say, does that mean that we could make software analytics simpler and more comprehensible?" This is an important question, since humans often have difficulty accurately assessing complex models (leading to unreliable and sometimes dangerous results). Prior PROMISE results have shown that data mining can effectively summarizing large models/ data sets into simpler and smaller ones. Therefore, the PROMISE community has the skills and experience needed to redefine, simplify, and improve the relationship between humans and AI. △ Less

Submitted 6 September, 2023; v1 submitted 3 September, 2023; originally announced September 2023.

Comments: 5 pages, 1 figure

arXiv:2305.03714 [pdf, other]

On the Benefits of Semi-Supervised Test Case Generation for Simulation Models

Authors: Xiao Ling, Tim Menzies

Abstract: Testing complex simulation models can be expensive and time consuming. Current state-of-the-art methods that explore this problem are fully-supervised; i.e. they require that all examples are labeled. On the other hand, the GenClu system (introduced in this paper) takes a semi-supervised approach; i.e. (a) only a small subset of information is actually labeled (via simulation) and (b) those labels… ▽ More Testing complex simulation models can be expensive and time consuming. Current state-of-the-art methods that explore this problem are fully-supervised; i.e. they require that all examples are labeled. On the other hand, the GenClu system (introduced in this paper) takes a semi-supervised approach; i.e. (a) only a small subset of information is actually labeled (via simulation) and (b) those labels are then spread across the rest of the data. When applied to five open-source simulation models of cyber-physical systems, GenClu's test generation can be multiple orders of magnitude faster than the prior state of the art. Further, when assessed via mutation testing, tests generated by GenClu were as good or better than anything else tested here. Hence, we recommend semi-supervised methods over prior methods (evolutionary search and fully-supervised learning). △ Less

Submitted 1 December, 2023; v1 submitted 5 May, 2023; originally announced May 2023.

Comments: 14 pages, 4 figures, 6 tables, first round review in TSE

arXiv:2302.01997 [pdf, other]

Less, but Stronger: On the Value of Strong Heuristics in Semi-supervised Learning for Software Analytics

Authors: Huy Tu, Tim Menzies

Abstract: In many domains, there are many examples and far fewer labels for those examples; e.g. we may have access to millions of lines of source code, but access to only a handful of warnings about that code. In those domains, semi-supervised learners (SSL) can extrapolate labels from a small number of examples to the rest of the data. Standard SSL algorithms use ``weak'' knowledge (i.e. those not based o… ▽ More In many domains, there are many examples and far fewer labels for those examples; e.g. we may have access to millions of lines of source code, but access to only a handful of warnings about that code. In those domains, semi-supervised learners (SSL) can extrapolate labels from a small number of examples to the rest of the data. Standard SSL algorithms use ``weak'' knowledge (i.e. those not based on specific SE knowledge) such as (e.g.) co-train two learners and use good labels from one to train the other. Another approach of SSL in software analytics is potentially use ``strong'' knowledge that use SE knowledge. For example, an often-used heuristic in SE is that unusually large artifacts contain undesired properties (e.g. more bugs). This paper argues that such ``strong'' algorithms perform better than those standard, weaker, SSL algorithms. We show this by learning models from labels generated using weak SSL or our ``stronger'' FRUGAL algorithm. In four domains (distinguishing security-related bug reports; mitigating bias in decision-making; predicting issue close time; and (reducing false alarms in static code warnings), FRUGAL required only 2.5% of the data to be labeled yet out-performed standard semi-supervised learners that relied on (e.g.) some domain-independent graph theory concepts. Hence, for future work, we strongly recommend the use of strong heuristics for semi-supervised learning for SE applications. To better support other researchers, our scripts and data are on-line at https://github.com/HuyTu7/FRUGAL. △ Less

Submitted 3 February, 2023; originally announced February 2023.

Comments: Submitting to EMSE

arXiv:2301.10407 [pdf, other]

Don't Lie to Me: Avoiding Malicious Explanations with STEALTH

Authors: Lauren Alvarez, Tim Menzies

Abstract: STEALTH is a method for using some AI-generated model, without suffering from malicious attacks (i.e. lying) or associated unfairness issues. After recursively bi-clustering the data, STEALTH system asks the AI model a limited number of queries about class labels. STEALTH asks so few queries (1 per data cluster) that malicious algorithms (a) cannot detect its operation, nor (b) know when to lie. STEALTH is a method for using some AI-generated model, without suffering from malicious attacks (i.e. lying) or associated unfairness issues. After recursively bi-clustering the data, STEALTH system asks the AI model a limited number of queries about class labels. STEALTH asks so few queries (1 per data cluster) that malicious algorithms (a) cannot detect its operation, nor (b) know when to lie. △ Less

Submitted 25 January, 2023; originally announced January 2023.

Comments: 6 pages, 6 Tables, 3 figures

arXiv:2301.06577 [pdf, other]

Learning from Very Little Data: On the Value of Landscape Analysis for Predicting Software Project Health

Authors: Andre Lustosa, Tim Menzies

Abstract: When data is scarce, software analytics can make many mistakes. For example, consider learning predictors for open source project health (e.g. the number of closed pull requests in twelve months time). The training data for this task may be very small (e.g. five years of data, collected every month means just 60 rows of training data). The models generated from such tiny data sets can make many pr… ▽ More When data is scarce, software analytics can make many mistakes. For example, consider learning predictors for open source project health (e.g. the number of closed pull requests in twelve months time). The training data for this task may be very small (e.g. five years of data, collected every month means just 60 rows of training data). The models generated from such tiny data sets can make many prediction errors. Those errors can be tamed by a {\em landscape analysis} that selects better learner control parameters. Our niSNEAK tool (a)~clusters the data to find the general landscape of the hyperparameters; then (b)~explores a few representatives from each part of that landscape. niSNEAK is both faster and more effective than prior state-of-the-art hyperparameter optimization algorithms (e.g. FLASH, HYPEROPT, OPTUNA). The configurations found by niSNEAK have far less error than other methods. For example, for project health indicators such as $C$= number of commits; $I$=number of closed issues, and $R$=number of closed pull requests, niSNEAK's 12 month prediction errors are \{I=0\%, R=33\%\,C=47\%\} Based on the above, we recommend landscape analytics (e.g. niSNEAK) especially when learning from very small data sets. This paper only explores the application of niSNEAK to project health. That said, we see nothing in principle that prevents the application of this technique to a wider range of problems. To assist other researchers in repeating, improving, or even refuting our results, all our scripts and data are available on GitHub at https://github.com/zxcv123456qwe/niSneak △ Less

Submitted 11 October, 2023; v1 submitted 16 January, 2023; originally announced January 2023.

arXiv:2211.10012 [pdf, other]

doi 10.1109/MC.2022.3223646

A Tale of Two Cities: Data and Configuration Variances in Robust Deep Learning

Authors: Guanqin Zhang, Jiankun Sun, Feng Xu, H. M. N. Dilum Bandara, Shiping Chen, Yulei Sui, Tim Menzies

Abstract: Deep neural networks (DNNs), are widely used in many industries such as image recognition, supply chain, medical diagnosis, and autonomous driving. However, prior work has shown the high accuracy of a DNN model does not imply high robustness (i.e., consistent performances on new and future datasets) because the input data and external environment (e.g., software and model configurations) for a dep… ▽ More Deep neural networks (DNNs), are widely used in many industries such as image recognition, supply chain, medical diagnosis, and autonomous driving. However, prior work has shown the high accuracy of a DNN model does not imply high robustness (i.e., consistent performances on new and future datasets) because the input data and external environment (e.g., software and model configurations) for a deployed model are constantly changing. Hence, ensuring the robustness of deep learning is not an option but a priority to enhance business and consumer confidence. Previous studies mostly focus on the data aspect of model variance. In this article, we systematically summarize DNN robustness issues and formulate them in a holistic view through two important aspects, i.e., data and software configuration variances in DNNs. We also provide a predictive framework to generate representative variances (counterexamples) by considering both data and configurations for robust learning through the lens of search-based optimization. △ Less

Submitted 25 November, 2022; v1 submitted 17 November, 2022; originally announced November 2022.

arXiv:2211.05920 [pdf, other]

When Less is More: On the Value of "Co-training" for Semi-Supervised Software Defect Predictors

Authors: Suvodeep Majumder, Joymallya Chakraborty, Tim Menzies

Abstract: Labeling a module defective or non-defective is an expensive task. Hence, there are often limits on how much-labeled data is available for training. Semi-supervised classifiers use far fewer labels for training models. However, there are numerous semi-supervised methods, including self-labeling, co-training, maximal-margin, and graph-based methods, to name a few. Only a handful of these methods ha… ▽ More Labeling a module defective or non-defective is an expensive task. Hence, there are often limits on how much-labeled data is available for training. Semi-supervised classifiers use far fewer labels for training models. However, there are numerous semi-supervised methods, including self-labeling, co-training, maximal-margin, and graph-based methods, to name a few. Only a handful of these methods have been tested in SE for (e.g.) predicting defects and even there, those methods have been tested on just a handful of projects. This paper applies a wide range of 55 semi-supervised learners to over 714 projects. We find that semi-supervised "co-training methods" work significantly better than other approaches. Specifically, after labeling, just 2.5% of data, then make predictions that are competitive to those using 100% of the data. That said, co-training needs to be used cautiously since the specific choice of co-training methods needs to be carefully selected based on a user's specific goals. Also, we warn that a commonly-used co-training method ("multi-view"-- where different learners get different sets of columns) does not improve predictions (while adding too much to the run time costs 11 hours vs. 1.8 hours). It is an open question, worthy of future work, to test if these reductions can be seen in other areas of software analytics. To assist with exploring other areas, all the codes used are available at https://github.com/ai-se/Semi-Supervised. △ Less

Submitted 15 February, 2024; v1 submitted 10 November, 2022; originally announced November 2022.

Comments: 36 pages, 10 figures, 5 tables

arXiv:2208.01595 [pdf, other]

Do I really need all this work to find vulnerabilities? An empirical case study comparing vulnerability detection techniques on a Java application

Authors: Sarah Elder, Nusrat Zahan, Rui Shu, Monica Metro, Valeri Kozarev, Tim Menzies, Laurie Williams

Abstract: CONTEXT: Applying vulnerability detection techniques is one of many tasks using the limited resources of a software project. OBJECTIVE: The goal of this research is to assist managers and other decision-makers in making informed choices about the use of software vulnerability detection techniques through an empirical study of the efficiency and effectiveness of four techniques on a Java-based we… ▽ More CONTEXT: Applying vulnerability detection techniques is one of many tasks using the limited resources of a software project. OBJECTIVE: The goal of this research is to assist managers and other decision-makers in making informed choices about the use of software vulnerability detection techniques through an empirical study of the efficiency and effectiveness of four techniques on a Java-based web application. METHOD: We apply four different categories of vulnerability detection techniques \textendash~ systematic manual penetration testing (SMPT), exploratory manual penetration testing (EMPT), dynamic application security testing (DAST), and static application security testing (SAST) \textendash\ to an open-source medical records system. RESULTS: We found the most vulnerabilities using SAST. However, EMPT found more severe vulnerabilities. With each technique, we found unique vulnerabilities not found using the other techniques. The efficiency of manual techniques (EMPT, SMPT) was comparable to or better than the efficiency of automated techniques (DAST, SAST) in terms of Vulnerabilities per Hour (VpH). CONCLUSIONS: The vulnerability detection technique practitioners should select may vary based on the goals and available resources of the project. If the goal of an organization is to find "all" vulnerabilities in a project, they need to use as many techniques as their resources allow. △ Less

Submitted 2 August, 2022; originally announced August 2022.

ACM Class: D.2.5

arXiv:2205.10504 [pdf, other]

How to Find Actionable Static Analysis Warnings: A Case Study with FindBugs

Authors: Rahul Yedida, Hong Jin Kang, Huy Tu, Xueqi Yang, David Lo, Tim Menzies

Abstract: Automatically generated static code warnings suffer from a large number of false alarms. Hence, developers only take action on a small percent of those warnings. To better predict which static code warnings should not be ignored, we suggest that analysts need to look deeper into their algorithms to find choices that better improve the particulars of their specific problem. Specifically, we show he… ▽ More Automatically generated static code warnings suffer from a large number of false alarms. Hence, developers only take action on a small percent of those warnings. To better predict which static code warnings should not be ignored, we suggest that analysts need to look deeper into their algorithms to find choices that better improve the particulars of their specific problem. Specifically, we show here that effective predictors of such warnings can be created by methods that locally adjust the decision boundary (between actionable warnings and others). These methods yield a new high water-mark for recognizing actionable static code warnings. For eight open-source Java projects (cassandra, jmeter, commons, lucene-solr, maven, ant, tomcat, derby) we achieve perfect test results on 4/8 datasets and, overall, a median AUC (area under the true negatives, true positives curve) of 92%. △ Less

Submitted 23 December, 2022; v1 submitted 21 May, 2022; originally announced May 2022.

Comments: Accepted to TSE

arXiv:2205.00665 [pdf, other]

Reducing the Cost of Training Security Classifier (via Optimized Semi-Supervised Learning)

Authors: Rui Shu, Tianpei Xia, Huy Tu, Laurie Williams, Tim Menzies

Abstract: Background: Most of the existing machine learning models for security tasks, such as spam detection, malware detection, or network intrusion detection, are built on supervised machine learning algorithms. In such a paradigm, models need a large amount of labeled data to learn the useful relationships between selected features and the target class. However, such labeled data can be scarce and expen… ▽ More Background: Most of the existing machine learning models for security tasks, such as spam detection, malware detection, or network intrusion detection, are built on supervised machine learning algorithms. In such a paradigm, models need a large amount of labeled data to learn the useful relationships between selected features and the target class. However, such labeled data can be scarce and expensive to acquire. Goal: To help security practitioners train useful security classification models when few labeled training data and many unlabeled training data are available. Method: We propose an adaptive framework called Dapper, which optimizes 1) semi-supervised learning algorithms to assign pseudo-labels to unlabeled data in a propagation paradigm and 2) the machine learning classifier (i.e., random forest). When the dataset class is highly imbalanced, Dapper then adaptively integrates and optimizes a data oversampling method called SMOTE. We use the novel Bayesian Optimization to search a large hyperparameter space of these tuning targets. Result: We evaluate Dapper with three security datasets, i.e., the Twitter spam dataset, the malware URLs dataset, and the CIC-IDS-2017 dataset. Experimental results indicate that we can use as low as 10% of original labeled data but achieve close or even better classification performance than using 100% labeled data in a supervised way. Conclusion: Based on those results, we would recommend using hyperparameter optimization with semi-supervised learning when dealing with shortages of labeled security data. △ Less

Submitted 2 May, 2022; originally announced May 2022.

arXiv:2203.11410 [pdf, other]

Dazzle: Using Optimized Generative Adversarial Networks to Address Security Data Class Imbalance Issue

Authors: Rui Shu, Tianpei Xia, Laurie Williams, Tim Menzies

Abstract: Background: Machine learning techniques have been widely used and demonstrate promising performance in many software security tasks such as software vulnerability prediction. However, the class ratio within software vulnerability datasets is often highly imbalanced (since the percentage of observed vulnerability is usually very low). Goal: To help security practitioners address software security d… ▽ More Background: Machine learning techniques have been widely used and demonstrate promising performance in many software security tasks such as software vulnerability prediction. However, the class ratio within software vulnerability datasets is often highly imbalanced (since the percentage of observed vulnerability is usually very low). Goal: To help security practitioners address software security data class imbalanced issues and further help build better prediction models with resampled datasets. Method: We introduce an approach called Dazzle which is an optimized version of conditional Wasserstein Generative Adversarial Networks with gradient penalty (cWGAN-GP). Dazzle explores the architecture hyperparameters of cWGAN-GP with a novel optimizer called Bayesian Optimization. We use Dazzle to generate minority class samples to resample the original imbalanced training dataset. Results: We evaluate Dazzle with three software security datasets, i.e., Moodle vulnerable files, Ambari bug reports, and JavaScript function code. We show that Dazzle is practical to use and demonstrates promising improvement over existing state-of-the-art oversampling techniques such as SMOTE (e.g., with an average of about 60% improvement rate over SMOTE in recall among all datasets). Conclusion: Based on this study, we would suggest the use of optimized GANs as an alternative method for security vulnerability data class imbalanced issues. △ Less

Submitted 2 May, 2022; v1 submitted 21 March, 2022; originally announced March 2022.

arXiv:2202.01322 [pdf, other]

How to Improve Deep Learning for Software Analytics (a case study with code smell detection)

Authors: Rahul Yedida, Tim Menzies

Abstract: To reduce technical debt and make code more maintainable, it is important to be able to warn programmers about code smells. State-of-the-art code small detectors use deep learners, without much exploration of alternatives within that technology. One promising alternative for software analytics and deep learning is GHOST (from TSE'21) that relies on a combination of hyper-parameter optimization o… ▽ More To reduce technical debt and make code more maintainable, it is important to be able to warn programmers about code smells. State-of-the-art code small detectors use deep learners, without much exploration of alternatives within that technology. One promising alternative for software analytics and deep learning is GHOST (from TSE'21) that relies on a combination of hyper-parameter optimization of feedforward neural networks and a novel oversampling technique to deal with class imbalance. The prior study from TSE'21 proposing this novel "fuzzy sampling" was somewhat limited in that the method was tested on defect prediction, but nothing else. Like defect prediction, code smell detection datasets have a class imbalance (which motivated "fuzzy sampling"). Hence, in this work we test if fuzzy sampling is useful for code smell detection. The results of this paper show that we can achieve better than state-of-the-art results on code smell detection with fuzzy oversampling. For example, for "feature envy", we were able to achieve 99+\% AUC across all our datasets, and on 8/10 datasets for "misplaced class". While our specific results refer to code smell detection, they do suggest other lessons for other kinds of analytics. For example: (a) try better preprocessing before trying complex learners (b) include simpler learners as a baseline in software analytics (c) try "fuzzy sampling" as one such baseline. △ Less

Submitted 27 March, 2022; v1 submitted 2 February, 2022; originally announced February 2022.

Comments: Accepted to MSR 2022

arXiv:2201.13296 [pdf, other]

doi 10.3847/1538-4357/ac739e

An Isolated Stellar-Mass Black Hole Detected Through Astrometric Microlensing

Authors: Kailash C. Sahu, Jay Anderson, Stefano Casertano, Howard E. Bond, Andrzej Udalski, Martin Dominik, Annalisa Calamida, Andrea Bellini, Thomas M. Brown, Marina Rejkuba, Varun Bajaj, Noe Kains, Henry C. Ferguson, Chris L. Fryer, Philip Yock, Przemek Mroz, Szymon Kozlowski, Pawel Pietrukowicz, Radek Poleski, Jan Skowron, Igor Soszynski, Michael K. Szymanski, Krzysztof Ulaczyk, Lukasz Wyrzykowski, Richard Barry , et al. (68 additional authors not shown)

Abstract: We report the first unambiguous detection and mass measurement of an isolated stellar-mass black hole (BH). We used the Hubble Space Telescope (HST) to carry out precise astrometry of the source star of the long-duration (t_E~270 days), high-magnification microlensing event MOA-2011-BLG-191/OGLE-2011-BLG-0462 (hereafter designated as MOA-11-191/OGLE-11-462), in the direction of the Galactic bulge.… ▽ More We report the first unambiguous detection and mass measurement of an isolated stellar-mass black hole (BH). We used the Hubble Space Telescope (HST) to carry out precise astrometry of the source star of the long-duration (t_E~270 days), high-magnification microlensing event MOA-2011-BLG-191/OGLE-2011-BLG-0462 (hereafter designated as MOA-11-191/OGLE-11-462), in the direction of the Galactic bulge. HST imaging, conducted at eight epochs over an interval of six years, reveals a clear relativistic astrometric deflection of the background star's apparent position. Ground-based photometry of MOA-11-191/OGLE-11-462 shows a parallactic signature of the effect of the Earth's motion on the microlensing light curve. Combining the HST astrometry with the ground-based light curve and the derived parallax, we obtain a lens mass of 7.1 +/- 1.3 Msun and a distance of 1.58 +/- 0.18 kpc. We show that the lens emits no detectable light, which, along with having a mass higher than is possible for a white dwarf or neutron star, confirms its BH nature. Our analysis also provides an absolute proper motion for the BH. The proper motion is offset from the mean motion of Galactic-disk stars at similar distances by an amount corresponding to a transverse space velocity of ~45 km/s, suggesting that the BH received a 'natal kick' from its supernova explosion. Previous mass determinations for stellar-mass BHs have come from radial-velocity measurements of Galactic X-ray binaries, and from gravitational radiation emitted by merging BHs in binary systems in external galaxies. Our mass measurement is the first for an isolated stellar-mass BH using any technique. △ Less

Submitted 22 July, 2022; v1 submitted 31 January, 2022; originally announced January 2022.

Comments: 37 pages, Published in ApJ

Journal ref: ApJ, 933, 83 (2022)

arXiv:2201.10592 [pdf, other]

DebtFree: Minimizing Labeling Cost in Self-Admitted Technical Debt Identification using Semi-Supervised Learning

Authors: Huy Tu, Tim Menzies

Abstract: Keeping track of and managing Self-Admitted Technical Debts (SATDs) is important for maintaining a healthy software project. Current active-learning SATD recognition tool involves manual inspection of 24% of the test comments on average to reach 90% of the recall. Among all the test comments, about 5% are SATDs. The human experts are then required to read almost a quintuple of the SATD comments wh… ▽ More Keeping track of and managing Self-Admitted Technical Debts (SATDs) is important for maintaining a healthy software project. Current active-learning SATD recognition tool involves manual inspection of 24% of the test comments on average to reach 90% of the recall. Among all the test comments, about 5% are SATDs. The human experts are then required to read almost a quintuple of the SATD comments which indicates the inefficiency of the tool. Plus, human experts are still prone to error: 95% of the false-positive labels from previous work were actually true positives. To solve the above problems, we propose DebtFree, a two-mode framework based on unsupervised learning for identifying SATDs. In mode1, when the existing training data is unlabeled, DebtFree starts with an unsupervised learner to automatically pseudo-label the programming comments in the training data. In contrast, in mode2 where labels are available with the corresponding training data, DebtFree starts with a pre-processor that identifies the highly prone SATDs from the test dataset. Then, our machine learning model is employed to assist human experts in manually identifying the remaining SATDs. Our experiments on 10 software projects show that both models yield a statistically significant improvement in effectiveness over the state-of-the-art automated and semi-automated models. Specifically, DebtFree can reduce the labeling effort by 99% in mode1 (unlabeled training data), and up to 63% in mode2 (labeled training data) while improving the current active learner's F1 relatively to almost 100%. △ Less

Submitted 25 January, 2022; originally announced January 2022.

Comments: Accepted at EMSE

arXiv:2112.01598 [pdf, other]

What Not to Test (for Cyber-Physical Systems)

Authors: Xiao Ling, Tim Menzies

Abstract: For simulation-based systems, finding a set of test cases with the least cost by exploring multiple goals is a complex task. Domain-specific optimization goals (e.g. maximize output variance) are useful for guiding the rapid selection of test cases via mutation. But evaluating the selected test cases via mutation (that can distinguish the current program from the mutated systems) is a different go… ▽ More For simulation-based systems, finding a set of test cases with the least cost by exploring multiple goals is a complex task. Domain-specific optimization goals (e.g. maximize output variance) are useful for guiding the rapid selection of test cases via mutation. But evaluating the selected test cases via mutation (that can distinguish the current program from the mutated systems) is a different goal to domain-specific optimizations. While the optimization goals can be used to guide the mutation analysis, that guidance should be viewed as a weak indicator since it can hurt the mutation effectiveness goals by focusing too much on the optimization goals. Based on the above, this paper proposes DoLesS (Domination with Least Squares Approximation) that selects the minimal and effective test cases by averaging over a coarse-grained grid of the information gained from multiple optimizations goals. DoLesS applies an inverted least squares approximation approach to find a minimal set of tests that can distinguish better from worse parts of the optimization goals. When tested on multiple simulation-based systems, DoLesS performs as well or even better as the prior state-of-the-art, while running 80-360 times faster on average (seconds instead of hours). △ Less

Submitted 5 May, 2023; v1 submitted 2 December, 2021; originally announced December 2021.

Comments: 17 pages, 5 figures, 7 tables. Accepted by TSE

arXiv:2110.13029 [pdf, other]

Fair Enough: Searching for Sufficient Measures of Fairness

Authors: Suvodeep Majumder, Joymallya Chakraborty, Gina R. Bai, Kathryn T. Stolee, Tim Menzies

Abstract: Testing machine learning software for ethical bias has become a pressing current concern. In response, recent research has proposed a plethora of new fairness metrics, for example, the dozens of fairness metrics in the IBM AIF360 toolkit. This raises the question: How can any fairness tool satisfy such a diverse range of goals? While we cannot completely simplify the task of fairness testing, we c… ▽ More Testing machine learning software for ethical bias has become a pressing current concern. In response, recent research has proposed a plethora of new fairness metrics, for example, the dozens of fairness metrics in the IBM AIF360 toolkit. This raises the question: How can any fairness tool satisfy such a diverse range of goals? While we cannot completely simplify the task of fairness testing, we can certainly reduce the problem. This paper shows that many of those fairness metrics effectively measure the same thing. Based on experiments using seven real-world datasets, we find that (a) 26 classification metrics can be clustered into seven groups, and (b) four dataset metrics can be clustered into three groups. Further, each reduced set may actually predict different things. Hence, it is no longer necessary (or even possible) to satisfy all fairness metrics. In summary, to simplify the fairness testing problem, we recommend the following steps: (1)~determine what type of fairness is desirable (and we offer a handful of such types); then (2) lookup those types in our clusters; then (3) just test for one item per cluster. △ Less

Submitted 21 March, 2022; v1 submitted 25 October, 2021; originally announced October 2021.

Comments: 8 tables and 1 figure

arXiv:2110.02922

SNEAK: Faster Interactive Search-based SE

Authors: Andre Lustosa, Jaydeep Patel, Venkata Sai Teja Malapati, Tim Menzies

Abstract: When AI tools can generate many solutions, some human preference must be applied to determine which solution is relevant to the current project. One way to find those preferences is interactive search-based software engineering (iSBSE) where humans can influence the search process. This paper argues that when optimizing a model using human-in-the-loop, data mining methods such as our SNEAK tool (t… ▽ More When AI tools can generate many solutions, some human preference must be applied to determine which solution is relevant to the current project. One way to find those preferences is interactive search-based software engineering (iSBSE) where humans can influence the search process. This paper argues that when optimizing a model using human-in-the-loop, data mining methods such as our SNEAK tool (that recurses into divisions of the data) perform better than standard iSBSE methods (that mutates multiple candidate solutions over many generations). For our case studies, SNEAK runs faster, asks fewer questions, achieves better solutions (that are within 3% of the best solutions seen in our sample space), and scales to large problems (in our experiments, models with 1000 variables can be explored with half a dozen interactions where, each time, we ask only four questions). Accordingly, we recommend SNEAK as a baseline against which future iSBSE work should be compared. To facilitate that, all our scripts are online at https://github.com/ai-se/sneak. △ Less

Submitted 16 January, 2023; v1 submitted 6 October, 2021; originally announced October 2021.

Comments: removal for resubmission under different title and more information

arXiv:2110.01710 [pdf, other]

PyTorrent: A Python Library Corpus for Large-scale Language Models

Authors: Mehdi Bahrami, N. C. Shrikanth, Shade Ruangwan, Lei Liu, Yuji Mizobuchi, Masahiro Fukuyori, Wei-Peng Chen, Kazuki Munakata, Tim Menzies

Abstract: A large scale collection of both semantic and natural language resources is essential to leverage active Software Engineering research areas such as code reuse and code comprehensibility. Existing machine learning models ingest data from Open Source repositories (like GitHub projects) and forum discussions (like Stackoverflow.com), whereas, in this showcase, we took a step backward to orchestrate… ▽ More A large scale collection of both semantic and natural language resources is essential to leverage active Software Engineering research areas such as code reuse and code comprehensibility. Existing machine learning models ingest data from Open Source repositories (like GitHub projects) and forum discussions (like Stackoverflow.com), whereas, in this showcase, we took a step backward to orchestrate a corpus titled PyTorrent that contains 218,814 Python package libraries from PyPI and Anaconda environment. This is because earlier studies have shown that much of the code is redundant and Python packages from these environments are better in quality and are well-documented. PyTorrent enables users (such as data scientists, students, etc.) to build off the shelf machine learning models directly without spending months of effort on large infrastructure. The dataset, schema and a pretrained language model is available at: https://github.com/fla-sil/PyTorrent △ Less

Submitted 4 October, 2021; originally announced October 2021.

Comments: 10 pages, 2 figures, 5 tables

arXiv:2110.01109 [pdf, other]

FairMask: Better Fairness via Model-based Rebalancing of Protected Attributes

Authors: Kewen Peng, Joymallya Chakraborty, Tim Menzies

Abstract: Context: Machine learning software can generate models that inappropriately discriminate against specific protected social groups (e.g., groups based on gender, ethnicity, etc). Motivated by those results, software engineering researchers have proposed many methods for mitigating those discriminatory effects. While those methods are effective in mitigating bias, few of them can provide explanation… ▽ More Context: Machine learning software can generate models that inappropriately discriminate against specific protected social groups (e.g., groups based on gender, ethnicity, etc). Motivated by those results, software engineering researchers have proposed many methods for mitigating those discriminatory effects. While those methods are effective in mitigating bias, few of them can provide explanations on what is the root cause of bias. Objective: We aim at better detection and mitigation of algorithmic discrimination in machine learning software problems. Method: Here we propose xFAIR, a model-based extrapolation method, that is capable of both mitigating bias and explaining the cause. In our xFAIR approach, protected attributes are represented by models learned from the other independent variables (and these models offer extrapolations over the space between existing examples). We then use the extrapolation models to relabel protected attributes later seen in testing data or deployment time. Our approach aims to offset the biased predictions of the classification model via rebalancing the distribution of protected attributes. Results: The experiments of this paper show that, without compromising (original) model performance, xFAIR can achieve significantly better group and individual fairness (as measured in different metrics) than benchmark methods. Moreover, when compared to another instance-based rebalancing method, our model-based approach shows faster runtime and thus better scalability. Conclusion: Algorithmic decision bias can be removed via extrapolation that smooths away outlier points. As evidence for this, our proposed xFAIR is not only performance-wise better (measured by fairness and performance metrics) than two state-of-the-art fairness algorithms. △ Less

Submitted 27 October, 2022; v1 submitted 3 October, 2021; originally announced October 2021.

Comments: 14 pages, 6 figures, 7 tables, accepted by TSE

ACM Class: D.2

arXiv:2109.14569 [pdf, other]

An Expert System for Redesigning Software for Cloud Applications

Authors: Rahul Yedida, Rahul Krishna, Anup Kalia, Tim Menzies, Jin Xiao, Maja Vukovic

Abstract: Cloud-based software has many advantages. When services are divided into many independent components, they are easier to update. Also, during peak demand, it is easier to scale cloud services (just hire more CPUs). Hence, many organizations are partitioning their monolithic enterprise applications into cloud-based microservices. Recently there has been much work using machine learning to simplif… ▽ More Cloud-based software has many advantages. When services are divided into many independent components, they are easier to update. Also, during peak demand, it is easier to scale cloud services (just hire more CPUs). Hence, many organizations are partitioning their monolithic enterprise applications into cloud-based microservices. Recently there has been much work using machine learning to simplify this partitioning task. Despite much research, no single partitioning method can be recommended as generally useful. More specifically, those prior solutions are "brittle"; i.e. if they work well for one kind of goal in one dataset, then they can be sub-optimal if applied to many datasets and multiple goals. In order to find a generally useful partitioning method, we propose DEEPLY. This new algorithm extends the CO-GCN deep learning partition generator with (a) a novel loss function and (b) some hyper-parameter optimization. As shown by our experiments, DEEPLY generally outperforms prior work (including CO-GCN, and others) across multiple datasets and goals. To the best of our knowledge, this is the first report in SE of such stable hyper-parameter optimization. To aid reuse of this work, DEEPLY is available on-line at https://bit.ly/2WhfFlB. △ Less

Submitted 27 June, 2022; v1 submitted 29 September, 2021; originally announced September 2021.

Comments: version 3

arXiv:2108.09847 [pdf, other]

FRUGAL: Unlocking SSL for Software Analytics

Authors: Huy Tu, Tim Menzies

Abstract: Standard software analytics often involves having a large amount of data with labels in order to commission models with acceptable performance. However, prior work has shown that such requirements can be expensive, taking several weeks to label thousands of commits, and not always available when traversing new research problems and domains. Unsupervised Learning is a promising direction to learn h… ▽ More Standard software analytics often involves having a large amount of data with labels in order to commission models with acceptable performance. However, prior work has shown that such requirements can be expensive, taking several weeks to label thousands of commits, and not always available when traversing new research problems and domains. Unsupervised Learning is a promising direction to learn hidden patterns within unlabelled data, which has only been extensively studied in defect prediction. Nevertheless, unsupervised learning can be ineffective by itself and has not been explored in other domains (e.g., static analysis and issue close time). Motivated by this literature gap and technical limitations, we present FRUGAL, a tuned semi-supervised method that builds on a simple optimization scheme that does not require sophisticated (e.g., deep learners) and expensive (e.g., 100% manually labelled data) methods. FRUGAL optimizes the unsupervised learner's configurations (via a simple grid search) while validating our design decision of labelling just 2.5% of the data before prediction. As shown by the experiments of this paper FRUGAL outperforms the state-of-the-art adoptable static code warning recognizer and issue closed time predictor, while reducing the cost of labelling by a factor of 40 (from 100% to 2.5%). Hence we assert that FRUGAL can save considerable effort in data labelling especially in validating prior work or researching new problems. Based on this work, we suggest that proponents of complex and expensive methods should always baseline such methods against simpler and cheaper alternatives. For instance, a semi-supervised learner like FRUGAL can serve as a baseline to the state-of-the-art software analytics. △ Less

Submitted 22 August, 2021; originally announced August 2021.

Comments: Accepted for ASE 2022

arXiv:2108.06821 [pdf, other]

doi 10.1145/3554976

Crowdsourcing the State of the Art(ifacts)

Authors: Maria Teresa Baldassarre, Neil Ernst, Ben Hermann, Tim Menzies, Rahul Yedida

Abstract: In any field, finding the "leading edge" of research is an on-going challenge. Researchers cannot appease reviewers and educators cannot teach to the leading edge of their field if no one agrees on what is the state-of-the-art. Using a novel crowdsourced "reuse graph" approach, we propose here a new method to learn this state-of-the-art. Our reuse graphs are less effort to build and verify than… ▽ More In any field, finding the "leading edge" of research is an on-going challenge. Researchers cannot appease reviewers and educators cannot teach to the leading edge of their field if no one agrees on what is the state-of-the-art. Using a novel crowdsourced "reuse graph" approach, we propose here a new method to learn this state-of-the-art. Our reuse graphs are less effort to build and verify than other community monitoring methods (e.g. artifact tracks or citation-based searches). Based on a study of 170 papers from software engineering (SE) conferences in 2020, we have found over 1,600 instances of reuse; i.e., reuse is rampant in SE research. Prior pessimism about a lack of reuse in SE research may have been a result of using the wrong methods to measure the wrong things. △ Less

Submitted 15 August, 2021; originally announced August 2021.

Comments: Submitted to Communications ACM

Journal ref: CACM February 2023 (Vol. 66, No. 2)

arXiv:2107.08310 [pdf, other]

FairBalance: How to Achieve Equalized Odds With Data Pre-processing

Authors: Zhe Yu, Joymallya Chakraborty, Tim Menzies

Abstract: This research seeks to benefit the software engineering society by providing a simple yet effective pre-processing approach to achieve equalized odds fairness in machine learning software. Fairness issues have attracted increasing attention since machine learning software is increasingly used for high-stakes and high-risk decisions. Amongst all the existing fairness notions, this work specifically… ▽ More This research seeks to benefit the software engineering society by providing a simple yet effective pre-processing approach to achieve equalized odds fairness in machine learning software. Fairness issues have attracted increasing attention since machine learning software is increasingly used for high-stakes and high-risk decisions. Amongst all the existing fairness notions, this work specifically targets "equalized odds" given its advantage in always allowing perfect classifiers. Equalized odds requires that members of every demographic group do not receive disparate mistreatment. Prior works either optimize for an equalized odds related metric during the learning process like a black-box, or manipulate the training data following some intuition. This work studies the root cause of the violation of equalized odds and how to tackle it. We found that equalizing the class distribution in each demographic group with sample weights is a necessary condition for achieving equalized odds without modifying the normal training process. In addition, an important partial condition for equalized odds (zero average odds difference) can be guaranteed when the class distributions are weighted to be not only equal but also balanced (1:1). Based on these analyses, we proposed FairBalance, a pre-processing algorithm which balances the class distribution in each demographic group by assigning calculated weights to the training data. On eight real-world datasets, our empirical results show that, at low computational overhead, the proposed pre-processing algorithm FairBalance can significantly improve equalized odds without much, if any damage to the utility. FairBalance also outperforms existing state-of-the-art approaches in terms of equalized odds. To facilitate reuse, reproduction, and validation, we made our scripts available at https://github.com/hil-se/FairBalance. △ Less

Submitted 26 April, 2023; v1 submitted 17 July, 2021; originally announced July 2021.

Comments: 13 pages

arXiv:2107.05088 [pdf, other]

Fairer Software Made Easier (using "Keys")

Authors: Tim Menzies, Kewen Peng, Andre Lustosa

Abstract: Can we simplify explanations for software analytics? Maybe. Recent results show that systems often exhibit a "keys effect"; i.e. a few key features control the rest. Just to say the obvious, for systems controlled by a few keys, explanation and control is just a matter of running a handful of "what-if" queries across the keys. By exploiting the keys effect, it should be possible to dramatically si… ▽ More Can we simplify explanations for software analytics? Maybe. Recent results show that systems often exhibit a "keys effect"; i.e. a few key features control the rest. Just to say the obvious, for systems controlled by a few keys, explanation and control is just a matter of running a handful of "what-if" queries across the keys. By exploiting the keys effect, it should be possible to dramatically simplify even complex explanations, such as those required for ethical AI systems. △ Less

Submitted 11 July, 2021; originally announced July 2021.

Comments: Submitted to NIER ASE 2021 (new ideas, emerging research)

arXiv:2106.06652 [pdf, ps, other]

Lessons learned from hyper-parameter tuning for microservice candidate identification

Authors: Rahul Yedida, Rahul Krishna, Anup Kalia, Tim Menzies, Jin Xiao, Maja Vukovic

Abstract: When optimizing software for the cloud, monolithic applications need to be partitioned into many smaller *microservices*. While many tools have been proposed for this task, we warn that the evaluation of those approaches has been incomplete; e.g. minimal prior exploration of hyperparameter optimization. Using a set of open source Java EE applications, we show here that (a) such optimization can si… ▽ More When optimizing software for the cloud, monolithic applications need to be partitioned into many smaller *microservices*. While many tools have been proposed for this task, we warn that the evaluation of those approaches has been incomplete; e.g. minimal prior exploration of hyperparameter optimization. Using a set of open source Java EE applications, we show here that (a) such optimization can significantly improve microservice partitioning; and that (b) an open issue for future work is how to find which optimizer works best for different problems. To facilitate that future work, see [https://github.com/yrahul3910/ase-tuned-mono2micro](https://github.com/yrahul3910/ase-tuned-mono2micro) for a reproduction package for this research. △ Less

Submitted 10 August, 2021; v1 submitted 11 June, 2021; originally announced June 2021.

Comments: Accepted to ASE 2021 (industry track, short paper)

arXiv:2106.03792

Preference Discovery in Large Product Lines

Authors: Andre Lustosa, Tim Menzies

Abstract: When AI tools can generate many solutions, some human preference must be applied to determine which solution is relevant to the current project. One way to find those preferences is interactive search-based software engineering (iSBSE) where humans can influence the search process. Current iSBSE methods can lead to cognitive fatigue (when they overwhelm humans with too many overly elaborate questi… ▽ More When AI tools can generate many solutions, some human preference must be applied to determine which solution is relevant to the current project. One way to find those preferences is interactive search-based software engineering (iSBSE) where humans can influence the search process. Current iSBSE methods can lead to cognitive fatigue (when they overwhelm humans with too many overly elaborate questions). WHUN is an iSBSE algorithm that avoids that problem. Due to its recursive clustering procedure, WHUN only pesters humans for $O(log_2{N})$ interactions. Further, each interaction is mediated via a feature selection procedure that reduces the number of asked questions. When compared to prior state-of-the-art iSBSE systems, WHUN runs faster, asks fewer questions, and achieves better solutions that are within $0.1\%$ of the best solutions seen in our sample space. More importantly, WHUN scales to large problems (in our experiments, models with 1000 variables can be explored with half a dozen interactions where, each time, we ask only four questions). Accordingly, we recommend WHUN as a baseline against which future iSBSE work should be compared. To facilitate that, all our scripts are online at https://github.com/ai-se/whun. △ Less

Submitted 16 January, 2023; v1 submitted 7 June, 2021; originally announced June 2021.

Comments: Reformatting and republishing of the paper under a different name

arXiv:2106.02716 [pdf, other]

VEER: Enhancing the Interpretability of Model-based Optimizations

Authors: Kewen Peng, Christian Kaltenecker, Norbert Siegmund, Sven Apel, Tim Menzies

Abstract: Many software systems can be tuned for multiple objectives (e.g., faster runtime, less required memory, less network traffic or energy consumption, etc.). Optimizers built for different objectives suffer from "model disagreement"; i.e., they have different (or even opposite) insights and tactics on how to optimize a system. Model disagreement is rampant (at least for configuration problems). Yet p… ▽ More Many software systems can be tuned for multiple objectives (e.g., faster runtime, less required memory, less network traffic or energy consumption, etc.). Optimizers built for different objectives suffer from "model disagreement"; i.e., they have different (or even opposite) insights and tactics on how to optimize a system. Model disagreement is rampant (at least for configuration problems). Yet prior to this paper, it has barely been explored. This paper shows that model disagreement can be mitigated via VEER, a one-dimensional approximation to the N-objective space. Since it is exploring a simpler goal space, VEER runs very fast (for eleven configuration problems). Even for our largest problem (with tens of thousands of possible configurations), VEER finds as good or better optimizations with zero model disagreements, three orders of magnitude faster (since its one-dimensional output no longer needs the sorting procedure). Based on the above, we recommend VEER as a very fast method to solve complex configuration problems, while at the same time avoiding model disagreement. △ Less

Submitted 12 February, 2023; v1 submitted 4 June, 2021; originally announced June 2021.

Comments: 27 pages, 7 figures, 4 tables, accepted by EMSE

ACM Class: D.2; K.6.3

arXiv:2105.12195 [pdf, other]

doi 10.1145/3468264.3468537

Bias in Machine Learning Software: Why? How? What to do?

Authors: Joymallya Chakraborty, Suvodeep Majumder, Tim Menzies

Abstract: Increasingly, software is making autonomous decisions in case of criminal sentencing, approving credit cards, hiring employees, and so on. Some of these decisions show bias and adversely affect certain social groups (e.g. those defined by sex, race, age, marital status). Many prior works on bias mitigation take the following form: change the data or learners in multiple ways, then see if any of th… ▽ More Increasingly, software is making autonomous decisions in case of criminal sentencing, approving credit cards, hiring employees, and so on. Some of these decisions show bias and adversely affect certain social groups (e.g. those defined by sex, race, age, marital status). Many prior works on bias mitigation take the following form: change the data or learners in multiple ways, then see if any of that improves fairness. Perhaps a better approach is to postulate root causes of bias and then applying some resolution strategy. This paper postulates that the root causes of bias are the prior decisions that affect- (a) what data was selected and (b) the labels assigned to those examples. Our Fair-SMOTE algorithm removes biased labels; and rebalances internal distributions such that based on sensitive attribute, examples are equal in both positive and negative classes. On testing, it was seen that this method was just as effective at reducing bias as prior approaches. Further, models generated via Fair-SMOTE achieve higher performance (measured in terms of recall and F1) than other state-of-the-art fairness improvement algorithms. To the best of our knowledge, measured in terms of number of analyzed learners and datasets, this study is one of the largest studies on bias mitigation yet presented in the literature. △ Less

Submitted 9 July, 2021; v1 submitted 25 May, 2021; originally announced May 2021.

Journal ref: ESEC/FSE'2021: The 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), Athens, Greece, August 23-28, 2021

arXiv:2105.11082 [pdf, other]

Assessing the Early Bird Heuristic (for Predicting Project Quality)

Authors: N. C. Shrikanth, Tim Menzies

Abstract: Before researchers rush to reason across all available data or try complex methods, perhaps it is prudent to first check for simpler alternatives. Specifically, if the historical data has the most information in some small region, perhaps a model learned from that region would suffice for the rest of the project. To support this claim, we offer a case study with 240 projects, where we find that… ▽ More Before researchers rush to reason across all available data or try complex methods, perhaps it is prudent to first check for simpler alternatives. Specifically, if the historical data has the most information in some small region, perhaps a model learned from that region would suffice for the rest of the project. To support this claim, we offer a case study with 240 projects, where we find that the information in those projects "clump" towards the earliest parts of the project. A quality prediction model learned from just the first 150 commits works as well, or better than state-of-the-art alternatives. Using just this "early bird" data, we can build models very quickly and very early in the project life cycle. Moreover, using this early bird method, we have shown that a simple model (with just a few features) generalizes to hundreds of projects. Based on this experience, we doubt that prior work on generalizing quality models may have needlessly complicated an inherently simple process. Further, prior work that focused on later-life cycle data needs to be revisited since their conclusions were drawn from relatively uninformative regions. Replication note: all our data and scripts are available here: https://github.com/snaraya7/early-bird △ Less

Submitted 11 January, 2023; v1 submitted 23 May, 2021; originally announced May 2021.

Comments: 38 pages (Accepted TOSEM Jan 2023)

arXiv:2103.12221 [pdf, other]

Mining Scientific Workflows for Anomalous Data Transfers

Authors: Huy Tu, George Papadimitriou, Mariam Kiran, Cong Wang, Anirban Mandal, Ewa Deelman, Tim Menzies

Abstract: Modern scientific workflows are data-driven and are often executed on distributed, heterogeneous, high-performance computing infrastructures. Anomalies and failures in the workflow execution cause loss of scientific productivity and inefficient use of the infrastructure. Hence, detecting, diagnosing, and mitigating these anomalies are immensely important for reliable and performant scientific work… ▽ More Modern scientific workflows are data-driven and are often executed on distributed, heterogeneous, high-performance computing infrastructures. Anomalies and failures in the workflow execution cause loss of scientific productivity and inefficient use of the infrastructure. Hence, detecting, diagnosing, and mitigating these anomalies are immensely important for reliable and performant scientific workflows. Since these workflows rely heavily on high-performance network transfers that require strict QoS constraints, accurately detecting anomalous network performance is crucial to ensure reliable and efficient workflow execution. To address this challenge, we have developed X-FLASH, a network anomaly detection tool for faulty TCP workflow transfers. X-FLASH incorporates novel hyperparameter tuning and data mining approaches for improving the performance of the machine learning algorithms to accurately classify the anomalous TCP packets. X-FLASH leverages XGBoost as an ensemble model and couples XGBoost with a sequential optimizer, FLASH, borrowed from search-based Software Engineering to learn the optimal model parameters. X-FLASH found configurations that outperformed the existing approach up to 28\%, 29\%, and 40\% relatively for F-measure, G-score, and recall in less than 30 evaluations. From (1) large improvement and (2) simple tuning, we recommend future research to have additional tuning study as a new standard, at least in the area of scientific workflow anomaly detection. △ Less

Submitted 22 March, 2021; originally announced March 2021.

Comments: Accepted for MSR 2021: Working Conference on Mining Software Repositories (https://2021.msrconf.org/details/msr-2021-technical-papers/1/Mining-Workflows-for-Anomalous-Data-Transfers)

arXiv:2103.05088 [pdf, other]

Structuring a Comprehensive Software Security Course Around the OWASP Application Security Verification Standard

Authors: Sarah Elder, Nusrat Zahan, Val Kozarev, Rui Shu, Tim Menzies, Laurie Williams

Abstract: Lack of security expertise among software practitioners is a problem with many implications. First, there is a deficit of security professionals to meet current needs. Additionally, even practitioners who do not plan to work in security may benefit from increased understanding of security. The goal of this paper is to aid software engineering educators in designing a comprehensive software securit… ▽ More Lack of security expertise among software practitioners is a problem with many implications. First, there is a deficit of security professionals to meet current needs. Additionally, even practitioners who do not plan to work in security may benefit from increased understanding of security. The goal of this paper is to aid software engineering educators in designing a comprehensive software security course by sharing an experience running a software security course for the eleventh time. Through all the eleven years of running the software security course, the course objectives have been comprehensive - ranging from security testing, to secure design and coding, to security requirements to security risk management. For the first time in this eleventh year, a theme of the course assignments was to map vulnerability discovery to the security controls of the Open Web Application Security Project (OWASP) Application Security Verification Standard (ASVS). Based upon student performance on a final exploratory penetration testing project, this mapping may have increased students' depth of understanding of a wider range of security topics. The students efficiently detected 191 unique and verified vulnerabilities of 28 different Common Weakness Enumeration (CWE) types during a three-hour period in the OpenMRS project, an electronic health record application in active use. △ Less

Submitted 8 March, 2021; originally announced March 2021.

Comments: 10 pages, 5 figures, 1 table, submitted to International Conference on Software Engineering: Joint Track on Software Engineering Education and Training (ICSE-JSEET)

ACM Class: K.3.0; D.2.0; K.6.5

arXiv:2101.06319 [pdf, other]

Old but Gold: Reconsidering the value of feedforward learners for software analytics

Authors: Rahul Yedida, Xueqi Yang, Tim Menzies

Abstract: There has been an increased interest in the use of deep learning approaches for software analytics tasks. State-of-the-art techniques leverage modern deep learning techniques such as LSTMs, yielding competitive performance, albeit at the price of longer training times. Recently, Galke and Scherp [18] showed that at least for image recognition, a decades-old feedforward neural network can match t… ▽ More There has been an increased interest in the use of deep learning approaches for software analytics tasks. State-of-the-art techniques leverage modern deep learning techniques such as LSTMs, yielding competitive performance, albeit at the price of longer training times. Recently, Galke and Scherp [18] showed that at least for image recognition, a decades-old feedforward neural network can match the performance of modern deep learning techniques. This motivated us to try the same in the SE literature. Specifically, in this paper, we apply feedforward networks with some preprocessing to two analytics tasks: issue close time prediction, and vulnerability detection. We test the hypothesis laid by Galke and Scherp [18], that feedforward networks suffice for many analytics tasks (which we call, the "Old but Gold" hypothesis) for these two tasks. For three out of five datasets from these tasks, we achieve new high-water mark results (that out-perform the prior state-of-the-art results) and for a fourth data set, Old but Gold performed as well as the recent state of the art. Furthermore, the old but gold results were obtained orders of magnitude faster than prior work. For example, for issue close time, old but gold found good predictors in 90 seconds (as opposed to the newer methods, which took 6 hours to run). Our results supports the "Old but Gold" hypothesis and leads to the following recommendation: try simpler alternatives before more complex methods. At the very least, this will produce a baseline result against which researchers can compare some other, supposedly more sophisticated, approach. And in the best case, they will obtain useful results that are as good as anything else, in a small fraction of the effort. To support open science, all our scripts and data are available on-line at https://github.com/fastidiouschipmunk/simple. △ Less

Submitted 5 February, 2022; v1 submitted 15 January, 2021; originally announced January 2021.

Comments: v2

arXiv:2101.02817 [pdf, other]

Faster SAT Solving for Software with Repeated Structures (with Case Studies on Software Test Suite Minimization)

Authors: Jianfeng Chen, Xipeng Shen, Tim Menzies

Abstract: Theorem provers has been used extensively in software engineering for software testing or verification. However, software is now so large and complex that additional architecture is needed to guide theorem provers as they try to generate test suites. The SNAP test suite generator (introduced in this paper) combines the Z3 theorem prover with the following tactic: cluster some candidate tests, then… ▽ More Theorem provers has been used extensively in software engineering for software testing or verification. However, software is now so large and complex that additional architecture is needed to guide theorem provers as they try to generate test suites. The SNAP test suite generator (introduced in this paper) combines the Z3 theorem prover with the following tactic: cluster some candidate tests, then search for valid tests by proposing small mutations to the cluster centroids. This technique effectively removes repeated structures in the tests since many repeated structures can be replaced with one centroid. In practice, SNAP is remarkably effective. For 27 real-world programs with up to half a million variables, SNAP found test suites which were 10 to 750 smaller times than those found by the prior state-of-the-art. Also, SNAP ran orders of magnitude faster and (unlike prior work) generated 100% valid tests. △ Less

Submitted 7 January, 2021; originally announced January 2021.

Comments: Submitted to Journal Software and Systems. arXiv admin note: substantial text overlap with arXiv:1905.05358

arXiv:2011.13071 [pdf, other]

Early Life Cycle Software Defect Prediction. Why? How?

Authors: N. C. Shrikanth, Suvodeep Majumder, Tim Menzies

Abstract: Many researchers assume that, for software analytics, "more data is better." We write to show that, at least for learning defect predictors, this may not be true. To demonstrate this, we analyzed hundreds of popular GitHub projects. These projects ran for 84 months and contained 3,728 commits (median values). Across these projects, most of the defects occur very early in their life cycle. Hence, d… ▽ More Many researchers assume that, for software analytics, "more data is better." We write to show that, at least for learning defect predictors, this may not be true. To demonstrate this, we analyzed hundreds of popular GitHub projects. These projects ran for 84 months and contained 3,728 commits (median values). Across these projects, most of the defects occur very early in their life cycle. Hence, defect predictors learned from the first 150 commits and four months perform just as well as anything else. This means that, at least for the projects studied here, after the first few months, we need not continually update our defect prediction models. We hope these results inspire other researchers to adopt a "simplicity-first" approach to their work. Some domains require a complex and data-hungry analysis. But before assuming complexity, it is prudent to check the raw data looking for "short cuts" that can simplify the analysis. △ Less

Submitted 8 February, 2021; v1 submitted 25 November, 2020; originally announced November 2020.

Comments: 12 pages (To appear ICSE 2021)

arXiv:2011.12720 [pdf, other]

Omni: Automated Ensemble with Unexpected Models against Adversarial Evasion Attack

Authors: Rui Shu, Tianpei Xia, Laurie Williams, Tim Menzies

Abstract: Background: Machine learning-based security detection models have become prevalent in modern malware and intrusion detection systems. However, previous studies show that such models are susceptible to adversarial evasion attacks. In this type of attack, inputs (i.e., adversarial examples) are specially crafted by intelligent malicious adversaries, with the aim of being misclassified by existing st… ▽ More Background: Machine learning-based security detection models have become prevalent in modern malware and intrusion detection systems. However, previous studies show that such models are susceptible to adversarial evasion attacks. In this type of attack, inputs (i.e., adversarial examples) are specially crafted by intelligent malicious adversaries, with the aim of being misclassified by existing state-of-the-art models (e.g., deep neural networks). Once the attackers can fool a classifier to think that a malicious input is actually benign, they can render a machine learning-based malware or intrusion detection system ineffective. Goal: To help security practitioners and researchers build a more robust model against non-adaptive, white-box, and non-targeted adversarial evasion attacks through the idea of an ensemble model. Method: We propose an approach called Omni, the main idea of which is to explore methods that create an ensemble of "unexpected models"; i.e., models whose control hyperparameters have a large distance to the hyperparameters of an adversary's target model, with which we then make an optimized weighted ensemble prediction. Result: In studies with five types of adversarial evasion attacks (FGSM, BIM, JSMA, DeepFooland Carlini-Wagner) on five security datasets (NSL-KDD, CIC-IDS-2017, CSE-CIC-IDS2018, CICAnd-Mal2017, and the Contagio PDF dataset), we show Omni is a promising approach as a defense strategy against adversarial attacks when compared with other baseline treatments. Conclusion: When employing ensemble defense against adversarial evasion attacks, we suggest creating an ensemble with unexpected models that are distant from the attacker's expected model (i.e., target model) through methods such as hyperparameter optimization. △ Less

Submitted 12 October, 2021; v1 submitted 23 November, 2020; originally announced November 2020.

Comments: Submitted to EMSE

arXiv:2010.03525 [pdf]

Empirical Standards for Software Engineering Research

Authors: Paul Ralph, Nauman bin Ali, Sebastian Baltes, Domenico Bianculli, Jessica Diaz, Yvonne Dittrich, Neil Ernst, Michael Felderer, Robert Feldt, Antonio Filieri, Breno Bernard Nicolau de França, Carlo Alberto Furia, Greg Gay, Nicolas Gold, Daniel Graziotin, Pinjia He, Rashina Hoda, Natalia Juristo, Barbara Kitchenham, Valentina Lenarduzzi, Jorge Martínez, Jorge Melegati, Daniel Mendez, Tim Menzies, Jefferson Molleri , et al. (18 additional authors not shown)

Abstract: Empirical Standards are natural-language models of a scientific community's expectations for a specific kind of study (e.g. a questionnaire survey). The ACM SIGSOFT Paper and Peer Review Quality Initiative generated empirical standards for research methods commonly used in software engineering. These living documents, which should be continuously revised to reflect evolving consensus around resear… ▽ More Empirical Standards are natural-language models of a scientific community's expectations for a specific kind of study (e.g. a questionnaire survey). The ACM SIGSOFT Paper and Peer Review Quality Initiative generated empirical standards for research methods commonly used in software engineering. These living documents, which should be continuously revised to reflect evolving consensus around research best practices, will improve research quality and make peer review more effective, reliable, transparent and fair. △ Less

Submitted 4 March, 2021; v1 submitted 7 October, 2020; originally announced October 2020.

Comments: For the complete standards, supplements and other resources, see https://github.com/acmsigsoft/EmpiricalStandards

arXiv:2008.09569 [pdf, other]

doi 10.1007/s10664-021-10068-4

Revisiting Process versus Product Metrics: a Large Scale Analysis

Authors: Suvodeep Majumder, Pranav Mody, Tim Menzies

Abstract: Numerous methods can build predictive models from software data. However, what methods and conclusions should we endorse as we move from analytics in-the-small (dealing with a handful of projects) to analytics in-the-large (dealing with hundreds of projects)? To answer this question, we recheck prior small-scale results (about process versus product metrics for defect prediction and the granular… ▽ More Numerous methods can build predictive models from software data. However, what methods and conclusions should we endorse as we move from analytics in-the-small (dealing with a handful of projects) to analytics in-the-large (dealing with hundreds of projects)? To answer this question, we recheck prior small-scale results (about process versus product metrics for defect prediction and the granularity of metrics) using 722,471 commits from 700 Github projects. We find that some analytics in-the-small conclusions still hold when scaling up to analytics in-the-large. For example, like prior work, we see that process metrics are better predictors for defects than product metrics (best process/product-based learners respectively achieve recalls of 98\%/44\% and AUCs of 95\%/54\%, median values). That said, we warn that it is unwise to trust metric importance results from analytics in-the-small studies since those change dramatically when moving to analytics in-the-large. Also, when reasoning in-the-large about hundreds of projects, it is better to use predictions from multiple models (since single model predictions can become confused and exhibit a high variance). △ Less

Submitted 26 October, 2021; v1 submitted 21 August, 2020; originally announced August 2020.

Comments: 36 pages, 12 figures and 5 tables

Journal ref: Empirical Software Engineering, Volume 27, Issue 3, May 2022

arXiv:2008.07334

Simpler Hyperparameter Optimization for Software Analytics: Why, How, When?

Authors: Amritanshu Agrawal, Xueqi Yang, Rishabh Agrawal, Xipeng Shen, Tim Menzies

Abstract: How to make software analytics simpler and faster? One method is to match the complexity of analysis to the intrinsic complexity of the data being explored. For example, hyperparameter optimizers find the control settings for data miners that improve for improving the predictions generated via software analytics. Sometimes, very fast hyperparameter optimization can be achieved by just DODGE-ing aw… ▽ More How to make software analytics simpler and faster? One method is to match the complexity of analysis to the intrinsic complexity of the data being explored. For example, hyperparameter optimizers find the control settings for data miners that improve for improving the predictions generated via software analytics. Sometimes, very fast hyperparameter optimization can be achieved by just DODGE-ing away from things tried before. But when is it wise to use DODGE and when must we use more complex (and much slower) optimizers? To answer this, we applied hyperparameter optimization to 120 SE data sets that explored bad smell detection, predicting Github ssue close time, bug report analysis, defect prediction, and dozens of other non-SE problems. We find that DODGE works best for data sets with low "intrinsic dimensionality" (D = 3) and very poorly for higher-dimensional data (D over 8). Nearly all the SE data seen here was intrinsically low-dimensional, indicating that DODGE is applicable for many SE analytics tasks. △ Less

Submitted 22 April, 2021; v1 submitted 13 August, 2020; originally announced August 2020.

Comments: made a mistake with my co-author. the current version of this doc is their version arXiv:1912.04061

arXiv:2008.03835 [pdf, other]

On the Value of Oversampling for Deep Learning in Software Defect Prediction

Authors: Rahul Yedida, Tim Menzies

Abstract: One truism of deep learning is that the automatic feature engineering (seen in the first layers of those networks) excuses data scientists from performing tedious manual feature engineering prior to running DL. For the specific case of deep learning for defect prediction, we show that that truism is false. Specifically, when we preprocess data with a novel oversampling technique called fuzzy sampl… ▽ More One truism of deep learning is that the automatic feature engineering (seen in the first layers of those networks) excuses data scientists from performing tedious manual feature engineering prior to running DL. For the specific case of deep learning for defect prediction, we show that that truism is false. Specifically, when we preprocess data with a novel oversampling technique called fuzzy sampling, as part of a larger pipeline called GHOST (Goal-oriented Hyper-parameter Optimization for Scalable Training), then we can do significantly better than the prior DL state of the art in 14/20 defect data sets. Our approach yields state-of-the-art results significantly faster deep learners. These results present a cogent case for the use of oversampling prior to applying deep learning on software defect prediction datasets. △ Less

Submitted 20 April, 2021; v1 submitted 9 August, 2020; originally announced August 2020.

Comments: v3, revision 2 (minor revision); submitted to TSE

arXiv:2008.00612 [pdf, other]

How Different is Test Case Prioritization for Open and Closed Source Projects?

Authors: Xiao Ling, Rishabh Agrawal, Tim Menzies

Abstract: Improved test case prioritization means that software developers can detect and fix more software faults sooner than usual. But is there one "best" prioritization algorithm? Or do different kinds of projects deserve special kinds of prioritization? To answer these questions, this paper applies nine prioritization schemes to 31 projects that range from (a) highly rated open-source Github projects t… ▽ More Improved test case prioritization means that software developers can detect and fix more software faults sooner than usual. But is there one "best" prioritization algorithm? Or do different kinds of projects deserve special kinds of prioritization? To answer these questions, this paper applies nine prioritization schemes to 31 projects that range from (a) highly rated open-source Github projects to (b) computational science software to (c) a closed-source project. We find that prioritization approaches that work best for open-source projects can work worst for the closed-source project (and vice versa). From these experiments, we conclude that (a) it is ill-advised to always apply one prioritization scheme to all projects since (b) prioritization requires tuning to different project types. △ Less

Submitted 20 February, 2021; v1 submitted 2 August, 2020; originally announced August 2020.

Comments: 15 pages, 4 figures, 16 tables, accepted to TSE

arXiv:2007.02893 [pdf, other]

doi 10.1145/3324884.3418932

Making Fair ML Software using Trustworthy Explanation

Authors: Joymallya Chakraborty, Kewen Peng, Tim Menzies

Abstract: Machine learning software is being used in many applications (finance, hiring, admissions, criminal justice) having a huge social impact. But sometimes the behavior of this software is biased and it shows discrimination based on some sensitive attributes such as sex, race, etc. Prior works concentrated on finding and mitigating bias in ML models. A recent trend is using instance-based model-agnost… ▽ More Machine learning software is being used in many applications (finance, hiring, admissions, criminal justice) having a huge social impact. But sometimes the behavior of this software is biased and it shows discrimination based on some sensitive attributes such as sex, race, etc. Prior works concentrated on finding and mitigating bias in ML models. A recent trend is using instance-based model-agnostic explanation methods such as LIME to find out bias in the model prediction. Our work concentrates on finding shortcomings of current bias measures and explanation methods. We show how our proposed method based on K nearest neighbors can overcome those shortcomings and find the underlying bias of black-box models. Our results are more trustworthy and helpful for the practitioners. Finally, We describe our future framework combining explanation and planning to build fair software. △ Less

Submitted 18 August, 2020; v1 submitted 6 July, 2020; originally announced July 2020.

Comments: New Ideas and Emerging Results (NIER) track; The 35th IEEE/ACM International Conference on Automated Software Engineering; Melbourne, Australia

Journal ref: ASE 2020: The 35th IEEE/ACM International Conference on Automated Software Engineering, Melbourne, Australia, Mon 21 - Fri 25 September 2020

arXiv:2006.07416 [pdf, other]

Defect Reduction Planning (using TimeLIME)

Authors: Kewen Peng, Tim Menzies

Abstract: Software comes in releases. An implausible change to software is something that has never been changed in prior releases. When planning how to reduce defects, it is better to use plausible changes, i.e., changes with some precedence in the prior releases. To demonstrate these points, this paper compares several defect reduction planning tools. LIME is a local sensitivity analysis tool that can r… ▽ More Software comes in releases. An implausible change to software is something that has never been changed in prior releases. When planning how to reduce defects, it is better to use plausible changes, i.e., changes with some precedence in the prior releases. To demonstrate these points, this paper compares several defect reduction planning tools. LIME is a local sensitivity analysis tool that can report the fewest changes needed to alter the classification of some code module (e.g., from "defective" to "non-defective"). TimeLIME is a new tool, introduced in this paper, that improves LIME by restricting its plans to just those attributes which change the most within a project. In this study, we compared the performance of LIME and TimeLIME and several other defect reduction planning algorithms. The generated plans were assessed via (a) the similarity scores between the proposed code changes and the real code changes made by developers; and (b) the improvement scores seen within projects that followed the plans. For nine project trails, we found that TimeLIME outperformed all other algorithms (in 8 out of 9 trials). Hence, we strongly recommend using past releases as a source of knowledge for computing fixes for new releases (using TimeLIME). Apart from these specific results about planning defect reductions and TimeLIME, the more general point of this paper is that our community should be more careful about using off-the-shelf AI tools, without first applying SE knowledge. In this case study, it was not difficult to augment a standard AI algorithm with SE knowledge (that past releases are a good source of knowledge for planning defect reductions). As shown here, once that SE knowledge is applied, this can result in dramatically better systems. △ Less

Submitted 15 February, 2021; v1 submitted 12 June, 2020; originally announced June 2020.

Comments: 15 pages, 5 figures, 12 tables, accepted by TSE. arXiv admin note: substantial text overlap with arXiv:2003.06887

arXiv:2006.07240 [pdf, other]

Predicting Health Indicators for Open Source Projects (using Hyperparameter Optimization)

Authors: Tianpei Xia, Wei Fu, Rui Shu, Rishabh Agrawal, Tim Menzies

Abstract: Software developed on public platform is a source of data that can be used to make predictions about those projects. While the individual developing activity may be random and hard to predict, the developing behavior on project level can be predicted with good accuracy when large groups of developers work together on software projects. To demonstrate this, we use 64,181 months of data from 1,159… ▽ More Software developed on public platform is a source of data that can be used to make predictions about those projects. While the individual developing activity may be random and hard to predict, the developing behavior on project level can be predicted with good accuracy when large groups of developers work together on software projects. To demonstrate this, we use 64,181 months of data from 1,159 GitHub projects to make various predictions about the recent status of those projects (as of April 2020). We find that traditional estimation algorithms make many mistakes. Algorithms like $k$-nearest neighbors (KNN), support vector regression (SVR), random forest (RFT), linear regression (LNR), and regression trees (CART) have high error rates. But that error rate can be greatly reduced using hyperparameter optimization. To the best of our knowledge, this is the largest study yet conducted, using recent data for predicting multiple health indicators of open-source projects. △ Less

Submitted 17 March, 2022; v1 submitted 12 June, 2020; originally announced June 2020.

Comments: Accepted to EMSE 2022

arXiv:2006.05060 [pdf, other]

doi 10.1007/s10664-021-09957-5

Assessing Practitioner Beliefs about Software Engineering

Authors: N. C. Shrikanth, William Nichols, Fahmid Morshed Fahid, Tim Menzies

Abstract: Software engineering is a highly dynamic discipline. Hence, as times change, so too might our beliefs about core processes in this field. This paper checks some five beliefs that originated in the past decades that comment on the relationships between (i) developer productivity; (ii) software quality and (iii) years of developer experience. Using data collected from 1,356 developers in the period… ▽ More Software engineering is a highly dynamic discipline. Hence, as times change, so too might our beliefs about core processes in this field. This paper checks some five beliefs that originated in the past decades that comment on the relationships between (i) developer productivity; (ii) software quality and (iii) years of developer experience. Using data collected from 1,356 developers in the period 1995 to 2006, we found support for only one of the five beliefs titled "Quality entails productivity". We found no clear support for four other beliefs based on programming languages and software developers. However, from the sporadic evidence of the four other beliefs we learned that a narrow scope could delude practitioners in misinterpreting certain effects to hold in their day to day work. Lastly, through an aggregated view of assessing the five beliefs, we find programming languages act as a confounding factor for developer productivity and software quality. Thus the overall message of this work is that it is both important and possible to revisit old beliefs in SE. Researchers and practitioners should routinely retest old beliefs. △ Less

Submitted 24 May, 2021; v1 submitted 9 June, 2020; originally announced June 2020.

Comments: 32 pages, published https://link.springer.com/article/10.1007/s10664-021-09957-5

arXiv:2006.00444 [pdf, other]

Learning to Recognize Actionable Static Code Warnings (is Intrinsically Easy)

Authors: Xueqi Yang, Jianfeng Chen, Rahul Yedida, Zhe Yu, Tim Menzies

Abstract: Static code warning tools often generate warnings that programmers ignore. Such tools can be made more useful via data mining algorithms that select the "actionable" warnings; i.e. the warnings that are usually not ignored. In this paper, we look for actionable warnings within a sample of 5,675 actionable warnings seen in 31,058 static code warnings from FindBugs. We find that data mining algori… ▽ More Static code warning tools often generate warnings that programmers ignore. Such tools can be made more useful via data mining algorithms that select the "actionable" warnings; i.e. the warnings that are usually not ignored. In this paper, we look for actionable warnings within a sample of 5,675 actionable warnings seen in 31,058 static code warnings from FindBugs. We find that data mining algorithms can find actionable warnings with remarkable ease. Specifically, a range of data mining methods (deep learners, random forests, decision tree learners, and support vector machines) all achieved very good results (recalls and AUC (TRN, TPR) measures usually over 95% and false alarms usually under 5%). Given that all these learners succeeded so easily, it is appropriate to ask if there is something about this task that is inherently easy. We report that while our data sets have up to 58 raw features, those features can be approximated by less than two underlying dimensions. For such intrinsically simple data, many different kinds of learners can generate useful models with similar performance. Based on the above, we conclude that learning to recognize actionable static code warnings is easy, using a wide range of learning algorithms, since the underlying data is intrinsically simple. If we had to pick one particular learner for this task, we would suggest linear SVMs (since, at least in our sample, that learner ran relatively quickly and achieved the best median performance) and we would not recommend deep learning (since this data is intrinsically very simple). △ Less

Submitted 10 January, 2021; v1 submitted 31 May, 2020; originally announced June 2020.

Comments: 24 pages, 5 figures, 7 tables, accepted to Empirical Software Engineering and to appear

arXiv:2003.10354 [pdf, other]

doi 10.1145/3368089.3409697

Fairway: A Way to Build Fair ML Software

Authors: Joymallya Chakraborty, Suvodeep Majumder, Zhe Yu, Tim Menzies

Abstract: Machine learning software is increasingly being used to make decisions that affect people's lives. But sometimes, the core part of this software (the learned model), behaves in a biased manner that gives undue advantages to a specific group of people (where those groups are determined by sex, race, etc.). This "algorithmic discrimination" in the AI software systems has become a matter of serious c… ▽ More Machine learning software is increasingly being used to make decisions that affect people's lives. But sometimes, the core part of this software (the learned model), behaves in a biased manner that gives undue advantages to a specific group of people (where those groups are determined by sex, race, etc.). This "algorithmic discrimination" in the AI software systems has become a matter of serious concern in the machine learning and software engineering community. There have been works done to find "algorithmic bias" or "ethical bias" in the software system. Once the bias is detected in the AI software system, the mitigation of bias is extremely important. In this work, we a)explain how ground-truth bias in training data affects machine learning model fairness and how to find that bias in AI software,b)propose a methodFairwaywhich combines pre-processing and in-processing approach to remove ethical bias from training data and trained model. Our results show that we can find bias and mitigate bias in a learned model, without much damaging the predictive performance of that model. We propose that (1) test-ing for bias and (2) bias mitigation should be a routine part of the machine learning software development life cycle. Fairway offers much support for these two purposes. △ Less

Submitted 6 October, 2020; v1 submitted 23 March, 2020; originally announced March 2020.

Comments: ESEC/FSE'20: The 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Journal ref: ESEC/FSE'2020: The 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Sacramento, California, United States, November 8-13, 2020

arXiv:2003.06887 [pdf, other]

How to Improve AI Tools (by Adding in SE Knowledge): Experiments with the TimeLIME Defect Reduction Tool

Authors: Kewen Peng, Tim Menzies

Abstract: AI algorithms are being used with increased frequency in SE research and practice. Such algorithms are usually commissioned and certified using data from outside the SE domain. Can we assume that such algorithms can be used ''off-the-shelf'' (i.e. with no modifications)? To say that another way, are there special features of SE problems that suggest a different and better way to use AI tools? To… ▽ More AI algorithms are being used with increased frequency in SE research and practice. Such algorithms are usually commissioned and certified using data from outside the SE domain. Can we assume that such algorithms can be used ''off-the-shelf'' (i.e. with no modifications)? To say that another way, are there special features of SE problems that suggest a different and better way to use AI tools? To answer these questions, this paper reports experiments with TimeLIME, a variant of the LIME explanation algorithm from KDD'16. LIME can offer recommendations on how to change static code attributes in order to reduce the number of defects in the next software release. That version of LIME used an internal weighting tool to decide what attributes to include/exclude in those recommendations. TimeLIME improves on that weighting scheme using the following SE knowledge: software comes in releases; an implausible change to software is something that has never been changed in prior releases; so it is better to use plausible changes, i.e. changes with some precedent in the prior releases. By restricting recommendations to just the frequently changed attributes, TimeLIME can produce (a)~dramatically better explanations of what causes defects and (b)~much better recommendations on how to fix buggy code. Apart from these specific results about defect reduction and TimeLIME, the more general point of this paper is that our community should be more careful about using off-the-shelf AI tools, without first applying SE knowledge. As shown here, it may not be a complex matter to apply that knowledge. Further, once that SE knowledge is applied, this can result in dramatically better systems. △ Less

Submitted 15 March, 2020; originally announced March 2020.

Comments: 11 pages, 7 figures, submitted to FSE

arXiv:2003.05922 [pdf, other]

The Changing Nature of Computational Science Software

Authors: Huy Tu, Rishabh Agrawal, Tim Menzies

Abstract: How should software engineering be adapted for Computational Science (CS)? If we understood that, then we could better support software sustainability, verifiability, reproducibility, comprehension, and usability for CS community. For example, improving the maintainability of the CS code could lead to: (a) faster adaptation of scientific project simulations to new and efficient hardware (multi-cor… ▽ More How should software engineering be adapted for Computational Science (CS)? If we understood that, then we could better support software sustainability, verifiability, reproducibility, comprehension, and usability for CS community. For example, improving the maintainability of the CS code could lead to: (a) faster adaptation of scientific project simulations to new and efficient hardware (multi-core and heterogeneous systems); (b) better support for larger teams to co-ordinate (through integration with interdisciplinary teams); and (c) an extended capability to model complex phenomena. In order to better understand computational science, this paper uses quantitative evidence (from 59 CS projects in Github) to check 13 published beliefs about CS. These beliefs reflect on (a) the nature of scientific challenges; (b) the implications of limitations of computer hardware; and (c) the cultural environment of scientific software development. What we found was, using this new data from Github, only a minority of those older beliefs can be endorsed. More than half of the pre-existing beliefs are dubious, which leads us to conclude that the nature of CS software development is changing. Further, going forward, this has implications for (1) what kinds of tools we would propose to better support computational science and (2) research directions for both communities. △ Less

Submitted 12 March, 2020; originally announced March 2020.

Comments: 10 pages, submitted for FSE

arXiv:2002.11049 [pdf, other]

doi 10.1109/TSE.2020.3031401

Identifying Self-Admitted Technical Debts with Jitterbug: A Two-step Approach

Authors: Zhe Yu, Fahmid Morshed Fahid, Huy Tu, Tim Menzies

Abstract: Keeping track of and managing Self-Admitted Technical Debts (SATDs) are important to maintaining a healthy software project. This requires much time and effort from human experts to identify the SATDs manually. The current automated solutions do not have satisfactory precision and recall in identifying SATDs to fully automate the process. To solve the above problems, we propose a two-step framewor… ▽ More Keeping track of and managing Self-Admitted Technical Debts (SATDs) are important to maintaining a healthy software project. This requires much time and effort from human experts to identify the SATDs manually. The current automated solutions do not have satisfactory precision and recall in identifying SATDs to fully automate the process. To solve the above problems, we propose a two-step framework called Jitterbug for identifying SATDs. Jitterbug first identifies the "easy to find" SATDs automatically with close to 100% precision using a novel pattern recognition technique. Subsequently, machine learning techniques are applied to assist human experts in manually identifying the remaining "hard to find" SATDs with reduced human effort. Our simulation studies on ten software projects show that Jitterbug can identify SATDs more efficiently (with less human effort) than the prior state-of-the-art methods. △ Less

Submitted 16 October, 2020; v1 submitted 25 February, 2020; originally announced February 2020.

Comments: 14 pages, 3 pages for appendix, 6+3 figures, 10 tables. Accepted by TSE journal

arXiv:1912.10093 [pdf, other]

doi 10.1145/3377813.3381367

Assessing Practitioner Beliefs about Software Defect Prediction

Authors: N. C. Shrikanth, Tim Menzies

Abstract: Just because software developers say they believe in "X", that does not necessarily mean that "X" is true. As shown here, there exist numerous beliefs listed in the recent Software Engineering literature which are only supported by small portions of the available data. Hence we ask what is the source of this disconnect between beliefs and evidence?. To answer this question we look for evidence for… ▽ More Just because software developers say they believe in "X", that does not necessarily mean that "X" is true. As shown here, there exist numerous beliefs listed in the recent Software Engineering literature which are only supported by small portions of the available data. Hence we ask what is the source of this disconnect between beliefs and evidence?. To answer this question we look for evidence for ten beliefs within 300,000+ changes seen in dozens of open-source projects. Some of those beliefs had strong support across all the projects; specifically, "A commit that involves more added and removed lines is more bug-prone" and "Files with fewer lines contributed by their owners (who contribute most changes) are bug-prone". Most of the widely-held beliefs studied are only sporadically supported in the data; i.e. large effects can appear in project data and then disappear in subsequent releases. Such sporadic support explains why developers believe things that were relevant to their prior work, but not necessarily their current work. Our conclusion will be that we need to change the nature of the debate with Software Engineering. Specifically, while it is important to report the effects that hold right now, it is also important to report on what effects change over time. △ Less

Submitted 8 April, 2020; v1 submitted 20 December, 2019; originally announced December 2019.

Comments: 9 pages, 3 Figures, 4 Tables, ICSE SEIP 2020

arXiv:1912.04189 [pdf, other]

Sequential Model Optimization for Software Process Control

Authors: Tianpei Xia, Rui Shu, Xipeng Shen, Tim Menzies

Abstract: Many methods have been proposed to estimate how much effort is required to build and maintain software. Much of that research assumes a ``classic'' waterfall-based approach rather than contemporary projects (where the developing process may be more iterative than linear in nature). Also, much of that work tries to recommend a single method-- an approach that makes the dubious assumption that one m… ▽ More Many methods have been proposed to estimate how much effort is required to build and maintain software. Much of that research assumes a ``classic'' waterfall-based approach rather than contemporary projects (where the developing process may be more iterative than linear in nature). Also, much of that work tries to recommend a single method-- an approach that makes the dubious assumption that one method can handle the diversity of software project data. To address these drawbacks, we apply a configuration technique called ``ROME'' (Rapid Optimizing Methods for Estimation), which uses sequential model-based optimization (SMO) to find what combination of effort estimation techniques works best for a particular data set. We test this method using data from 1161 classic waterfall projects and 120 contemporary projects (from Github). In terms of magnitude of relative error and standardized accuracy, we find that ROME achieves better performance than existing state-of-the-art methods for both classic and contemporary problems. In addition, we conclude that we should not recommend one method for estimation. Rather, it is better to search through a wide range of different methods to find what works best for local data. To the best of our knowledge, this is the largest effort estimation experiment yet attempted and the only one to test its methods on classic and contemporary projects. △ Less

Submitted 17 February, 2020; v1 submitted 9 December, 2019; originally announced December 2019.

arXiv:1912.04061 [pdf, other]

doi 10.1109/TSE.2021.3073242

Simpler Hyperparameter Optimization for Software Analytics: Why, How, When?

Authors: Amritanshu Agrawal, Xueqi Yang, Rishabh Agrawal, Rahul Yedida, Xipeng Shen, Tim Menzies

Abstract: How can we make software analytics simpler and faster? One method is to match the complexity of analysis to the intrinsic complexity of the data being explored. For example, hyperparameter optimizers find the control settings for data miners that improve the predictions generated via software analytics. Sometimes, very fast hyperparameter optimization can be achieved by "DODGE-ing"; i.e. simply st… ▽ More How can we make software analytics simpler and faster? One method is to match the complexity of analysis to the intrinsic complexity of the data being explored. For example, hyperparameter optimizers find the control settings for data miners that improve the predictions generated via software analytics. Sometimes, very fast hyperparameter optimization can be achieved by "DODGE-ing"; i.e. simply steering way from settings that lead to similar conclusions. But when is it wise to use that simple approach and when must we use more complex (and much slower) optimizers?} To answer this, we applied hyperparameter optimization to 120 SE data sets that explored bad smell detection, predicting Github issue close time, bug report analysis, defect prediction, and dozens of other non-SE problems. We find that the simple DODGE works best for data sets with low "intrinsic dimensionality" (u ~ 3) and very poorly for higher-dimensional data (u > 8). Nearly all the SE data seen here was intrinsically low-dimensional, indicating that DODGE is applicable for many SE analytics tasks. △ Less

Submitted 22 April, 2021; v1 submitted 9 December, 2019; originally announced December 2019.

Comments: 15 pages

Journal ref: Transactions on Software Engineering, 2021

arXiv:1911.04250 [pdf, other]

Methods for Stabilizing Models across Large Samples of Projects (with case studies on Predicting Defect and Project Health)

Authors: Suvodeep Majumder, Tianpei Xia, Rahul Krishna, Tim Menzies

Abstract: Despite decades of research, SE lacks widely accepted models (that offer precise quantitative stable predictions) about what factors most influence software quality. This paper provides a promising result showing such stable models can be generated using a new transfer learning framework called "STABILIZER". Given a tree of recursively clustered projects (using project meta-data), STABILIZER promo… ▽ More Despite decades of research, SE lacks widely accepted models (that offer precise quantitative stable predictions) about what factors most influence software quality. This paper provides a promising result showing such stable models can be generated using a new transfer learning framework called "STABILIZER". Given a tree of recursively clustered projects (using project meta-data), STABILIZER promotes a model upwards if it performs best in the lower clusters (stopping when the promoted model performs worse than the models seen at a lower level). The number of models found by STABILIZER is minimal: one for defect prediction (756 projects) and less than a dozen for project health (1628 projects). Hence, via STABILIZER, it is possible to find a few projects which can be used for transfer learning and make conclusions that hold across hundreds of projects at a time. Further, the models produced in this manner offer predictions that perform as well or better than the prior state-of-the-art. To the best of our knowledge, STABILIZER is order of magnitude faster than the prior state-of-the-art transfer learners which seek to find conclusion stability, and these case studies are the largest demonstration of the generalizability of quantitative predictions of project quality yet reported in the SE literature. In order to support open science, all our scripts and data are online at https://github.com/Anonymous633671/STABILIZER. △ Less

Submitted 21 March, 2022; v1 submitted 6 November, 2019; originally announced November 2019.

Comments: 12 pages, 4 figures, 5 Tables

arXiv:1911.02476 [pdf, other]

How to Better Distinguish Security Bug Reports (using Dual Hyperparameter Optimization

Authors: Rui Shu, Tianpei Xia, Jianfeng Chen, Laurie Williams, Tim Menzies

Abstract: Background: In order that the general public is not vulnerable to hackers, security bug reports need to be handled by small groups of engineers before being widely discussed. But learning how to distinguish the security bug reports from other bug reports is challenging since they may occur rarely. Data mining methods that can find such scarce targets require extensive optimization effort. Goal:… ▽ More Background: In order that the general public is not vulnerable to hackers, security bug reports need to be handled by small groups of engineers before being widely discussed. But learning how to distinguish the security bug reports from other bug reports is challenging since they may occur rarely. Data mining methods that can find such scarce targets require extensive optimization effort. Goal: The goal of this research is to aid practitioners as they struggle to optimize methods that try to distinguish between rare security bug reports and other bug reports. Method: Our proposed method, called Swift, is a dual optimizer that optimizes both learner and pre-processor options. Since this is a large space of options, Swift uses a technique called epsilon-dominance that learns how to avoid operations that do not significantly improve performance. Result: When compared to recent state-of-the-art results (from FARSEC which is published in TSE'18), we find that the Swift's dual optimization of both pre-processor and learner is more useful than optimizing each of them individually. For example, in a study of security bug reports from the Chromium dataset, the median recalls of FARSEC and Swift were 15.7% and 77.4%, respectively. For another example, in experiments with data from the Ambari project, the median recalls improved from 21.5% to 85.7% (FARSEC to SWIFT). Conclusion: Overall, our approach can quickly optimize models that achieve better recalls than the prior state-of-the-art. These increases in recall are associated with moderate increases in false positive rates (from 8% to 24%, median). For future work, these results suggest that dual optimization is both practical and useful. △ Less

Submitted 17 March, 2021; v1 submitted 4 November, 2019; originally announced November 2019.

Comments: arXiv admin note: substantial text overlap with arXiv:1905.06872

arXiv:1911.01817 [pdf, other]

Whence to Learn? Transferring Knowledge in Configurable Systems using BEETLE

Authors: Rahul Krishna, Vivek Nair, Pooyan Jamshidi, Tim Menzies

Abstract: As software systems grow in complexity and the space of possible configurations increases exponentially, finding the near-optimal configuration of a software system becomes challenging. Recent approaches address this challenge by learning performance models based on a sample set of configurations. However, collecting enough sample configurations can be very expensive since each such sample require… ▽ More As software systems grow in complexity and the space of possible configurations increases exponentially, finding the near-optimal configuration of a software system becomes challenging. Recent approaches address this challenge by learning performance models based on a sample set of configurations. However, collecting enough sample configurations can be very expensive since each such sample requires configuring, compiling, and executing the entire system using a complex test suite. When learning on new data is too expensive, it is possible to use \textit{Transfer Learning} to "transfer" old lessons to the new context. Traditional transfer learning has a number of challenges, specifically, (a) learning from excessive data takes excessive time, and (b) the performance of the models built via transfer can deteriorate as a result of learning from a poor source. To resolve these problems, we propose a novel transfer learning framework called BEETLE, which is a "bellwether"-based transfer learner that focuses on identifying and learning from the most relevant source from amongst the old data. This paper evaluates BEETLE with 57 different software configuration problems based on five software systems (a video encoder, an SAT solver, a SQL database, a high-performance C-compiler, and a streaming data analytics tool). In each of these cases, BEETLE found configurations that are as good as or better than those found by other state-of-the-art transfer learners while requiring only a fraction ($\frac{1}{7}$th) of the measurements needed by those other methods. Based on these results, we say that BEETLE is a new high-water mark in optimally configuring software. △ Less

Submitted 25 March, 2020; v1 submitted 1 November, 2019; originally announced November 2019.

Comments: Accepted, to appear in IEEE TSE. arXiv admin note: text overlap with arXiv:1803.03900

arXiv:1911.01387 [pdf, other]

Understanding Static Code Warnings: an Incremental AI Approach

Authors: Xueqi Yang, Zhe Yu, Junjie Wang, Tim Menzies

Abstract: Knowledge-based systems reason over some knowledge base. Hence, an important issue for such systems is how to acquire the knowledge needed for their inference. This paper assesses active learning methods for acquiring knowledge for "static code warnings". Static code analysis is a widely-used method for detecting bugs and security vulnerabilities in software systems. As software becomes more com… ▽ More Knowledge-based systems reason over some knowledge base. Hence, an important issue for such systems is how to acquire the knowledge needed for their inference. This paper assesses active learning methods for acquiring knowledge for "static code warnings". Static code analysis is a widely-used method for detecting bugs and security vulnerabilities in software systems. As software becomes more complex, analysis tools also report lists of increasingly complex warnings that developers need to address on a daily basis. Such static code analysis tools are usually over-cautious; i.e. they often offer many warnings about spurious issues. Previous research work shows that about 35% to 91% of warnings reported as bugs by SA tools are actually unactionable (i.e., warnings that would not be acted on by developers because they are falsely suggested as bugs). Experienced developers know which errors are important and which can be safely ignored. How can we capture that experience? This paper reports on an incremental AI tool that watches humans reading false alarm reports. Using an incremental support vector machine mechanism, this AI tool can quickly learn to distinguish spurious false alarms from more serious matters that deserve further attention. In this work, nine open-source projects are employed to evaluate our proposed model on the features extracted by previous researchers and identify the actionable warnings in a priority order given by our algorithm. We observe that our model can identify over 90% of actionable warnings when our methods tell humans to ignore 70 to 80% of the warnings. △ Less

Submitted 22 October, 2020; v1 submitted 4 November, 2019; originally announced November 2019.

Comments: Accepted to Expert Systems with Applications

arXiv:1909.07249 [pdf, other]

Assessing Expert System-Assisted Literature Reviews With a Case Study

Authors: Zhe Yu, Jeffrey C. Carver, Gregg Rothermel, Tim Menzies

Abstract: Given the large number of publications in software engineering, frequent literature reviews are required to keep current on work in specific areas. One tedious work in literature reviews is to find relevant studies amongst thousands of non-relevant search results. In theory, expert systems can assist in finding relevant work but those systems have primarily been tested in simulations rather than i… ▽ More Given the large number of publications in software engineering, frequent literature reviews are required to keep current on work in specific areas. One tedious work in literature reviews is to find relevant studies amongst thousands of non-relevant search results. In theory, expert systems can assist in finding relevant work but those systems have primarily been tested in simulations rather than in application to actual literature reviews. Hence, few researchers have faith in such expert systems. Accordingly, using a realistic case study, this paper assesses how well our state-of-the-art expert system can help with literature reviews. The assessed literature review aimed at identifying test case prioritization techniques for automated UI testing, specifically from 8,349 papers on IEEE Xplore. This corpus was studied with an expert system that incorporates an incrementally updated human-in-the-loop active learning tool. Using that expert system, in three hours, we found 242 relevant papers from which we identified 12 techniques representing the state-of-the-art in test case prioritization when source code information is not available. These results were then validated by six other graduate students manually exploring the same corpus. Without the expert system, this task would have required 53 hours and would have found 27 additional papers. That is, our expert system achieved 90% recall with 6% of the human effort cost when compared to a conventional manual method. Significantly, the same 12 state-of-the-art test case prioritization techniques were identified by both the expert system and the manual method. That is, the 27 papers missed by the expert system would not have changed the conclusion of the literature review. Hence, if this result generalizes, it endorses the use of our expert system to assist in literature reviews. △ Less

Submitted 8 April, 2022; v1 submitted 16 September, 2019; originally announced September 2019.

Comments: 23+8 pages, 9 figures, 3 tables. Accepted by Expert Systems with Applications

arXiv:1905.08297 [pdf, other]

Better Technical Debt Detection via SURVEYing

Authors: Fahmid M. Fahid, Zhe Yu, Tim Menzies

Abstract: Software analytics can be improved by surveying; i.e. rechecking and (possibly) revising the labels offered by prior analysis. Surveying is a time-consuming task and effective surveyors must carefully manage their time. Specifically, they must balance the cost of further surveying against the additional benefits of that extra effort. This paper proposes SURVEY0, an incremental Logistic Regression… ▽ More Software analytics can be improved by surveying; i.e. rechecking and (possibly) revising the labels offered by prior analysis. Surveying is a time-consuming task and effective surveyors must carefully manage their time. Specifically, they must balance the cost of further surveying against the additional benefits of that extra effort. This paper proposes SURVEY0, an incremental Logistic Regression estimation method that implements cost/benefit analysis. Some classifier is used to rank the as-yet-unvisited examples according to how interesting they might be. Humans then review the most interesting examples, after which their feedback is used to update an estimator for estimating how many examples are remaining. This paper evaluates SURVEY0 in the context of self-admitted technical debt. As software project mature, they can accumulate "technical debt" i.e. developer decisions which are sub-optimal and decrease the overall quality of the code. Such decisions are often commented on by programmers in the code; i.e. it is self-admitted technical debt (SATD). Recent results show that text classifiers can automatically detect such debt. We find that we can significantly outperform prior results by SURVEYing the data. Specifically, for ten open-source JAVA projects, we can find 83% of the technical debt via SURVEY0 using just 16% of the comments (and if higher levels of recall are required, SURVEY0can adjust towards that with some additional effort). △ Less

Submitted 20 May, 2019; originally announced May 2019.

Comments: 10 pages, 4 figures, 4 tables, conference

arXiv:1905.07019 [pdf, other]

doi 10.1145/3338906.3340448

TERMINATOR: Better Automated UI Test Case Prioritization

Authors: Zhe Yu, Fahmid M. Fahid, Tim Menzies, Gregg Rothermel, Kyle Patrick, Snehit Cherian

Abstract: Automated UI testing is an important component of the continuous integration process of software development. A modern web-based UI is an amalgam of reports from dozens of microservices written by multiple teams. Queries on a page that opens up another will fail if any of that page's microservices fails. As a result, the overall cost for automated UI testing is high since the UI elements cannot be… ▽ More Automated UI testing is an important component of the continuous integration process of software development. A modern web-based UI is an amalgam of reports from dozens of microservices written by multiple teams. Queries on a page that opens up another will fail if any of that page's microservices fails. As a result, the overall cost for automated UI testing is high since the UI elements cannot be tested in isolation. For example, the entire automated UI testing suite at LexisNexis takes around 30 hours (3-5 hours on the cloud) to execute, which slows down the continuous integration process. To mitigate this problem and give developers faster feedback on their code, test case prioritization techniques are used to reorder the automated UI test cases so that more failures can be detected earlier. Given that much of the automated UI testing is "black box" in nature, very little information (only the test case descriptions and testing results) can be utilized to prioritize these automated UI test cases. Hence, this paper evaluates 17 "black box" test case prioritization approaches that do not rely on source code information. Among these, we propose a novel TCP approach, that dynamically re-prioritizes the test cases when new failures are detected, by applying and adapting a state of the art framework from the total recall problem. Experimental results on LexisNexis automated UI testing data show that our new approach (which we call TERMINATOR), outperformed prior state of the art approaches in terms of failure detection rates with negligible CPU overhead. △ Less

Submitted 18 June, 2019; v1 submitted 16 May, 2019; originally announced May 2019.

Comments: 10+2 pages, 4 figures, 3 tables, ESEC/FSE 2019 industry track

arXiv:1905.06872 [pdf, other]

Better Security Bug Report Classification via Hyperparameter Optimization

Authors: Rui Shu, Tianpei Xia, Laurie Williams, Tim Menzies

Abstract: When security bugs are detected, they should be (a)~discussed privately by security software engineers; and (b)~not mentioned to the general public until security patches are available. Software engineers usually report bugs to bug tracking system, and label them as security bug reports (SBRs) or not-security bug reports (NSBRs), while SBRs have a higher priority to be fixed before exploited by at… ▽ More When security bugs are detected, they should be (a)~discussed privately by security software engineers; and (b)~not mentioned to the general public until security patches are available. Software engineers usually report bugs to bug tracking system, and label them as security bug reports (SBRs) or not-security bug reports (NSBRs), while SBRs have a higher priority to be fixed before exploited by attackers than NSBRs. Yet suspected security bug reports are often publicly disclosed because the mislabelling issues ( i.e., mislabel security bug reports as not-security bug report). The goal of this paper is to aid software developers to better classify bug reports that identify security vulnerabilities as security bug reports through parameter tuning of learners and data pre-processor. Previous work has applied text analytics and machine learning learners to classify which reported bugs are security related. We improve on that work, as shown by our analysis of five open source projects. We apply hyperparameter optimization to (a)~the control parameters of a learner; and (b)~the data pre-processing methods that handle the case where the target class is a small fraction of all the data. We show that optimizing the pre-processor is more useful than optimizing the learners. We also show that improvements gained from our approach can be very large. For example, using the same data sets as recently analyzed by our baseline approach, we show that adjusting the data pre-processing results in improvements to classification recall of 35% to 65% (median to max) with moderate increment of false positive rate. △ Less

Submitted 16 May, 2019; originally announced May 2019.

Comments: 12 pages, 1 figure, submitted to 34th IEEE/ACM International Conference on Automated Software Engineering (ASE 2019)

arXiv:1905.06390 [pdf, other]

doi 10.1145/3338906.3340450

Predicting Breakdowns in Cloud Services (with SPIKE)

Authors: Jianfeng Chen, Joymallya Chakraborty, Philip Clark, Kevin Haverlock, Snehit Cherian, Tim Menzies

Abstract: Maintaining web-services is a mission-critical task where any down-time means loss of revenue and reputation (of being a reliable service provider). In the current competitive web services market, such a loss of reputation causes extensive loss of future revenue. To address this issue, we developed SPIKE, a data mining tool which can predict upcoming service breakdowns, half an hour into the futur… ▽ More Maintaining web-services is a mission-critical task where any down-time means loss of revenue and reputation (of being a reliable service provider). In the current competitive web services market, such a loss of reputation causes extensive loss of future revenue. To address this issue, we developed SPIKE, a data mining tool which can predict upcoming service breakdowns, half an hour into the future. Such predictions let an organization alert and assemble the tiger team to address the problem (e.g. by reconfiguring cloud hardware in order to reduce the likelihood of that breakdown). SPIKE utilizes (a) regression tree learning (with CART); (b) synthetic minority over-sampling (to handle how rare spikes are in our data); (c) hyperparameter optimization (to learn best settings for our local data) and (d) a technique we called "topology sampling" where training vectors are built from extensive details of an individual node plus summary details on all their neighbors. In the experiments reported here, SPIKE predicted service spikes 30 minutes into future with recalls and precision of 75% and above. Also, SPIKE performed relatively better than other widely-used learning methods (neural nets, random forests, logistic regression). △ Less

Submitted 14 June, 2019; v1 submitted 15 May, 2019; originally announced May 2019.

Comments: 9 pages, 6 figures, in Proceedings of The 27th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE'19), industry track

arXiv:1905.05786 [pdf, ps, other]

Software Engineering for Fairness: A Case Study with Hyperparameter Optimization

Authors: Joymallya Chakraborty, Tianpei Xia, Fahmid M. Fahid, Tim Menzies

Abstract: We assert that it is the ethical duty of software engineers to strive to reduce software discrimination. This paper discusses how that might be done. This is an important topic since machine learning software is increasingly being used to make decisions that affect people's lives. Potentially, the application of that software will result in fairer decisions because (unlike humans) machine learning… ▽ More We assert that it is the ethical duty of software engineers to strive to reduce software discrimination. This paper discusses how that might be done. This is an important topic since machine learning software is increasingly being used to make decisions that affect people's lives. Potentially, the application of that software will result in fairer decisions because (unlike humans) machine learning software is not biased. However, recent results show that the software within many data mining packages exhibits "group discrimination"; i.e. their decisions are inappropriately affected by "protected attributes"(e.g., race, gender, age, etc.). There has been much prior work on validating the fairness of machine-learning models (by recognizing when such software discrimination exists). But after detection, comes mitigation. What steps can ethical software engineers take to reduce discrimination in the software they produce? This paper shows that making \textit{fairness} as a goal during hyperparameter optimization can (a) preserve the predictive power of a model learned from a data miner while also (b) generates fairer results. To the best of our knowledge, this is the first application of hyperparameter optimization as a tool for software engineers to generate fairer software. △ Less

Submitted 30 October, 2019; v1 submitted 14 May, 2019; originally announced May 2019.

arXiv:1905.05358 [pdf, other]

Building Very Small Test Suites (with Snap)

Authors: Jianfeng Chen, Xipeng Shen, Tim Menzies

Abstract: Software is now so large and complex that additional architecture is needed to guide theorem provers as they try to generate test suites. For example, the SNAP test suite generator (introduced in this paper) combines the Z3 theorem prover with the following tactic: sample around the average values seen in a few randomly selected valid tests. This tactic is remarkably effective. For 27 real-world p… ▽ More Software is now so large and complex that additional architecture is needed to guide theorem provers as they try to generate test suites. For example, the SNAP test suite generator (introduced in this paper) combines the Z3 theorem prover with the following tactic: sample around the average values seen in a few randomly selected valid tests. This tactic is remarkably effective. For 27 real-world programs with up to half a million variables, SNAP found test suites which were 10 to 750 smaller times than those found by the prior state-of-the-art. Also, SNAP ran orders of magnitude faster and (unlike prior work) generated 100% valid tests. △ Less

Submitted 14 July, 2020; v1 submitted 13 May, 2019; originally announced May 2019.

Comments: 13 pages, 9 figures, submitted to IEEE Transactions on Software Engineering (Journal First)

arXiv:1905.01719 [pdf, other]

Better Data Labelling with EMBLEM (and how that Impacts Defect Prediction)

Authors: Huy Tu, Zhe Yu, Tim Menzies

Abstract: Standard automatic methods for recognizing problematic development commits can be greatly improved via the incremental application of human+artificial expertise. In this approach, called EMBLEM, an AI tool first explore the software development process to label commits that are most problematic. Humans then apply their expertise to check those labels (perhaps resulting in the AI updating the suppo… ▽ More Standard automatic methods for recognizing problematic development commits can be greatly improved via the incremental application of human+artificial expertise. In this approach, called EMBLEM, an AI tool first explore the software development process to label commits that are most problematic. Humans then apply their expertise to check those labels (perhaps resulting in the AI updating the support vectors within their SVM learner). We recommend this human+AI partnership, for several reasons. When a new domain is encountered, EMBLEM can learn better ways to label which comments refer to real problems. Also, in studies with 9 open source software projects, labelling via EMBLEM's incremental application of human+AI is at least an order of magnitude cheaper than existing methods ($\approx$ eight times). Further, EMBLEM is very effective. For the data sets explored here, EMBLEM better labelling methods significantly improved $P_{opt}20$ and G-scores performance in nearly all the projects studied here. △ Less

Submitted 7 April, 2020; v1 submitted 5 May, 2019; originally announced May 2019.

Comments: 17 pages, 2 pages references, submitted for TSE journal

arXiv:1904.09954 [pdf, other]

Communication and Code Dependency Effects on Software Code Quality: An Empirical Analysis of Herbsleb Hypothesis

Authors: Suvodeep Majumder, Joymallya Chakraborty, Amritanshu Agrawal, Tim Menzies

Abstract: Prior literature has suggested that in many projects 80\% or more of the contributions are made by a small called group of around 20% of the development team. Most prior studies deprecate a reliance on such a small inner group of "heroes", arguing that it causes bottlenecks in development and communication. Despite this, such projects are very common in open source projects. So what exactly is the… ▽ More Prior literature has suggested that in many projects 80\% or more of the contributions are made by a small called group of around 20% of the development team. Most prior studies deprecate a reliance on such a small inner group of "heroes", arguing that it causes bottlenecks in development and communication. Despite this, such projects are very common in open source projects. So what exactly is the impact of "heroes" in code quality? Herbsleb argues that if code is strongly connected yet their developers are not, then that code will be buggy. To test the Hersleb hypothesis, we develop and apply two metrics of (a) "social-ness'"and (b) "hero-ness" that measure (a) how much one developer comments on the issues of another; and (b) how much one developer changes another developer's code (and "heroes" are those that change the most code, all around the system). In a result endorsing the Hersleb hypothesis, in over 1000 open source projects, we find that "social-ness" is a statistically stronger indicate for code quality (number of bugs) than "hero-ness". Hence we say that debates over the merits of "hero-ness" is subtly misguided. Our results suggest that the real benefits of these so-called "heroes" is not so much the code they generate but the pattern of communication required when the interaction between a large community of programmers passes through a small group of centralized developers. To say that another way, to build better code, build better communication flows between core developers and the rest. In order to allow other researchers to confirm/improve/refute our results, all our scripts and data are available, on-line at https://github.com/Anonymous633671/A-Comparison-on-Communication-and-Code-Dependency-Effects-on-Software-Code-Quality. △ Less

Submitted 21 March, 2022; v1 submitted 22 April, 2019; originally announced April 2019.

Comments: 12 pages, 7 figures, 2 tables

arXiv:1904.05794 [pdf, other]

Assessing Developer Beliefs: A Reply to "Perceptions, Expectations, and Challenges in Defect Prediction"

Authors: Shrikanth N. C., Tim Menzies

Abstract: It can be insightful to extend qualitative studies with a secondary quantitative analysis (where the former suggests insightful questions that the latter can answer). Documenting developer beliefs should be the start, not the end, of Software Engineering research. Once prevalent beliefs are found, they should be checked against real-world data. For example, this paper finds several notable discrep… ▽ More It can be insightful to extend qualitative studies with a secondary quantitative analysis (where the former suggests insightful questions that the latter can answer). Documenting developer beliefs should be the start, not the end, of Software Engineering research. Once prevalent beliefs are found, they should be checked against real-world data. For example, this paper finds several notable discrepancies between empirical evidence and the developer beliefs documented in Wan et al.'s recent TSE paper "Perceptions, expectations, and challenges in defect prediction". By reporting these discrepancies we can stop developers (a) wasting time on inconsequential matters or (b) ignoring important effects. For the future, we would encourage more "extension studies" of prior qualitative results with quantitative empirical evidence. △ Less

Submitted 11 April, 2019; originally announced April 2019.

Comments: 3 pages, 1 figure, 2 tables, Submitted to TSE Journal

arXiv:1902.04060 [pdf, other]

Replication Can Improve Prior Results: A GitHub Study of Pull Request Acceptance

Authors: Di Chen, Kathyrn Stolee, Tim Menzies

Abstract: Crowdsourcing and data mining can be used to effectively reduce the effort associated with the partial replication and enhancement of qualitative studies. For example, in a primary study, other researchers explored factors influencing the fate of GitHub pull requests using an extensive qualitative analysis of 20 pull requests. Guided by their findings, we mapped some of their qualitative insight… ▽ More Crowdsourcing and data mining can be used to effectively reduce the effort associated with the partial replication and enhancement of qualitative studies. For example, in a primary study, other researchers explored factors influencing the fate of GitHub pull requests using an extensive qualitative analysis of 20 pull requests. Guided by their findings, we mapped some of their qualitative insights onto quantitative questions. To determine how well their findings generalize, we collected much more data (170 additional pull requests from 142 GitHub projects). Using crowdsourcing, that data was augmented with subjective qualitative human opinions about how pull requests extended the original issue. The crowd's answers were then combined with quantitative features and, using data mining, used to build a predictor for whether code would be merged. That predictor was far more accurate that one built from the primary study's qualitative factors (F1=90 vs 68\%), illustrating the value of a mixed-methods approach and replication to improve prior results. To test the generality of this approach, the next step in future work is to conduct other studies that extend qualitative studies with crowdsourcing and data mining. △ Less

Submitted 8 February, 2019; originally announced February 2019.

Comments: 12 pages, submitted to ICPC 2019. arXiv admin note: substantial text overlap with arXiv:1702.08571

arXiv:1902.01838 [pdf, other]

doi 10.1109/TSE.2019.2945020

How to "DODGE" Complex Software Analytics?

Authors: Amritanshu Agrawal, Wei Fu, Di Chen, Xipeng Shen, Tim Menzies

Abstract: Machine learning techniques applied to software engineering tasks can be improved by hyperparameter optimization, i.e., automatic tools that find good settings for a learner's control parameters. We show that such hyperparameter optimization can be unnecessarily slow, particularly when the optimizers waste time exploring "redundant tunings"', i.e., pairs of tunings which lead to indistinguishabl… ▽ More Machine learning techniques applied to software engineering tasks can be improved by hyperparameter optimization, i.e., automatic tools that find good settings for a learner's control parameters. We show that such hyperparameter optimization can be unnecessarily slow, particularly when the optimizers waste time exploring "redundant tunings"', i.e., pairs of tunings which lead to indistinguishable results. By ignoring redundant tunings, DODGE, a tuning tool, runs orders of magnitude faster, while also generating learners with more accurate predictions than seen in prior state-of-the-art approaches. △ Less

Submitted 1 December, 2019; v1 submitted 5 February, 2019; originally announced February 2019.

Comments: 13 Pages, Accepted to IEEE Transactions in Software Engineering, 2019

arXiv:1812.01550 [pdf, other]

doi 10.1007/s10664-020-09808-9

Better Software Analytics via "DUO": Data Mining Algorithms Using/Used-by Optimizers

Authors: Amritanshu Agrawal, Tim Menzies, Leandro L. Minku, Markus Wagner, Zhe Yu

Abstract: This paper claims that a new field of empirical software engineering research and practice is emerging: data mining using/used-by optimizers for empirical studies or DUO. For example, data miners can generate models that are explored by optimizers. Also, optimizers can advise how to best adjust the control parameters of a data miner. This combined approach acts like an agent leaning over the shoul… ▽ More This paper claims that a new field of empirical software engineering research and practice is emerging: data mining using/used-by optimizers for empirical studies or DUO. For example, data miners can generate models that are explored by optimizers. Also, optimizers can advise how to best adjust the control parameters of a data miner. This combined approach acts like an agent leaning over the shoulder of an analyst that advises "ask this question next" or "ignore that problem, it is not relevant to your goals". Further, those agents can help us build "better" predictive models, where "better" can be either greater predictive accuracy or faster modeling time (which, in turn, enables the exploration of a wider range of options). We also caution that the era of papers that just use data miners is coming to an end. Results obtained from an unoptimized data miner can be quickly refuted, just by applying an optimizer to produce a different (and better performing) model. Our conclusion, hence, is that for software analytics it is possible, useful and necessary to combine data mining and optimization using DUO. △ Less

Submitted 13 December, 2019; v1 submitted 4 December, 2018; originally announced December 2018.

Comments: 35 Pages, Accepted to EMSE Journal

arXiv:1809.00039 [pdf, other]

doi 10.1145/3283812.3283818

Total Recall, Language Processing, and Software Engineering

Authors: Zhe Yu, Tim Menzies

Abstract: A broad class of software engineering problems can be generalized as the "total recall problem". This short paper claims that identifying and exploring total recall language processing problems in software engineering is an important task with wide applicability. To make that case, we show that by applying and adapting the state of the art active learning and text mining, solutions of the total… ▽ More A broad class of software engineering problems can be generalized as the "total recall problem". This short paper claims that identifying and exploring total recall language processing problems in software engineering is an important task with wide applicability. To make that case, we show that by applying and adapting the state of the art active learning and text mining, solutions of the total recall problem, can help solve two important software engineering tasks: (a) supporting large literature reviews and (b) identifying software security vulnerabilities. Furthermore, we conjecture that (c) test case prioritization and (d) static warning identification can also be categorized as the total recall problem. The widespread applicability of "total recall" to software engineering suggests that there exists some underlying framework that encompasses not just natural language processing, but a wide range of important software engineering tasks. △ Less

Submitted 31 August, 2018; originally announced September 2018.

Comments: 4 pages, 2 figures. Submitted to NL4SE@ESEC/FSE 2018

arXiv:1805.12124 [pdf, other]

Better Metrics for Ranking SE Researchers

Authors: George Mathew, Tim Menzies

Abstract: This paper studies how SE researchers are ranked using a variety of metrics and data from 35,406 authors of 35,391 papers from 34 top SE venues in the period 1992-2016. Based on that analysis, we: deprecate the widely used "h-index", favoring instead an alternate Weighted Page Rank(PR_W) metric that is somewhat analogous to the PageRank(PR) metric developed at Google. Unlike the h-index, PR_W rewa… ▽ More This paper studies how SE researchers are ranked using a variety of metrics and data from 35,406 authors of 35,391 papers from 34 top SE venues in the period 1992-2016. Based on that analysis, we: deprecate the widely used "h-index", favoring instead an alternate Weighted Page Rank(PR_W) metric that is somewhat analogous to the PageRank(PR) metric developed at Google. Unlike the h-index, PR_W rewards not just citation counts but also how often authors collaborate. Using PR_W, we offer a ranking of the top 20 SE authors in the last decade. △ Less

Submitted 29 May, 2018; originally announced May 2018.

Comments: 5 pages, 2 figures, 4 tables. Submitted to IEEE TSE 2019

arXiv:1805.03218 [pdf, other]

doi 10.1109/ICSE.2019.00097

Crowdtesting : When is The Party Over?

Authors: Junjie Wang, Ye Yang, Zhe Yu, Tim Menzies, Qing Wang

Abstract: Trade-offs such as "how much testing is enough" are critical yet challenging project decisions in software engineering. Most existing approaches adopt risk-driven or value-based analysis to prioritize test cases and minimize test runs. However, none of these is applicable to the emerging crowd testing paradigm where task requesters typically have no control over online crowdworkers's dynamic behav… ▽ More Trade-offs such as "how much testing is enough" are critical yet challenging project decisions in software engineering. Most existing approaches adopt risk-driven or value-based analysis to prioritize test cases and minimize test runs. However, none of these is applicable to the emerging crowd testing paradigm where task requesters typically have no control over online crowdworkers's dynamic behavior and uncertain performance. In current practice, deciding when to close a crowdtesting task is largely done by guesswork due to lack of decision support. This paper intends to fill this gap by introducing automated decision support for monitoring and determining appropriate time to close the crowdtesting tasks. First, this paper investigates the necessity and feasibility of close prediction of crowdtesting tasks based on industrial dataset. Then,it designs 8 methods for close prediction, based on various models including the bug trend, bug arrival model, capture-recapture model.Finally, the evaluation is conducted on 218 crowdtesting tasks from one of the largest crowdtesting platforms in China, and the results show that a median of 91% bugs can be detected with 49% saved cost. △ Less

Submitted 8 May, 2018; originally announced May 2018.

Comments: 12 pages

Journal ref: iSENSE: completion-aware crowdtesting management. ICSE 2019: 912-923

arXiv:1805.02763 [pdf, other]

doi 10.1016/j.infsof.2019.03.003

Cutting Away the Confusion From Crowdtesting

Authors: Junjie Wang, Mingyang Li, Song Wang, Tim Menzies, Qing Wang

Abstract: Crowdtesting is effective especially when it comes to the feedback on GUI systems, or subjective opinions about features. Despite of this, we find crowdtesting reports are highly replicated, i.e., 82% of them are replicates of others. Hence automatically detecting replicate reports could help reduce triaging efforts. Most of the existing approaches mainly adopted textual information for replicate… ▽ More Crowdtesting is effective especially when it comes to the feedback on GUI systems, or subjective opinions about features. Despite of this, we find crowdtesting reports are highly replicated, i.e., 82% of them are replicates of others. Hence automatically detecting replicate reports could help reduce triaging efforts. Most of the existing approaches mainly adopted textual information for replicate detection, and suffered from low accuracy because of the expression gap. Our observation on real industrial crowdtesting data found that when dealing with crowdtesting reports of GUI systems, the reports would accompanied with images, i.e., the screenshots of the app. We assume the screenshot to be valuable for replicate crowdtesting report detection because it reflects the real scenario of the failure and is not affected by the variety of natural languages. In this work, we propose a replicate detection approach, TSDetector, which combines information from the screenshots and the textual descriptions to detect replicate crowdtesting reports. We extract four types of features to characterize the screenshots and the textual descriptions, and design an algorithm to detect replicates based on four similarity scores derived from the four different features respectively. We investigate the effectiveness and advantage of TSDetector on 15 commercial projects with 4,172 reports from one of the Chinese largest crowdtesting platforms.Results show that TSDetector can outperform existing state-of-the-art approaches significantly. In addition, we also evaluate its usefulness using real-world case studies. The feedback from real-world testers demonstrates its practical value △ Less

Submitted 13 November, 2018; v1 submitted 7 May, 2018; originally announced May 2018.

Comments: 12 pages

Journal ref: Information and Software Technology. 110 (2019). 139-155

arXiv:1805.02744 [pdf, other]

Effective Automated Decision Support for Managing Crowdtesting

Authors: Junjie Wang, Ye Yang, Rahul Krishna, Tim Menzies, Qing Wang

Abstract: Crowdtesting has grown to be an effective alter-native to traditional testing, especially in mobile apps. However,crowdtesting is hard to manage in nature. Given the complexity of mobile applications and unpredictability of distributed, parallel crowdtesting process, it is difficult to estimate (a) the remaining number of bugs as yet undetected or (b) the required cost to find those bugs. Experien… ▽ More Crowdtesting has grown to be an effective alter-native to traditional testing, especially in mobile apps. However,crowdtesting is hard to manage in nature. Given the complexity of mobile applications and unpredictability of distributed, parallel crowdtesting process, it is difficult to estimate (a) the remaining number of bugs as yet undetected or (b) the required cost to find those bugs. Experience-based decisions may result in ineffective crowdtesting process. This paper aims at exploring automated decision support to effectively manage crowdtesting process. The proposed ISENSE applies incremental sampling technique to process crowdtesting reports arriving in chronological order, organizes them into fixed-size groups as dynamic inputs, and predicts two test completion indicators in an incrementally manner. The two indicators are: 1)total number of bugs predicted with Capture-ReCapture (CRC)model, and 2) required test cost for achieving certain test objectives predicted with AutoRegressive Integrated Moving Average(ARIMA) model. We assess ISENSE using 46,434 reports of 218 crowdtesting tasks from one of the largest crowdtesting platforms in China. Its effectiveness is demonstrated through two applications for automating crowdtesting management, i.e. automation oftask closing decision, and semi-automation of task closing trade-off analysis. The results show that decision automation using ISENSE will provide managers with greater opportunities to achieve cost-effectiveness gains of crowdtesting. Specifically, a median of 100% bugs can be detected with 30% saved cost basedon the automated close prediction △ Less

Submitted 7 May, 2018; originally announced May 2018.

Comments: 12 pages

arXiv:1805.00336 [pdf, other]

Hyperparameter Optimization for Effort Estimation

Authors: Tianpei Xia, Rahul Krishna, Jianfeng Chen, George Mathew, Xipeng Shen, Tim Menzies

Abstract: Software analytics has been widely used in software engineering for many tasks such as generating effort estimates for software projects. One of the "black arts" of software analytics is tuning the parameters controlling a data mining algorithm. Such hyperparameter optimization has been widely studied in other software analytics domains (e.g. defect prediction and text mining) but, so far, has not… ▽ More Software analytics has been widely used in software engineering for many tasks such as generating effort estimates for software projects. One of the "black arts" of software analytics is tuning the parameters controlling a data mining algorithm. Such hyperparameter optimization has been widely studied in other software analytics domains (e.g. defect prediction and text mining) but, so far, has not been extensively explored for effort estimation. Accordingly, this paper seeks simple, automatic, effective and fast methods for finding good tunings for automatic software effort estimation. We introduce a hyperparameter optimization architecture called OIL (Optimized Inductive Learning). We test OIL on a wide range of hyperparameter optimizers using data from 945 software projects. After tuning, large improvements in effort estimation accuracy were observed (measured in terms of standardized accuracy). From those results, we recommend using regression trees (CART) tuned by different evolution combine with default analogy-based estimator. This particular combination of learner and optimizers often achieves in a few hours what other optimizers need days to weeks of CPU time to accomplish. An important part of this analysis is its reproducibility and refutability. All our scripts and data are on-line. It is hoped that this paper will prompt and enable much more research on better methods to tune software effort estimators. △ Less

Submitted 31 January, 2019; v1 submitted 27 April, 2018; originally announced May 2018.

Comments: Submitted to EMSE

arXiv:1804.10657 [pdf, other]

Can You Explain That, Better? Comprehensible Text Analytics for SE Applications

Authors: Amritanshu Agrawal, Huy Tu, Tim Menzies

Abstract: Text mining methods are used for a wide range of Software Engineering (SE) tasks. The biggest challenge of text mining is high dimensional data, i.e., a corpus of documents can contain $10^4$ to $10^6$ unique words. To address this complexity, some very convoluted text mining methods have been applied. Is that complexity necessary? Are there simpler ways to quickly generate models that perform as… ▽ More Text mining methods are used for a wide range of Software Engineering (SE) tasks. The biggest challenge of text mining is high dimensional data, i.e., a corpus of documents can contain $10^4$ to $10^6$ unique words. To address this complexity, some very convoluted text mining methods have been applied. Is that complexity necessary? Are there simpler ways to quickly generate models that perform as well as the more convoluted methods and also be human-readable? To answer these questions, we explore a combination of LDA (Latent Dirichlet Allocation) and FFTs (Fast and Frugal Trees) to classify NASA software bug reports from six different projects. Designed using principles from psychological science, FFTs return very small models that are human-comprehensible. When compared to the commonly used text mining method and a recent state-of-the-art-system (search-based SE method that automatically tune the control parameters of LDA), these FFT models are very small (a binary tree of depth $d = 4$ that references only 4 topics) and hence easy to understand. They were also faster to generate and produced similar or better severity predictions. Hence we can conclude that, at least for datasets explored here, convoluted text mining models can be deprecated in favor of simpler method such as LDA+FFTs. At the very least, we recommend LDA+FFTs (a) when humans need to read, understand, and audit a model or (b) as an initial baseline method for the SE researchers exploring text artifacts from software projects. △ Less

Submitted 27 April, 2018; originally announced April 2018.

Comments: 10+2 pages, submitted to ASE 2018

arXiv:1804.00626 [pdf, other]

Why Software Effort Estimation Needs SBSE

Authors: Tianpei Xia, Jianfeng Chen, George Mathew, Xipeng Shen, Tim Menzies

Abstract: Industrial practitioners now face a bewildering array of possible configurations for effort estimation. How to select the best one for a particular dataset? This paper introduces OIL (short for optimized learning), a novel configuration tool for effort estimation based on differential evolution. When tested on 945 software projects, OIL significantly improved effort estimations, after exploring… ▽ More Industrial practitioners now face a bewildering array of possible configurations for effort estimation. How to select the best one for a particular dataset? This paper introduces OIL (short for optimized learning), a novel configuration tool for effort estimation based on differential evolution. When tested on 945 software projects, OIL significantly improved effort estimations, after exploring just a few configurations (just a few dozen). Further OIL's results are far better than two methods in widespread use: estimation-via-analogy and a recent state-of-the-art baseline published at TOSEM'15 by Whigham et al. Given that the computational cost of this approach is so low, and the observed improvements are so large, we conclude that SBSE should be a standard component of software effort estimation. △ Less

Submitted 2 April, 2018; originally announced April 2018.

Comments: 15 pages, submitted to SSBSE'18

arXiv:1803.06545 [pdf, other]

doi 10.1109/TSE.2019.2949275

Improving Vulnerability Inspection Efficiency Using Active Learning

Authors: Zhe Yu, Christopher Theisen, Laurie Williams, Tim Menzies

Abstract: Software engineers can find vulnerabilities with less effort if they are directed towards code that might contain more vulnerabilities. HARMLESS is an incremental support vector machine tool that builds a vulnerability prediction model from the sourcecode inspected to date, then suggests what source code files should be inspected next. In this way, HARMLESS can reduce the time and effort required… ▽ More Software engineers can find vulnerabilities with less effort if they are directed towards code that might contain more vulnerabilities. HARMLESS is an incremental support vector machine tool that builds a vulnerability prediction model from the sourcecode inspected to date, then suggests what source code files should be inspected next. In this way, HARMLESS can reduce the time and effort required to achieve some desired level of recall for finding vulnerabilities. The tool also provides feedback on when to stop (at that desired level of recall) while at the same time, correcting human errors by double-checking suspicious files. This paper evaluates HARMLESS on Mozilla Firefox vulnerability data. HARMLESS found 80, 90, 95, 99% of the vulnerabilities by inspecting 10, 16, 20, 34% of the source code files. When targeting 90, 95, 99% recall, HARMLESS could stop after inspecting 23, 30, 47% of the source code files. Even when human reviewers fail to identify half of the vulnerabilities (50% false negative rate), HARMLESScould detect 96% of the missing vulnerabilities by double-checking half of the inspected files. Our results serve to highlight the very steep cost of protecting software from vulnerabilities (in our case study that cost is, for example, the human effort of inspecting 28,750$\times$20% = 5,750 source code files to identify 95% of the vulnerabilities). While this result could benefit the mission-critical projects where human resources are available for inspecting thousands of source code files, the research challenge for future work is how to further reduce that cost. The conclusion of this paper discusses various ways that goal might be achieved. △ Less

Submitted 26 October, 2019; v1 submitted 17 March, 2018; originally announced March 2018.

Comments: 17+1 pages, 4 figures, 7 tables. Accepted by IEEE Transactions on Software Engineering

arXiv:1803.05587 [pdf, other]

Micky: A Cheaper Alternative for Selecting Cloud Instances

Authors: Chin-Jung Hsu, Vivek Nair, Tim Menzies, Vincent Freeh

Abstract: Most cloud computing optimizers explore and improve one workload at a time. When optimizing many workloads, the single-optimizer approach can be prohibitively expensive. Accordingly, we examine "collective optimizer" that concurrently explore and improve a set of workloads significantly reducing the measurement costs. Our large-scale empirical study shows that there is often a single cloud configu… ▽ More Most cloud computing optimizers explore and improve one workload at a time. When optimizing many workloads, the single-optimizer approach can be prohibitively expensive. Accordingly, we examine "collective optimizer" that concurrently explore and improve a set of workloads significantly reducing the measurement costs. Our large-scale empirical study shows that there is often a single cloud configuration which is surprisingly near-optimal for most workloads. Consequently, we create a collective-optimizer, MICKY, that reformulates the task of finding the near-optimal cloud configuration as a multi-armed bandit problem. MICKY efficiently balances exploration (of new cloud configurations) and exploitation (of known good cloud configuration). Our experiments show that MICKY can achieve on average 8.6 times reduction in measurement cost as compared to the state-of-the-art method while finding near-optimal solutions. Hence we propose MICKY as the basis of a practical collective optimization method for finding good cloud configurations (based on various constraints such as budget and tolerance to near-optimal configurations). △ Less

Submitted 15 March, 2018; originally announced March 2018.

arXiv:1803.05518 [pdf, other]

Bad Smells in Software Analytics Papers

Authors: Tim Menzies, Martin Shepperd

Abstract: CONTEXT: There has been a rapid growth in the use of data analytics to underpin evidence-based software engineering. However the combination of complex techniques, diverse reporting standards and poorly understood underlying phenomena are causing some concern as to the reliability of studies. OBJECTIVE: Our goal is to provide guidance for producers and consumers of software analytics studies (co… ▽ More CONTEXT: There has been a rapid growth in the use of data analytics to underpin evidence-based software engineering. However the combination of complex techniques, diverse reporting standards and poorly understood underlying phenomena are causing some concern as to the reliability of studies. OBJECTIVE: Our goal is to provide guidance for producers and consumers of software analytics studies (computational experiments and correlation studies). METHOD: We propose using "bad smells", i.e., surface indications of deeper problems and popular in the agile software community and consider how they may be manifest in software analytics studies. RESULTS: We list 12 "bad smells" in software analytics papers (and show their impact by examples). CONCLUSIONS: We believe the metaphor of bad smell is a useful device. Therefore we encourage more debate on what contributes to the validty of software analytics studies (so we expect our list will mature over time). △ Less

Submitted 12 April, 2019; v1 submitted 14 March, 2018; originally announced March 2018.

Comments: Accepted April 2019. To appear

Journal ref: Information Software Technology, 2019

arXiv:1803.05067 [pdf, other]

Applications of Psychological Science for Actionable Analytics

Authors: Di Chen, Wei Fu, Rahul Krishna, Tim Menzies

Abstract: Actionable analytics are those that humans can understand, and operationalize. What kind of data mining models generate such actionable analytics? According to psychological scientists, humans understand models that most match their own internal models, which they characterize as lists of "heuristic" (i.e., lists of very succinct rules). One such heuristic rule generator is the Fast-and-Frugal Tre… ▽ More Actionable analytics are those that humans can understand, and operationalize. What kind of data mining models generate such actionable analytics? According to psychological scientists, humans understand models that most match their own internal models, which they characterize as lists of "heuristic" (i.e., lists of very succinct rules). One such heuristic rule generator is the Fast-and-Frugal Trees (FFT) preferred by psychological scientists. Despite their successful use in many applied domains, FFTs have not been applied in software analytics. Accordingly, this paper assesses FFTs for software analytics. We find that FFTs are remarkably effective. Their models are very succinct (5 lines or less describing a binary decision tree). These succinct models outperform state-of-the-art defect prediction algorithms defined by Ghortra et al. at ICSE'15. Also, when we restrict training data to operational attributes (i.e., those attributes that are frequently changed by developers), FFTs perform much better than standard learners. Our conclusions are two-fold. Firstly, there is much that software analytics community could learn from psychological science. Secondly, proponents of complex methods should always baseline those methods against simpler alternatives. For example, FFTs could be used as a standard baseline learner against which other software analytics tools are compared. △ Less

Submitted 13 March, 2018; originally announced March 2018.

Comments: 10 pages, 5 figures

arXiv:1803.04608 [pdf, other]

Building Better Quality Predictors Using "$ε$-Dominance"

Authors: Wei Fu, Tim Menzies, Di Chen, Amritanshu Agrawal

Abstract: Despite extensive research, many methods in software quality prediction still exhibit some degree of uncertainty in their results. Rather than treating this as a problem, this paper asks if this uncertainty is a resource that can simplify software quality prediction. For example, Deb's principle of $ε$-dominance states that if there exists some $ε$ value below which it is useless or impossible t… ▽ More Despite extensive research, many methods in software quality prediction still exhibit some degree of uncertainty in their results. Rather than treating this as a problem, this paper asks if this uncertainty is a resource that can simplify software quality prediction. For example, Deb's principle of $ε$-dominance states that if there exists some $ε$ value below which it is useless or impossible to distinguish results, then it is superfluous to explore anything less than $ε$. We say that for "large $ε$ problems", the results space of learning effectively contains just a few regions. If many learners are then applied to such large $ε$ problems, they would exhibit a "many roads lead to Rome" property; i.e., many different software quality prediction methods would generate a small set of very similar results. This paper explores DART, an algorithm especially selected to succeed for large $ε$ software quality prediction problems. DART is remarkable simple yet, on experimentation, it dramatically out-performs three sets of state-of-the-art defect prediction methods. The success of DART for defect prediction begs the questions: how many other domains in software quality predictors can also be radically simplified? This will be a fruitful direction for future work. △ Less

Submitted 12 March, 2018; originally announced March 2018.

Comments: 10 pages

arXiv:1803.03900 [pdf, other]

Transfer Learning with Bellwethers to find Good Configurations

Authors: Vivek Nair, Rahul Krishna, Tim Menzies, Pooyan Jamshidi

Abstract: As software systems grow in complexity, the space of possible configurations grows exponentially. Within this increasing complexity, developers, maintainers, and users cannot keep track of the interactions between all the various configuration options. Finding the optimally performing configuration of a software system for a given setting is challenging. Recent approaches address this challenge by… ▽ More As software systems grow in complexity, the space of possible configurations grows exponentially. Within this increasing complexity, developers, maintainers, and users cannot keep track of the interactions between all the various configuration options. Finding the optimally performing configuration of a software system for a given setting is challenging. Recent approaches address this challenge by learning performance models based on a sample set of configurations. However, collecting enough data on enough sample configurations can be very expensive since each such sample requires configuring, compiling and executing the entire system against a complex test suite. The central insight of this paper is that choosing a suitable source (a.k.a. "bellwether") to learn from, plus a simple transfer learning scheme will often outperform much more complex transfer learning methods. Using this insight, this paper proposes BEETLE, a novel bellwether based transfer learning scheme, which can identify a suitable source and use it to find near-optimal configurations of a software system. BEETLE significantly reduces the cost (in terms of the number of measurements of sample configuration) to build performance models. We evaluate our approach with 61 scenarios based on 5 software systems and demonstrate that BEETLE is beneficial in all cases. This approach offers a new highwater mark in configuring software systems. Specifically, BEETLE can find configurations that are as good or better as those found by anything else while requiring only 1/7th of the evaluations needed by the state-of-the-art. △ Less

Submitted 14 March, 2018; v1 submitted 10 March, 2018; originally announced March 2018.

arXiv:1803.01296 [pdf, other]

Scout: An Experienced Guide to Find the Best Cloud Configuration

Authors: Chin-Jung Hsu, Vivek Nair, Tim Menzies, Vincent W. Freeh

Abstract: Finding the right cloud configuration for workloads is an essential step to ensure good performance and contain running costs. A poor choice of cloud configuration decreases application performance and increases running cost significantly. While Bayesian Optimization is effective and applicable to any workloads, it is fragile because performance and workload are hard to model (to predict). In th… ▽ More Finding the right cloud configuration for workloads is an essential step to ensure good performance and contain running costs. A poor choice of cloud configuration decreases application performance and increases running cost significantly. While Bayesian Optimization is effective and applicable to any workloads, it is fragile because performance and workload are hard to model (to predict). In this paper, we propose a novel method, SCOUT. The central insight of SCOUT is that using prior measurements, even those for different workloads, improves search performance and reduces search cost. At its core, SCOUT extracts search hints (inference of resource requirements) from low-level performance metrics. Such hints enable SCOUT to navigate through the search space more efficiently---only spotlight region will be searched. We evaluate SCOUT with 107 workloads on Apache Hadoop and Spark. The experimental results demonstrate that our approach finds better cloud configurations with a lower search cost than state of the art methods. Based on this work, we conclude that (i) low-level performance information is necessary for finding the right cloud configuration in an effective, efficient and reliable way, and (ii) a search method can be guided by historical data, thereby reducing cost and improving performance. △ Less

Submitted 3 March, 2018; originally announced March 2018.

arXiv:1802.05319 [pdf, other]

doi 10.1145/3196398.3196424

500+ Times Faster Than Deep Learning (A Case Study Exploring Faster Methods for Text Mining StackOverflow)

Authors: Suvodeep Majumder, Nikhila Balaji, Katie Brey, Wei Fu, Tim Menzies

Abstract: Deep learning methods are useful for high-dimensional data and are becoming widely used in many areas of software engineering. Deep learners utilizes extensive computational power and can take a long time to train-- making it difficult to widely validate and repeat and improve their results. Further, they are not the best solution in all domains. For example, recent results show that for finding r… ▽ More Deep learning methods are useful for high-dimensional data and are becoming widely used in many areas of software engineering. Deep learners utilizes extensive computational power and can take a long time to train-- making it difficult to widely validate and repeat and improve their results. Further, they are not the best solution in all domains. For example, recent results show that for finding related Stack Overflow posts, a tuned SVM performs similarly to a deep learner, but is significantly faster to train. This paper extends that recent result by clustering the dataset, then tuning very learners within each cluster. This approach is over 500 times faster than deep learning (and over 900 times faster if we use all the cores on a standard laptop computer). Significantly, this faster approach generates classifiers nearly as good (within 2\% F1 Score) as the much slower deep learning method. Hence we recommend this faster methods since it is much easier to reproduce and utilizes far fewer CPU resources. More generally, we recommend that before researchers release research results, that they compare their supposedly sophisticated methods against simpler alternatives (e.g applying simpler learners to build local models). △ Less

Submitted 14 February, 2018; originally announced February 2018.

Journal ref: MSR '18, Proceedings of the 15th International Conference on Mining Software Repositories, May 2018, Pages 554 to 563

arXiv:1801.10241 [pdf, other]

doi 10.1145/3196398.3196442

Data-Driven Search-based Software Engineering

Authors: Vivek Nair, Amritanshu Agrawal, Jianfeng Chen, Wei Fu, George Mathew, Tim Menzies, Leandro Minku, Markus Wagner, Zhe Yu

Abstract: This paper introduces Data-Driven Search-based Software Engineering (DSE), which combines insights from Mining Software Repositories (MSR) and Search-based Software Engineering (SBSE). While MSR formulates software engineering problems as data mining problems, SBSE reformulates SE problems as optimization problems and use meta-heuristic algorithms to solve them. Both MSR and SBSE share the common… ▽ More This paper introduces Data-Driven Search-based Software Engineering (DSE), which combines insights from Mining Software Repositories (MSR) and Search-based Software Engineering (SBSE). While MSR formulates software engineering problems as data mining problems, SBSE reformulates SE problems as optimization problems and use meta-heuristic algorithms to solve them. Both MSR and SBSE share the common goal of providing insights to improve software engineering. The algorithms used in these two areas also have intrinsic relationships. We, therefore, argue that combining these two fields is useful for situations (a) which require learning from a large data source or (b) when optimizers need to know the lay of the land to find better solutions, faster. This paper aims to answer the following three questions: (1) What are the various topics addressed by DSE? (2) What types of data are used by the researchers in this area? (3) What research approaches do researchers use? The paper briefly sets out to act as a practical guide to develop new DSE techniques and also to serve as a teaching resource. This paper also presents a resource (tiny.cc/data-se) for exploring DSE. The resource contains 89 artifacts which are related to DSE, divided into 13 groups such as requirements engineering, software product lines, software processes. All the materials in this repository have been used in recent software engineering papers; i.e., for all this material, there exist baseline results against which researchers can comparatively assess their new ideas. △ Less

Submitted 16 March, 2018; v1 submitted 30 January, 2018; originally announced January 2018.

Comments: 10 pages, 7 figures

arXiv:1801.02175 [pdf, other]

Finding Faster Configurations using FLASH

Authors: Vivek Nair, Zhe Yu, Tim Menzies, Norbert Siegmund, Sven Apel

Abstract: Finding good configurations for a software system is often challenging since the number of configuration options can be large. Software engineers often make poor choices about configuration or, even worse, they usually use a sub-optimal configuration in production, which leads to inadequate performance. To assist engineers in finding the (near) optimal configuration, this paper introduces FLASH, a… ▽ More Finding good configurations for a software system is often challenging since the number of configuration options can be large. Software engineers often make poor choices about configuration or, even worse, they usually use a sub-optimal configuration in production, which leads to inadequate performance. To assist engineers in finding the (near) optimal configuration, this paper introduces FLASH, a sequential model-based method, which sequentially explores the configuration space by reflecting on the configurations evaluated so far to determine the next best configuration to explore. FLASH scales up to software systems that defeat the prior state of the art model-based methods in this area. FLASH runs much faster than existing methods and can solve both single-objective and multi-objective optimization problems. The central insight of this paper is to use the prior knowledge (gained from prior runs) to choose the next promising configuration. This strategy reduces the effort (i.e., number of measurements) required to find the (near) optimal configuration. We evaluate FLASH using 30 scenarios based on 7 software systems to demonstrate that FLASH saves effort in 100% and 80% of cases in single-objective and multi-objective problems respectively by up to several orders of magnitude compared to the state of the art techniques. △ Less

Submitted 1 September, 2018; v1 submitted 7 January, 2018; originally announced January 2018.

arXiv:1712.10081 [pdf, other]

Low-Level Augmented Bayesian Optimization for Finding the Best Cloud VM

Authors: Chin-Jung Hsu, Vivek Nair, Vincent W. Freeh, Tim Menzies

Abstract: With the advent of big data applications, which tends to have longer execution time, choosing the right cloud VM to run these applications has significant performance as well as economic implications. For example, in our large-scale empirical study of 107 different workloads on three popular big data systems, we found that a wrong choice can lead to a 20 times slowdown or an increase in cost by 10… ▽ More With the advent of big data applications, which tends to have longer execution time, choosing the right cloud VM to run these applications has significant performance as well as economic implications. For example, in our large-scale empirical study of 107 different workloads on three popular big data systems, we found that a wrong choice can lead to a 20 times slowdown or an increase in cost by 10 times. Bayesian optimization is a technique for optimizing expensive (black-box) functions. Previous attempts have only used instance-level information (such as # of cores, memory size) which is not sufficient to represent the search space. In this work, we discover that this may lead to the fragility problem---either incurs high search cost or finds only the sub-optimal solution. The central insight of this paper is to use low-level performance information to augment the process of Bayesian Optimization. Our novel low-level augmented Bayesian Optimization is rarely worse than current practices and often performs much better (in 46 of 107 cases). Further, it significantly reduces the search cost in nearly half of our case studies. Based on this work, we conclude that it is often insufficient to use general-purpose off-the-shelf methods for configuring cloud instances without augmenting those methods with essential systems knowledge such as CPU utilization, working memory size and I/O wait time. △ Less

Submitted 28 December, 2017; originally announced December 2017.

arXiv:1710.09055 [pdf, other]

doi 10.1145/3183519.3183549

We Don't Need Another Hero? The Impact of "Heroes" on Software Development

Authors: Amritanshu Agrawal, Akond Rahman, Rahul Krishna, Alexander Sobran, Tim Menzies

Abstract: A software project has "Hero Developers" when 80% of contributions are delivered by 20% of the developers. Are such heroes a good idea? Are too many heroes bad for software quality? Is it better to have more/less heroes for different kinds of projects? To answer these questions, we studied 661 open source projects from Public open source software (OSS) Github and 171 projects from an Enterprise Gi… ▽ More A software project has "Hero Developers" when 80% of contributions are delivered by 20% of the developers. Are such heroes a good idea? Are too many heroes bad for software quality? Is it better to have more/less heroes for different kinds of projects? To answer these questions, we studied 661 open source projects from Public open source software (OSS) Github and 171 projects from an Enterprise Github. We find that hero projects are very common. In fact, as projects grow in size, nearly all project become hero projects. These findings motivated us to look more closely at the effects of heroes on software development. Analysis shows that the frequency to close issues and bugs are not significantly affected by the presence of project type (Public or Enterprise). Similarly, the time needed to resolve an issue/bug/enhancement is not affected by heroes or project type. This is a surprising result since, before looking at the data, we expected that increasing heroes on a project will slow down howfast that project reacts to change. However, we do find a statistically significant association between heroes, project types, and enhancement resolution rates. Heroes do not affect enhancement resolution rates in Public projects. However, in Enterprise projects, the more heroes increase the rate at which project complete enhancements. In summary, our empirical results call for a revision of a long-held truism in software engineering. Software heroes are far more common and valuable than suggested by the literature, particularly for medium to large Enterprise developments. Organizations should reflect on better ways to find and retain more of these heroes △ Less

Submitted 20 February, 2018; v1 submitted 24 October, 2017; originally announced October 2017.

Comments: 8 pages + 1 references, Accepted to International conference on Software Engineering - Software Engineering in Practice, 2018

Journal ref: International conference on Software Engineering - Software Engineering in Practice, 2018

arXiv:1710.08736 [pdf, other]

doi 10.1145/3183519.3183548

What is the Connection Between Issues, Bugs, and Enhancements? (Lessons Learned from 800+ Software Projects)

Authors: Rahul Krishna, Amritanshu Agrawal, Akond Rahman, Alexander Sobran, Tim Menzies

Abstract: Agile teams juggle multiple tasks so professionals are often assigned to multiple projects, especially in service organizations that monitor and maintain a large suite of software for a large user base. If we could predict changes in project conditions changes, then managers could better adjust the staff allocated to those projects.This paper builds such a predictor using data from 832 open source… ▽ More Agile teams juggle multiple tasks so professionals are often assigned to multiple projects, especially in service organizations that monitor and maintain a large suite of software for a large user base. If we could predict changes in project conditions changes, then managers could better adjust the staff allocated to those projects.This paper builds such a predictor using data from 832 open source and proprietary applications. Using a time series analysis of the last 4 months of issues, we can forecast how many bug reports and enhancement requests will be generated next month. The forecasts made in this way only require a frequency count of this issue reports (and do not require an historical record of bugs found in the project). That is, this kind of predictive model is very easy to deploy within a project. We hence strongly recommend this method for forecasting future issues, enhancements, and bugs in a project. △ Less

Submitted 5 September, 2018; v1 submitted 24 October, 2017; originally announced October 2017.

Comments: Accepted to 2018 International Conference on Software Engineering, at the software engineering in practice track. 10 pages, 10 figures

arXiv:1708.08127 [pdf, other]

RIOT: a Stochastic-based Method for Workflow Scheduling in the Cloud

Authors: Jianfeng Chen, Tim Menzies

Abstract: Cloud computing provides engineers or scientists a place to run complex computing tasks. Finding a workflow's deployment configuration in a cloud environment is not easy. Traditional workflow scheduling algorithms were based on some heuristics, e.g. reliability greedy, cost greedy, cost-time balancing, etc., or more recently, the meta-heuristic methods, such as genetic algorithms. These methods ar… ▽ More Cloud computing provides engineers or scientists a place to run complex computing tasks. Finding a workflow's deployment configuration in a cloud environment is not easy. Traditional workflow scheduling algorithms were based on some heuristics, e.g. reliability greedy, cost greedy, cost-time balancing, etc., or more recently, the meta-heuristic methods, such as genetic algorithms. These methods are very slow and not suitable for rescheduling in the dynamic cloud environment. This paper introduces RIOT (Randomized Instance Order Types), a stochastic based method for workflow scheduling. RIOT groups the tasks in the workflow into virtual machines via a probability model and then uses an effective surrogate-based method to assess a large amount of potential scheduling. Experiments in dozens of study cases showed that RIOT executes tens of times faster than traditional methods while generating comparable results to other methods. △ Less

Submitted 22 April, 2018; v1 submitted 27 August, 2017; originally announced August 2017.

Comments: 8 pages, 4 figures, 3 tables. In Proceedings of IEEE international conference on Cloud Computing'18

arXiv:1708.05442 [pdf, other]

Learning Actionable Analytics from Multiple Software Projects

Authors: Rahul Krishna, Tim Menzies

Abstract: The current generation of software analytics tools are mostly prediction algorithms (e.g. support vector machines, naive bayes, logistic regression, etc). While prediction is useful, after prediction comes planning about what actions to take in order to improve quality. This research seeks methods that generate demonstrably useful guidance on "what to do" within the context of a specific software… ▽ More The current generation of software analytics tools are mostly prediction algorithms (e.g. support vector machines, naive bayes, logistic regression, etc). While prediction is useful, after prediction comes planning about what actions to take in order to improve quality. This research seeks methods that generate demonstrably useful guidance on "what to do" within the context of a specific software project. Specifically, we propose XTREE (for within-project planning) and BELLTREE (for cross-project planning) to generating plans that can improve software quality. Each such plan has the property that, if followed, it reduces the expected number of future defect reports. To find this expected number, planning was first applied to data from release x. Next, we looked for changes in release x+1 that conformed to our plans. This procedure was applied using a range of planners from the literature, as well as XTREE. In 10 open-source JAVA systems, several hundreds of defects were reduced in sections of the code that conformed to XTREE's plans. Further, when compared to other planners, XTREE's plans were found to be easier to implement (since they were shorter) and more effective at reducing the expected number of defects. △ Less

Submitted 24 January, 2020; v1 submitted 17 August, 2017; originally announced August 2017.

Comments: Accepted at Empirical Software Engineering Journal

arXiv:1705.05420 [pdf, other]

doi 10.1016/j.eswa.2018.11.021

FAST$^2$: an Intelligent Assistant for Finding Relevant Papers

Authors: Zhe Yu, Tim Menzies

Abstract: Literature reviews are essential for any researcher trying to keep up to date with the burgeoning software engineering literature. FAST$^2$ is a novel tool for reducing the effort required for conducting literature reviews by assisting the researchers to find the next promising paper to read (among a set of unread papers). This paper describes FAST$^2$ and tests it on four large software engineeri… ▽ More Literature reviews are essential for any researcher trying to keep up to date with the burgeoning software engineering literature. FAST$^2$ is a novel tool for reducing the effort required for conducting literature reviews by assisting the researchers to find the next promising paper to read (among a set of unread papers). This paper describes FAST$^2$ and tests it on four large software engineering literature reviews conducted by Wahono (2015), Hall (2012), Radjenović (2013) and Kitchenham (2017). We find that FAST$^2$ is a faster and robust tool to assist researcher finding relevant SE papers which can compensate for the errors made by humans during the review process. The effectiveness of FAST$^2$ can be attributed to three key innovations: (1) a novel way of applying external domain knowledge (a simple two or three keyword search) to guide the initial selection of papers---which helps to find relevant research papers faster with less variances; (2) an estimator of the number of remaining relevant papers yet to be found---which in practical settings can be used to decide if the reviewing process needs to be terminated; (3) a novel self-correcting classification algorithm---automatically corrects itself, in cases where the researcher wrongly classifies a paper. △ Less

Submitted 11 November, 2018; v1 submitted 15 May, 2017; originally announced May 2017.

Comments: 20+3 pages, 6 figures, 5 tables, and 4 algorithms. Accepted by Journal of Expert Systems with Applications

MSC Class: 68N01; 68T50 ACM Class: D.2.0; I.2.7

arXiv:1705.05018 [pdf, other]

FLASH: A Faster Optimizer for SBSE Tasks

Authors: Vivek Nair, Zhe Yu, Tim Menzies

Abstract: Most problems in search-based software engineering involve balancing conflicting objectives. Prior approaches to this task have required a large number of evaluations- making them very slow to execute and very hard to comprehend. To solve these problems, this paper introduces FLASH, a decision tree based optimizer that incrementally grows one decision tree per objective. These trees are then used… ▽ More Most problems in search-based software engineering involve balancing conflicting objectives. Prior approaches to this task have required a large number of evaluations- making them very slow to execute and very hard to comprehend. To solve these problems, this paper introduces FLASH, a decision tree based optimizer that incrementally grows one decision tree per objective. These trees are then used to select the next best sample. This paper compares FLASH to state-of-the-art algorithms from search-based SE and machine learning. This comparison uses multiple SBSE case studies for release planning, configuration control, process modeling, and sprint planning for agile development. FLASH was found to be the fastest optimizer (sometimes requiring less than 1% of the evaluations used by evolutionary algorithms). Also, measured in terms of model size, FLASH's reasoning was far more succinct and comprehensible. Further, measured in terms of finding effective optimization, FLASH's recommendations were highly competitive with other approaches. Finally, FLASH scaled to more complex models since it always terminated (while state-of-the-art algorithm did not). △ Less

Submitted 18 May, 2017; v1 submitted 14 May, 2017; originally announced May 2017.

Comments: 11 pages, 11 figures

arXiv:1705.03697 [pdf, other]

doi 10.1145/3180155.3180197

Is "Better Data" Better than "Better Data Miners"? (On the Benefits of Tuning SMOTE for Defect Prediction)

Authors: Amritanshu Agrawal, Tim Menzies

Abstract: We report and fix an important systematic error in prior studies that ranked classifiers for software analytics. Those studies did not (a) assess classifiers on multiple criteria and they did not (b) study how variations in the data affect the results. Hence, this paper applies (a) multi-criteria tests while (b) fixing the weaker regions of the training data (using SMOTUNED, which is a self-tuning… ▽ More We report and fix an important systematic error in prior studies that ranked classifiers for software analytics. Those studies did not (a) assess classifiers on multiple criteria and they did not (b) study how variations in the data affect the results. Hence, this paper applies (a) multi-criteria tests while (b) fixing the weaker regions of the training data (using SMOTUNED, which is a self-tuning version of SMOTE). This approach leads to dramatically large increases in software defect predictions. When applied in a 5*5 cross-validation study for 3,681 JAVA classes (containing over a million lines of code) from open source systems, SMOTUNED increased AUC and recall by 60% and 20% respectively. These improvements are independent of the classifier used to predict for quality. Same kind of pattern (improvement) was observed when a comparative analysis of SMOTE and SMOTUNED was done against the most recent class imbalance technique. In conclusion, for software analytic tasks like defect prediction, (1) data pre-processing can be more important than classifier choice, (2) ranking studies are incomplete without such pre-processing, and (3) SMOTUNED is a promising candidate for pre-processing. △ Less

Submitted 20 February, 2018; v1 submitted 10 May, 2017; originally announced May 2017.

Comments: 10 pages + 2 references. Accepted to International Conference of Software Engineering (ICSE), 2018

Journal ref: International Conference of Software Engineering (ICSE), 2018

arXiv:1703.06218 [pdf, ps, other]

Bellwethers: A Baseline Method For Transfer Learning

Authors: Rahul Krishna, Tim Menzies

Abstract: Software analytics builds quality prediction models for software projects. Experience shows that (a) the more projects studied, the more varied are the conclusions; and (b) project managers lose faith in the results of software analytics if those results keep changing. To reduce this conclusion instability, we propose the use of "bellwethers": given N projects from a community the bellwether is th… ▽ More Software analytics builds quality prediction models for software projects. Experience shows that (a) the more projects studied, the more varied are the conclusions; and (b) project managers lose faith in the results of software analytics if those results keep changing. To reduce this conclusion instability, we propose the use of "bellwethers": given N projects from a community the bellwether is the project whose data yields the best predictions on all others. The bellwethers offer a way to mitigate conclusion instability because conclusions about a community are stable as long as this bellwether continues as the best oracle. Bellwethers are also simple to discover (just wrap a for-loop around standard data miners). When compared to other transfer learning methods (TCA+, transfer Naive Bayes, value cognitive boosting), using just the bellwether data to construct a simple transfer learner yields comparable predictions. Further, bellwethers appear in many SE tasks such as defect prediction, effort estimation, and bad smell detection. We hence recommend using bellwethers as a baseline method for transfer learning against which future work should be compared △ Less

Submitted 21 January, 2018; v1 submitted 17 March, 2017; originally announced March 2017.

Comments: 23 Pages

arXiv:1703.00133 [pdf, other]

doi 10.1145/3106237.3106256

Easy over Hard: A Case Study on Deep Learning

Authors: Wei Fu, Tim Menzies

Abstract: While deep learning is an exciting new technique, the benefits of this method need to be assessed with respect to its computational cost. This is particularly important for deep learning since these learners need hours (to weeks) to train the model. Such long training time limits the ability of (a)~a researcher to test the stability of their conclusion via repeated runs with different random seeds… ▽ More While deep learning is an exciting new technique, the benefits of this method need to be assessed with respect to its computational cost. This is particularly important for deep learning since these learners need hours (to weeks) to train the model. Such long training time limits the ability of (a)~a researcher to test the stability of their conclusion via repeated runs with different random seeds; and (b)~other researchers to repeat, improve, or even refute that original work. For example, recently, deep learning was used to find which questions in the Stack Overflow programmer discussion forum can be linked together. That deep learning system took 14 hours to execute. We show here that applying a very simple optimizer called DE to fine tune SVM, it can achieve similar (and sometimes better) results. The DE approach terminated in 10 minutes; i.e. 84 times faster hours than deep learning method. We offer these results as a cautionary tale to the software analytics community and suggest that not every new innovation should be applied without critical analysis. If researchers deploy some new and expensive process, that work should be baselined against some simpler and faster alternatives. △ Less

Submitted 24 June, 2017; v1 submitted 28 February, 2017; originally announced March 2017.

Comments: 12 pages, 6 figures, accepted at FSE2017

arXiv:1703.00132 [pdf, other]

doi 10.1145/3106237.3106257

Revisiting Unsupervised Learning for Defect Prediction

Authors: Wei Fu, Tim Menzies

Abstract: Collecting quality data from software projects can be time-consuming and expensive. Hence, some researchers explore "unsupervised" approaches to quality prediction that does not require labelled data. An alternate technique is to use "supervised" approaches that learn models from project data labelled with, say, "defective" or "not-defective". Most researchers use these supervised models since, it… ▽ More Collecting quality data from software projects can be time-consuming and expensive. Hence, some researchers explore "unsupervised" approaches to quality prediction that does not require labelled data. An alternate technique is to use "supervised" approaches that learn models from project data labelled with, say, "defective" or "not-defective". Most researchers use these supervised models since, it is argued, they can exploit more knowledge of the projects. At FSE'16, Yang et al. reported startling results where unsupervised defect predictors outperformed supervised predictors for effort-aware just-in-time defect prediction. If confirmed, these results would lead to a dramatic simplification of a seemingly complex task (data mining) that is widely explored in the software engineering literature. This paper repeats and refutes those results as follows. (1) There is much variability in the efficacy of the Yang et al. predictors so even with their approach, some supervised data is required to prune weaker predictors away. (2)Their findings were grouped across $N$ projects. When we repeat their analysis on a project-by-project basis, supervised predictors are seen to work better. Even though this paper rejects the specific conclusions of Yang et al., we still endorse their general goal. In our our experiments, supervised predictors did not perform outstandingly better than unsupervised ones for effort-aware just-in-time defect prediction. Hence, they may indeed be some combination of unsupervised learners to achieve comparable performance to supervised ones. We therefore encourage others to work in this promising area. △ Less

Submitted 24 June, 2017; v1 submitted 28 February, 2017; originally announced March 2017.

Comments: 11 pages, 5 figures. Accepted at FSE2017

arXiv:1702.08571 [pdf, other]

Replicating and Scaling up Qualitative Analysis using Crowdsourcing: A Github-based Case Study

Authors: Di Chen, Kathryn T. Stolee, Tim Menzies

Abstract: Due to the difficulties in replicating and scaling up qualitative studies, such studies are rarely verified. Accordingly, in this paper, we leverage the advantages of crowdsourcing (low costs, fast speed, scalable workforce) to replicate and scale-up one state-of-the-art qualitative study. That qualitative study explored 20 GitHub pull requests to learn factors that influence the fate of pull requ… ▽ More Due to the difficulties in replicating and scaling up qualitative studies, such studies are rarely verified. Accordingly, in this paper, we leverage the advantages of crowdsourcing (low costs, fast speed, scalable workforce) to replicate and scale-up one state-of-the-art qualitative study. That qualitative study explored 20 GitHub pull requests to learn factors that influence the fate of pull requests with respect to approval and merging. As a secondary study, using crowdsourcing at a cost of $200, we studied 250 pull requests from 142 GitHub projects. The prior qualitative findings are mapped into questions for crowds workers. Their answers were converted into binary features to build a predictor which predicts whether code would be merged with median F1 scores of 68%. For the same large group of pull requests, the median F1 scores could achieve 90% by a predictor built with additional features defined by prior quantitative results. Based on this case study, we conclude that there is much benefit in combining different kinds of research methods. While qualitative insights are very useful for finding novel insights, they can be hard to scale or replicate. That said, they can guide and define the goals of scalable secondary studies that use (e.g.) crowdsourcing+data mining. On the other hand, while data mining methods are reproducible and scalable to large data sets, their results may be spectacularly wrong since they lack contextual information. That said, they can be used to test the stability and external validity, of the insights gained from a qualitative analysis. △ Less

Submitted 1 March, 2017; v1 submitted 27 February, 2017; originally announced February 2017.

Comments: Submitted to FSE'17, 12 pages

arXiv:1702.07735 [pdf, other]

Better Predictors for Issue Lifetime

Authors: Mitch Rees-Jones, Matthew Martin, Tim Menzies

Abstract: Predicting issue lifetime can help software developers, managers, and stakeholders effectively prioritize work, allocate development resources, and better understand project timelines. Progress had been made on this prediction problem, but prior work has reported low precision and high false alarms. The latest results also use complex models such as random forests that detract from their readabili… ▽ More Predicting issue lifetime can help software developers, managers, and stakeholders effectively prioritize work, allocate development resources, and better understand project timelines. Progress had been made on this prediction problem, but prior work has reported low precision and high false alarms. The latest results also use complex models such as random forests that detract from their readability. We solve both issues by using small, readable decision trees (under 20 lines long) and correlation feature selection to predict issue lifetime, achieving high precision and low false alarms (medians of 71% and 13% respectively). We also address the problem of high class imbalance within issue datasets - when local data fails to train a good model, we show that cross-project data can be used in place of the local data. In fact, cross-project data works so well that we argue it should be the default approach for learning predictors for issue lifetime. △ Less

Submitted 4 April, 2017; v1 submitted 24 February, 2017; originally announced February 2017.

Comments: 9 pages, 3 figures, 5 tables

arXiv:1702.05701 [pdf, other]

doi 10.1145/3106237.3106238

Using Bad Learners to find Good Configurations

Authors: Vivek Nair, Tim Menzies, Norbert Siegmund, Sven Apel

Abstract: Finding the optimally performing configuration of a software system for a given setting is often challenging. Recent approaches address this challenge by learning performance models based on a sample set of configurations. However, building an accurate performance model can be very expensive (and is often infeasible in practice). The central insight of this paper is that exact performance values (… ▽ More Finding the optimally performing configuration of a software system for a given setting is often challenging. Recent approaches address this challenge by learning performance models based on a sample set of configurations. However, building an accurate performance model can be very expensive (and is often infeasible in practice). The central insight of this paper is that exact performance values (e.g. the response time of a software system) are not required to rank configurations and to identify the optimal one. As shown by our experiments, models that are cheap to learn but inaccurate (with respect to the difference between actual and predicted performance) can still be used rank configurations and hence find the optimal configuration. This novel \emph{rank-based approach} allows us to significantly reduce the cost (in terms of number of measurements of sample configuration) as well as the time required to build models. We evaluate our approach with 21 scenarios based on 9 software systems and demonstrate that our approach is beneficial in 16 scenarios; for the remaining 5 scenarios, an accurate model can be built by using very few samples anyway, without the need for a rank-based approach. △ Less

Submitted 28 June, 2017; v1 submitted 19 February, 2017; originally announced February 2017.

Comments: 11 pages, 11 figures

arXiv:1702.05568 [pdf, other]

"SHORT"er Reasoning About Larger Requirements Models

Authors: George Mathew, Tim Menzies, Neil A. Ernst, John Klein

Abstract: When Requirements Engineering(RE) models are unreasonably complex, they cannot support efficient decision making. SHORT is a tool to simplify that reasoning by exploiting the "key" decisions within RE models. These "keys" have the property that once values are assigned to them, it is very fast to reason over the remaining decisions. Using these "keys", reasoning about RE models can be greatly SHOR… ▽ More When Requirements Engineering(RE) models are unreasonably complex, they cannot support efficient decision making. SHORT is a tool to simplify that reasoning by exploiting the "key" decisions within RE models. These "keys" have the property that once values are assigned to them, it is very fast to reason over the remaining decisions. Using these "keys", reasoning about RE models can be greatly SHORTened by focusing stakeholder discussion on just these key decisions. This paper evaluates the SHORT tool on eight complex RE models. We find that the number of keys are typically only 12% of all decisions. Since they are so few in number, keys can be used to reason faster about models. For example, using keys, we can optimize over those models (to achieve the most goals at least cost) two to three orders of magnitude faster than standard methods. Better yet, finding those keys is not difficult: SHORT runs in low order polynomial time and terminates in a few minutes for the largest models. △ Less

Submitted 22 August, 2017; v1 submitted 17 February, 2017; originally announced February 2017.

Comments: 10 pages, 5 figures, IEEE Requirements Engineering 2017

arXiv:1701.08106 [pdf, other]

doi 10.1007/s1051

Faster Discovery of Faster System Configurations with Spectral Learning

Authors: Vivek Nair, Tim Menzies, Norbert Siegmund, Sven Apel

Abstract: Despite the huge spread and economical importance of configurable software systems, there is unsatisfactory support in utilizing the full potential of these systems with respect to finding performance-optimal configurations. Prior work on predicting the performance of software configurations suffered from either (a) requiring far too many sample configurations or (b) large variances in their predi… ▽ More Despite the huge spread and economical importance of configurable software systems, there is unsatisfactory support in utilizing the full potential of these systems with respect to finding performance-optimal configurations. Prior work on predicting the performance of software configurations suffered from either (a) requiring far too many sample configurations or (b) large variances in their predictions. Both these problems can be avoided using the WHAT spectral learner. WHAT's innovation is the use of the spectrum (eigenvalues) of the distance matrix between the configurations of a configurable software system, to perform dimensionality reduction. Within that reduced configuration space, many closely associated configurations can be studied by executing only a few sample configurations. For the subject systems studied here, a few dozen samples yield accurate and stable predictors - less than 10% prediction error, with a standard deviation of less than 2%. When compared to the state of the art, WHAT (a) requires 2 to 10 times fewer samples to achieve similar prediction accuracies, and (b) its predictions are more stable (i.e., have lower standard deviation). Furthermore, we demonstrate that predictive models generated by WHAT can be used by optimizers to discover system configurations that closely approach the optimal performance. △ Less

Submitted 3 August, 2017; v1 submitted 27 January, 2017; originally announced January 2017.

Comments: 26 pages, 6 figures

arXiv:1701.07950 [pdf, other]

doi 10.1016/j.infsof.2017.08.007

Beyond Evolutionary Algorithms for Search-based Software Engineering

Authors: Jianfeng Chen, Vivek Nair, Tim Menzies

Abstract: Context: Evolutionary algorithms typically require a large number of evaluations (of solutions) to converge - which can be very slow and expensive to evaluate.Objective: To solve search-based software engineering (SE) problems, using fewer evaluations than evolutionary methods.Method: Instead of mutating a small population, we build a very large initial population which is then culled using a recu… ▽ More Context: Evolutionary algorithms typically require a large number of evaluations (of solutions) to converge - which can be very slow and expensive to evaluate.Objective: To solve search-based software engineering (SE) problems, using fewer evaluations than evolutionary methods.Method: Instead of mutating a small population, we build a very large initial population which is then culled using a recursive bi-clustering chop approach. We evaluate this approach on multiple SE models, unconstrained as well as constrained, and compare its performance with standard evolutionary algorithms. Results: Using just a few evaluations (under 100), we can obtain comparable results to state-of-the-art evolutionary algorithms.Conclusion: Just because something works, and is widespread use, does not necessarily mean that there is no value in seeking methods to improve that method. Before undertaking search-based SE optimization tasks using traditional EAs, it is recommended to try other techniques, like those explored here, to obtain the same results with fewer evaluations. △ Less

Submitted 17 September, 2017; v1 submitted 27 January, 2017; originally announced January 2017.

Comments: 17 pages, 10 figures, Information and Software Technology 2017

arXiv:1612.03240 [pdf, other]

Impacts of Bad ESP (Early Size Predictions) on Software Effort Estimation

Authors: George Mathew, Tim Menzies, Jairus Hihn

Abstract: Context: Early size predictions (ESP) can lead to errors in effort predictions for software projects. This problem is particular acute in parametric effort models that give extra weight to size factors (for example, the COCOMO model assumes that effort is exponentially proportional to project size). Objective: To test if effort estimates are crippled by bad ESP. Method: Document inaccuracies in ea… ▽ More Context: Early size predictions (ESP) can lead to errors in effort predictions for software projects. This problem is particular acute in parametric effort models that give extra weight to size factors (for example, the COCOMO model assumes that effort is exponentially proportional to project size). Objective: To test if effort estimates are crippled by bad ESP. Method: Document inaccuracies in early size estimates. Use those error sizes to determine the implications of those inaccuracies via a Monte Carlo perturbation analysis of effort models and an analysis of the equations used in those effort models. Results: While many projects have errors in ESP of up to +/- 100%, those errors add very little to the overall effort estimate error. Specifically, we find no statistically significant difference in the estimation errors seen after increasing ESP errors from 0 to +/- 100%. An analysis of effort estimation models explains why this is so: the net additional impact of ESP error is relatively small compared to the other sources of error associated with in estimation models. Conclusion: ESP errors effect effort estimates by a relatively minor amount. As soon as a model uses a size estimate and other factors to predict project effort, then ESP errors are not crippling to the process of estimation △ Less

Submitted 19 February, 2018; v1 submitted 9 December, 2016; originally announced December 2016.

Comments: 17 pages. Submitted to EMSE journal, 5 figures, 2 tables

arXiv:1612.03224 [pdf, other]

doi 10.1007/s10664-017-9587-0

Finding Better Active Learners for Faster Literature Reviews

Authors: Zhe Yu, Nicholas A. Kraft, Tim Menzies

Abstract: Literature reviews can be time-consuming and tedious to complete. By cataloging and refactoring three state-of-the-art active learning techniques from evidence-based medicine and legal electronic discovery, this paper finds and implements FASTREAD, a faster technique for studying a large corpus of documents. This paper assesses FASTREAD using datasets generated from existing SE literature reviews… ▽ More Literature reviews can be time-consuming and tedious to complete. By cataloging and refactoring three state-of-the-art active learning techniques from evidence-based medicine and legal electronic discovery, this paper finds and implements FASTREAD, a faster technique for studying a large corpus of documents. This paper assesses FASTREAD using datasets generated from existing SE literature reviews (Hall, Wahono, Radjenović, Kitchenham et al.). Compared to manual methods, FASTREAD lets researchers find 95% relevant studies after reviewing an order of magnitude fewer papers. Compared to other state-of-the-art automatic methods, FASTREAD reviews 20-50% fewer studies while finding same number of relevant primary studies in a systematic literature review. △ Less

Submitted 2 February, 2018; v1 submitted 9 December, 2016; originally announced December 2016.

Comments: 23 pages, 5 figures, 3 tables, accepted for publication in EMSE journal

MSC Class: 68N01; 68T50 ACM Class: D.2.0; I.2.7

arXiv:1609.05563 [pdf, other]

Negative Results for Software Effort Estimation

Authors: Tim Menzies, Ye Yang, George Mathew, Barry Boehm, Jairus Hihn

Abstract: Context:More than half the literature on software effort estimation (SEE) focuses on comparisons of new estimation methods. Surprisingly, there are no studies comparing state of the art latest methods with decades-old approaches. Objective:To check if new SEE methods generated better estimates than older methods. Method: Firstly, collect effort estimation methods ranging from "classical" COCOMO (p… ▽ More Context:More than half the literature on software effort estimation (SEE) focuses on comparisons of new estimation methods. Surprisingly, there are no studies comparing state of the art latest methods with decades-old approaches. Objective:To check if new SEE methods generated better estimates than older methods. Method: Firstly, collect effort estimation methods ranging from "classical" COCOMO (parametric estimation over a pre-determined set of attributes) to "modern" (reasoning via analogy using spectral-based clustering plus instance and feature selection, and a recent "baseline method" proposed in ACM Transactions on Software Engineering).Secondly, catalog the list of objections that lead to the development of post-COCOMO estimation methods.Thirdly, characterize each of those objections as a comparison between newer and older estimation methods.Fourthly, using four COCOMO-style data sets (from 1991, 2000, 2005, 2010) and run those comparisons experiments.Fifthly, compare the performance of the different estimators using a Scott-Knott procedure using (i) the A12 effect size to rule out "small" differences and (ii) a 99% confident bootstrap procedure to check for statistically different groupings of treatments). Results: The major negative results of this paper are that for the COCOMO data sets, nothing we studied did any better than Boehm's original procedure. Conclusions: When COCOMO-style attributes are available, we strongly recommend (i) using that data and (ii) use COCOMO to generate predictions. We say this since the experiments of this paper show that, at least for effort estimation,how data is collected is more important than what learner is applied to that data. △ Less

Submitted 29 September, 2016; v1 submitted 18 September, 2016; originally announced September 2016.

Comments: 22 pages, EMSE 2016 accepted submission. Affiliated to "Jet Propulsion Laboratory, California Institute of Technology"

arXiv:1609.04886 [pdf, other]

doi 10.1007/s10664-016-9469-x

Are Delayed Issues Harder to Resolve? Revisiting Cost-to-Fix of Defects throughout the Lifecycle

Authors: Tim Menzies, William Nichols, Forrest Shull, Lucas Layman

Abstract: Many practitioners and academics believe in a delayed issue effect (DIE); i.e. the longer an issue lingers in the system, the more effort it requires to resolve. This belief is often used to justify major investments in new development processes that promise to retire more issues sooner. This paper tests for the delayed issue effect in 171 software projects conducted around the world in the peri… ▽ More Many practitioners and academics believe in a delayed issue effect (DIE); i.e. the longer an issue lingers in the system, the more effort it requires to resolve. This belief is often used to justify major investments in new development processes that promise to retire more issues sooner. This paper tests for the delayed issue effect in 171 software projects conducted around the world in the period from 2006--2014. To the best of our knowledge, this is the largest study yet published on this effect. We found no evidence for the delayed issue effect; i.e. the effort to resolve issues in a later phase was not consistently or substantially greater than when issues were resolved soon after their introduction. This paper documents the above study and explores reasons for this mismatch between this common rule of thumb and empirical data. In summary, DIE is not some constant across all projects. Rather, DIE might be an historical relic that occurs intermittently only in certain kinds of projects. This is a significant result since it predicts that new development processes that promise to faster retire more issues will not have a guaranteed return on investment (depending on the context where applied), and that a long-held truth in software engineering should not be considered a global truism. △ Less

Submitted 15 September, 2016; originally announced September 2016.

Comments: 31 pages. Accepted with minor revisions to Journal of Empirical Software Engineering. Keywords: software economics, phase delay, cost to fix

ACM Class: D.2.8

Journal ref: Empirical Software Engineering, Volume 22, Issue 4, 1903--1935 (2017)

arXiv:1609.03614 [pdf, other]

doi 10.1016/j.infsof.2017.03.012

Less is More: Minimizing Code Reorganization using XTREE

Authors: Rahul Krishna, Tim Menzies, Lucas Layman

Abstract: Context: Developers use bad code smells to guide code reorganization. Yet developers, text books, tools, and researchers disagree on which bad smells are important. Objective: To evaluate the likelihood that a code reorganization to address bad code smells will yield improvement in the defect-proneness of the code. Method: We introduce XTREE, a tool that analyzes a historical log of defects seen p… ▽ More Context: Developers use bad code smells to guide code reorganization. Yet developers, text books, tools, and researchers disagree on which bad smells are important. Objective: To evaluate the likelihood that a code reorganization to address bad code smells will yield improvement in the defect-proneness of the code. Method: We introduce XTREE, a tool that analyzes a historical log of defects seen previously in the code and generates a set of useful code changes. Any bad smell that requires changes outside of that set can be deprioritized (since there is no historical evidence that the bad smell causes any problems). Evaluation: We evaluate XTREE's recommendations for bad smell improvement against recommendations from previous work (Shatnawi, Alves, and Borges) using multiple data sets of code metrics and defect counts. Results: Code modules that are changed in response to XTREE's recommendations contain significantly fewer defects than recommendations from previous studies. Further, XTREE endorses changes to very few code metrics, and the bad smell recommendations (learned from previous studies) are not universal to all software projects. Conclusion: Before undertaking a code reorganization based on a bad smell report, use a tool like XTREE to check and ignore any such operations that are useless; i.e. ones which lack evidence in the historical record that it is useful to make that change. Note that this use case applies to both manual code reorganizations proposed by developers as well as those conducted by automatic methods. This recommendation assumes that there is an historical record. If none exists, then the results of this paper could be used as a guide. △ Less

Submitted 14 May, 2017; v1 submitted 12 September, 2016; originally announced September 2016.

Comments: 16 pages, 9 figures (2 colour images), and 68 References

Journal ref: Information and Software Technology, Volume 88, August 2017, Pages 53-66, ISSN 0950-5849

arXiv:1609.02613 [pdf, other]

Why is Differential Evolution Better than Grid Search for Tuning Defect Predictors?

Authors: Wei Fu, Vivek Nair, Tim Menzies

Abstract: Context: One of the black arts of data mining is learning the magic parameters which control the learners. In software analytics, at least for defect prediction, several methods, like grid search and differential evolution (DE), have been proposed to learn these parameters, which has been proved to be able to improve the performance scores of learners. Objective: We want to evaluate which method… ▽ More Context: One of the black arts of data mining is learning the magic parameters which control the learners. In software analytics, at least for defect prediction, several methods, like grid search and differential evolution (DE), have been proposed to learn these parameters, which has been proved to be able to improve the performance scores of learners. Objective: We want to evaluate which method can find better parameters in terms of performance score and runtime cost. Methods: This paper compares grid search to differential evolution, which is an evolutionary algorithm that makes extensive use of stochastic jumps around the search space. Results: We find that the seemingly complete approach of grid search does no better, and sometimes worse, than the stochastic search. When repeated 20 times to check for conclusion validity, DE was over 210 times faster than grid search to tune Random Forests on 17 testing data sets with F-Measure Conclusions: These results are puzzling: why does a quick partial search be just as effective as a much slower, and much more, extensive search? To answer that question, we turned to the theoretical optimization literature. Bergstra and Bengio conjecture that grid search is not more effective than more randomized searchers if the underlying search space is inherently low dimensional. This is significant since recent results show that defect prediction exhibits very low intrinsic dimensionality-- an observation that explains why a fast method like DE may work as well as a seemingly more thorough grid search. This suggests, as a future research direction, that it might be possible to peek at data sets before doing any optimization in order to match the optimization algorithm to the problem at hand. △ Less

Submitted 10 March, 2017; v1 submitted 8 September, 2016; originally announced September 2016.

Comments: 12 pages, 8 figures, submitted to Information and Software Technology

arXiv:1609.01759 [pdf, ps, other]

doi 10.1016/j.infsof.2016.04.017

Tuning for Software Analytics: is it Really Necessary?

Authors: Wei Fu, Tim Menzies, Xipeng Shen

Abstract: Context: Data miners have been widely used in software engineering to, say, generate defect predictors from static code measures. Such static code defect predictors perform well compared to manual methods, and they are easy to use and useful to use. But one of the "black art" of data mining is setting the tunings that control the miner. Objective:We seek simple, automatic, and very effective metho… ▽ More Context: Data miners have been widely used in software engineering to, say, generate defect predictors from static code measures. Such static code defect predictors perform well compared to manual methods, and they are easy to use and useful to use. But one of the "black art" of data mining is setting the tunings that control the miner. Objective:We seek simple, automatic, and very effective method for finding those tunings. Method: For each experiment with different data sets (from open source JAVA systems), we ran differential evolution as anoptimizer to explore the tuning space (as a first step) then tested the tunings using hold-out data. Results: Contrary to our prior expectations, we found these tunings were remarkably simple: it only required tens, not thousands,of attempts to obtain very good results. For example, when learning software defect predictors, this method can quickly find tuningsthat alter detection precision from 0% to 60%. Conclusion: Since (1) the improvements are so large, and (2) the tuning is so simple, we need to change standard methods insoftware analytics. At least for defect prediction, it is no longer enough to just run a data miner and present the resultwithoutconducting a tuning optimization study. The implication for other kinds of analytics is now an open and pressing issue △ Less

Submitted 6 September, 2016; originally announced September 2016.

Journal ref: Information and Software Technology 76 (2016): 135-146

arXiv:1609.00489 [pdf, other]

A deep learning model for estimating story points

Authors: Morakot Choetkiertikul, Hoa Khanh Dam, Truyen Tran, Trang Pham, Aditya Ghose, Tim Menzies

Abstract: Although there has been substantial research in software analytics for effort estimation in traditional software projects, little work has been done for estimation in agile projects, especially estimating user stories or issues. Story points are the most common unit of measure used for estimating the effort involved in implementing a user story or resolving an issue. In this paper, we offer for th… ▽ More Although there has been substantial research in software analytics for effort estimation in traditional software projects, little work has been done for estimation in agile projects, especially estimating user stories or issues. Story points are the most common unit of measure used for estimating the effort involved in implementing a user story or resolving an issue. In this paper, we offer for the \emph{first} time a comprehensive dataset for story points-based estimation that contains 23,313 issues from 16 open source projects. We also propose a prediction model for estimating story points based on a novel combination of two powerful deep learning architectures: long short-term memory and recurrent highway network. Our prediction system is \emph{end-to-end} trainable from raw input data to prediction outcomes without any manual feature engineering. An empirical evaluation demonstrates that our approach consistently outperforms three common effort estimation baselines and two alternatives in both Mean Absolute Error and the Standardized Accuracy. △ Less

Submitted 6 September, 2016; v1 submitted 2 September, 2016; originally announced September 2016.

Comments: Submitted to ICSE'17

arXiv:1608.08176 [pdf, other]

doi 10.1016/j.infsof.2018.02.005

What is Wrong with Topic Modeling? (and How to Fix it Using Search-based Software Engineering)

Authors: Amritanshu Agrawal, Wei Fu, Tim Menzies

Abstract: Context: Topic modeling finds human-readable structures in unstructured textual data. A widely used topic modeler is Latent Dirichlet allocation. When run on different datasets, LDA suffers from "order effects" i.e. different topics are generated if the order of training data is shuffled. Such order effects introduce a systematic error for any study. This error can relate to misleading results;spe… ▽ More Context: Topic modeling finds human-readable structures in unstructured textual data. A widely used topic modeler is Latent Dirichlet allocation. When run on different datasets, LDA suffers from "order effects" i.e. different topics are generated if the order of training data is shuffled. Such order effects introduce a systematic error for any study. This error can relate to misleading results;specifically, inaccurate topic descriptions and a reduction in the efficacy of text mining classification results. Objective: To provide a method in which distributions generated by LDA are more stable and can be used for further analysis. Method: We use LDADE, a search-based software engineering tool that tunes LDA's parameters using DE (Differential Evolution). LDADE is evaluated on data from a programmer information exchange site (Stackoverflow), title and abstract text of thousands ofSoftware Engineering (SE) papers, and software defect reports from NASA. Results were collected across different implementations of LDA (Python+Scikit-Learn, Scala+Spark); across different platforms (Linux, Macintosh) and for different kinds of LDAs (VEM,or using Gibbs sampling). Results were scored via topic stability and text mining classification accuracy. Results: In all treatments: (i) standard LDA exhibits very large topic instability; (ii) LDADE's tunings dramatically reduce cluster instability; (iii) LDADE also leads to improved performances for supervised as well as unsupervised learning. Conclusion: Due to topic instability, using standard LDA with its "off-the-shelf" settings should now be depreciated. Also, in future, we should require SE papers that use LDA to test and (if needed) mitigate LDA topic instability. Finally, LDADE is a candidate technology for effectively and efficiently reducing that instability. △ Less

Submitted 20 February, 2018; v1 submitted 29 August, 2016; originally announced August 2016.

Comments: 15 pages + 2 page references. Accepted to IST

Journal ref: Information and Software Technology Journal, 2018

arXiv:1608.08100 [pdf, other]

doi 10.1109/TSE.2018.2870388

Finding Trends in Software Research

Authors: George Mathew, Amritanshu Agrawal, Tim Menzies

Abstract: This paper explores the structure of research papers in software engineering. Using text mining, we study 35,391 software engineering (SE) papers from 34 leading SE venues over the last 25 years. These venues were divided, nearly evenly, between conferences and journals. An important aspect of this analysis is that it is fully automated and repeatable. To achieve that automation, we used a stable… ▽ More This paper explores the structure of research papers in software engineering. Using text mining, we study 35,391 software engineering (SE) papers from 34 leading SE venues over the last 25 years. These venues were divided, nearly evenly, between conferences and journals. An important aspect of this analysis is that it is fully automated and repeatable. To achieve that automation, we used a stable topic modeling technique called LDADE that fully automates parameter tuning in LDA. Using LDADE, we mine 11 topics that represent much of the structure of contemporary SE. The 11 topics presented here should not be "set in stone" as the only topics worthy of study in SE. Rather our goal is to report that (a) text mining methods can detect large scale trends within our community; (b) those topic change with time; so (c) it is important to have automatic agents that can update our understanding of our community whenever new data arrives. △ Less

Submitted 2 October, 2018; v1 submitted 29 August, 2016; originally announced August 2016.

Comments: 12 pages, 4 tables, 11 figures, Accepted in IEEE Transactions on Software Engineering 2018

arXiv:1608.07617 [pdf, other]

"Sampling"' as a Baseline Optimizer for Search-based Software Engineering

Authors: Jianfeng Chen, Vivek Nair, Rahul Krishna, Tim Menzies

Abstract: Increasingly, Software Engineering (SE) researchers use search-based optimization techniques to solve SE problems with multiple conflicting objectives. These techniques often apply CPU-intensive evolutionary algorithms to explore generations of mutations to a population of candidate solutions. An alternative approach, proposed in this paper, is to start with a very large population and sample down… ▽ More Increasingly, Software Engineering (SE) researchers use search-based optimization techniques to solve SE problems with multiple conflicting objectives. These techniques often apply CPU-intensive evolutionary algorithms to explore generations of mutations to a population of candidate solutions. An alternative approach, proposed in this paper, is to start with a very large population and sample down to just the better solutions. We call this method "SWAY", short for "the sampling way". Sway is very simple to implement and, in studies with various software engineering models, this sampling approach was found to be competitive with corresponding state-of-the-art evolutionary algorithms while requiring far less computation cost. Considering the simplicity and effectiveness of Sway, we, therefore, propose this approach as a baseline method for search-based software engineering models, especially for models that are very slow to execute. △ Less

Submitted 5 January, 2018; v1 submitted 26 August, 2016; originally announced August 2016.

Comments: 15 pages, 11 figures, 4 tables. To appear, IEEE Trans Software Engineering 2018

Showing 1–120 of 120 results for all: tim menzies