Research
I am genuinely interested in any topic that has to do with systems and software, and combinations thereof. Specifically my research interests are the following:
- Software engineering
- Software analytics
- Machine learning
- Systems software
Software Engineering - Software Analytics
Current work
I am exploring ways to apply machine learning on software engineering problems. During my sabbatical at Facebook, I worked on type prediction for Python (TypeWriter); we extended this work into Type4Py (see also the related ManyTypes4Py dataset), which also includes a VS Code plugin. We also implemented a Codefill, a state-of-the-art beating autocompletion system for Python based on the concept of parallel representations of syntax and code sequences.
At Facebook, I have also applied ML for timeseries to the problem of unsupervised crash categorization (KabOOM).
We are working to improve dependency management. Our initial experiments with the Rust ecosystem are very promising. I was the original primary instigator of the H2020 FASTEN project, which works on making dependency management better by making package managers more intelligent. Using fine grained dependency management, we are also able to perform safer dependency updates.
This line of work powers our software supply chain security startup, Endor Labs.
In the area of software process optimization, I was co-leading and I am currently helping out the Fintech lab’s software analytics track. We are exploring how to make project planning more predicable by using machine learning. In our initial work, we investigated the productivity effects of releasing software in smaller iterations. In follow up work, we built models to characterize and predict delays in agile development. We also found that adding team features on task duration predictions improves them by 30%.
I am exploring how pull-based distributed software development works both quantitatively (ICSE 2014) and qualitatively (ICSE 2015, ICSE 2016 SIGSOFT best paper award). The dataset I developed as part of the quantitative investigation, won the best dataset award at MSR 2014. Using the findings of the qualitative work, I have also co-proposed a service to help developers prioritize pull request handling and a service to match job advertisements to developer profiles. We also investigated whether geographical and nationality biases exist among OSS developers. More than 1000 other papers have collectively cited/used our results; we compose the follow up findings in a massive scale replication (TSE) of the ICSE 2014 paper.
Expansions of this line of work have found their way to the industry. Nudge and ConE are systems deployed at Microsoft (and elsewhere) that facilitate developers accelerate the pull request code review process and learn which other pull requests may conflict with theirs.
Furthermore, we synthesize all related work in a preliminary theory of software change.
Past work
From 2011 till 2021, I was actively developing and maintaining a collection of tools for obtaining and analysing data from Github, through the GHTorrent project. The project has been awarded the best data project award at MSR 2013, has received the foundational contribution award at MSR 2018, and has been selected as the official dataset of the MSR 2014 mining challenge. At the moment (Jun 2018) 250 papers (40% of all GitHub-related papers according to one source) where based on it, more that 450 researchers are using its data, GitHub included it in its 2014 data challenge, while Microsoft chose it as a data source for monitoring their OSS projects. Together with colleagues, we have documented the promises and pitfalls of doing research with GHTorrent (MSR14, EMSE).
The follow up CodeFeedr project created a platform for real time mining of software analytics. We describe our vision here and present a set of case studies we worked on here.
I worked on analysing Continuous Integration logs. For that, we co-created the TravisTorrent project to disseminate Travis build results in a way similar to GHTorrent. TravisTorrent was the dataset for the 2017 MSR mining challenge (attracting a record number of submissions). Our results (MSR 2017) indicated significant (order of magnitude) differences between testing habits among programming languages, including build breakage rates and number of tests run.
In the context of the TestRoots project, I did research on how developers use automated testing. Mostly working with Moritz Beller, I co-implemented a data collection and analysis pipeline for the Watchdog framework and co-implemented a similar pipeline for Travis CI data. With those tools, we have shown (ICSE-NIER15, ICSE-SERIP16, FSE15) that developers do not test as much as they think they do and do not follow TDD approaches.
I led the design and development of Alitheia Core, a high performance software analytics platform that works with data from software repositories. I used the platform to develop models for the evaluation of developer contribution, to investigate the evolution of software security issues and to process and share data from more than 700 open source software repositories.
I also proposed a platform for analysing the quality of OSS projects and a corresponding hierarchical metrics-based software quality model for evaluating both the process and the product quality. I have contributed to a comprehensive survey of the literature in OSS research.
Together with Diomidis Spinellis, I worked on editing the Beautiful Architecture book. You can find more information from the publisher’s book web site. Beautiful Architecure has also been translated to Chinese (架构之美), Japanese (ビューティフルアーキテクチャ) and Russian (Идеальная архитектура). If you buy it, you ‘re also helping a good cause, as the royalties are donated to the Medecins Sans Frontieres.
Systems Software and Security
In a previous life, I ported JikesRVM, a JVM written mostly in Java, to run on top of bare hardware, without support from an operating system. Later, I contributed patches to the JikesRVM project to enable support for Opensolaris, and in the context of the Google Summer of Code 2008 program, to compile and run with OpenJDK. I also proposed an architecture to make Java’s I/O subsystem faster by replacing the operating system with a hypervisor and thus relieving it of context switches and unnecessary data copies; an almost identical approach has been independently developed into the JRockit virtual edition product.
On the systems performance front, I have studied the performance of then popular dynamic content generation technologies for the Apache web server (my very first paper, which received a best paper award) and I investigated ways for optimally configuring the garbage collection algorithms for the then two prevailing implementations of the JVM. I have also implemented a library that parses the output from the DTrace provider for Java in order to reveal obscure problems in the co-operation of the JVM with the operating system.
On the software security front, I co-developed a mechanism for identifying and preventing cross-site scripting attacks through (missing reference), I investigated the evolution of software security issues, an effort which was then scaled to analyze the entire Maven repository, and I analysed the security challenges and requirements of (then) popular component software middleware.
Grants
I have written the following research proposals which led to funding for me or for the institution I was working for at the time. I have also authored and co-authored several other proposals, obviously not as lucky ones :-)
- FASTEN
- Fine-grained analysis for dependency management €3,950,000. Host: TU Delft. Partners: AUEB, U Milano, SIG, Endocode, OW2. Funding agency: European Commission. 2019
- CodeFeedr
- Next-gen software analytics. NWO Big Software. €440,000. Host: TU Delft. Partners: SIG. Funding agency: NWO. 2015
- Pourquoi
- Pull request quality services. STW Take-off phase 1. €40,000. Host: TU Delft. Funding agency: STW. 2014
- SEFUNC
- Software Engineering Properties of Functionally Enabled Languages. Marie Curie Intra European Fellowship grant. Duration 16 months. Total budget €130,000. Host: TU Delft. Funding agency: Research Executive Agency (European Commission). 2012
- STEREO
- Software Engineering Research Platform. Collaborative research project. 3 partners. Duration 36 months. Total budget: €600,000. Host: Athens University of Economics and Business. Funding agency: Greek Secretariat of Research and Technology. 2010
- CallGraphRank
- Pagerank on software graphs. Basic research grant. 2 researchers. Duration 12 months. Total budget: €10,000. Funding agency: Athens University of Economics and Business. 2010
- SQO-OSS
- Software Quality Observatory for Open Source Software. Collaborative research project. 6 partners. Duration 24 months. Total budget €1,638,000. Host: Athens University of Economics and Business. Funding agency: European Commission. 2006
Projects
A list of research projects I have been actively involved with, along with a description of my role in them, in reverse chronological order:
- GHTorrent:
- GHTorrent created a platform to continuosly mine and share data from the Github social coding platform (2011 – 2021). I was the lead developer and maintainer of the project. Code: GHTorrent.
- Codefeedr
- The project created a streaming software analytics infrastructure for real-time mining of software project repositories (2016 – 2020). I was the PI and lead software designer. Code can be found here.
- Testroots
- The project investigates how developers use testing. The ultimate goal is to learn from past behaviours to drive up testing quality and efficiency.
- Passive
- The project investigated trust establishment frameworks in cloud computing and shared virtual machines environments. As part of the project management team, I oversaw and contributed to the mechanism design and software development.
- SQO-OSS
- The project developed a novel platform for software quality assessment. I was in charge of preparing the proposal, submitted it to the European Commission and handle the subsequent negotiations. As the project’s manager, I oversaw its development. As member of the RTD team, I was the platform’s chief designer and contributed the largest number of lines in its code base. After the end of the project, I maintain and expand its source code base. The project was the basis for my doctoral dissertation
- PENED
- The PENED series of projects were Greek state funded joint research projects for PhD students. Along with Stephanos Androutsellis Theotokis and Kostas Stroggylos, we investigated novel ways for doing secure business to business transactions, both at the architectural and at the implementation level. As a researcher, I oversaw the architectural design and optimised the developed platform.
- PRAXIS
- The PRAXIS project looked into methods for performing automated document exchange in business to business and business to state scenarios. As a research assistant, I designed document schemas and a couple of exchange processes.
- Jamaica
- The project investigated chip multiprocessors and the accompanying parallel software. As part of my MSc research, I ported the JikesRVM Java virtual machine to run on top of bare hardware.
Research software and data
-
In our work (missing reference), we used both the Alitheia Core dataset and a new, sorter dataset. Find more details on how to replicate the case studies reported in this paper here.
-
Alitheia Core is a platform that enables researchers in the field of software engineering to automate experiments and distribute the processing load on a cluster of machines. I maintain the project in its own site. The project has been the centerpiece of my PhD work, while it has been presented individually in this work. A dump of the Alitheia Core database as it was by the end of my PhD can be found from this page. The data schema has changed a lot since then so if you are interested in a recent version of those, please contact me directly.
-
The Java DTrace Toolkit (JDT) is a series of Perl scripts that use the JVM DTrace providers on
OpenSolaris to provide deep insights on how Java software executes and how it interacts with the operating system. More…
Scripts and Hacks
-
Whereami is a Perl script along with a J2ME application that allows a user to record his tracks and allows other users to view the current status, speed and altitude of the data sharing user on a Google map. This hack was done back on Dec 07, long before Google released its similarly functioning, but widely more popular, Latitude service. You can find it here .
-
Thesis-o-meter: Almost all computer science PhD students implement a tool to monitor their thesis writing progress some time. I am sharing mine here to help against re-inventing the wheel (even though I am pretty sure this particular wheel will keep being re-invented forever — procrastination is much more enjoyable than writing).