Hoopla!

Hello, world! My name is Michael R. Crusoe; I’m a Research Software Engineer in C. Titus Brown’s laboratory for Data Intensive Biology at the University of California, Davis in the department of Population Health and Reproduction. This blog will be my primary communication outlet for my work with the Common Workflow Language project.

Background

The CWL began in July 2014 in Boston, MA, USA at the codefest preceding the Bioinformatics Open Source Conference. John Chilton, from the Galaxy Team, approached Nebojša Tijanić of Seven Bridges Genomics, about workflow interoperability. They were quickly joined by Peter Amstutz from Curoverse, Hervé Ménager from the Pasteur Institute and by myself.

As a bioinformatics tool developer I knew the burden that is attempting to keep up deep integration for all the various bioinformatics processing platforms out there. I had complained to John Chilton several times before about how unrealistic it was to expect scientific tool authors to describe their tools to each system separately. When I saw the group forming I jumped at the chance to promote a portable way to describe how to run command line tools.

While I take no credit for it, I am very proud that the working group that grew around the CWL released their second draft at this year’s BOSC and it was my pleasure to present the work of the team at the conference.

Funding

Now that the group has established itself and we are attracting even more projects who want to work with us, it is time for the group to mature and grow. At BOSC this year I was vocal and open about the fact that there was no specific funding for the CWL as a project. To my delight leadership from both Curoverse and Seven Bridges Genomics reached out to see what could be done. I am happy to announce that Curoverse has written myself, John Chilton, and Carrie Ganote (Indiana University, Bloomington) into their recently submitted Phase II SBIR proposal. On a more immediate basis SBG has given UC Davis an unrestricted gift to support me working full time on the project for the next six months.

Working with Brandi Davis-Dusenbery, Scientific Program Manager at SBG, I developed both a set of six month goals for the project and a longer term vision:

Six Month Plan

  1. Continuation of the working group meetings held online every 2-3 weeks.
  2. CWL workshop at the Festival of Genomics, California on November 3rd.
  3. CWL codefest in Santa Clara on November 4th.
  4. I’ll be meeting with the Agave Platform team in November to see about harmonizing our efforts.
  5. One to three additional meetups, workshops, or hackathons.
  6. A baseline sustainability report on the project will be written and used to further prioritize plans. This report will follow in spirit the Criteria-based assessment of the khmer suite. In addition, specific community engagement metrics will be reported.

Group goals:

1. A system of project governance is agreed upon and set up through a community-centered process.

  • A US charitable foundation is set up and it owns assets on behalf of the community as directed by the new shared governance structure.
  • This foundation is able to provide direct support to contributors from donated funds and/or funds raised through competitive grant applications.
  1. Version 1 of the specification is released.
  • A process for defining extensions is agreed upon.
  1. At least 3 implementations of CWL are available and input from groups implementing the spec is incorporated to the specification as extensions.
  2. A manuscript or position paper describing CWL is formalized and ready for submission to a peer-reviewed journal.
  3. A robust validator is available
  4. User-facing guides on tool and workflow description are grown via several collaborative tech events (codefests, hackathons, site visits, workshops) where community members teach/work with newcomers on some aspect of the CWL. These events include dedicated time and space to execute on lessons learned by updating documentation, issue tracker management, specification & implementation work.
  5. A simple dashboard is built that lists the number of tools (workflows) described in using the CWL on the main page of the CWL. Other dashboards show how these tool descriptions’ tests (in the form of workflows) are passing on the various implementations. The implementations themselves are also regularly tested using a conformance suite.

Long Term Goals

  1. CWL Workflows are the preferred way to fulfill the now-ubiquitous journal and funder requirements for computational reproducibility and reusability in the life sciences and beyond. (supplying a CWL compliant workflow version fulfills journal requirements for reproducibility and fast-tracks publications)
  2. Scientific tool authors ship CWL tool definitions as a matter of course. Most of them are made automatically when the authors use standard libraries. For some tools the community provides the definitions. Regardless of the source, the CWL tool definitions are shipped alongside each tool when installed using the major end-user package managers (Debian, RedHat, Conda, LinuxBrew).
  3. Workflow authors/users have a variety of tools to create, edit, and visualize workflows; both standalone as part of complete workflow systems.
  4. What constitutes a valid workflow/tool description is non-controversial. Automated checkers and community support are widely available.
  5. Certifications of compliance are issued to conformant platforms in an automated fashion.
  6. The CWL Project has a perfect score using the Software Sustainability Institute’s criteria-based software evaluation.
  7. The regular participants in the CWL project and leadership are diverse both by demographics (with global demographics as the target) and by skillset / field / role.