Model and Data Sharing Working Group - 2014 MSM

Working Group leads: Peter HunterHerbert SauroJim BassingthwaighteRoger MarkGeorge Moody

Goals and Objectives:

The specific goals of this Working Group are to:

  • Develop and promote
    • modeling standards
    • software for authoring, visualisation and simulation of models
    • repositories of models and experimental data to facilitate model reproducibility, model sharing, and model enhancement
  • Foster the pairing of models with data. Models accompanied by data that provide evidence of validity will serve as milestones of scientific progress.
  • Archive data and models in forms convenient to potential users, and develop archives that will have permanence.

The group will provide an interface to international efforts on model sharing by the SBML consortium, the European VPH community, the IUPS Physiome community, the NeuroML project and the synthetic biology data exchange group.

Recognizing that the interpretation of DATA and the integration of knowledge about real biological systems is the prime purpose of the IMAG working groups and the consortium as a whole, this working group is also dedicated to encouraging the development, use, and maintenance of repositories of experimental data together with the associated models that illuminate the data. In particular, the goal is to facilitate the free exchange of data and models. The philosophical view is that good data are the basic treasures of science; such treasures should be made freely available.

Participants:

This list is incomplete. Please join in.

  1. Bassingthwaighte, Jim, U.Washington, jbb2@uw.edu
  2. Barkak, Jacob, jacob.barhak at gmail.com
  3. Beard, Daniel A, Med Coll Wisc, dbeard@mcw.edu
  4. Christini, David J, Weill Cornell, dchristi@med.cornell.edu
  5. Hucka, Michael, CalTech, mhucka@caltech.edu
  6. Hunt, Anthony, UCSF, a.hunt@ucsf.edu
  7. Hunter, Peter, U. of Auckland NZ, p.hunter@auckland.ac.nz
  8. Mark, Roger, MIT, rgmark@MIT.EDU
  9. Marmarelis,Vasilis, USC, marmarelis@hotmail.com
  10. McCulloch, Andrew, UC San Diego, amcculloch@eng.ucsd.edu
  11. Moody, George, MIT, george@mit.edu
  12. Ortega, Jason M, Lawrence Livermore Nat Lab, ortega17@llnl.gov
  13. Peng, Grace, NIBIB/NIH (IMAG Chair) penggr@mail.nih.gov
  14. Sauro, Herbert, U. Washington, Seattle hsauro@uw.edu
  15. Smith, Lucian P, CalTech, lpsmith@spod-central.org

MSM Meetings

2013 Meeting

Slides for breakout session


Current Discussions

Presentations

Thursday January 24, 2013 3pm EST

A status update on COMBINE standardization activities, with a focus on SBML. Michael Hucka, Caltech
SLIDES

A vast number of modeling and simulation software tools are available today for research in computational systems biology. This wealth of resources is a boon to researchers, but it also presents interoperability problems. Different software tools for systems biology are implemented in different programming languages, run on different operating systems, express models using different mathematical frameworks, provide different analysis methods, present different user interfaces, and support different data formats. Despite working with different tools, researchers want to disseminate their work widely, as well as reuse and extend the models of other researchers. They do not want to hardcode their models as software programs, nor assume everyone uses the same computing environment; they need common exchange formats for representing their models in such a way that a variety of software systems can read and write them.

There exist a number of standardization efforts today with the goal of developing and evolving exchange formats for computational systems biology; they differ along dimensions such as domain of specialization and medium of communication. Many of these efforts are engaged in COMBINE (the COmputational Modeling in BIology NEtwork; http://co.mbine.org), an organization whose main goal is to help coordinate community standardization activities. In this presentation, I will summarize the goals of the core standards represented in COMBINE, and provide details about recent developments in certain ones with probable relevance to multiscale modeling, particularly SBML (Systems Biology Markup Language), as well as SED-ML (Simulation Experiment Description Markup Language) and SBGN (the Systems Biology Graphical Notation).


Friday March 8, 2013 2pm EST


Modular Modeling: Standards and Tools. Lucian Smith, Caltech
SLIDES

Model exchange and re-use has been greatly enhanced over the last decade by the emergence of standard model exchange languages such as SBML and CellML. Model design and re-use becomes even more tractable with modularity: larger, more complex models can be built using well-understood smaller models. CellML has long been modular, and SBML now has a modularity 'package' which allows modular model construction. We will present an overview of the capabilities of the modeling standards and tools that facilitate modular modeling. (http://antimony.sf.net).



Modular Modeling: Standards and Tools. Lucian Smith, Caltech
SLIDES

Model exchange and re-use has been greatly enhanced over the last decade by the emergence of standard model exchange languages such as SBML and CellML. Model design and re-use becomes even more tractable with modularity: larger, more complex models can be built using well-understood smaller models. CellML has long been modular, and SBML now has a modularity 'package' which allows modular model construction. We will present an overview of the capabilities of the modeling standards and tools that facilitate modular modeling. (http://antimony.sf.net).



Monday, September 9, 2013 at 2pm EDT:The 3D Virtual Cell Project – Towards Sustainable Scientific Progress

Host: Herbert Sauro

Philip Bourne, PhD, Professor of Pharmacology and Associate Vice Chancellor for Innovation, University of California San Diego Abstract: I would argue that the process of scientific discovery is inefficient and therefore not cost effective and hence only able to exploit a fraction of what needs to be explored. In a small way, we are in the early stages of a project that tries to address these shortcomings through community enablement. The 3D Virtual Cell Project seeks to establish the community and infrastructure to accelerate interdisciplinary science through the in silico modeling and simulation of the action of a living cell. Building community is not easy and requires building trust and providing reward in a shared space; efficient modeling requires new software approaches. The end result should be the accurate prediction of cellular function under a variety of environmental conditions delivered through new modes of dissemination and learning. I will discuss progress to date and hope to engage you in a discussion of the broader issues.

 


Other Discussions

Model Sharing Myths

1. Models are easily reproducible

This is common myth. One takes a model, enters it into a piece of software and the expectation is that one will get the same answer as published in the literature. Not necessarily true. Different tools implement numerical analysis methods differently, random number generators may have bias or different time steps or scale are used resulting in different simulation outcomes.

2. Extracting working models from the literature is trivial

This is a myth perpetrated by those who have never tried to extract a model from a published paper. Experience from the BioModels database project at EBI shows that at least 9 out of 10 of all models curated by EBI cannot be made to work from the published paper itself. This represents a huge waste of resources since models must be painstakingly recreated. One would assume this is especially true for multiscale models which are generally more complex than subcellular models. However no data is available on the creation of multiscale models from the literature. Developing systematic methods to share models is one way to make the above myths come true. On the one hand standards such as SBML or CellML can be used to unambiguously describe a model but we have currently no way to unambiguously specify how to reproduce the results of simulating a model. It as if an experimentalists had no easy way to reproduce a published experiment.

Standards, Reproducibility and Model Repositories

On Tue, Sep 10, 2013 at 12:58 AM, George Moody <george@mit.edu> wrote:

Welcome to the group, Jacob! Unfortunately, neither Roger Mark nor I will be able to attend next month's MSM meeting, so we look forward to meeting you and introducing ourselves at a later time.

Roger and I joined the MSM consortium several years ago at Grace's invitation, and together with Jim Bassingthwaighte we established the MSM's data sharing working group, which merged with the model sharing working group last year. Roger and I represent PhysioNet (http://physionet.org/), an NIH-sponsored web resource we established in 1999 that provides large archives of open-access physiologic signals and time series for research; open-source software for exploration and analysis of these data, and (to a lesser extent) simulation and modeling of physiologic systems; and web services for researchers interested in collecting, preserving, characterizing, and sharing physiologic signals, time series, and related clinical data securely with limited groups of colleagues for limited periods, and eventually as open-access data.

Our goals in the MSM context are to connect modelers needing data to test their models with clinical researchers and experimentalists seeking predictive or explanatory models to account for their observations, and to encourage sharing, linkage, and reuse of open data, open models, and related open-source analytical software.

On Thu, Sep 12, 2013 at 4:21 AM, Jacob Barhak <jacob.barhak@gmail.com> wrote:

PhysioNet is an impressive tool for sharing data. However, it is limited in several aspects:

1. Limited to certain types of data - physiologic signals - no clinical trial data for a counter example

2. Limited as a model/software repository - no version control - no auto installation like an app in a smartphone – although the web application to show the data is great

3. There is no buildup and aggregation of information that is apparent. For example, no preparation for ensemble models and although there are challenges posted, I did not see the next step of assembling the algorithms together to improve results.

Also, there are other web sites that post medical challenges/repositories such as:

http://www.heritagehealthprize.com/c/hhp

http://www.grand-challenge.org/index.php/Main_Page

http://senselab.med.yale.edu/ModelDB/default.asp

I suggest members will contribute to this list so we can figure out what is out there already.

Is there a way to organize all these and aggregate/share this data and software? I suggest we start a list of elements needed to support data/model sharing. Here are several points I suggest:

1. Version control - models are well as data change from time to time and this information should be retrievable and accessible

2. Web interface with search capabilities - this means it would be possible to run the model through the web interface

3. Association of model and data - each model should have an example of data associated with it - each model will have to have sample data for the benefit of the user

4. Proper Model Categorization - each model will be associated with keywords and be easily categorized, searched, and accessed

5. A privilege system in case sharing is restricted to certain groups/users

6. Easy to use Model installer - preferably supporting multiple platforms in the same system and HPC in case of larger models.

7. Supports big data - including transfer / storing delta of large datasets among users.

I hope others find this list useful and will contribute their own characteristics of an ideal data/model sharing system.


Regarding model data sharing myth #2. Extracting working models from the literature is trivial. The author of this myth is right, it is not trivial, yet it is still possible and worthwhile as it allows true exploration of phenomena from different perspectives.

On Thu, Sep 19, 2013 at 12:11 PM, Jacob Barhak <jacob.barhak@gmail.com> wrote:

There are recent important developments taking place in the open source community that are highly relevant to this discussion. Here is a link to discussion started by Travis Oliphant this week that discusses not only the sharing aspect - it also discusses the testing and review aspects:

https://groups.google.com/forum/m/#!topic/numfocus/pcxrXX89KT4

I strongly recommend that this discussion will be included in the model sharing group - after all models are essentially software.

And another voice within the open source community calls for recognition of open source software as a publication for academic review and promotion purposes. Some are working towards assigning a DOI for open source software.

To summarize those two elements I would like to add points 9,10 to the list below I asked to upload to the wiki:

9. A model repository should have a review model and a testing suite for uploaded models. 10. A model repository should have a DOI for the model and links to other relevant DOI.

On Sep 19, 2013, at 4:49 PM, hsauro <hsauro@u.washington.edu> wrote:

I had a quick look at the discussions. In subcellular modeling (i.e reaction networks) models are not software but specifications in XML (eg SBML) which are then converted to software. This makes the model, software agnostic. Ie, we can run the model in Matlab, C, Python etc. Also it allows us to heavily annotate the model with biological information in a computer readable form. If there are existing standards for models I encourage people to use those. If there isn't a standard then you've no choice but to archive the model in the executable software itself.

As for R I don't think I know anyone who uses R to do subcellular modeling though Python is starting to become of interest, at least as a way to control simulations. My group is moving towards Python for specifying simulation thought models must and should be specified though an official standard like SBML. I would however be interested to learn more about what was discussed.

As for repositories, we have to think about who maintains them and the hardware in the long term? I am thinking of just piggy backing on biomodels which has long term funding from EBI.

On Thu, Sep 19, 2013 at 6:31 PM, Jacob Barhak <jacob.barhak@gmail.com> wrote:

Good point regarding agnostic implementations for models. It is a good idea - yet more long term since models already are implemented in many languages and platforms. If we want to share those we should not force the developers to change language again now. If applicable it is in the very far future.

If you look at the discussion I linked to you will find out that the issue is not only for python or R and there are solutions today that can archive multiple models from multiple languages. Yet if you mentioned python - it seems to be a good glue language - I myself do recommend it for many uses. Yet again people model in many different languages for different reasons.

And as for repositories - our architectures and development models change constantly as well as funding models and prices. You will find new opportunities today within the cloud for example. Storage is very cheap and abundant so repositories are not hard to store and expose - the issue becomes maintenance and support. Today there are good tools to simplify those tasks - we can explore this further on the wiki.

On Mon, Sep 23, 2013 at 11:39 AM, Hunt, C. Anthony <a.hunt@ucsf.edu> wrote:

I agree wholeheartedly that the M&S community needs standards for the specification, instantiation, and description of models, as well as standards for experimentation, sensitivity analyses, parameter estimation, etc. Without these standards (or good practices), the seemingly ad hoc nature of M&S limits its efficacy in the study of biological modeling, much less biomedical modeling and pharma. Beyond the mere need to use standards is the seeming plethora of different standards and conflict within the community surrounding any given standard. (E.g. the recent response within the SBML community to Sorger's PySB). Rather than merely using and supporting a given standard for whatever peculiarly specific domain one's model might fit within, it is also important to understand the relationships between standards, the extent to which they are commensurable and can (and cannot) be unified. Further, it is important to be open to, and to welcome disruptive innovation (see *) (because we are engaged in research not product development).

The problem then becomes one of which standard is appropriate given the particular research at hand, as well as the research focus of one's lab and members, and the arching objectives of MSM and the M&S community as a whole. The choice of a standard (or the choice not to use a standard) should be entirely driven by these objectives. And where multiple standards, or integration of various cross- or non-standards complying models, effort must be made to clarify the role played by the standards as well as the research objectives with which the use of those standards might conflict.

  • If the battles for the OSI (Open Systems Interconnection) standards had been won, we would not have today's internet.

See http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6565559&tag=1

On Mon, Sep 23, 2013 at 1:14 PM, hsauro <hsauro@u.washington.edu> wrote:

I agree that different standards have different strengths and roles. For example SBML was designed specifically to represent subcellular processes but is not at all good at representing multicellular systems and I would never use it for that. As with computer programming languages, use the language that best suites the application. Having said that however, we don't actually have many modeling standards, we have nothing for example for representing multicellular systems in a community agreed exchangable format. Currently we have SBML, CellML and NeuroML as the most used standards, each has strengths and weaknesses. One mustn't confuse these however with proprietor or local efforts such as Matlab, MML, Jarnac script, VCML, PySB etc. These are specific to their tools and are not exchangeable other than via one of the three MLs. As for unification of standards, this is difficult because sometimes the philosophy is different. For example it is quite difficult to interchange CellML and SBML because one is bio based and the other math based. Also translating languages such as Matlab to one of the standard MLs is very difficult because there is no annotation in a Matlab script so it isn't possible to easily identify the biological parts of the model or any part of the model for that matter.

When I started working in the standards area in the late 90s it wasn't possible for me to move a biochemical model from one tool to another because each tool had its own format. This made life a little difficult and meant I was tied to a particular tool. As a result we formulated SBML and now I can easily take a biochemical model from one tool and load it into another. Try doing that with multiscale models, its currently impossible.

One significant advantage of having a standard exchange format is that if the author of a tool stops development and the application can not longer be used (eg only works on Windows 3.1 :)), I've still got the SBML file which I can take to a more modern tool which is supported. So we no longer have orphaned models which used to happen all the time.

On Mon, Sep 23, 2013 at 7:19 PM, Jacob Barhak <jacob.barhak@gmail.com> wrote:

I wanted to draw your attention and add the following link with regards to the list of repositories.

https://simtk.org/xml/index.xml

SimTk already allows many of the elements I noted. I am CCing Joy as a contact person for this repository.

As for the discussion with Herbert and Tony. Yes, forcing uniformity may have ill effects and we do want to allow variation in models and freedom of exploration. Never the less, we have to be able to communicate amongst models at some level. If we cannot connect our models since they are too different then we will not be able to explain complex Multi-scale phenomena.

And repositories have to be modernized and kept up to date with technology - this is what I am suggesting. See GitHub for example - it almost became a standard for open source software sharing - why not follow with models sharing?

On Mon, Sep 23, 2013 at 7:35 PM, Herbert M Sauro <hsauro@u.washington.edu> wrote:

I checked simtk last week and I actually couldn't find any models. Either they are not there or they are buried deep in the web site somewhere. Have you seen biomodels? This is an example of a good model repository, it mainly confined to subcellular models but that is mainly due to the fact the standards are most developed for subcellular models. They curate, annotate and test models. They have pretty good search capabilities and have web services (which we use) for programmatic access. Of course there is always room for improvement.

You mention github, something we use for our software. It might be appropriate for model hosting but unless you're a computer scientist it is difficult to use. Our users are straight biologists who don't have the time to learn the intricacies of revision control systems. We are however in the process of building a UI on top of a revision control system to isolate users from the complexity.

On Mon, Sep 23, 2013 at 9:50 PM, Jacob Barhak <jacob.barhak@gmail.com> wrote:

Biomodels is interesting. It has some coding for models. And stores files such as octave. I liked the fact that model parameters are explicitly defined - very important. And the repository uses subversion as a version control system. Yet it seems to be centered around the SBML language. Therefore it is channeled to certain uses and probably certain audiences. I know I can upload a disease model to SimTk pretty easily - I am not sure I can jump through the hoops of converting my code to SBML. From that perspective I assume biomodels may be too specialized. Yet again, an impressive piece of work. We should have it on the list of repositories for sure to learn from it.

Tony had a good point - we should allow some freedom - biomodels may be too regulated to fit many computational models.

On the contrary Github is actually a good example of being easy to use and agnostic in many aspects.

1. It is not sensitive to language - it will try to detect your language to help you - yet there is no restriction here.

2. It has version control practically invisible to the user - you can just upload your files and not even care that there is version control happening in the background. You can even change a text file with your equations through the web editor and your model is made accessible in its latest version to anyone. Or you can be very picky and handle multiple branches of a coding operation handled by many people.

3. You can store data or code. For example SciPy stores the conference proceedings on GitHub - in 2012 the conference proceedings were assembled by accumulating the contributions of all participants to one repository. In a similar way you can just share XML data through this repository.

4. There is sharing support that is agnostic to type of project such as wiki, issue tracker and many other features. Using this technology you can support any type of computational model I can think of.

What I am trying to convey is this kind of technology is not confined to programmers or biologists. Such technology is rich enough and easy enough to use to support multiple fields and this is what we want from sharing data/model/code.

I am surprised that you report the biologists have issues with Github - can you be more specific? What are the issues you are trying to resolve?

By the way, version control is very important on the list for repositories - at least considering what the CPMS committee computational science team has drafted so far. Here us a link to a draft of our initial conclusions that will be discussed further at the MSM meeting:

http://wiki.simtk.org/cpms/Ten_Simple_Rules_of_Credible_Practice/Mathematical_and_Computational_Sciences_Team

If we are able to combine the needs of different disciples into a sharing repository then we can make information flow so much easier.

I am CCing the leads of the CPMS committee Lealem and Ahmet since there is overlap in discussion and interests here. The groups should be aware of each other before the meeting.

Herbert, you are the most active lead of the Model and Data sharing group as it seems from our communications. Do you have any special plans or agenda you wish to raise in the MSM meeting?

On Tue, Sep 24, 2013 at 12:51 AM, Herbert M Sauro <hsauro@u.washington.edu> wrote:

For biochemical reaction models the only formats that can be exchanged on a *community wide basis* are sbml and cellml but particularly sbml because it was designed specifically for biochemical models. I should point out that there are between 200 and 250 tools that can read and write sbml. As for other kinds of models, I don't think there are any standard formats, so people have no choice but to represent the model in the programming language they actually used to run the model. Remember however sbml should only be used to store biochemical reaction models, if your disease model is something else then it is not advisable to convert it into sbml.

Not sure I would call biomodels specialized since there are 1000s of published biochemical models, it's a fairly large community and through biomodels and the cellml repository I have access to a large number of them in a language agnostic form. Imagine if they were all stored in octave, I could only run the models in octave or matlab. However, there are more important reasons for using something like sbml, these include:

1. We try to store as much biology as possible in sbml, the final math model is not stored. This allows use to retarget the model to different programming languages and most important of all, completely different kinds of math implementations, eg stochastic, ode, or boolean. If you store the model in its final programming language one cannot do any of these.

2. Using sbml means the models can be heavily annotated, providing information on what assumptions were made, what decisions were made eg choosing parameters, what the various symbols mean by using control vocabularies and ontologies, storing a model as a composition of models (ie submodels), etc. One couldn't do any of these easily if the storage medium were a programming language such as matlab or java. To give an unusual application, we took all the models in biomodels and automatically deconstructed them into their constituent parts, we now have a repository of computational parts to build new models with.

Reusability is a critical aspect of any modeling format. Such reusability should be easy to do and shouldn't require one to download say two models written in different computer languages and then spend a month or two to reverse engineer the code in order to merge them.

I feel the advantage one gains from using a community wide standard outweighs the disadvantages, but it is true there are some disadvantages. I am interested to learn however what freedoms are lost in using something like sbml? If there are significant barriers to using open standards then these should be addressed. I want to emphasize again because I think there is some confusion in the multiscale community, sbml should *only* be used for exchanging biochemical reaction models, and in my opinion, nothing else. If you have a model that involves more than one cell, I wouldn't recommend sbml (note the insides of each cell could be represented using sbml)

Unfortunately we have nothing equivalent for exchanging multicellular models, models of tumors, vascular systems, development etc. I don't know about mechanical systems but electrical systems I assume could be dealt with by neuroml and purely mathematical models by cellml. Pharmokinetic models can be represented by sbml although I think there is project in Europe to develop a specific pharmokinetic exchange standard.

As for github being easy to use I guess it depends on what interface you use. If you're using the command line then a biologist will not touch it for obvious reasons. One could use tortoisegit but even that has 15 options in the popup menu and the language is alien to most non-computer specialists. The front page for raw github projects is also I feel quite intimidating to a non-computer specialist. I do have some specific ideas for a version tracking system which we can talk about at the meeting.

I don't have any specific plans yet for the meeting but I think these discussions suggest many possible topics we could talk about and I'd be happy to make these topics the focus. msm meetings in the past haven't generally covered these areas so it would be very appropriate to cover them.


On Tue, Sep 24, 2013 at 6:57 AM, Jacob Barhak <jacob.barhak@gmail.com> wrote:

This is a good explanation. Please allow me to resolve the issue and claim that SBML is just another programming language - specialized for certain tasks and has great advantages in a certain domain.

What I was trying to do is generalize - what if there was one repository for all biomedical models - a one stop shop for all kinds of models we can share and combine - this is what I was trying to characterize. And what if this repository could initially populate itself by pulling information from all other repositories we know and offer them from a central location? I was thinking something like clinicaltrials.gov for clinical trials.

As for github - I see your point - on the fun side I can see how the octo-cat mascot can be intimidating to a biologist.;)

Yet seriously - you should try the windows interface or the web interface - they are pretty easy - it just takes a while to get used to - as with any system.

Yes, it seems we have many discussion points - and I know at least Tony and I will be there. If we know the group we can perhaps schedule a time that will be convenient to most. I know I will have scheduling issues since I want to attend several workgroups - I guess it is he issue with most.

So who plans to attend?

On Tue, Sep 24, 2013 at 10:41 AM, Herbert M Sauro <hsauro@u.washington.edu> wrote:

What you suggest would great to have, I totally agree. At the moment it is difficult to do this because anything other than subcellar models would currently have to be encoded in an ad hoc manner, probably in the compter programming language that the modeler used (perhaps even just binary, some modelers do that) which could be anything. I agree this would be better than nothing but it wouldn't be a long term solution, we wouldn't have reusability, parts extraction, annotation, search etc. One has to bear in mind that the submitted models won't be like software libraries which are documented and have a defined API and are therefore designed for reusability and interogation.. The model repository would most likely be populated with chunks of code, usually complete applications, which would have little documentation on their internal structure and of course no API, so basically unreusable.

I have tried on a number of occasions to kick start an effort to formulate a project to start thinking about how we can share higher level models effectively but I've never got any traction. I think the idea of model at exchange at levels higher than subcellular is still not yet accepted as either a doable or even desirable effort to undertake. However meetings like the one next week are places where perhaps the case can be made.

I have no doubt that github is easy to use, my students use it regularly but I still use svn on google code or sourceforge. The reason being is that I don't have the time to learn a new version control model. This is what you'll be up against with other researchers. None of these things are intrinsically difficult but people only have so much time in the day to learn the subtleties of things like github. As I said we're working on a simple graphical layer that will hide most of the complexity for model version control for biologists. I know the cellml people have looked at this as well but I think they expose the raw operation of the version control system to users (Peter will be able to answer that)

On Tue, Sep 27, 2013 at 9:06 AM, Ahmet Erdemir <erdemira@ccf.org> wrote:

I thought I would put my two cents as an avid Simtk.org user.

There are indeed models in Simtk.org but someone may find it difficult to identify where they are. In a sense, Simtk.org provides a catalog of modeling and simulation projects as individualized web sites including many different components (forums, wikis, source code repositories, and downloads, i.e. potential for feature creep). The projects may be active or inactive, may or may not have a downloadable model, and may or may not be associated with a publication (or a review mechanism). There is not much structure within a given project other than the web based framework for development and dissemination; it is left to individual project administrators. When you are interested in a certain biological system or a specific tool, you need to first find the project and hope there are downloadable software/data/models associated with it. I believe Herbert noted previously; m&s activities in some disciplines result in rather heterogeneous and unstructured information without much documentation. My work in biomechanics, and in particular m&s using finite element analysis, is an example. One way that I dealt with sharing of clumsy models and data relied on Simtk.org. Essentially, 1) zip all raw and processed data files, any file that defines the model, any scripts that are helpful to run simulations and post-process results, and the results themselves (but not necessarily the simulation software), 2) upload the zip file to the downloads section of Simtk.org, 3) refer to the download site in a paper that will be submitted for publication (and hope you reported things well in the paper). Of course, when cataloged, i.e. through Biositemaps, through IMAG/MSM index of models, and by Google, the chances to access and reuse these increase.

Is this workflow ideal? Not necessarily, neither for the developer nor the user. The developer does not have any guidance on how to effectively share the data, models, etc. The user does not necessarily have an easy way to reproduce and/or reuse the model due to lack of detailed documentation and heterogenous information that may be hard to navigate.

With reference to a brief mention of biositemaps:

Sat, 28 Sep 2013 15:39:53 -0700 (PDT) Jacob Barhak <jacob.barhak@gmail.com> wrote:

Hi Joy,


Are you affiliated with this biositemaps project as well?

I went on the site and it seems there is a need for technical revamping of that site. Yet if you choose the distributed repositories approach with a single search engine model the information stored already is valuable.

And Joy, can you perhaps join only to the model sharing workgroup breakout session via Skype or google chat? If so we will have to schedule in advance. I know they have technical support for this to happen on the conference site.

And looking at remarks by Ahmet and Herbert it seems we need to address the issue of how difference disciplines look at a repository and how they use it. This is an important issue that is not technical.

For example Herbert seems to make a distinction between models and software. In my mind: Computational Model = Software !!!

Therefore a model repository should have many similarities to a software repository.

The model can be mostly data in an XML file, yet it has to run on a computer if it is not a trivial model.

I hope Joy can join the discussion.

Jacob

On Sat, Sep 28, 2013 at 11:55 PM, hsauro <hsauro@u.washington.edu> wrote:

In the community I work with, executable models in the form of software is most definitely not the model. We make a distinction between defining a model and it's instantiation as a runtime construct. There are a number of reasons for this:

1. Annotation can be much richer, this means we keep all the biology that went into the model.

2. Models can be built by combining other models in a formal way.

3. There are no 'my computer programming languages is better than yours' wars

4. We can target the model to what ever executable end point we want to, this means in ten 10 years when Google invents yet another new computer language it is not difficult to generate new code from the model description.

5. Most importantly, we are not confined to the original mathematical approach that was used, but can retarget the model to new analyses and even completely different mathematical approach.


I am sure there are other reasons but those are off the top of my head. Of course it depends what you mean by software, I use it to mean something that describes an algorithm. SBML models don't describe algorithms, they describe the biology.

Herbert

Sat, 28 Sep 2013 22:59:23 Jacob Barhak <jacob.barhak@gmail.com> wrote:

Hi Herbert,

Your arguments describe good practices, yet from the view of a computer scientist you have just described software written in a Domain Specific Language (DSL). Here is a Wikipedia page that gives this definition: http://en.wikipedia.org/wiki/Domain-specific_language

The biomedical modeling community should not limit itself to a specific computer language - especially if we wish to combine information from multiple model types as Multi-Scale Modeling suggests. DSLs and specific formats to store data are beneficial tools - yet a good repository should allow a variety useful to many user types - otherwise the repository will be limited in scope and ability.

Fortunately, software repository technology has advanced to support storing all these types of information. And there are even newer advances I did not touch yet that will allow much easier user interactions and eliminate many complexities in storing software and data.

There is a recent push towards Reproducible Science. A good repository should hold the software that allows reproducing the results the scientist reached using a certain model. We should think in that direction. And including software in mind seems an essential step.

I hope you find my observations acceptable.

Jacob

Sun, 29 Sep 2013 12:03:59 -0700 "Hunt, C. Anthony" <a.hunt@ucsf.edu> wrote:

Herbert, you mention "difference in philosophy." Maybe the difference relate more to focus, perspective, and model use cases than to philosophy of science.

With that in mind, please clarify this statement: "a SBML model describes the biological system."

My understanding is different. If my understanding is flawed, please correct me.

I thought that an SBML model is a representation of a computational model describing idealized (hypothetical) processes given a set of prespecified predicates. The predicates consist typically of data (and thus particular experiments), assumptions, abstractions, assertions about the biology, and simplifications.

The modeler asserts that if such an idealized process could be made concrete and real, then the mathematical description would stand as an accurate representation of the identified details during operation.

Additional work would be required to establish confidence in various quantitative mappings from the instantiated model to measures taken on specific biological referents.

-Tony-

On Sep 28, 2013, at 10:59 PM, Jacob Barhak <jacob.barhak@gmail.com> wrote:

> Hi Herbert, > > Your arguments describe good practices, yet from the view of a computer scientist you have just described software written in a Domain Specific Language (DSL). Here is a Wikipedia page that gives this definition: http://en.wikipedia.org/wiki/Domain-specific_language > > The biomedical modeling community should not limit itself to a specific computer language - especially if we wish to combine information from multiple model types as Multi-Scale Modeling suggests. DSLs and specific formats to store data are beneficial tools - yet a good repository should allow a variety useful to many user types - otherwise the repository will be limited in scope and ability. > > Fortunately, software repository technology has advanced to support storing all these types of information. And there are even newer advances I did not touch yet that will allow much easier user interactions and eliminate many complexities in storing software and data. > > There is a recent push towards Reproducible Science. A good repository should hold the software that allows reproducing the results the scientist reached using a certain model. We should think in that direction. And including software in mind seems an essential step. > > I hope you find my observations acceptable. > > Jacob -Tony-

Sunday, September 29, 2013 8:46 AM Herbert M Sauro <hsauro@u.washington.edu> wrote:

I should emphasize again, a sbml model does not describe an algorithm, it describes the biological system. Some parts of the sbml description are algorithms, eg the rate laws but as a whole a sbml model is declarative and describes biology. It requires other software to interpret the biology, decide what the math model will be and convert it into a runnable piece of software.

Like you I looked up wikipedia and found this statement on software:

[software]is any set of machine-readable instructions that directs a computer's processor to perform specific operations.

If software is defined this way then SBML models are not software. As indicated in my last email by describing the biology rather than the executable model itself we can do many more things that we couldn't if the model were raw executable code. I would agree that sbml is a dsl, but unlike most dsls, sbml doesn't describe algorithms.

I'm not sure if we'll have the imag meeting but if not it might be worth have a Skype or some other virtual meeting to discuss these points. They are important because they represent a difference in philosophy.

Reproducible science is also at the heart of the subcellular modeling community and we've been trying to thing how best to achieve that (see sedml.org as one possible approach).

Herbert

Sun, 29 Sep 2013 17:27:58 -0700 "Hunt, C. Anthony" <a.hunt@ucsf.edu> wrote:


Herbert, I have seven quick observations on SBML and standards in general, and an inference. My comments add to some points made earlier by Jacob.

1) It's been asserted that SBML is inappropriate for rule-based modeling (very closely related to agent-based modeling).

2) SBML-Multi (http://sbml.org/Community/Wiki/SBML_Level_3_Proposals/Multistate_and_Multicomponent_Species_Proposal/Introduction ) is an attempt at a standard for describing and specifying such systems.

3) There are more established alternatives like Kappa (http://www.kappalanguage.org/ ).

4) Both SBML-Multi and Kappa are abstractions, languages for describing and prescribing.

5) Peter Sorger's group has pointed out (Lopez 2013) that such abstractions can constrain the modeler's ability to create a model and achieve new research objectives (new use cases). Constraints can be good or bad, depending on what you are trying to do! … your use cases.

6) PySB is not the only effort in the above area. Our own and various other efforts (e.g. Extended DEVS, which deals directly with multi-attributes & multiple scales) focus on making computational models more concrete. And little b (http://www.littleb.org/ ) is another notable attempt to strike a middle ground between abstraction and implementation.

7) All 4 of these (little b, Kappa, PySB, and SBML-Multi) are standards that can be used to facilitate sharing, curation, and composition of (primarily) molecular systems biology models. Each has strengths and weaknesses. A serious MSM effort would evaluate and choose each standard based on the objectives of the effort (and other use cases), rather than on a compulsion to use standards or an accidental adoption of any one of them.

[Lopez et al 2013] "Programming biological models in Python using PySB" http://www.nature.com/msb/journal/v9/n1/full/msb20131.html?WT.ec_id=MSB-v9/n1

Each of the above efforts were motivated by new or additional model use cases. Something new was demanded of the simulation model.

Given our research context and observing similar developments in other domains, we can expect constant pressure on earlier standards intended for a smaller set of use cases.

We follow well established best practices used by software and systems engineers (adjusted to our domain). Our use case documents specify the following: the context for the envisioned model (domain experts' research questions); is its use one-off or is it intended for reuse; who the model might serve, who will use it (who conduct and evaluate in silico experiments), what will be expected from it near- and longer-term; what are the current, referent wet-lab experiments (what are the biological aspects/phenomena of interest), what are the validation targets, how will similarities be measured; how will non-biomimetic features be identified, etc., etc.

We often have multiple, referent wet-lab experiments and an expectation that future simulation results will be able to map to some specific feature of those experiments. Each of those experiments may thus become a current or future use case. A referent wet-lab experiment is a model use case.

We need all of that information before focusing our thinking on (model) requirements. Some requirements may be beyond current technology. In that case we focus on the achievable and may add a new use case: our near-term models must be evolvable.

Given requirements, we can think about specifications.

Given specifications, we have enough information to focus on selecting (and excluding) modeling methods and models of computation (MoCs).

Regards -Tony-


Sunday, September 29, 2013 8:46 AM Herbert M Sauro <hsauro@u.washington.edu> wrote

On 9/29/2013 5:23 PM, Hunt, C. Anthony wrote: > Herbert, I have seven quick observations on SBML and standards in general, and an inference. My comments add to some points made earlier by Jacob. > > 1) It's been asserted that SBML is inappropriate for rule-based modeling (very closely related to agent-based modeling). Agreed.

> 2) SBML-Multi (http://sbml.org/Community/Wiki/SBML_Level_3_Proposals/Multistate_and_Multicomponent_Species_Proposal/Introduction ) is an attempt at a standard for describing and specifying such systems. Agreed

> 3) There are more established alternatives like Kappa (http://www.kappalanguage.org/ ). Agreed

> 4) Both SBML-Multi and Kappa are abstractions, languages for describing and prescribing. Agreed, all models are abstractions.

> 5) Peter Sorger's group has pointed out (Lopez 2013) that such abstractions can constrain the modeler's ability to create a model and achieve new research objectives (new use cases). Constraints can be good or bad, depending on what you are trying to do! … your use cases. That is true. Rule based models I feel are still in an exploratory stage. It is true that SBML looks at biology in a particular way, the textbook biochemistry way if you like. New thinking such as rule based modeling should not of course be constrained by standards given their experimental status and should be able to explore their domain without any limitations. Eventually I imagine that the rule-based community will reach consensus and develop a common exchange format that will allow rule based models to be freely exchanged between different rule based tools from which other advantages will flow.

> 6) PySB is not the only effort in the above area. Our own and various other efforts (e.g. Extended DEVS, which deals directly with multi-attributes & multiple scales) focus on making computational models more concrete. And little b (http://www.littleb.org/ ) is another notable attempt to strike a middle ground between abstraction and implementation. I understand. You have to remember that SBML was formulated before any of these alternatives were even thought of and SBML is very helpful for what it was designed for. PS I'm not familiar with Extended DEVS, and a search on Google didn't come up with anything, at least on the first page.

> 7) All 4 of these (little b, Kappa, PySB, and SBML-Multi) are standards that can be used to facilitate sharing, curation, and composition of (primarily) molecular systems biology models. Each has strengths and weaknesses. A serious MSM effort would evaluate and choose each standard based on the objectives of the effort (and other use cases), rather than on a compulsion to use standards or an accidental adoption of any one of them. Our definition of a standard both in the SBML community and the synthetic bio community is one where two tools must be able to exchange the agreed standard. If we didn't do it this way, we'd get lots of standards proposed but none actually used. So a proposal only becomes a standard when it is used in practice.

> [Lopez et al 2013] "Programming biological models in Python using PySB" http://www.nature.com/msb/journal/v9/n1/full/msb20131.html?WT.ec_id=MSB-v9/n1 > > Each of the above efforts were motivated by new or additional model use cases. Something new was demanded of the simulation model. That's fine, one shouldn't be using community standards for bleeding edge research since the standard will by definition be behind the curve, although one can inform the standards community of what is going on. This actually has implications for multiscale modeling, that is, is the field even ready to formalize a community-agreed way to describe models? Perhaps it isn't. My suspicion is that it isn't quite ready. > Given our research context and observing similar developments in other domains, we can expect constant pressure on earlier standards intended for a smaller set of use cases. Agreed > > We follow well established best practices used by software and systems engineers (adjusted to our domain). Our use case documents specify the following: the context for the envisioned model (domain experts' research questions); is its use one-off or is it intended for reuse; who the model might serve, who will use it (who conduct and evaluate in silico experiments), what will be expected from it near- and longer-term; what are the current, referent wet-lab experiments (what are the biological aspects/phenomena of interest), what are the validation targets, how will similarities be measured; how will non-biomimetic features be identified, etc., etc. This is an excellent list. I suspect however (but could be wrong) that you might be the only group doing this. Worth advertising to the rest of the world I think. > We often have multiple, referent wet-lab experiments and an expectation that future simulation results will be able to map to some specific feature of those experiments. Each of those experiments may thus become a current or future use case. A referent wet-lab experiment is a model use case. > > We need all of that information before focusing our thinking on (model) requirements. Some requirements may be beyond current technology. In that case we focus on the achievable and may add a new use case: our near-term models must be evolvable. > > Given requirements, we can think about specifications. > > Given specifications, we have enough information to focus on selecting (and excluding) modeling methods and models of computation (MoCs). What you suggest seems to be the right way to think about it but given human frailties I think it might be a hard sell.

Herbert

On 9/29/2013 4:08 PM, James B. Bassingthwaighte wrote:

Herb, the definitions and the process you describe demonstrate that SBML is NOT a vehicle for reproducible research. Even if SBML is good enough to convey an idea of the reaction, it fails at the fundamental processes of modeling: mathematical definition, verification, and validation testing (which demands evaluation against data). If you agree with this, you will see that MIRIAM is also subminimal.

The lack of mathematical definition, e.g. pin-pointing the exact math or mechanism for a reaction, is what bothers me the most. Is there an enzyme or other catalyst, what are the ionic etc conditions and the other influences of temperature, pressure , pH, etc? Next, verification requires defining the solvers (for reproducibility), as well as being checked against the math.

There is no perfect system, but shouldn't we be pushing for reproducibility at least? I am sorry to be a nuisance about this, but we have to develop a plan to do better. Some of what I regard as the principles are outlined in the attached presentation that I gave as a keynote Friday at the BMES meeting. I'd be delighted to receive criticisms of this as it need to be taken further for publication. (For those of you not attending, this is now a meeting of about 7000, many (most?) of whom see modeling as a substantial ingredient in their armamentarium.

Jim

Sun, 29 Sep 2013 16:42:56 -0700 Herbert M Sauro <hsauro@u.washington.edu> wrote

Jim, you're absolutely right, SBML is limited at the moment. But I wish someone would come up with something better. It is easy to criticize but its more difficult to come up with a viable *community wide* solution and at the moment no one has. I am open to all new ideas, I am a reproducibility convert, have been for many years, hence the efforts in SEDML. SEDML is also not perfect but yet again no one has come up with anything better.

"The lack of mathematical definition, e.g. pin-pointing the exact math or mechanism for a reaction, is what bothers me the most. Is there an enzyme or other catalyst"

The above is solved by using the Systems Biology Ontology (SBO), that is what is was deigned for. SBML uses it and think CellML may also. As for "what are the ionic etc conditions and the other influences of temperature, pressure , pH, etc? " you might have a better case. Obviously one can add one's own annotation to SBML or CellML to indicate this kind of information (or what ever language you happen to use), but this is a bit ad hoc. I am not sure myself how one would specify these in a community agreed manner. Perhaps one could use the OPB (Ontology of Physics for Biology), but it is something that certainly needs thinking about.

I urge the multiscale community to propose concrete solutions to problems that are perceived in the current crop of standards and efforts. I don't have the solutions, it is a tough one to crack especially for multiscale modeling. I would like to say however, we have come a long way in 12 years. 12 years ago there was nothing to speak of in terms of community agreed formats. .

Herbert

On Sun, Sep 29, 2013 at 7:23 PM, Hunt, C. Anthony <a.hunt@ucsf.edu> wrote:

Herbert, I have seven quick observations on SBML and standards in general, and an inference. My comments add to some points made earlier by Jacob.

1) It's been asserted that SBML is inappropriate for rule-based modeling (very closely related to agent-based modeling).

2) SBML-Multi (http://sbml.org/Community/Wiki/SBML_Level_3_Proposals/Multistate_and_Multicomponent_Species_Proposal/Introduction ) is an attempt at a standard for describing and specifying such systems.

3) There are more established alternatives like Kappa (http://www.kappalanguage.org/ ).

4) Both SBML-Multi and Kappa are abstractions, languages for describing and prescribing.

5) Peter Sorger's group has pointed out (Lopez 2013) that such abstractions can constrain the modeler's ability to create a model and achieve new research objectives (new use cases). Constraints can be good or bad, depending on what you are trying to do!  ????? your use cases.

6) PySB is not the only effort in the above area. Our own and various other efforts (e.g. Extended DEVS, which deals directly with multi-attributes & multiple scales) focus on making computational models more concrete. And little b (http://www.littleb.org/ ) is another notable attempt to strike a middle ground between abstraction and implementation.

7) All 4 of these (little b, Kappa, PySB, and SBML-Multi) are standards that can be used to facilitate sharing, curation, and composition of (primarily) molecular systems biology models. Each has strengths and weaknesses. A serious MSM effort would evaluate and choose each standard based on the objectives of the effort (and other use cases), rather than on a compulsion to use standards or an accidental adoption of any one of them.

[Lopez et al 2013] "Programming biological models in Python using PySB" http://www.nature.com/msb/journal/v9/n1/full/msb20131.html?WT.ec_id=MSB-v9/n1

Each of the above efforts were motivated by new or additional model use cases. Something new was demanded of the simulation model.

Given our research context and observing similar developments in other domains, we can expect constant pressure on earlier standards intended for a smaller set of use cases.

We follow well established best practices used by software and systems engineers (adjusted to our domain). Our use case documents specify the following: the context for the envisioned model (domain experts' research questions); is its use one-off or is it intended for reuse; who the model might serve, who will use it (who conduct and evaluate in silico experiments), what will be expected from it near- and longer-term; what are the current, referent wet-lab experiments (what are the biological aspects/phenomena of interest), what are the validation targets, how will similarities be measured; how will non-biomimetic features be identified, etc., etc.

We often have multiple, referent wet-lab experiments and an expectation that future simulation results will be able to map to some specific feature of those experiments. Each of those experiments may thus become a current or future use case. A referent wet-lab experiment is a model use case.

We need all of that information before focusing our thinking on (model) requirements. Some requirements may be beyond current technology. In that case we focus on the achievable and may add a new use case: our near-term models must be evolvable.

Given requirements, we can think about specifications.

Given specifications, we have enough information to focus on selecting (and excluding) modeling methods and models of computation (MoCs).


On Sun, Sep 29, 2013 at 8:11 PM, hsauro <hsauro@u.washington.edu> wrote:


Quote from 9/29/2013 5:23 PM, Hunt, C. Anthony are marjed with >>>

>>> Herbert, I have seven quick observations on SBML and standards in general, and an inference. My comments add to some points made earlier by Jacob. >>> 1) It's been asserted that SBML is inappropriate for rule-based modeling (very closely related to agent-based modeling).

Agreed.

>>> 2) SBML-Multi (http://sbml.org/Community/Wiki/SBML_Level_3_Proposals/Multistate_and_Multicomponent_Species_Proposal/Introduction ) is an attempt at a standard for describing and specifying such systems.

Agreed

>>> 3) There are more established alternatives like Kappa (http://www.kappalanguage.org/ ).

Agreed

>>> 4) Both SBML-Multi and Kappa are abstractions, languages for describing and prescribing.

Agreed, all models are abstractions.

>>> 5) Peter Sorger's group has pointed out (Lopez 2013) that such abstractions can constrain the modeler's ability to create a model and achieve new research objectives (new use cases). Constraints can be good or bad, depending on what you are trying to do!  ?????? your use cases.

That is true. Rule based models I feel are still in an exploratory stage. It is true that SBML looks at biology in a particular way, the textbook biochemistry way if you like. New thinking such as rule based modeling should not of course be constrained by standards given their experimental status and should be able to explore their domain without any limitations. Eventually I imagine that the rule-based community will reach consensus and develop a common exchange format that will allow rule based models to be freely exchanged between different rule based tools from which other advantages will flow.

>>> 6) PySB is not the only effort in the above area. Our own and various other efforts (e.g. Extended DEVS, which deals directly with multi-attributes & multiple scales) focus on making computational models more concrete. And little b (http://www.littleb.org/ ) is another notable attempt to strike a middle ground between abstraction and implementation.

I understand. You have to remember that SBML was formulated before any of these alternatives were even thought of and SBML is very helpful for what it was designed for. PS I'm not familiar with Extended DEVS, and a search on Google didn't come up with anything, at least on the first page.

>>> 7) All 4 of these (little b, Kappa, PySB, and SBML-Multi) are standards that can be used to facilitate sharing, curation, and composition of (primarily) molecular systems biology models. Each has strengths and weaknesses. A serious MSM effort would evaluate and choose each standard based on the objectives of the effort (and other use cases), rather than on a compulsion to use standards or an accidental adoption of any one of them.

Our definition of a standard both in the SBML community and the synthetic bio community is one where two tools must be able to exchange the agreed standard. If we didn't do it this way, we'd get lots of standards proposed but none actually used. So a proposal only becomes a standard when it is used in practice.

>>> [Lopez et al 2013] "Programming biological models in Python using PySB" http://www.nature.com/msb/journal/v9/n1/full/msb20131.html?WT.ec_id=MSB-v9/n1 >>> Each of the above efforts were motivated by new or additional model use cases. Something new was demanded of the simulation model.

That's fine, one shouldn't be using community standards for bleeding edge research since the standard will by definition be behind the curve, although one can inform the standards community of what is going on. This actually has implications for multiscale modeling, that is, is the field even ready to formalize a community-agreed way to describe models? Perhaps it isn't. My suspicion is that it isn't quite ready.

>>> Given our research context and observing similar developments in other domains, we can expect constant pressure on earlier standards intended for a smaller set of use cases.

Agreed

Quote from 9/29/2013 5:23 PM, Hunt, C. Anthony: " We follow well established best practices used by software and systems engineers (adjusted to our domain). Our use case documents specify the following: the context for the envisioned model (domain experts' research questions); is its use one-off or is it intended for reuse; who the model might serve, who will use it (who conduct and evaluate in silico experiments), what will be expected from it near- and longer-term; what are the current, referent wet-lab experiments (what are the biological aspects/phenomena of interest), what are the validation targets, how will similarities be measured; how will non-biomimetic features be identified, etc., etc. " This is an excellent list. I suspect however (but could be wrong) that you might be the only group doing this. Worth advertising to the rest of the world I think.

Quote from 9/29/2013 5:23 PM, Hunt, C. Anthony: " We often have multiple, referent wet-lab experiments and an expectation that future simulation results will be able to map to some specific feature of those experiments. Each of those experiments may thus become a current or future use case. A referent wet-lab experiment is a model use case. We need all of that information before focusing our thinking on (model) requirements. Some requirements may be beyond current technology. In that case we focus on the achievable and may add a new use case: our near-term models must be evolvable. Given requirements, we can think about specifications. Given specifications, we have enough information to focus on selecting (and excluding) modeling methods and models of computation (MoCs). "

What you suggest seems to be the right way to think about it but given human frailties I think it might be a hard sell.

Herbert

On Sun, Sep 29, 2013 at 9:19 PM, hsauro <hsauro@u.washington.edu> wrote:

I meant to forward a proposed agenda from Jacob which I think is a good starting point for the meeting:

"Here are possible points for the agenda:
- Recent advances and leading ideas in model description
- Assembling a list of existing model repositories and reviewing their characteristics
- Figuring out features of the next generation model repository
- How to encourage interaction amongst models/modelers of different fields/disciplines?"

On Sun, Sep 29, 2013 at 9:43 PM, hsauro <hsauro@u.washington.edu> wrote:

I've added some basic slides on the wiki at

http://www.imagwiki.nibib.nih.gov/sites/default/files/2020-07/WGImag2013.ppt


On Sun, Sep 29, 2013 at 10:02 PM, Jacob Barhak <jacob.barhak@gmail.com> wrote:<b>

Hi Herbert,

If your model describes computations such as mathematical expressions or control flow rules then it is can be viewed as software. The alternative extreme to software would be data.

To try to resolve this let me first give an extreme bad claim. One can claim the a binary executable file encoding machine language is data rather than code/software - just because it stores data that is channeled by the operating system software to run. You can immediately understand that my example does not make human sense since a binary executable is what humans see as software.

The other extreme bad claim would be that a text file such as this one is software - just because a certain character in the ASCII text causes a certain computation in the CPU to display a certain pixel on the screen rather than another. The claim is bad since we consider a text file as data rather than software. So we can always stretch definitions to loose perspective and should be careful.

Back to your definition of a model. You use a certain encoding. This encoding is your modeling language. I would claim that if your language stores mathematical expressions or control flow information including if statements and repetition information such as a for loop then you are encoding software using a Domain Specific Language (DSL). If, however, you are encoding free text, chemical notations, geometry, or similar information that does not describe computation then you are encoding data. Of course there is a grey area in between software code and data, yet to my understanding a computational model is more software than data - just since it describes computation.

Note that I myself use an intermediate language to represent my disease models, yet I would still claim that my models are software encoded using a DSL.

Fortunately repositories do not care much if the file stored is data or software code - a repository stores files and either detects the type or lets the user define the type of data stored by defining location/context or file name extension. So what we need to do as a team is figure out how a repository should represent certain file types so that more users will be satisfied. For example, in your case a repository should recognize a file with SBML extension as SBML code and have a nice interface that will let you access this code easily the way you want to. In my model case it should recognize a certain pattern within a file and know it is a disease model.

However, reproducibility is a key point that we must address as a team. Let me give an example. You claim that SBML code is disjoint from the implementation of the interpreter/compiler/handling software. Well, in that case you may have a reproducibility issue for your model unless your specifications of the software are very strict. How do you guarantee that the same model executed twice in different environments will produce the same results?

I must tell you that even advanced computer languages such as Python have issues that can cause the same code to run differently. So there should be much caution in coding and understanding computations beyond the encoding of the model.

Fortunately there are solutions to cope with reproducibility and I move that we think about reproducibility as part of a repository. This is much more important than distinguishing between encoding types of information - computer science has resolved these issues while reproducible science is still an issue in focus.

I hope you find reproducibility important enough to load on the agenda.

<b>On Sun, Sep 29, 2013 at 11:58 PM, hsauro <hsauro@u.washington.edu> wrote:

I hope we can meet up in DC to talk more about this because email is a terrible medium for this kind of thing. For now I'll hold off on the software/data discussion other than to say that if someone describes a diagram to me of a pathway using free text, chemical notations, geometry, or similar information then to me it is a model not data. If someone measures the reaction rate of an enzyme using a colorimic method and hands me a number then that is data to me but if someone describes the Haldane-Briggs mechanism of the same enzyme in a diagram then that to me that is a model. I see SBML model more like the later because it describes a hypothesize pathway with hypothesized ratelaws etc. If DC doesn't happen I wouldn't mind doing a skype or some kind of network call with you.

The main point I wanted to mention was your statement:

"However, reproducibility is a key point that we must address as a team. Let me give an example. You claim that SBML code is disjoint from the implementation of the interpreter/compiler/handling software. Well, in that case you may have a reproducibility issue for your model unless your specifications of the software are very strict. How do you guarantee that the same model executed twice in different environments will produce the same results?"

In the subcellular modeling community I believe we differentiate between reproduction and replication (I think this is similar across other fields?)

Replication is running the original source code that the modeler provided, preferably on a machine and os that is a similar if not identical to the original (not always possible of course).

Reproduction is when we try to recreate the same reported simulation but possibly using different software and different hardware.

Replication is nice but I'm not sure if it's terribly useful. Reproduction is much more stringent. I would claim that if one can't reproduce a simulation without exactly the same source code and hardware then there is a problem. In the SBML/CellML world reproducibility is defined using a corresponding SEDML file. This file contains all the information you need to run the simulation. For this purpose an entire ontology of numerical methods was created (http://biomodels.net/kisao/) and is used to annotated the SEDML file. For example one might have used a Runge-Kutta 4th integrator with a given step size to run a simulation. This information is specified in the SEDML file. I think if there is an *algorithm* that is very sensitive to the computer hardware and software then I feel that the simulation isn't strictly reproducible. I can replicate it but not reproduce it. Certain ODE models are naturally sensitive (eg Chaotic models) and for these there is a problem, same with stochastic models. In these cases it might be more sensible to just record aggregate summaries of the simulation, eg the Lyapunov exponents for chaotic simulations and means and variances for stochastic simulations. These are particular issues that are still on for grabs in biology but might have been solved in other disciplines. One suggestion I've heard for replication is that one stores an entire virtual machine, however again this doesn't guarantee reproducibility.

Until tomorrow. :)

Herbert

On Mon, Sep 30, 2013 at 7:28 AM, Jacob Barhak <jacob.barhak@gmail.com> wrote:

Hi Herbert,

First, from your discussions it seems that SBML is software - more like a library or a compiled object. I am not sure of details since I am not a user yet from the examples you provided it looks similar to what is used to define computer languages - it reminds me old exercises I did with Lex and Yacc to define computer languages long ago - yet I may be mistaken - it just seems awfully close. In any case the discussion now is on reproducibility.

Yes. Storing virtual machines is an important step that should be adopted by modern repositories. A blank slate virtual machine (VM) can be used as a base snapshot and the developers will have to publish the derivative with their software about to run. This is one solution I was hinting at before.

Yet there are still issues to address there. For example random state information should be provided by the modeler to replicate a certain result as well as additional information that is specific to the model. And probably version of VM hosting environment.

The thing is, our computational models are not worth much unless we can replicate them - even for debug purposes.

Never the less even with publishing a VM there are still challenges in reproducibility. For example in High Performance Computing (HPC) environment publishing a single VM may not be sufficient.

Again, we should start thinking more about reproducibility in all its aspects when we discuss repositories.

On Mon, Sep 30, 2013 at 10:46 AM, Hunt, C. Anthony <a.hunt@ucsf.edu> wrote:


On 9/29/13 7:19 PM, "hsauro" <hsauro@u.washington.edu> wrote:

>I meant to forward a proposed agenda from Jacob which I think is a good
>starting point for the meeting:
>
>"Here are possible points for the agenda:
>- Recent advances and leading ideas in model description

I anticipate that such a discussion will be most productive if participants are encouraged to first read Michael Wiesberg's recent book: "Simulation and similarity: using models to understand the world"


http://global.oup.com/academic/product/simulation-and-similarity-9780199933

662;jsessionid=86CCC3894D736E2C4C681B811C248E59?cc=us&lang=en&


>- Assembling a list of existing model repositories and reviewing their >characteristics >- Figuring out features of the next generation model repository


Based strictly on our own limited efforts to reuse simulation models, "repository features" were important, but were not among the limiting issues. Almost all of the models that we explored were simply not designed for reuse. They were built for particular uses by a particular person or group within a particular context. After the fact, they were presented to the "community" as if they could be reused for a somewhat different purpose in a somewhat different context.

I'm not criticizing those models. Simulation models can achieve their purposes without having to be designed for reuse by others. I maintain that it should be a best practice to answer the following two questions, "for the record," at the start of any MSM modeling project; the questions should be revisited at intervals thereafter.

Is the envisioned MSM simulation model (or any of its components) intended for reuse by others? If yes, within what reuse constraints? The answers become use cases, which will impact other use cases.


On Mon, Sep 30, 2013 at 12:51 PM, Joy P. Ku <joyku@stanford.edu> wrote:

Hi Jacob,

I was part of the team that helped develop the biositemaps RDF format but didn't ?????? t have any input into the Resource Discovery System site. However, I can certainly pass along any feedback about the site.

Good idea about trying to join the workgroup breakout session remotely. Unfortunately, I am not free until ????? of the way through the meeting. I would be very interested in any notes or recordings of the session, though. If this is not being posted publicly, I would appreciate anyone who attends and has notes to please share them with me.

On Mon, Sep 30, 2013 at 12:56 PM, Hunt, C. Anthony <a.hunt@ucsf.edu> wrote:

?????

>>> >>> 5) Peter Sorger's group has pointed out (Lopez 2013) that such abstractions can constrain the modeler's ability to create a model and achieve new research objectives (new use cases). Constraints can be good or bad, depending on what you are trying to do! ????? your use cases.

>>> That is true. Rule based models I feel are still in an exploratory stage. It is true that SBML looks at biology in a particular way, the textbook biochemistry way if you like. New thinking such as rule based modeling should not of course be constrained by standards given their experimental status and should be able to explore their domain without any limitations. Eventually I imagine that the rule-based community will reach consensus and develop a common exchange format that will allow rule based models to be freely exchanged between different rule based tools from which other advantages will flow.

That's a strange statement. Please expand. von Neumann would not have agreed. Carl Hewitt among many others would not agree. I think your ODEs are rules stated in a particular formalism? Correct?

>>> It is true that SBML looks at biology in a particular way, the textbook biochemistry way if you like. New thinking such as rule based modeling should not of course be constrained by standards given their experimental status

That too is a strange statement. What experimental status? Wet-lab research thrives on encouraging invention and use of new methods. For biomedically-focused MSMs to become an essential feature of the biomedical research landscape, we need a somewhat similar environment. New methods are needed to discover and validate mechanisms that provide deeper, more actionable explanations of phenomena important in healthcare.

>>> and should be able to explore their domain without any limitations. Eventually I imagine that the rule-based community will reach consensus and develop a common exchange format that will allow rule based models to be freely exchanged between different rule based tools from which other advantages will flow.

They exist, are being improved, and have been in use in other domains for years. see: http://en.wikipedia.org/wiki/DEVS as a starting point. Various Extended DEVS are used by all DoD Defense Modeling and Simulation Offices: http://www.msco.mil because independent reproducibility (as defined by Jacob) is often a requirement.

For specific biologically-focused examples, see the work from Adelinde Uhrmacher's group: http://wwwmosi.informatik.uni-rostock.de/Plone/en/group/en_leye/pubs

>>> >>> 6) PySB is not the only effort in the above area. Our own and various other efforts (e.g. Extended DEVS, which deals directly with multi-attributes & multiple scales) focus on making computational models more concrete. And little b (http://www.littleb.org/ ) is another notable attempt to strike a middle ground between abstraction and implementation.

>>> I understand. You have to remember that SBML was formulated before any of these alternatives were even thought of and SBML is very helpful for what it was designed for. PS I'm not familiar with Extended DEVS, and a search on Google didn't come up with anything, at least on the first page.

>>> >>> 7) All 4 of these (little b, Kappa, PySB, and SBML-Multi) are standards that can be used to facilitate sharing, curation, and composition of (primarily) molecular systems biology models. Each has strengths and weaknesses. A serious MSM effort would evaluate and choose each standard based on the objectives of the effort (and other use cases), rather than on a compulsion to use standards or an accidental adoption of any one of them.

>>> Our definition of a standard both in the SBML community and the synthetic bio community is one where two tools must be able to exchange the agreed standard. If we didn't do it this way, we'd get lots of standards proposed but none actually used. So a proposal only becomes a standard when it is used in practice.

>>> >>> [Lopez et al 2013] "Programming biological models in Python using PySB" http://www.nature.com/msb/journal/v9/n1/full/msb20131.html?WT.ec_id=MSB-v9/n1 >>> >>> Each of the above efforts were motivated by new or additional model use cases. Something new was demanded of the simulation model.

>>> That's fine, one shouldn't be using community standards for bleeding edge research since the standard will by definition be behind the curve, although one can inform the standards community of what is going on. This actually has implications for multiscale modeling, that is, is the field even ready to formalize a community-agreed way to describe models? Perhaps it isn't. My suspicion is that it isn't quite ready.

The larger biomedical research community see members of the MSM consortium as engaged in bleeding edge MSM research.

>>> This actually has implications for multiscale modeling, that is, is the field even ready to formalize a community-agreed way to describe models? Perhaps it isn't. My suspicion is that it isn't quite ready.

"What sharable evidence is supporting that suspicion?"

Every MS biomedical simulation model that I've encountered can be described and specified in one or another version of DEVS. However a DEVS specification requires work and should only be undertaken for a credible MSM that is expected to (and designed to) have a long lifetime.

I believe that, in the context of an R&D focus, the MSM community should reject any effort to seek and formalize a specific, community-agreed upon way to describe models. Rather, I would promote clarity, attention to detail, and adherence to best practices.

Why? Given a DEVs specification of any MSM and a week to discuss future issues of interest to wet-lab researchers doing the referent research, I can add a new use case to the current list that is beyond the scope of the current DEVS extensions. That should not stop revising that MSM to meet the envisioned, new needs.

Complex wet-lab experiments are executed following assembly of many components (some living), for which credibility is deeply established each given stated constraints. The components are constantly improving; new ones are introduced weekly. How they are used may or may not lea to useful experiments.

Simulation experiments (and the models used) will eventually follow a similar course. That can be done when complicated MSMs are assembled form many independent (somewhat autonomous) biomimetic modules, where the credibility of each modules is documented. Some MSM researchers may even network modules in strange ways (as kids do with LittleBits http://littlebits.com).

>>> We follow well established best practices used by software and systems engineers (adjusted to our domain). Our use case documents specify the following: the context for the envisioned model (domain experts' research questions); is its use one-off or is it intended for reuse; who the model might serve, who will use it (who conduct and evaluate in silico experiments), what will be expected from it near- and longer-term; what are the current, referent wet-lab experiments (what are the biological aspects/phenomena of interest), what are the validation targets, how will similarities be measured; how will non-biomimetic features be identified, etc., etc.

>>> This is an excellent list. I suspect however (but could be wrong) that you might be the only group doing this. Worth advertising to the rest of the world I think.


I have colleagues who shun such work. All future MS simulation models will be engineered devices. We simply try to follow well-established engineering best practices.

>>> We often have multiple, referent wet-lab experiments and an expectation that future simulation results will be able to map to some specific feature of those experiments. Each of those experiments may thus become a current or future use case. A

>>> referent wet-lab experiment is a model use case.

>>> We need all of that information before focusing our thinking on (model) requirements. Some requirements may be beyond current technology. In that case we focus on the achievable and may add a new use case: our near-term models must be evolvable.

>>> Given requirements, we can think about specifications.

>>> Given specifications, we have enough information to focus on selecting (and excluding) modeling methods and models of computation (MoCs).

>>> What you suggest seems to be the right way to think about it but given human frailties I think it might be a hard sell.

Yes, it will be a hard sell to anyone who prefers to "do their own thing" at the expense of best practices. Regular use of GLPs requires training that will exclude some individuals with incompatible human frailties.

The above tasks are not time consuming (typically a few hours) if initiated at the start of a M&S project.

Regards

-Tony-


On Mon, Sep 30, 2013 at 1:49 PM, hsauro <hsauro@u.washington.edu> wrote:

>>> That is true. Rule based models I feel are still in an exploratory stage.

That's a strange statement. Please expand. von Neumann would not have agreed. Carl Hewitt among many others would not agree. I think your ODEs are rules stated in a particular formalism? Correct?

The rule based modeling I know of has only been around since 2003 and people are still exploring its implications which I think are considerable. Traditional ODE modeling of cellular networks has been with us since the 1940s. I think there are still a number of outstanding issues with the rule based approach, in particular what kinetic rate constants does one assign to the thousands of reactions that the rules generate. The current default is to give every rate constant the same value, is that good or bad? How does one even analyse such large networks? All our current analytical and numerical tools are not developed for such large systems. Maybe other fields have experience with such large systems, I'm thinking here of the atmospheric chemistry people. Plus we probably shouldn't be modeling such networks using ODEs given the huge number of states and relatively small number of molecules. We're only now coming to terms with how to model these systems efficiently, ie using the network free approaches. Finally how does one even attempt to validate such systems, I can't easily imagine how one would. The number of states can run into the 10K or more. I think a lot more research needs to be done in the are, hence experimental.

I think however the rule based field is very interesting because it gives a completely new perspective on protein networks. I fully support this research because it breaks new ground and make us think differently about how protein networks might operate.

Maybe I misread you, but I am referring to the Blinov/Hlavacek/Faeder rule based modeling, this is what that SBML extension package is all about. Are you referring to something else? If so I apologize.

"That too is a strange statement. What experimental status? "

See my statement above, not so strange. I'm all for new methods, but when it comes to standardization, we don't want to be too early otherwise it can stifle the emerging field.

"What sharable evidence is supporting that suspicion? "

Reports from proposal review panels, discussions with MSM members. If you think it is not too early I would very much be interested in any proposal you have. My criterion for deciding whether to help a community develop a standard or is whether there is clear need voiced by the community itself. In reaction modeling the need was the ability to share models across different software platforms, modelers were asking for it. In synthetic biology the need was the ability to reproduce engineered designs, again people were frustrated in not being able to take a published design and make it work in their own lab. Both efforts have been quite successful. I would probably not attempt to start formulating a standard unless the practitioners in the field felt there was a need for something that they could benefit from. In the multiscale community I've not sensed such a need for any standards except from a very few practitioners, ie about three. There is a group at UW which is thinking about standards for multiscale modeling but they are getting pushed back all the time by the community so its a bit of a struggle.

Later in the email you even support this assertion: "I believe that, in the context of an R&D focus, the MSM community should reject any effort to seek and formalize a specific, community-agreed upon way to describe models." I think I agree with you.

If not standard exchange formats then community might want to come up with what it thinks it needs. Maybe a simple repository of models in the form of source code or VMs will be sufficient for their needs. I hope you don't think I pushing any agenda, if the multiscale community doesn't want a standard formalism then that's fine.

If we have the workshop this week I am looking forward very much to your talk, the reason I asked you to present is because your perspective is important to hear. If we don't maybe you could do a webinar?

Herbert

Tutorials

Sharing Data via PhysioNet: A tutorial introduction for the MSM community to PhysioNetWorks. --MoodyG 12:26, 30 October 2012 (EDT)

Past Presentations

Tuesday October 24, 2012 IMAG Meeting

Discussion topic: Progress in the last 12 months in the standards community

SLIDES

Thursday January 19, 2012 12-1pm EDT - Working Group Leads will moderate a discussion

TITLE: Project files: Reproducible Scientific Units.
ABSTRACT -

In this we would address the essential close relationships amongst hypotheses (models), data, model verification (for mathematical accuracy), validity testing of model against data, parameter estimation and confidence ranges, and open sourcing of models, data, and data analyses for widespread dissemination. The JSim project file, by including data storage, the platform for model design, the technologies for the modeling analysis of data and the displays of results, is an exemplary unit for the exchange of reproducible science.

SLIDES

Wednesday October 19, 2011 4-5pm ET

Discussion topic: Model reproducibility and component/parts repositories

SLIDES

Friday February 25, 2011 3-4pm EDT - Working Group Leads will moderate a discussion

TITLE: Data Sharing within the MSM Community - Is it Desirable? Feasible?
ABSTRACT - Discussion Guide

--MoodyG 16:27, 23 January 2013 (EST)

Table sorting checkbox
Off