June 19, 2020
By Jonathan Scialpi, Solutions Engineer at R3
We often hear about innovative ways that machine learning models are leveraged in applications. What we almost never hear of how painful is the experience that the data scientists who create these models go through. In this article I will discuss (assuming you have a basic understanding of Corda):
Before we dive into the issues, let’s briefly cover some common terms and the overall workflow to produce a machine learning model. Since machine learning is a rather vast subject with many specializations, let’s focus specifically on text classification. The goal of text classification is simply to categorize a piece of text. A few common use cases for text classification are sentiment analysis, spam detectors, and intent detection.
Teaching a machine how to classify text is analogous to how you would teach a child how to detect objects that are colored “red” and those that are “not red”. You would compile a set of photos which contain red items like an apple or a fire truck and another set of items that are not red like a banana or a tree. Then you would simply show each picture to the child and categorize it aloud.
To teach a machine how to classify with accuracy, data scientists compile a data set (known as a “corpus”) with various sample texts (called “utterances”) along with their respective classification category (“labels”). This corpus will be fed to a classification algorithm which will tokenize the corpus and produce a model. This model is what the machine references when trying to make predictions.
There are four main ways to acquire training data and each has their own issues:
These sites are great for getting started in machine learning but most of the time will not have the data that you are looking for, since every classification model is unique to its use case. For example, let’s say you are building an intent detection application capable of determining whether someone was asking to reset their password or not (the not label is commonly referred to as the “negative” category) from within a mobile banking application. You go to Kaggle, download the data, train your model and test it out to notice that your model has 60% accuracy score as utterances such as “I can’t get into my USA banking mobile app!” are failing because the data set you downloaded contained generalized utterances such as “I lost my excel document containing all of my credentials…”. Generalizations like this could cause models to overfit or underfit.
If you’re [un]lucky (depends how you look at it), the client who you are creating a classification model for will tell you that they have heaps of data that you could use to train the model only to find out that the “heaps of data” are really just unstructured exports of help desk incident tickets where people are trying to access their account. This, of course, means that you must manually sift through these logs. You could make some regex-based script to speed up the process but you will eventually still have to sift and structure the data according to your training needs.
If you want something done right, do it yourself…right? Besides the fact that you’d probably rather be spending your time on something else besides writing utterance after utterance, just writing the corpus yourself can result in a biased data set where you have imposed conscious or unconscious preference that may go undetected until your app hits production.
You can use sites like Mechanical Turk to offer a five-cent bounty for every utterance that is related to an intent you’re classifying. For a low price, this can relieve you of the burden from having to type up your own utterances but there is still no guarantee that people will follow your instructions perfectly and you will be stuck manually reviewing each utterance that was submitted.
Besides underfitting, overfitting, structure, bias, and time lost in review, data can also become corrupted. One human error in reviewing a pull request on a corpus can have disastrous effects on a model’s performance. Even online training solutions where models are learning from live data or batched data can become corrupted by a bug in data ingestion or filtering can render a model/application useless.
Based on my experience with the above issues, I believe that the crowdsourcing approach is the most practical. It gives the best chance at avoiding bias while also relieving the burden of creating a corpus from scratch. However, the problems of underfitting/overfitting, manual review, and manual corpus maintenance still exist. If only there was an automated way to ensure crowdsourced contributions follow my format and only accept these contributions if they have a positive effect on the model’s performance…
Introducing the Decentralized Corpus Manager! DCM is an application I built to facilitate data crowdsourcing for machine learning models using Corda. Beyond crowdsourcing, Corda contracts provide a way to tackle the three remaining problems:
Step 1 — The user issuing the corpus (known as the “owner”) can call the IssueCorpus REST API to create a CorpusState.
Step 3 — The flow makes an external POST request to the Flask server which consumes the corpus. The utterances and their respective labels are then used to build a model and are measured against a test set.
Step 4 — The results of the test set are passed back to the IssueCorpusFlow as a JSON response to be included as the classificationReport attribute on the CorpusState.
Step 5 — Lastly, all participants sign and verify the new state and their ledgers are updated with the new CorpusState.
Steps 1–5 for the corpus update are like that of the issuance. The major difference is the command used and its associated requirements in the smart contract:
Now that I have proven that blockchain can be used to solve the issues that come with producing machine learning models, let’s examine why Corda is the right blockchain platform for this use case by drawing comparisons from Microsoft’s Decentralized & Collaborative AI on Blockchain (I will refer to this framework as DeCA). This framework is built on the Ethereum platform to host and train publicly available machine learning models.
While both DCM and DeCA store data and meta-data relative to the model in progress on-chain, it’s important to distinguish how the term “on-chain” affects the cost outcome. Being that DCM uses Corda, said data can be stored for low-cost on a cloud database such as Azure’s PostgreSQL. DeCA on the other hand stores its data on the Ethereum blockchain which has extremely high costs.
As demonstrated by the table above, a 1GB write to the ETH blockchain could costs millions making DeCA unusable. To work around the storage cost issue, DeCA only allows contributors to propose one data sample at a time to minimize write fees: “As of July 2019, it costs about USD0.25 to update the model on Ethereum”. It is further explained that this fee could be removed for contributors through a reimbursement.
There are four drawbacks with this approach:
Due to the nature of the gossip protocol, every transaction that is made through DeCA will be seen throughout the network. This includes the proposed corpus contributions, the corpus itself (say goodbye to your “secret sauce”), and any rewards/punishments from its incentive mechanism.
Whereas transactions through DCM are sent as point-to-point messages and sends them only on a need to know basis (lazy propagation). Subjective transactions open the door modifications to how DCM currently works:
These two modifications are especially interesting since the idea of multiple states doesn’t exist on Ethereum being that its blockchain represents one global state. The only way for DeCA to keep the model private while still accepting contributions would be to store the model off-chain. In contrast, the privacy modifications for DCM would still store both the contribution and the model on-chain!
Although lazy propagation allows for private transactions, the drawback is obviously in decentralization when compared to the gossip protocol. Being built on Ethereum, DeCA by default will send every transaction to every node on the network allowing any node to participate. Although DCM doesn’t include every node on the network by default in every transaction, my current implementation doesn’t prevent any node from being a contributor. Meaning, any participant included within a corpus state can add any other participant on the network. Furthermore, if you had a list of every participant on the network you could technically include each one of them on a corpus state issuance.
Thus, participation in DCM is by default semi-decentralized with the potential to be fully decentralized. Contributions are still decentralized between those that are included on the state since any of those participants can propose state changes.
To ensure participation in a crowdsourcing application, building an incentive mechanism is vital. Although my first goal was to prove that this functionality was possible on Corda, I built it knowing that the incentive mechanism would probably be the next task in the pipeline. With the delta variable mentioned earlier (in the Corpus Contribution Flow section), there is a way to tell if a user’s contribution had a positive/negative effect on the model’s performance and measure how much it improved/degraded by. Using this measurement, a payout scheme can be implemented such as DeCA’s staking approach or something more rudimentary such as: for each positive delta point, the owner of the model pays out X amount of DCM tokens that were used to fund the crowdsourcing project.
DCM currently stores the corpus that is used to produce a model and its respective metadata but does not actually store the model object. I don’t think this would be difficult to accomplish and could add some value to the CorDapp being that it allows for more flexibility. For example, you might have a corpus state which contains data you wish to keep private but would like to sell subscriptions to other participants willing to pay some fee to use your model’s classification.
The storage of a corpus and its respective model on-chain opens the door for the sharing of these data sets for the common good or to make a profit.
DCM currently uses a FlowExternalOperation to make a POST request to the Flask server endpoint which responds back with the classification report. Instead, I’d like to examine Corda Oracles to see how their usage could benefit the CorDapp. An immediate benefit I can think of is that the Oracle is a permissioned node on your network and therefore you can trust the response you receive back from the classification endpoint if you see the Oracle’s signature on the request/transaction.
Thirty per cent of the training data submitted is randomly selected and used to test the classification model’s performance and produce a classification report. Being that it is random, the smaller the data set is, the greater the chance that new data added to the corpus might end up in the testing set rather than the training set. The flask API should be altered to only consider data that is outside of the contribution set to guarantee that it is used to train the model rather than test it.
The power in crowdsourcing is manifested in the development of machine learning models. Whether it’s for classification, regression, or clustering, machine learning models are only as good as the data that supports them. Blockchain technology protects said data with a distributed smart contract which programmatically ensures that crowdsourced data is structured, quality, and immutable, and as public/private as you need it to be.
As demonstrated through DCM, Corda is the best platform for the job. Microsoft’s DeCA framework focused on “publicly accessible” models because the gossip protocol isn’t built for privacy while the Ethereum platform can only provide anonymity, hashed data, or off-chain solutions. On Corda however, you could instil your own participation rules which are subjective by default. Without any sort of privacy, there would be no way to secure your machine learning assets on-chain which would leave your application to be used only for non-profit or educational purposes.
Furthermore, storing data on-chain using platforms like Ethereum is extremely expensive and not a great idea for a use case that relies heavily on data sets which could be huge. Machine learning applications on these platforms simply will not scale. Corda states and transactions, on the other hand, can be stored on cloud-hosted databases for a very low cost.
Thanks for reading and feel free to reach out if you are interested in contributing to DCM.
Share this post
July 01, 2020
June 29, 2020
sdfsdf sdf sdModular Test
Stay up to date on the latest news and articles related to Corda.