1. 程式人生 > >How much training data do you need?

How much training data do you need?

How much training data do you need?

The quality and amount of training data is often the single most dominant factor that determines the performance of a model. Once you have the training data angle covered, the rest usually follows. But exactly how much training data do you need? The correct answer is: it depends. It depends on the task you are trying to perform, the performance you want to achieve, the input features you have, the noise in the training data, the noise in your extracted features, the complexity of your model and so on. So the way to find out the interaction of all these variables is to train your model on varying amounts of training data and plot

learning curves. But this requires you to already have some decent amount of training data to construct interesting plots. What do you do when you are just starting out? Or when you suspect you have too little training data and want to estimate how big a problem you are in?

So instead of the dead accurate “correct” answer to the problem, how about an estimate, a practical rule of thumb? One way out is to take an empirical approach as follows. First, automatically generate a lot of logistic regression problems. For each generated problem, study the relationship between the amount of training data and the performance of the trained models. Observing this relationship over a range of problems, generalize to a simple rule.

Here is the code to generate a range of logistic regression problems and study the effect of varying the amount of training data. The code is based on Tensorflow. Running the code doesn’t require any special software or hardware (Tensorflow is open sourced by Google), and I was able to run the entire experiment on my laptop. Upon running, the code spits out the graph below.

The x-axis is the ratio of the number of training samples to the number of model parameters. The y-axis is the f-score of the trained model. The curves in different colors correspond to models that differ in the number of parameters. For example, the red curve which corresponds to a model with 128 parameters indicate how the fscore changes as one varies the number of training samples to 128 x 1, 128 x 2 and so on.

The first observation is that the f-score curves don’t vary as the parameters scale. This is expected given the models are linear and it’s good to see that some hidden non-linearity doesn’t creep in. Of course, larger models need more training data, but for a given ratio of the number of training samples to the number of model parameters you get the same peformance. The second observation is that when the ratio of training samples to model parameters is 10:1, the f-score lands in the vicinity of 0.85 which we take as the definition of a well performing model. This leads us to the rule of 10, namely the amount of training data you need for a well performing model is 10x the number of parameters in the model.

The rule of 10 transforms the problem of estimating the amount of training data required to knowing the number of parameters in the model, so it deserves some discussion. For linear models such as logistic regression, the number of parameters equal the number of input features since the model assigns a parameter corresponding to each feature. However there could be some complications:

  • Your features may be sparse, so counting the number of features may not be straightforward.
  • Due to regularization and feature selection techniques a lot of features may be discarded, so the real feature count is much smaller than the number of raw features that are input to the model.

One way to tackle the issue is to observe that you don’t really need labeled data to get an estimate of the number of features, even unlabeled examples are sufficient for that purpose. For example, given a large corpus of text, you can generate histograms of word frequencies to understand your feature space before beginning to label the data for training. Given the histogram, you can discard the words in the long tail to get an estimate of the real feature count, which then gives an estimate of the amount of training data you need applying the rule of 10.

Neural networks pose a different set of problems than linear models like logistic regression. To get the number of parameters in a neural network you need to

  • Count the number of parameters used in the embedding layer if your input is sparse (see the Tensorflow tutorial on word embeddings for example).
  • Count the number of edges in your network.

The problem is the relationship between the parameters in a neural network is no longer linear, so the emperical study we did based on logistic regression doesn’t really apply anymore. In such cases you can treat the rule of 10 as a lower bound to the amount of training data needed.

Despite the complications above, in my experience the rule of 10 seem to work across a wide range of problems, including shallow neural nets. However when in doubt, plug in your own model and assumptions in the Tensorflow code and run the simulation to study it’s effects. Please feel free to share if you gain any insight in the process.

See also:

相關推薦

How much training data do you need?

How much training data do you need?The quality and amount of training data is often the single most dominant factor that determines the performance of a mo

How Much Money Do You Need to Move the Bitcoin Market?

How Much Money Do You Need to Move the Bitcoin Market?Whale Science 

bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

join with python ast 環境 測試 href from local 環境是win10 python3.5 安裝beautifulsoup後,運行測試報錯 from bs4 import BeautifulSoup soup = Beautiful

[error:沒有解析庫]Couldn't find a tree builder with the features you requested: xml. Do you need to install a parser library?

error ted requested install lib you features builder all 將代碼拷貝到服務器上運行,發生錯誤提示需要新安裝parser library. 查看代碼中發現有以下內容: soup = BeautifulSoup(open(

Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

res requested tro fin IT html str 成功 color python3.6.3 我在處理爬蟲時候使用BeautifulSoup中遇到報錯 “ bs4.FeatureNotFound: Couldn‘t find a tree builde

bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need

遇到一個奇葩的問題,我已經安裝了lxml,但是執行程式一直報bs4.FeatureNotFound: Couldn’t find a tree builder with the features you requested: lxml. Do you need to install a pa

bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need

遇到一個奇葩的問題,我已經安裝了lxml,但是執行程式一直報bs4.FeatureNotFound: Couldn’t find a tree builder with the features you requested: lxml. Do you need

Do You Need Customer Experience and AI in Your Strategy?

Companies are facing more pressure than ever to differentiate their customer experience. But in today's hyper-competitive business landscape, how can you g

What is heuristic evaluation and why do you need it?

Should You Pay for a Heuristic Evaluation or Do It Yourself?Many companies are sitting on the fence about whether or not they should pay for a heuristic an

Do you need a calendar that is independent of email?

So, the other day I was DMing on Twitter and the converseation led to a phone meeting. We both exchanged numbers but there was no easy way to add this as a

Angular 7 使用require 出現的問題: Cannot find name 'require'. Do you need to install......

報錯詳情: 建立了一個新的angular專案,打算通過require使用js,出現了“ERROR in src/app/app.component.ts(11,9): error TS2580: Cannot find name 'require'. Do you need to install

how much time per week do you spend programming? | Hacker News

Before asking this note the following: The most senior engineer(s) should be on the least important projects. That leaves the second be

Ask HN: How Much Software Do You Build Before Attempting to Sell It? (B2B)

As B2B SaaS founders, how much product have you actually developed, before trying to secure B2B commitments from potential customers?

How do you explain Machine Learning and Data Mining to a layman?

Suppose you go shopping for mangoes one day. The vendor has laid out a cart full of mangoes. You can handpick the mangoes, the vendor will weigh them, and

Ask HN: How much do you pay for news media?

On HN wsj.com submissions, I’ve noticed that at least one person posts a free link to the article content to avoid the WSJ paywall, and it got me thinking

The 4 Most Important Things You Need to Do

screen any bsp take make chan head som who   We live in an age where food which used to take us hours to obtain can be delivered to your

How do you stop Ansible from creating .retry files in the home directory?

files Go director cfg home reat fault int true There are two options that you can add to the [defaults] section of the ansible.cfg file t

docker - how do you disable auto-restart on a container?

command and -o down upd rest bsp ref spa https://stackoverflow.com/questions/37599128/docker-how-do-you-disable-auto-restart-on-a-contain

Redis in python, how do you close the connection?

exec share case exe art done time ise mali down voteaccepted Just use redis.Redis. It uses a connection pool under the hood

UVA 10943 How do you add 組合數學之隔板法 OR DP

Larry is very bad at math — he usually uses a calculator, which worked well throughout college. Unforunately, he is now struck in a deserted island wi