If you’re starting work on a machine learning problem, it’s often considered very good practice to start by building a very simple algorithm that you can implement quickly.
- Start with a simple algorithm that you can implement quickly. Implement it and test it on your cross-validation data.
- Plot learning curves to decide if more data, more features, etc. are likely to help.
- Error analysis: Manually examine the examples (in cross validation set) that your algorithm made errors on. See if you spot any systematic trend in what type of examples it is making errors on.
Manually examine the samples (in cross validation set) that your algorithm made errors on, to see if you can work out why.
- Built a spam classifier with 500 examples in CV set, and error rate is high - gets 100 wrong
- Manually look at these 100 errors and categorize them depending on features
- Find the most common type of spam emails are pharmacy emails and phishing emails
- And try some new features may help us to classify them correctly.
- e.g. deliberate misspelling
- Unusual email routing
- Unusual punctuation
The importance of numerical evaluation
- to do some A/B tests, like, with/without stemming; distinguish upper/lower case or not, etc.