Writing an AI To Predict Ramen Quality:
You know what's great? Ramen! But you know what's less great? Soykaf ramen! so.... what if we made an AI that could predict the qualtiy of ramen before we buy it? That'd be pretty schway and I have nothing else to do this afternoon so it's happening.

Finding The Data:
Alright, like all AI we need to start with finding a dataset, luckily we've got one from here. This data set's got all sorts of information relating to the 5 star rating of the ramen dish in the final column.

Review # Brand ... Stars Top Ten 0 2580 New Touch ... 3.75 NaN 1 2579 Just Way ... 1 NaN 2 2578 Nissin ... 2.25 NaN 3 2577 Wei Lih ... 2.75 NaN 4 2576 Ching's Secret ... 3.75 NaN ... ... ... ... ... ... 2575 5 Vifon ... 3.5 NaN 2576 4 Wai Wai ... 1 NaN 2577 3 Wai Wai ... 2 NaN 2578 2 Wai Wai ... 2 NaN 2579 1 Westbrae ... 0.5 NaN Isn't that nifty! But we really don't need all this soykaf, in fact some of it's just unusable. The "Top Ten" category only applies to 10 data so that's not very useful, and the "Variety" category was unique for almost every entry, so let's drop those two from our table. We also don't need the "Review #", it's not really relevant.

So what we're left with is the Brand, Style, Country of Origin, and of course the actual ratings themselves.

Pre-Processing The Data:
So our data is very stringy as you can see, except for the actual rankings, all of it's given to us as a bunch of names and tags. There are two common ways to go about vectorizing data like this, Ordinal Encoding and One Hot Encoding.

In Ordinal Encoding we assign each of the recurring strings it's own number, and then replace all instances of that string with that number. So for example, if we have the vectors:
["Nissin", "Japan"] ["Sapporo Ichiban", "Japan"] ["Mr Noodles", "USA"] ["Sapporo Ichiban", "USA"] would become: [0, 0] [1, 0] [2, 1] [1, 1] This method might seem like the straightforward and easy way of doing things, but it actually creates a significant problem in most cases of regression. See now, our regressional model only sees the numerical tokens, and numerically 1 is between 0 and 2. So to our model, Sapporo Ichiban is a value between Nissin and Mr Noodles. So if Mr Noodles always gets a score of 2 and Nissing Always gets a score of 4, Sapporo Ichiban might be seen as getting a score between those two values. The only real way to change that would be to add another turn in our graph, which requires raising it's degree (for each outlying ramen company!). Now that might be do-able, but it really doesn't address the underlying issue, that we're lying to the model when we put Sapporo Ichiban as between these other two parameters. Instead, we need to use something called One Hot Encoding.

In One Hot Encoding each type of data is given it's own input vector, so instead of the first two codeblocks we'd have a vector that looks like this:
[ Nissing, Sapporo Ichiban, Mr Noodles, Japan USA ] [ 1, 0, 0, 1, 0 ] [ 0, 1, 0, 1, 0 ] [ 0, 0, 1, 0, 1 ] [ 0, 1, 0, 0, 1 ] So in each vector, we have a series of sub-vectors (I'm calling them veclets, I just made that up I don't really know what they're called), so each veclet is either a 1 or a 0 representing whether or not the ramen is of that brand. So if it's a Sapporo Ichiban package, then the veclet representing Sapporo Ichiban would be one, and the one's representing the other brands would be 0. This sort of encoding is very common in classification problems.

Training The Network Itself:
So we're just going to use the sklearn module for Support Vector Regression. We're going to use a 75% training data, 25% testing data split, and we're just gonna shove it down svm.SVR().fit(). Train it for whatever the default number of epochs is (I think it's one??).

Obviously you can use any number of regressional models for this problem, and you can put way more effort into them than I am, but I wanted to do this in a day and I didn't want to wait for a massive neural network to train.

Demo?:
Alright I know y'all are too lazy to download and run things yourself, so here's a demo of the program:


My Web Browser Sucks

Profit?:
Well, I trained it up and ran it over the test data. The average difference (just regular subtraction not mean-squared or anything funky) between reality and the prediction was ~0.58 stars, which is pretty good considering the limited data set and possible lack of a correlation. I might come back to this and do a more elaborate job some time later (because I'm certain I could crank out better prediction) but for now you can download the content below.

Links and Downloads:
>> Project Folder<<
>> Just The Pickle File (Model) <<
>> Just The Pickle File (Encoder) <<