Rails 4 Nested Select Query Using ActiveRecord

Just skip to the last part if you are in a rush 😉

Last year I did an upgrade for a legacy Rails website, from version 3.2 to 4.2. One daunting task was to upgrade the ActiveRecord RubyGem since the API changed drastically in the new release. Luckily there was a RubyGem built specifically for preserving the functionality using the deprecated API, at least for now. To prepare for the inevitable, I want to write down some of the changes made recently for future reference so I don’t have to search through the documentation and Stackoverflow again.

What’s the problem?

We have a method that returns a report that aggregates on some data. 7 years ago when this application was built, there wasn’t much data so this was fine but now the report won’t load for an obvious reason: the performance of the query to calculate col2 was so bad given the current data volume. Therefore, we have to optimize it.

def self.get_summary(params)
  params[:fields] = [
	  "column1 as col1",	
      "GROUP_CONCAT(DISTINCT(column2) ORDER BY colum2 SEPARATOR ',') as col2",
      "sum(column3) as col3"]
  args = parse_params(params)
  self.find(:all, :select => args[:select],
		    :condition => args[:conditions],
		    :group => 'col1',
		    :from => "#{self.table_name}")
end
Rails 3 query ActiveRecord style

The above Rails code translate to the following SQL query:

select 
    colum1 as col1,
    GROUP_CONCAT(DISTINCT(column2) ORDER BY column2 SEPARATOR ',') as col2,
	sum(column3) as col3
from table1
where lastUpdatedDate between '20XX-01-XX' and '20XX-02-XX'
AND (column3 <> 0)
GROUP BY col1;
Rails 3 query in SQL

What’s the solution?

We decided to use a nested query to group by both the col1 and column2 first and select from that result:

select 
    col1,
    GROUP_CONCAT(DISTINCT(column2) ORDER BY column2 SEPARATOR ',') as col2,
	sum(col3)
from (
	select 
		colum1 as col1,
		column2,
		sum(column3) as col3
	from table1
	where lastUpdatedDate between '20XX-01-XX' and '20XX-02-XX'
	AND (column3 <> 0)
	GROUP BY col1, column2;
) a
GROUP BY col1;
Optimized query

Works like a charm except that I’m having problem translate this back to Rails code since I haven’t worked with these Rails code for almost a year now. I did some research but couldn’t find any straight answers to my question so I had to switch to trial and error by composing different API combinations and comparing the logged query with the desired one.

It turned out this one page in the documentation has all you need but you need to be creative to combine them.

Here’s how

  • find all with :select can be replaced by the new .select() API.
  • :conditions can be replaced by .where() API.
  • :group can be replace by .group() API.
  • :from can be replaced by .from() API.

The nested part took me some time to figure out. It turned it’s just the .from() API.

def self.get_summary(params)
	# nested inner query fields.
	params[:fields] = [
	  "column1 as col1",	
      "column2",
      "sum(column3) as col3"]
	args = parse_params(params)
	nested_query_result = self.select(args[:select])
	                          .where(args[:conditions])
	                          .group('col1, column2')
	# outer query select columns.
	params[:fields] = [
		"col1",
		"GROUP_CONCAT(column2 ORDER BY column2 SEPARATOR ',') as col2",
		"sum(col3)"]
	args = parse_params(params)
	results = self.select(args[:select])
	              .from(nested_query_results)
	              .group('col1')
	return result
end

In summary, just do the normal select for the inner query and save the result into a variable inner_result. Then pass it to the outer query’s .from() API.

Deep Learning Note 4 — Mismatched Training and Dev Set

Deep learning hungers for a lot of training data. It is tempting to put whatever data you can find into the training set. There are some best practices for dealing with situations when the training and dev/test distribution differ from each other. This post is also published on Steemit.

Training and testing on different distributions

e03f47a762e4852a9e9467a16a04aba3

Imagine that you have 10, 000 mobile uploads which you care about. It’s too few for your algorithm to learn. You can find 200,000 high resolution images from the web for training but they are different from production environment.

Continue reading

Deep Learning Note 4 — Error Analysis

Following up last post Deep Learning Note 3 — Comparing to Human Level Performance on the Machine Learning Strategy, the next two posts cover the error analysis and how to handle mismatch training and dev/test sets. This post is also published on Steemit.

Carrying out error analysis

Look at dev examples to evaluate ideas

Using the cat classifier as an example, say you have 90% accuracy and 10% error. Your team looked at some errors and found misclassified examples on dogs into cats. Should you spend time making the classifier do better on dogs? It depends. You need to figure out whether it’s worthwhile to do so.

6f09400a3fffdd34359c09c2fc392352

If we check 100 mislabeled dev set examples and find only 5% of them are dogs, then the ceiling of your performance improvement with working on dog picture is 5%. In this case, it may not worth your time. However, if it’s 50%, then you find something that could potentially reduce half the error rate.

Continue reading

ElasticSearch Fielddata Configuration Tuning Part 1

Observations

We have an ElasticSearch cluster that is running fine most of time. However, occasionally our users can submit heavy queries that add a lot of load to the cluster by accident. They typically include aggregations (on scripted fields and/or analyzed fields) over a large period of time without any filters to narrow down the search scope. During the execution of these heavy queries, the cluster usually has high CPU utilization. For some other users, they may see the cluster not responding to their queries. Or they may see error messages popping up on Kibana about shard failures. This post is also published on Steemit.

Continue reading

Deep Learning Note 3 — Comparing to Human Level Performance

Following up last post Deep Learning Note 3 — Machine Learning Strategy of Setting up the Learning Target on the Machine Learning Strategy, this post covers the last 3 points on how to work on a machine learning project and accelerate the project iteration. Topics covered in this post are marked in bold in the Learning Objectives. This post is also published on Steemit.

Learning Objectives

  • Understand why Machine Learning strategy is important
  • Apply satisfying and optimizing metrics to set up your goal for ML projects
  • Choose a correct train/dev/test split of your dataset
  • Understand how to define human-level performance
  • Use human-level performance to define your key priorities in ML projects
  • Take the correct ML Strategic decision based on observations of performances and dataset

Continue reading

Deep Learning Note 3 — Machine Learning Strategy of Setting up the Learning Target

This week’s course Machine Learning Strategy mainly focuses on how to work on a machine learning project and accelerate the project iteration. Since this course covers quite a lot small topics, I’ll break down my notes to several shorter posts. Topics covered in this post are marked in bold in the Learning Objectives. This post is also published on Steemit.

Previous Notes:

Learning Objectives

  • Understand why Machine Learning strategy is important
  • Apply satisfying and optimizing metrics to set up your goal for ML projects
  • Choose a correct train/dev/test split of your dataset
  • Understand how to define human-level performance
  • Use human-level performance to define your key priorities in ML projects
  • Take the correct ML Strategic decision based on observations of performances and dataset

Continue reading

Deep learning Note 2 — Part 2 Optimization

In the last post we talked about regularization methods to improve deep neural networks. In this week I will summarize another two topics from the course Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization: Setup up your optimization problem and Optimization Algorithms. This post is also published to Steemit under the name @steelwings.

Setup up your optimization problem

Assuming that we have defined our cost function, it’s time for us to jump right into optimize it, right? Well, not so fast. There are a few things to consider first.

Continue reading

Deep Learning Note 2 — Part 1 Regularization

In this week I started the second courser Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization in the Coursera Deep Learning track. The note in this post focuses on the content of Setting up your machine learning application and Regularizing the network. This post is also published on Steemit under the username @steelwings.

Fact

Applied ML is a highly iterative process of Ideas -> Code -> Experiment and repeat.Accept and deal with it but we certainly should make this process easier. I guess that’s what the machine learning platforms out there are built for.

Before tuning the parameters

One question we have to answer is: what should we tune the parameters on? If you come from a machine learning background, you probably already know that the data should be split into 3 parts.

Continue reading

Deep Learning Note – Neural Networks and Deep Learning

I recently signed up for the Deep Learning Specialization on Cousera and have just completed the first course Neural Networks and Deep Learning. Although it is recommended for 4 weeks of study, with some backgrounds in Machine Learning and the help of 1.5x play speed, finishing it in 1 week is also achievable. In this post I just want to summarize some of the take-aways for myself and hope it also helps whoever’s reading it. If you are familiar with the implementation of neural network from scratch, you can just skip to the last section for Tips and Best practices mentioned in the course. Note that this post is also posted on Medium and Steemit under the username @steelwings.

Scale of the problem

cad604b222fcef9176060820ed783536

  • On small training set, neural network may not have a big advantage. If you can come up with good features, you can still achieve better results using other traditional machine learning algorithms than neural network.
  • However, as the amount of data grows, the traditional learning algorithms can hit a plateau while the neural network’s performance keeps increasing. The larger the neural network, the larger the increase.
  • As the neural network grows, it takes more resource and times to compute. Many innovation on the neural network algorithms were initially driven by performance requirement.

Continue reading

Keras Tutorial Notes

I’ve recently finished the first pass of CS231N Convolutional Neural Networks for Visual Recognition. Now it’s time to try out a library to get hands dirty. Keras seems to be an easy-to-use high-level library, which wraps over 3 different backend engine: TensorFlow, CNTK and Theano. Just perfect for a beginner in Deep Learning.

The tutorial I picked is the one on the MNIST dataset. I’m adding some notes along the way to refresh my memory on what I have learned as well as some links so that I can find the references in CS231N quickly in the future.

Continue reading