MongoDB, mongoid, elasticsearch and tire issue

Posted: June 28th, 2011 | Author: | Filed under: mongodb, ruby | Tags: , , , | 1 Comment »

I just setup a quick app with MongoDB, mongoid, elasticsearch and the tire gem. The system seems to work nicely but there is an issue you should be aware of. Since tire uses the ‘after_save’ callback to insert data into elasticsearch it will be pretty automatic, which for the most part is nice.

The issue occurs when you have a unique index in mongodb but don’t have a ‘validates_uniqueness_of’ in your model. With this combination and two similar records, mongo will reject the new record, but it will still insert into elasticsearch giving you a record in elasticsearch that doesn’t exist in mongodb. This is potentially a nasty bug.

The quick fix is to make sure that any unique index in your database also includes the ‘validates_uniqueness_of’ validator in your model.


Letting multiple users modify sets of records from a larger group

Posted: April 22nd, 2011 | Author: | Filed under: ruby, web | Tags: , , | No Comments »

The title really isn’t great but it does describe an issue I recently ran into. I’m working on a web application that lets a user set wether an attached term is correct or incorrect. While this can be done on a record by record basis, it becomes quite tedious. I set up a system that would show a listing of all of the terms along with a snippet of the text where the term was annotated. This worked well as I had searching, sorting and pagination and users could always move to page two or three if they wanted. This also worked when I had 1000 records. Once I bumped the dataset up to 300k records, pagination in MySQL became pretty lethargic.

I decided to remove the pagination and show the user the first set of records available. It then dawned on me that once you had multiple people working on the same page, they would all see the same records. This isn’t very efficient when you have a lot of stuff to get done. On a side note, this same problem existed in the paginated version, but when you have a link to move to the next page, it doesn’t crop up as much)

So now I needed to figure out a way to determine who was seeing what records, make sure that someone else didn’t see those same records at the same time, and make sure that if nothing was done to the records, they would return to the general pool in a reasonable amount of time.

Enter Redis.

I like redis. I’ve used it on a few other projects and felt that it would be a good fit since it has some nice set manipulation commands and the ability to expire a key after a specified amount of time. I started thinking about how to remove records after the user processed them and such, but I ran into some issues with possible AJAX search lag and that some records I didn’t really care about might be removed from the general population. Also, what do you do if the user only wanted to update a subset of those records? By modifying the key, I’d need to reset the expiration timer.

After writing a bit of code, I figured out that if I deleted the current user’s key before I make a union of all the users records, I wouldn’t have to worry about orphan records and unmodified records. Every time the user saw a list of records, only the current visible set would be removed from the general population.

Here’s a quick overview of what the script does.
2. Add the current user id to the set of curator ids
3. Remove the key that holds the current user’s reserved records
4. Get all the curator ids
5. Generate keys for all the curator reserved records
6. Get the union of all of the reserved records

I have to make a special note here. Redis expects arguments in the form SUNION “key1″ “key2″ “key3″. When you have an array in ruby, generally it will look something like ["key1", "key2", "key3"] The * (splat) operator splits out the array into a group of keys that works for the Redis command.

7-10. Find the first 15 currently available records and store the ids in the current user’s set in Redis.
11. Add a five minute expiration to the current user’s key.

It was interesting to me after thinking about it for a while that the solution ended up being so simple.


Dreamhost and Rails 3 solution?

Posted: October 25th, 2010 | Author: | Filed under: ruby, web | Tags: , , , | 1 Comment »

Use *linode.com

Email sent to Dreamhost back in August.

I’m going to make a request for this for you. Please note, it’s a significant bump from what we have right now, Rack 1.1.0, so only our rails expert can make the call, and check to make sure it won’t break anything. She’s on vacation this week, but I’ll send her a follow up email, so she can get back with you on this as soon as she’s back.

This was the response I got from Dreamhost after waiting for two months for an answer. (After I had to ask “hey, what’s up?”)

Thanks for writing. We have to be careful with the upgrade as we have a high number of dependency packages that need to go through the upgrade. It’s scheduled and being worked on, and the current estimate for the full upgrade to 3.0.1 is within a month of two. I’ll get back to you again when there is an update, but feel free to contact me if you have any more questions.

*Link includes my referral code


adding rails log rotation to dreamhost

Posted: May 21st, 2010 | Author: | Filed under: ruby, web | Tags: , , , | 3 Comments »

While Dreamhost may rotate the apache logs for you there is nothing automatic to rotate the rails production logs. This may not be an issue since you have “unlimited” disk space but it’s a good idea anyway.

You need to install logrotate since it doesn’t exist by default on the server and then place it in a location where it can be run. You also need to create the configuration file and status files. Once that is set up, you can install a cron job via the Dreamhost panel.

Now it should be rotating your application logs nightly. You can add more sites to the conf file as needed.


I beat Dreamhost. How to really get rails 3, bundler and dreamhost working.

Posted: May 17th, 2010 | Author: | Filed under: ruby, web | Tags: , , , , | 13 Comments »

Please check out my earlier post so you can get the whole git/capistrano setup working as well.

So I’ve been trying to get rails 3 running with bundler on Dreamhost for a while. I’ve had a few posts on here about how to do it. In the end, they didn’t work completely. I’ve sent requests into Dreamhost to upgrade rubygems to 1.3.6 so the newest versions of bundler would work. I was told ‘No’ but go to this wiki site.

I wasn’t too happy about that since the top of the page has a big warning ‘DON’T DO THIS’. If you’re here looking at this you probably have the tech knowhow to do this and realize the world won’t end when you do.

So here’s the step by step directions for how I got it working.
You can check out the original wiki instructions at http://wiki.dreamhost.com/RubyGems.

Add this to your .bashrc
This will setup your shell to use local gems installed in your .gems directory, setup the path to check there first and opt/bin as well. Next we need to install a newer version of rubygems.

This will get rubygems installed and make sure you run your version before dreamhost’s. You then install bundler and rake. These are the only two gems I install in the system as I prefer to have all of my apps use their own gem versions. Putting everything into system is a big mess and a dependency nightmare on updates.

Next we need to make sure your app is setup to use the bundler gem. You need to modify your environment.rb and boot.rb

Make sure you change the application name to your name. Really you’re only adding the single line for the GEM_PATH

I was using the suggestions in the wiki but I eventually figured out that Dreamhost wasn’t properly picking up my gems even with the changed GEM_PATH. Adding the ‘Gem.clear_paths’ to boot.rb allowed the gems to be seen. This is what finally cracked the problem.

Hopefully this works for you.


dreamhost, please upgrade rubygems

Posted: April 30th, 2010 | Author: | Filed under: ruby, web | Tags: , , , | 2 Comments »

I’ve posted before about trying to use rails 3 with bundler and dreamhost. This is now not possible. The shared servers for dreamhost use rubygems 1.3.5. In order to use the current version of bundler you need rubygems 1.3.6.

If you are a dreamhost user please make sure to vote up the install rubygems 1.3.6 suggestion from the panel (panel.dreamhost.com) For a company that has been good about supporting ruby and rails this is something that needs to get fixed ASAP.


mongodb, mongoid, will_paginate and sorting

Posted: April 6th, 2010 | Author: | Filed under: mongodb, ruby, web | Tags: , , | 4 Comments »

Had a bit of an issue with will_paginate and mongoid. I couldn’t find an example of how to sort the pagination query and paginating without a defined sort order defeats the purpose.

paginate(:page => page, :per_page => size, :sort => [['ontology_term_id', :desc], ['_id', :asc]])

Instead of using “order” or “order_by” you can just use “sort” with an array of arrays.


distributed job processing options for GMiner

Posted: March 16th, 2010 | Author: | Filed under: ruby, web | Tags: , , , , , , , | No Comments »

I’ve been working on a project that requires the processing of a series of jobs. I had originally written my own system for doing this because I wanted to know more about how they work. After a time, I decided to modify it, and found that I had broken it. Instead of trying to fix it, I decided to see if there was anything out there that someone else had done that would work better for me.

Resque

My first attempt was to use resque. It worked, but as I started to scale things up, I ran into some issues that I didn’t like.
It polled the DB a lot. While it was in memory, it was a lot of “do I have a message?” checks which seemed messy.
It wasn’t fast. There was a lot of overhead. Things “felt” slow as they were running.
It was memory based. Redis will store data to the disk, but it’s meant as an in memory system which gives it the speed.

What I did like is that it worked. The jobs finished and there was a nice web interface to see what was going on.

Cloud Crowd

I had looked at Cloud Crowd before and it seemed interesting. I like the web dashboard but it was also one of the biggest problems with Cloud Crowd. According to the authors, it was created to handle a small number of very expensive jobs. I have no doubt based on my experience that it would excel in that environment. My problems consists of a very large number of small fast jobs. Cloud Crowd ground to a halt pretty quickly. The dashboard was taking too much time to render which began to multiple the render time and eventually it needed to be turned off.
The other big issue I ran into was with how the workers processed their jobs. It wouldn’t start another job until all of the other jobs it had created finished. If you have a scheduling job that launches 10 processing jobs, the system gets stuck waiting for the 10 processing jobs to finish before it can start another scheduling job. Again, it works very well. Everything gets done and you get a result string but it was slow.

RabbitMQ

I decided to figure out what was wrong with my system since it was working at one point. I’m using RabbitMQ as a message broker to pass the jobs back and forth between daemons running on linux machines. I believe my issue was caused by using a topic exchange with a key per worker. I was running into issues where some processors were picking up messages from the topic that were not assigned to their key. Once I realized this was happening I decided to go back to a queue per worker. I wanted to get away from that since originally I had been creating multiple queues in rabbit that never disappeared. I changed the queues to be exclusive. Exclusive means that only one client (processor) can read from that queue. It also makes the queue self-delete when the consumer disappears.

I’m attaching the code for my system below. I’ll post more about how each of the parts works later. I hope to add a bit more control into the system, but as of now it’s pretty self healing and very fast.

http://github.com/mcwbbc/gminer_scheduler
http://github.com/mcwbbc/gminer_node
http://github.com/mcwbbc/gminer_processor
http://github.com/mcwbbc/gminer_databaser


properly rendering 404 errors from inside a rails application

Posted: March 2nd, 2010 | Author: | Filed under: ruby, web | Tags: , , | 1 Comment »

I just migrated a site that had a bunch of links that have been in in the search engines for a while. Oddly it seems that the only thing hitting those links seem to be the crawlers themselves. I needed a way to invalidate those links, since I couldn’t create a proper redirect because of changing IDs.

/records/show/12345 used to be valid, but has been replaced with the RESTful version /records/00123. The ID is now also meaningful instead of a MySQL generated id.

My first attempt was to just redirect to the 404 page.
record = Record.find(params[:id]
rescue ActiveRecord::RecordNotFound
redirect_to("/404.html")

But as I watched the logs, I noticed that this really wasn’t right since it was still returning a 302 (redirect) and the a 200 (OK) code for those links. The crawlers were getting the instruction that you should just display the 404 page for those links. That might seem OK, but really I wanted them to get the 404 immediately and remove the page from their databases.

record = Record.find(params[:id]
rescue ActiveRecord::RecordNotFound
render(:file => "#{RAILS_ROOT}/public/404.html", :layout => false, :status => 404)

By rendering the 404.html directly and including the 404 status code, it should help to fix the situation.


shared host or small vps

Posted: February 24th, 2010 | Author: | Filed under: ruby, web | Tags: , , , | No Comments »

I started looking into moving off of dreamhost because I’ve had some issues with responsiveness on my applications. For $20 a year, I could put up with it. Now that I’m paying $100, it’s a bit more annoying since there are other options out there at that price point.

I’m considering slicehost.com, linode.com and webfaction.com.

I guess the other big reason is that I want to play with MongoDB and each of these gives me that option.