Tag Archives: SOLR

How I solved a personal problem with Marines.mil

Published by:

This post is not in keeping with the normal topic on this blog, but I wanted to right it down so that I could share my experience with having a problem and finding a solution.

The problem –

http://www.marines.mil/news/messages/Pages/maradmins.aspx?pid=frontpage_maradmins

The above link is the search engine for the primary way we distribute administrative information in the Marine Corps.  If you try it out (if it is up) you will see a couple of problems with the UI, but the primary problem is in the way that their results are sorted.  Information in this data set are very very weighted toward age, meaning that there are many items with duplicate information that are only relevant if they are the most recent version of that document.  The results that the site returns are ranked in regards to quality of the likeness, which usually returns nothing that is relevant to the actual document you are trying to find.   This problem is caused because a huge chunk of the vocabulary in the documents is similar regardless of the topic/title.  The entire site is very slow on top of these issues.  It is hard for me to explain why this is broken to someone that does not completely understand the context of the data, but trust me that it does not return relevant results.

 

The Solution –

The first problem I had was not having any access to the data itself.  So knowing that this specific problem would be beyond my skill level, I reached out to a friend that provided me with the right tools to scrape the information I needed from Marines.mil and save the in a mySQL database.  We used some basic libraries to scrape the information I needed to XML files indexed with SOLR.  I then pushed the entire repository to a mySQL database and run the same process on the RSS feed daily for updates.  This is not a perfect solution but I had to work with what skills and guidance I could get and keep the process simple in order to self maintain.

After I had my database populated I began looking for the best way to quickly serve up results without lots of bandwidth use.  This requirement is due to the unbelievably bad network bandwidth that the entire Navy and Marine Corps Intranet(NMCI) provides to the end user.  It is not uncommon for a user to have a 25-50 Kb connection due to hugely over engineered solutions to simple network issues.  I decided to use javascript with some ajax in order to prevent transferring anything other than large amounts of text.  I found some tutorials on JS SQL interations and ajax events and built out a small and ugly, according to my same friend that helped with parsing, search engine.  Instead of a complicated ranking of titles and bodies for the documents in the database I provided the user with two separate search boxes and simply let them pick, there is nothing wrong with letting the user decided how they want to interact with the data.

Anyways, the result after a week of organic word spreading, is 100-200 unique users a day and growing.  I am not yet listed on the first page of Google for MARADMIN or MARADMIN SEARCH but have climbed up quickly to the second page and should be first page in 2-3 days.  I will never make a dollar off of this site but being able to save a little bit of frustration and time of my fellow Marines, and the couple of emails of thanks is plenty of motivation to keep learning and creating.

SSgt Frank Phillips

http://maradmin.killfoot.com

https://100.24.72.144