This post is not in keeping with the normal topic on this blog, but I wanted to right it down so that I could share my experience with having a problem and finding a solution.
The problem –
The above link is the search engine for the primary way we distribute administrative information in the Marine Corps. If you try it out (if it is up) you will see a couple of problems with the UI, but the primary problem is in the way that their results are sorted. Information in this data set are very very weighted toward age, meaning that there are many items with duplicate information that are only relevant if they are the most recent version of that document. The results that the site returns are ranked in regards to quality of the likeness, which usually returns nothing that is relevant to the actual document you are trying to find. This problem is caused because a huge chunk of the vocabulary in the documents is similar regardless of the topic/title. The entire site is very slow on top of these issues. It is hard for me to explain why this is broken to someone that does not completely understand the context of the data, but trust me that it does not return relevant results.
The Solution –
The first problem I had was not having any access to the data itself. So knowing that this specific problem would be beyond my skill level, I reached out to a friend that provided me with the right tools to scrape the information I needed from Marines.mil and save the in a mySQL database. We used some basic libraries to scrape the information I needed to XML files indexed with SOLR. I then pushed the entire repository to a mySQL database and run the same process on the RSS feed daily for updates. This is not a perfect solution but I had to work with what skills and guidance I could get and keep the process simple in order to self maintain.
Anyways, the result after a week of organic word spreading, is 100-200 unique users a day and growing. I am not yet listed on the first page of Google for MARADMIN or MARADMIN SEARCH but have climbed up quickly to the second page and should be first page in 2-3 days. I will never make a dollar off of this site but being able to save a little bit of frustration and time of my fellow Marines, and the couple of emails of thanks is plenty of motivation to keep learning and creating.
SSgt Frank Phillips