Tuesday, March 8, 2016

T-SQL Tuesday #76, text searching the hard way


T-SQL Tuesday #76 invite


Text searches.  

Possibly one of the least fun things to do in SQL.

Back in the days before Full-Text Search in SQL, I was tasked with replacing the search engine at ABC Radio News, this search engine helped them research any broadcast they had ever done.  The Fox Pro application they had in place was so slow, and unsupported, they needed something better.

Some of the requirements was either exact match, wild card searches, anywhere searches (i.e. the words typed into the search bar, should find any records that had the words anywhere in them).  Because they had gotten so used to the search being slow, they also had an ability to narrow down the search results with further key words.

Starting work on this, I first convinced them not to implement the search of the search function, and let me implement this the way I want, convinced I could make this fast enough without the use of such trickery.

Looking into what was needed, I quickly realized there was a LOT of data, guess 50+ years of news broadcasts will do this.  Consider this was in the early 2000s, some innovation was needed from anything I had coded before.  Obviously LIKE searches was out of the question, full text search was not available.  So what to do?

Basically I decided to break down each broadcast to words into a separate table, the entire application fit in 2 tables: Story and Words.

The Story table had everything in it, broadcast date, tape location, tape number, tape type, and the text of the broadcast itself.  The Words table had 3 columns:  StoryID, Word and WordCount.

The parsing of the Story table took a while, with old school TEXT data type, no easy way to do this.  I would then insert the results into the Words table, with the count showing how many times that word appeared in that one story (filtered out articles like the, a and an).

On the search itself, if "Anywhere search" was selected, it simply found any article that had all those words in the Words table, then returned the records from the Story table.  If "exact match" was selected it would use the Words table to find candidate records, then do a like search on the text column in the Story table that had the entire story.  Pretty simple.  And it was quick!  We could search all of their history in sub seconds.

When I first demonstrated this to a few users, it was immediately accepted and one of our front end developers made a quick application to sit as a front end for it.  The search capability was now so good some of them where upset since in the past they had used the time while waiting to go get coffee etc.  Win in my book.