If it's about a full-text search, then Sphinx is my go-to choice: small & native, super-fast, does only one thing — builds text indices and looks through them. Has a SQL-like query language, allows additional (non-searcheable) data in index, supports fuzzy search and suggestions. Worth a shot.
I have used it on a small-scale website search (indexing around 3500 pages of content, 1000+ word articles) and it handled that fine.
A friend of mine gave it ago on his boxing statistics website and he said it was slow and unusable on the scale he was running it at and would start seeing performance issues for anything past 250,000 records. I can't say for sure if it was the way he implemented it or the server wasn't up to the task but he switched to Elastic Search and had no issues at all.
For larger scale search I've always used Sphinx (http://sphinxsearch.com) and to be honest I wouldn't use anything else for a clients websites (where budget allows) as I havn't found anything that can match its speed.
I personally decided to outsource complex queries like this to Sphinx: http://sphinxsearch.com/
Basically I use filters to do a query such as "Find all files with region = USA or region = Europe and genre = Fighting" If I had to express this in MySQL I'd need some complex joins.
The down side is of course having to run and maintain Sphinx, but it's just such a powerful addition to my stack that I found it well worth it.
Somebody mentioned Solr and Elasticsearch. I add "Sphinx" http://sphinxsearch.com/
To people that have used all? Do you know which ones can solve the problem of "pink" but not "pink floyd" without hardcoding it?
I have heard Lucene's name come up many times when this question is asked, but usually in a programming / library context.
Are you looking for a fully functional product to index them or a programmatic way to access them?
Wikipedia's full text search page has a list of software, including Lucene. Apache's Solr running on top of Tomcat appears to offer what you want?
Sphinx may also be a good bet?
Elasticsearch and Apache Solr are both built on top of Apache Lucene (all in Java), but Elasticsearch has more mindshare recently. Be aware that any commercial solution you buy might well be built on these anyway.
I'm also fond of SphinxSearch from way back. However, it seems like there might have been a license change or a new open-core direction going on there recently as there's no apparent access to the source code for the new 3.x releases, so tread cautiously.
However, being C++, Sphinx is lighter-weight and more straightforward than the typical Java project. Even better would be a pure C engine, but I'm not aware of one currently.
I'd have a HashTags
table, and a Images
table, and then have an ImageHashTags
join table.
Otherwise searching for hashtags is going to be really slow if you just have them stored as a long string on the Images
table.
Look at using something like Sphinx for searching, I use it at work, and it's pretty good.
Well, I started writing some code but found a free solution online. Let me know if you need help running it.
http://sphinxsearch.com/docs/current.html#installing-windows
Use this regular expression to find emails: "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+.[A-Za-z]{2,6}\b"
I'm assuming you or the person that needs this is running a Windows machine. If you're using, or can use, Linux, it'd be much more trivial.
Sorry I wasn't so clear Sphinx can be used with the most popular RDBMS (Postgres included), I was referring to the latest NoSql trending DB's, you can use other data stores but its messy and probably not worth your time.
No, Sphinx creates its own index files (I'm not sure what structure they have or if they use a 3rd party DB) which are placed on the file system it doesn't use MySql for this.
Returning large data sets is not an issue for Sphinx in my experience but I've never done it at a scale of 100,000 results so can't comment on that. There will be a point (obviously) where it degrades the speed of the resutls drastically.
Sphinx uses an SQL like syntax (SphinxQL) to interact with it so pagination is a breeze by setting a limit on the end of the query.
I would check the documentation to see if the query language could handle your specific use case but I'm not sure Sphinx is designed for handling and linking objects in that manner, its purely a full text search http://sphinxsearch.com/docs/current.html#sphinxql-reference
pls
do you know how slow they would be if they used just mysql considering the size of them?
like you need to lrn 2 sphinx and stuff
it's all very complicated
A good idea would be to extend this or add a mode to do advanced search. Obviously too complex for a first version and maybe more suited to a website or such. I'm not sure what kind of backend you're using for this, but something like Sphinx search gives you really fast, flexible search over MySQL, with stuff like automatic stemming, adding "s" at the end, etc... It gets really complex, look at http://sphinxsearch.com/docs/current.html#conf-morphology.
I don't have data except mine regarding the number of notebooks with pages but ... I do have quite a few.
Anyway this is trivial to test :
docker pull macbre/sphinxsearch:latest
Another tangential solution might be to use something like Sphinx. It's extremely powerful for these types of query and could use MySQL as a real-time data-source. http://sphinxsearch.com/about/sphinx/
A lot of databases offer a "LIKE" clause which supports wildcards. So, if a user searched for "supports wildcards" on this very text, the query would like like "SELECT * FROM whatever WHERE field LIKE '%supports wildcards%'.
If you want to get more complicated than this then you can look into an open source search engine like sphinx search.
Sphinx Search could be a good match for your use case. I've used Elasticsearch to replace MySQL/Postgres/Oracle's full text on several occasions but in every case the SQL database remained the primary datastore. The minimum fields needed to execute the enhanced search functionality are copied over either online, through a queue or during a batch.
Debian includes packaged RT.
As for search - I do not know current situation, but back in 2005 we ran home-made extension for RT using Sphinx indexer (http://sphinxsearch.com)
My guess is that you should be spending 10% or less time on explicit documentation (not necessarily including business communications that might nevertheless be used as background and rationale docs). Some ways to automate and simplify:
script
logs a terminal session and makes for a rough and verbose documentation by itself, or can be quickly edited and cleaned up into great documentation.When you're in a documentation deficit, it can take some major effort to catch up. But after that point you shouldn't be spending a huge amount of time on this.
There are a number of full-text search systems. In the past I've found MySQL's integral full-text search to be quite poor; other RDBMS should be better. I've had very good success with SphinxSearch which is open source, and is in C++ so is generally quite fast and low-footprint compared to typical Java-based alternatives. Presumably you would be doing more integration and figuring things out than with some other stacks where existing HOWTOs might be published.
> I assume if there is something out there its build on Solr or Elasticsearch.
Should be just a matter of using a PDF library to parse the content and then indexing it in one of those or in SphinxSearch. I wonder why there isn't already open code for this?
Building custom engines is rarely worthwhile given engines like Lucene (used by e.g. Elasticsearch, which gives you a more polished experience - "just" chuck all your documents encoded as JSON into Elasticsearch and you get a ton of functionality "for free") or Sphinx
There's still plenty to do to tweak ranking when you don't have pagerank, but these engines have decent starting points and a ton of stuff you can tweak.
If you really, truly need indexing, then I would look at Sphinx Search first (not the same as Python's Sphinx-doc).
Anecdotally, MySQL full-text search used to be so bad that all of our MediaWikis had to get Sphinx Search plugins implemented before the users were satisfied. The difference was night and day.
http://sphinxsearch.com/docs/latest/rt-caveats.html
"In case of a damaged binlog, recovery will stop on the first damaged transaction, even though it's technically possible to keep looking further for subsequent undamaged transactions, and recover those. This mid-file damage case (due to flaky HDD/CDD/tape?) is supposed to be extremely rare, though."
Not even the developers of Sphinx know exactly what the problem is, but at least we know now that it IS an internal Sphinx bug specific to RT indices.
Sounds to me like some manual recover code could be written in the event of such a failure, a fallback should Sphinx fail.
Also sounds like RT indices are just plain bad period :D
A lot of really good answers in here, but I suggest that you post the database you are using behind the medoo framework.
You might look into sphinx if you are specifically doing search. http://sphinxsearch.com/
Sphinx is great. You can query for results using a sql like dialect which is one of my favorite features....integrates very well with mysql.
For excluding pink floyd you can rely on attribute filtering....assuming your search has some attributes to filter on.....so for example if pink had styles "r & b or pop" and pink floyd had "classic rock" you could use those attributes to refine the search.
Check out Sphinx Search. It installs on your server and is used for fast searches. It can talk directly to a MySQL database (it does so by default) and offers in depth indexing, filtering, and other search tools. The documentation on the site is a little hard to follow, at least for me, but I was able to find a very useful book (pdf) that got me started.
Learn Redis, use the phpredis extension, write the whole thing yourself to do exactly what you need. It will be super fast, highly extendable, and you'll understand what is going on. I started with SphinxSearch http://sphinxsearch.com/ , you could use that first to get an idea of how 'key->value dictionary tables' speed up search and common ideas like stopwords/wordforms/etc. Sphinx has a php api that it comes with.
>Searching through the database is resource-intensive
Which is why you implement something like sphinxsearch...
If someone wanted to build their own search engine to crawl and index reddit, had a server, some disk space and some bandwidth, I think Constellio might be a good option (it's essentially a free Google Search Appliance). Its problem is that it's Java based so it could get quite resource hungry with increasing load...
Apparently Reddit is using Amazon Cloudsearch to power its current search functionality.
I would agree except for the database part. If you only need to find which files have the desired text, you can just save the filename(s) that match in a results array. Just use the same buffer for the read-in text for each iteration. Not all databases are terribly efficient at searching for text.
Of course, you asked for "the most efficient way"... and that would be to install an actual search engine, like HTDIG or SPHINX. These applications take your corpus (the 25K HTML docs), and generate an index of words and the documents they are in, and then searching will become much more efficient.