Getting Started with Apache Solr

What is Solr?

In simple words, it is a search engine. Similar to a relational database system (like MySQL etc.), it can store textual, numeric, spatial or binary data and allows quick search and retrieval. Here’s a list of the equivalent concepts between a database system and Solr:

Database System Solr
Table Collection
Row Document
Column Field

Solr is suitable for searching, filtering and faceting across full text fields and other types of fields and flexible for influencing ranked order of retrieved results based on relevance to the queries. A major difference between Solr and database systems is that Solr is not suitable for join operations across multiple collections, unlike database systems where joins across tables are very common. Database systems practitioners suggest data to be normalized, but it is recommended to have your data as de-normalized as possible in Solr. Solr offers tons of other features like searching across multiple fields at once, spell correction, highlighting, grouping, streaming functions, robust scaling features etc. We shall explore all of those in subsequent posts.

Running Solr

For this tutorial, lets use Docker to start up Solr 8.5.2, create a collection, index a few documents and perform some search queries. This article assumes no prior knowledge of Docker; just assumes that Docker is already installed. To install Docker, visit https://get.docker.com.

docker run -it -p 8983:8983 -p 9983:9983 solr:8.5.2 /opt/solr/bin/solr -c -f

Output:

2020-06-20 14:19:20.985 INFO  (main) [   ] o.e.j.u.log Logging initialized @965ms to org.eclipse.jetty.util.log.Slf4jLog
2020-06-20 14:19:21.152 INFO  (main) [   ] o.e.j.s.Server jetty-9.4.24.v20191120; built: 2019-11-20T21:37:49.771Z; git: 363d5f2df3a8a28de40604320230664b9c793c16; jvm 11.0.7+10
2020-06-20 14:19:21.213 INFO  (main) [   ] o.e.j.d.p.ScanningAppProvider Deployment monitor [file:///opt/solr-8.5.2/server/contexts/] at interval 0
2020-06-20 14:19:21.456 INFO  (main) [   ] o.e.j.w.StandardDescriptorProcessor NO JSP Support for /solr, did not find org.apache.jasper.servlet.JspServlet
2020-06-20 14:19:21.466 INFO  (main) [   ] o.e.j.s.session DefaultSessionIdManager workerName=node0
2020-06-20 14:19:21.466 INFO  (main) [   ] o.e.j.s.session No SessionScavenger set, using defaults
2020-06-20 14:19:21.469 INFO  (main) [   ] o.e.j.s.session node0 Scavenging every 600000ms
2020-06-20 14:19:21.537 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter Using logger factory org.apache.logging.slf4j.Log4jLoggerFactory
2020-06-20 14:19:21.541 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter  ___      _       Welcome to Apache Solr™ version 8.5.2
2020-06-20 14:19:21.542 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter / __| ___| |_ _   Starting in cloud mode on port 8983
2020-06-20 14:19:21.542 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter \__ \/ _ \ | '_|  Install dir: /opt/solr
2020-06-20 14:19:21.542 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter |___/\___/_|_|    Start time: 2020-06-20T14:19:21.542368Z
2020-06-20 14:19:21.580 INFO  (main) [   ] o.a.s.c.SolrResourceLoader Using system property solr.solr.home: /var/solr/data
2020-06-20 14:19:21.587 INFO  (main) [   ] o.a.s.c.SolrXmlConfig Loading container configuration from /var/solr/data/solr.xml
...
2020-06-20 14:19:22.547 INFO  (main) [   ] o.a.s.c.SolrZkServer STARTING EMBEDDED STANDALONE ZOOKEEPER SERVER at port 9983
2020-06-20 14:19:22.547 WARN  (main) [   ] o.a.s.c.SolrZkServer Embedded Zookeeper is not recommended in production environments. See Reference Guide for details.
2020-06-20 14:19:23.049 INFO  (main) [   ] o.a.s.c.ZkContainer Zookeeper client=localhost:9983
2020-06-20 14:19:23.094 INFO  (main) [   ] o.a.s.c.c.ConnectionManager Waiting for client to connect to ZooKeeper
2020-06-20 14:19:23.117 INFO  (zkConnectionManagerCallback-7-thread-1) [   ] o.a.s.c.c.ConnectionManager zkClient has connected
2020-06-20 14:19:23.117 INFO  (main) [   ] o.a.s.c.c.ConnectionManager Client is connected to ZooKeeper
...
2020-06-20 14:19:23.940 INFO  (main) [   ] o.e.j.s.AbstractConnector Started ServerConnector@56ace400{HTTP/1.1,[http/1.1, h2c]}{0.0.0.0:8983}
2020-06-20 14:19:23.941 INFO  (main) [   ] o.e.j.s.Server Started @3924ms

At this point, Solr is started up and is ready to go (point your browser to http://localhost:8983/solr to view the Solr’s Admin UI). Before we proceed, though, let us understand the various parts of the Docker command used to start Solr. The base command is “run” which is used to instantiate a Docker container within which Solr will be running.

- The flags -it instruct Docker to run the container interactively (as opposed to running it in the background) and allocating a pseudo TTY to go with it.

- The flag -p is used to expose a port from within a container and map it to a port opened in the host computer (where Docker is running). Since 8983 is the default Solr port, it is exposed through this mechanism so that we could interact with Solr now. The port 9983 in this example refers to a ZooKeeper port to which other Solr containers can potentially connect to later so as to form a Solr cluster of multiple Solr nodes.

- solr:8.5.2 refers to the application and version that needs to be started. In this case, the Solr application’s official Docker image will be pulled from the central Docker repositories (called Docker Hub) and the image will be used to start Solr containers.

- Here, /opt/solr/bin/solr -c -f is the command that starts Solr after the container is started. Inside the container, Solr is installed in the /opt/solr directory and the ./bin/solr script is used to start Solr. The -c parameter to the bin/solr script instructs it to start Solr in “cloud” mode or “SolrCloud” mode. It means that Solr would start up as part of a cluster in a distributed mode. In the cloud mode, an (embedded) instance of ZooKeeper, used for cluster coordination, would be started up alongside the Solr process; other Solr nodes can be made to be part of this SolrCloud cluster by connecting themselves to this ZooKeeper instance. The -f parameter instructs the bin/solr script to start Solr in the foreground mode so that the Docker container continues to run and the logs are displayed.

Interacting with Solr: A books collection

From a separate terminal, issue the following commands:

Create a collection:

curl -X POST \
  http://localhost:8983/api/collections \
  -d '{
  "create": {
    "name": "books",
    "numShards": 1
  }
}'

Indexing documents into the collection:

Indexing one document at a time:

curl -X POST \
  -D '{"id":"1", "title":"Hitchhikers Guide to the Galaxy", "author":"Douglas Adams"}' \
  http://localhost:8983/api/collections/books/update?commit=true
curl -X POST \
  -D '{"id":"2", "title":"My Family and Other Animals", "author":"Gerald Durrell"}' \
  http://localhost:8983/api/collections/books/update?commit=true
curl -X POST \
  -D '{"id":"3", "title":"1984", "author":"George Orwell"}' \
  http://localhost:8983/api/collections/books/update?commit=true
curl -X POST \
  -D '{"id":"4", "title":"Lucene in Action", "author":"Erik Hatcher"}' \
  http://localhost:8983/api/collections/books/update?commit=true

Or, batch indexing:

curl -X POST -d \
  '[
    {"id":"1", "title":"Hitchhikers Guide to the Galaxy", "author":"Douglas Adams"},
    {"id":"2", "title":"My Family and Other Animals", "author":"Gerald Durrell"},
    {"id":"3", "title":"1984", "author":"George Orwell"},
    {"id":"4", "title":"Lucene in Action", "author":"Erik Hatcher"}
  ]' \
  "http://localhost:8983/api/collections/books/update?commit=true"

Search queries:

Get all Solr documents (books):

curl "http://localhost:8983/api/collections/books/select?q=*:*"

Search for books titled “lucene”:

curl "http://localhost:8983/api/collections/books/select?q=title:lucene"

Search for books by “orwell”:

curl "http://localhost:8983/api/collections/books/select?q=author:orwell"

Search for books by “douglas adams”

curl 'http://localhost:8983/api/collections/books/select?q=author:"douglas+adams"'

Conclusion

This was just a quick 5 minute introduction. There are various nuances associated with collection creation (like sharding, replication), indexing (schema management etc.) and querying (different query parsers etc.). Refer to the official reference guide for more details.