Extractive search on company documents

How to use the new frontiers of AI to retrieve the skills and knowledge dispersed among company documents.

Discover how our semantic search project leverages the latest in AI to streamline information management and knowledge retrieval across your company's documents. Unlock efficiency and insight with our cutting-edge extractive search models.

Objectives of a project with an extractive model

Retrieve a set of documents linked to a given query, with the most relevant paragraphs (contexts) highlighted.

Our project aims to significantly enhance the efficiency and accuracy of document search by retrieving the most relevant paragraphs for specific queries. By focusing on precise, actionable insights, we aim to facilitate better decision-making and improve operational productivity.

How does an extractive model work?

This model is frequently used in tasks like question-answering, aiming to identify the precise answer within a vast corpus of documents.

Our extractive model employs BERT technology to efficiently sift through documents, pinpointing exact text segments relevant to your queries. This approach simplifies information retrieval, making it faster and more accurate.

How does an extractive model work?

Extractive Phase:

A series of corporate documents are broken down into smaller documents.

These are vectorized, and their respective vectors are stored within a database.

When a search is conducted, it too is vectorized.

The answer consists of contexts "close" to the query and the respective documents in which they are contained.
 

What is a vector database and what is its purpose?

A vector database stores contexts, which are transformed into vectors.

This vector transformation is used to find contexts most similar to a given query. 

A very fast open-source vector database is Faiss (Facebook AI Similarity Search). Another one is Milvus. 

Within the database, we can associate specific contexts with a specific user, ensuring confidentiality. The documents remain within the company, as does the vector database.
 

Vector search is conducted using algorithms based on the minimum spatial distance.

Technology

Using exclusively open-source technology allows for operations on local machines, ensuring the utmost security for the company. Nothing is stored in the cloud!

By integrating Faiss and Milvus, our project utilizes cutting-edge vector databases designed for high-speed similarity search, ensuring that the most relevant document contexts are quickly accessible. This choice underpins our commitment to leveraging open-source, highly efficient technologies.

We use open source pre-trained models for vectorization (embedding) and extraction: BERT.

The User interface is developed in Java Spring as a web application.
 

Project steps for the implementation of Semantic Search

Our project unfolds in carefully planned stages, from document selection and preparation to vectorization, ensuring every phase contributes to a robust semantic search system.

1. Select the documents on which to perform the extractive search.

2. Create a folder to store the company documents to be searched.

3. Clean and convert the documents into text.

4. Vectorize the documents.
 

Now connect the vector database to the Java application and start the search.

The Java Spring web application

To access the vector database, we developed a Java application with user authentication and authorization.

Our Java Spring web application is designed with user experience in mind, facilitating seamless access to document searches with robust security measures to protect your data. We prioritize user-friendly design and data privacy.

Each user can only access their own documents. Each user can access up to N projects, with each project corresponding to a unique set of documents. If necessary, user groups can be created to share specific document sets among members.

The Java application allows for vector searches on the vector Database. 

The result is a series of relevant contexts displayed in descending order of proximity with the question. 

Some screenshots of the web app

Software and Infrastructure Component Interactions

Workstation/Server IT Infrastructure

Integrating Docker with Apache, MySQL, Flask, and Milvus brings a comprehensive suite of advantages, further enhancing development and deployment workflows. Docker's containerization guarantees environment consistency, facilitating seamless integration of Apache's web serving, MySQL's database management, Flask's flexible web framework, and Milvus's powerful vector database for AI applications.

This combination not only simplifies scalability and ensures isolation, minimizing dependency conflicts, but also extends the system's capabilities into efficient AI-powered search functionalities. Additionally, Docker's ecosystem, including tools like Docker Compose, streamlines the management of complex, service-oriented applications. This robust setup provides developers with a highly efficient, scalable, and isolated environment, ideal for developing advanced web and AI-enabled applications.

Interested in our semantic search initiatives? Contact us to dive deeper into our project or to discuss how we can work together.

Ivo Durisch, MSc in Engineering
Senior Engineer
ETH Zurich (Swiss Federal Institute of Technology)
Zurich, Switzerland

Davide Lupi, MSc in Engineering
Senior Engineer
SUPSI (University of Applied Sciences and Arts of Southern Switzerland)
Lugano, Switzerland

info@lineasoft.ch