• 作者:老汪软件技巧
  • 发表时间:2024-10-09 17:09
  • 浏览量:

作者:来自 ElasticDavid Pilato

LangChain4j(Java 版 LangChain)将 Elasticsearch 作为嵌入存储。了解如何使用它以纯 Java 构建 RAG 应用程序。

在上一篇文章中,我们发现了 LangChain4j 是什么以及如何:

这篇博文介绍了如何:

创建嵌入

要创建嵌入,我们需要定义要使用的 EmbeddingModel。例如,我们可以使用上一篇文章中使用的相同 mistral 模型。它与 ollama 一起运行:


1.  EmbeddingModel model = OllamaEmbeddingModel.builder()
2.    .baseUrl(ollama.getEndpoint())
3.    .modelName(MODEL_NAME)
4.    .build();

模型能够从文本生成向量。在这里我们可以检查模型生成的维数:


1.  Logger.info("Embedding model has {} dimensions.", model.dimension());
2.  // This gives: Embedding model has 4096 dimensions.

要从文本生成向量,我们可以使用:

Response response = model.embed("A text here");

或者,如果我们还想提供元数据,以便我们过滤文本、价格、发布日期等内容,我们可以使用 Metadata.from()。例如,我们在这里将游戏名称添加为元数据字段:


1.  TextSegment game1 = TextSegment.from("""
2.      The game starts off with the main character Guybrush Threepwood stating "I want to be a pirate!"
3.      To do so, he must prove himself to three old pirate captains. During the perilous pirate trials, 
4.      he meets the beautiful governor Elaine Marley, with whom he falls in love, unaware that the ghost pirate 
5.      LeChuck also has his eyes on her. When Elaine is kidnapped, Guybrush procures crew and ship to track 
6.      LeChuck down, defeat him and rescue his love.
7.  """, Metadata.from("gameName", "The Secret of Monkey Island"));
8.  Response response1 = model.embed(game1);
9.  TextSegment game2 = TextSegment.from("""
10.      Out Run is a pseudo-3D driving video game in which the player controls a Ferrari Testarossa 
11.      convertible from a third-person rear perspective. The camera is placed near the ground, simulating 
12.      a Ferrari driver's position and limiting the player's view into the distance. The road curves, 
13.      crests, and dips, which increases the challenge by obscuring upcoming obstacles such as traffic 
14.      that the player must avoid. The object of the game is to reach the finish line against a timer.
15.      The game world is divided into multiple stages that each end in a checkpoint, and reaching the end 
16.      of a stage provides more time. Near the end of each stage, the track forks to give the player a 
17.      choice of routes leading to five final destinations. The destinations represent different 
18.      difficulty levels and each conclude with their own ending scene, among them the Ferrari breaking 
19.      down or being presented a trophy.
20.  """, Metadata.from("gameName", "Out Run"));
21.  Response response2 = model.embed(game2);

如果你想运行此代码,请查看 Step5EmbedddingsTest.java 类。

添加 Elasticsearch 来存储我们的向量

LangChain4j 提供内存嵌入存储。这对于运行简单测试很有用:


1.  EmbeddingStore embeddingStore = new InMemoryEmbeddingStore<>();
2.  embeddingStore.add(response1.content(), game1);
3.  embeddingStore.add(response2.content(), game2);

但显然,这不适用于更大的数据集,因为这个数据存储将所有内容都存储在内存中,而我们的服务器上没有无限的内存。因此,我们可以将嵌入存储到 Elasticsearch 中,从定义上讲,Elasticsearch 是 “弹性的”,可以根据你的数据进行扩展和扩展。为此,让我们将 Elasticsearch 添加到我们的项目中:


1.  <dependency>
2.    <groupId>dev.langchain4jgroupId>
3.    <artifactId>langchain4j-elasticsearchartifactId>
4.    <version>${langchain4j.version}version>
5.  dependency>
7.  <dependency>
8.    <groupId>org.testcontainersgroupId>
9.    <artifactId>elasticsearchartifactId>
10.    <version>1.20.1version>
11.    <scope>testscope>
12.  dependency>

正如你所注意到的,我们还将 Elasticsearch TestContainers 模块添加到项目中,因此我们可以从测试中启动 Elasticsearch 实例:


1.  // Create the elasticsearch container
2.  ElasticsearchContainer container =
3.    new ElasticsearchContainer("docker.elastic.co/elasticsearch/elasticsearch:8.15.0")
4.      .withPassword("changeme");
6.  // Start the container. This step might take some time...
7.  container.start();
9.  // As we don't want to make our TestContainers code more complex than
10.  // needed, we will use login / password for authentication.
11.  // But note that you can also use API keys which is preferred.
12.  final CredentialsProvider credentialsProvider = new BasicCredentialsProvider();
13.  credentialsProvider.setCredentials(AuthScope.ANY, new UsernamePasswordCredentials("elastic", "changeme"));
15.  // Create a low level Rest client which connects to the elasticsearch container.
16.  client = RestClient.builder(HttpHost.create("https://" + container.getHttpHostAddress()))
17.    .setHttpClientConfigCallback(httpClientBuilder -> {
18.      httpClientBuilder.setDefaultCredentialsProvider(credentialsProvider);
19.      httpClientBuilder.setSSLContext(container.createSslContextFromCa());
20.      return httpClientBuilder;
21.    })
22.    .build();
24.  // Check the cluster is running
25.  client.performRequest(new Request("GET", "/"));

要将 Elasticsearch 用作嵌入存储,你 “只需” 从 LangChain4j 内存数据存储切换到 Elasticsearch 数据存储:


1.  EmbeddingStore<TextSegment> embeddingStore =
2.    ElasticsearchEmbeddingStore.builder()
3.      .restClient(client)
4.      .build();
5.  embeddingStore.add(response1.content(), game1);
6.  embeddingStore.add(response2.content(), game2);

这会将你的向量存储在 Elasticsearch 的默认索引中。你还可以将索引名称更改为更有意义的名称:


1.  EmbeddingStore<TextSegment> embeddingStore =
2.    ElasticsearchEmbeddingStore.builder()
3.      .indexName("games")
4.      .restClient(client)
5.      .build();
6.  embeddingStore.add(response1.content(), game1);
7.  embeddingStore.add(response2.content(), game2);

如果你想运行此代码,请查看 Step6ElasticsearchEmbedddingsTest.java 类。

搜索相似向量

嵌入式存储是什么意思_嵌入式存储_

要搜索相似向量,我们首先需要使用我们之前使用的相同模型将问题转换为向量表示。我们已经这样做了,所以再次这样做并不难。请注意,在这种情况下我们不需要元数据:


1.  String question = "I want to pilot a car";
2.  Embedding questionAsVector = model.embed(question).content();

我们可以用这个问题的表示来构建一个搜索请求,并要求嵌入存储找到第一个顶部向量:


1.  EmbeddingSearchResult<TextSegment> result = embeddingStore.search(
2.    EmbeddingSearchRequest.builder()
3.      .queryEmbedding(questionAsVector)
4.      .build());

我们现在可以迭代结果并打印一些信息,例如来自元数据和分数的游戏名称:


1.  result.matches().forEach(m -> Logger.info("{} - score [{}]",
2.    m.embedded().metadata().getString("gameName"), m.score()));

正如我们所料,第一个结果就是 “Out Run”:


1.  Out Run - score [0.86672974]
2.  The Secret of Monkey Island - score [0.85569763]

如果你想运行此代码,请查看 类。

幕后

Elasticsearch Embedding 存储的默认配置是在幕后使用近似 kNN 查询。


1.  POST games/_search
2.  {
3.    "query" : {
4.      "knn": {
5.        "field": "vector",
6.        "query_vector": [-0.019137882, /* ... */, -0.0148779955]
7.      }
8.    }
9.  }

但是,可以通过向嵌入存储提供默认配置(ElasticsearchConfigurationKnn)以外的另一个配置(ElasticsearchConfigurationScript)来改变这种情况:


1.  EmbeddingStore<TextSegment> embeddingStore =
2.    ElasticsearchEmbeddingStore.builder()
3.      .configuration(ElasticsearchConfigurationScript.builder().build())
4.      .indexName("games")
5.      .restClient(client)
6.      .build();

ElasticsearchConfigurationScript 实现在后台使用 cosineSimilarity 函数运行 script_score 查询。

基本上,在调用时:


1.  EmbeddingSearchResult<TextSegment> result = embeddingStore.search(
2.    EmbeddingSearchRequest.builder()
3.      .queryEmbedding(questionAsVector)
4.      .build());

现在调用:


1.  POST games/_search
2.  {
3.    "query": {
4.      "script_score": {
5.        "script": {
6.          "source": "(cosineSimilarity(params.query_vector, 'vector') + 1.0) / 2",
7.          "params": {
8.            "queryVector": [-0.019137882, /* ... */, -0.0148779955]
9.          }
10.        }
11.      }
12.    }
13.  }

在这种情况下,结果在 “顺序” 方面不会改变,而只是调整分数,因为 cosineSimilarity 调用不使用任何近似值,而是计算每个匹配向量的余弦:


1.  Out Run - score [0.871952]
2.  The Secret of Monkey Island - score [0.86380446]

如果你想运行此代码,请查看 类。

结论

我们已经介绍了如何轻松地从文本生成嵌入,以及如何使用两种不同的方法在 Elasticsearch 中存储和搜索最近的邻居:

下一步将根据我们在这里学到的知识构建一个完整的 RAG 应用程序。

准备好自己尝试一下了吗?开始免费试用。

Elasticsearch 集成了 LangChain、Cohere 等工具。加入我们的高级语义搜索网络研讨会,构建你的下一个 GenAI 应用程序!

原文:LangChain4j with Elasticsearch as the embedding store — Search Labs