[LlamaIndex] Indexing

[LlamaIndex] Indexing

카테고리 없음 2024. 1. 22. 16:08
데이터가 로드되면 이제 Document 개체 목록(또는 Node 목록)이 생성됩니다. 이제 쿼리를 시작할 수 있도록 Index를 빌드할 시간입니다. Index는 Document 객체로 구성된 데이터 구조로, LLM에서 쿼리할 수 있도록 설계되었습니다. LlamaIndex는 다양한 인덱스 유형을 제공합니다.

○ Vector Store Index

Vector Store Index는 가장 자주 접하게 되는 인덱스 유형입니다. Vector Store Index는 문서를 가져와서 노드로 분할합니다. 그런 다음 모든 노드의 텍스트에 대한 벡터 임베딩을 생성하여 LLM에서 쿼리할 수 있도록 준비합니다.

# Embedding

벡터 임베딩은 LLM 애플리케이션이 작동하는 방식의 핵심입니다. 흔히 임베딩이라고 부르는 벡터 임베딩은 텍스트의 의미 또는 의미를 숫자로 표현한 것입니다. 비슷한 의미를 가진 두 개의 텍스트는 실제 텍스트가 상당히 다르더라도 수학적으로 유사한 임베딩을 갖습니다. 이러한 수학적 관계를 통해 의미론적 검색이 가능해지며, 사용자가 검색어를 입력하면 라마인덱스는 단순한 키워드 매칭이 아니라 검색어의 의미와 관련된 텍스트를 찾을 수 있습니다. 이는 RAG가 작동하는 방식과 LLM이 일반적으로 작동하는 방식에서 큰 부분을 차지합니다.

# Vector Store Index embeds your documents

벡터 스토어 인덱스는 LLM의 API를 사용하여 모든 텍스트를 임베딩으로 변환하며, 이것이 바로 "텍스트를 임베딩한다"는 말의 의미입니다. 텍스트가 많으면 임베딩을 생성하는 데 많은 왕복 API 호출이 포함되므로 시간이 오래 걸릴 수 있습니다.

임베딩을 검색하려면 쿼리 자체가 벡터 임베딩으로 변환된 다음, VectorStoreIndex가 수학적 연산을 수행하여 모든 임베딩이 쿼리와 의미적으로 얼마나 유사한지에 따라 순위를 매깁니다.

# Top K Retrieval
순위 지정이 완료되면 VectorStoreIndex는 가장 유사한 임베딩을 해당 텍스트 청크로 반환합니다. 반환하는 임베딩의 수를 k라고 하며, 반환할 임베딩의 수를 제어하는 매개변수를 top_k라고 합니다. 이러한 이유로 이러한 전체 유형의 검색을 종종 "Top k Semantic Retrieval"이라고 합니다. 상위 k 검색은 벡터 인덱스를 쿼리하는 가장 간단한 형태이며, 쿼리 섹션을 읽으면서 더 복잡하고 미묘한 전략에 대해 배우게 될 것입니다.

# Vector Store Index 사용

Vector Store Index를 사용하려면 로딩 단계에서 생성한 문서 목록을 전달하세요.

from llama_index import VectorStoreIndex index = VectorStoreIndex.from_documents(documents) # 노드를 직접 Indexing 하는 예시 index = VectorStoreIndex(nodes)

from_documents를 사용하면 문서가 청크로 분할되고 메타데이터와 관계를 추적하는 텍스트 문자열에 대한 경량 추상화인 노드 객체로 구문 분석됩니다. 기본적으로 VectorStoreIndex는 모든 것을 메모리에 저장합니다.

# 수집 파이프라인 사용

문서 색인 생성 방법을 더 효과적으로 제어하려면 수집 파이프라인을 사용하는 것이 좋습니다. 이를 통해 노드의 청크, 메타데이터 및 포함을 사용자 정의할 수 있습니다.

from llama_index import Document from llama_index.embeddings import OpenAIEmbedding from llama_index.text_splitter import SentenceSplitter from llama_index.extractors import TitleExtractor from llama_index.ingestion import IngestionPipeline, IngestionCache # create the pipeline with transformations pipeline = IngestionPipeline( transformations=[ SentenceSplitter(chunk_size=25, chunk_overlap=0), TitleExtractor(), OpenAIEmbedding(), ] ) # run the pipeline nodes = pipeline.run(documents=[Document.example()])

# 벡터 인덱스 지정

LlamaIndex는 수십 개의 벡터 저장소를 지원합니다. 이 예제에서 Pinecone을 사용하는 것처럼 StorageContext를 전달하여 사용할 벡터 저장소를 지정하고, 그 위에 vector_store 인수를 지정할 수 있습니다.

import pinecone from llama_index import VectorStoreIndex, SimpleDirectoryReader, StorageContext from llama_index.vector_stores import PineconeVectorStore # init pinecone pinecone.init(api_key="<api_key>", environment="<environment>") pinecone.create_index( "quickstart", dimension=1536, metric="euclidean", pod_type="p1" ) # construct vector store and customize storage context storage_context = StorageContext.from_defaults( vector_store=PineconeVectorStore(pinecone.Index("quickstart")) ) # Load documents and build index documents = SimpleDirectoryReader( "../../examples/data/paul_graham" ).load_data() index = VectorStoreIndex.from_documents( documents, storage_context=storage_context )

○ Metadata Extraction

많은 경우, 하나의 Chunk(텍스트 덩어리)가 다른 유사한 Chunk와 명확히 구분하는 데 필요한 컨텍스트가 부족할 수 있습니다. 이러한 문제를 해결하기 위해 LLM을 사용하여 문서와 관련된 특정 문맥 정보를 추출함으로써 검색 및 언어 모델이 Chunk를 더 잘 구분할 수 있도록 돕습니다.

먼저, 메타데이터 추출기를 정의합니다. 그런 다음 이를 노드 파서에 전달하면 노드 파서가 각 노드에 추가 메타데이터를 추가합니다.

from llama_index.node_parser import SentenceSplitter from llama_index.extractors import ( SummaryExtractor, QuestionsAnsweredExtractor, TitleExtractor, KeywordExtractor, EntityExtractor, ) transformations = [ SentenceSplitter(), TitleExtractor(nodes=5), QuestionsAnsweredExtractor(questions=3), SummaryExtractor(summaries=["prev", "self"]), KeywordExtractor(keywords=10), EntityExtractor(prediction_threshold=0.5), ] # tansformations 파이프라인 적용 from llama_index.ingestion import IngestionPipeline pipeline = IngestionPipeline(transformations=transformations) nodes = pipeline.run(documents=documents)

위 코드를 실행하면 아래와 같은 메타데이터가 추출됩니다.

{'page_label': '2', 'file_name': '10k-132.pdf', 'document_title': 'Uber Technologies, Inc. 2019 Annual Report: Revolutionizing Mobility and Logistics Across 69 Countries and 111 Million MAPCs with $65 Billion in Gross Bookings', 'questions_this_excerpt_can_answer': '\n\n1. How many countries does Uber Technologies, Inc. operate in?\n2. What is the total number of MAPCs served by Uber Technologies, Inc.?\n3. How much gross bookings did Uber Technologies, Inc. generate in 2019?', 'prev_section_summary': "\n\nThe 2019 Annual Report provides an overview of the key topics and entities that have been important to the organization over the past year. These include financial performance, operational highlights, customer satisfaction, employee engagement, and sustainability initiatives. It also provides an overview of the organization's strategic objectives and goals for the upcoming year.", 'section_summary': '\nThis section discusses a global tech platform that serves multiple multi-trillion dollar markets with products leveraging core technology and infrastructure. It enables consumers and drivers to tap a button and get a ride or work. The platform has revolutionized personal mobility with ridesharing and is now leveraging its platform to redefine the massive meal delivery and logistics industries. The foundation of the platform is its massive network, leading technology, operational excellence, and product expertise.', 'excerpt_keywords': '\nRidesharing, Mobility, Meal Delivery, Logistics, Network, Technology, Operational Excellence, Product Expertise, Point A, Point B'}

○ 기타 Indexing 방법들

LlamaIndex에서는 Vector Store Index 말고도 다양한 Indexing 방법을 구현 할 수 있습니다.

Summary Index, Knowledge Graph Index, SQL Index 등을 지원하며, 지원하는 모든 Indexing 목록은 다음 링크에서 확인 가능합니다.

https://docs.llamaindex.ai/en/stable/module_guides/indexing/modules.html
저작자표시

ABOUT ME

AI for Everyone AI for Everyone

○ Vector Store Index

○ Metadata Extraction

○ 기타 Indexing 방법들

티스토리툴바