Nhảy tới nội dung

Loader

Before you can start indexing your documents, you need to load them into memory.

SimpleDirectoryReader

Open in StackBlitz

LlamaIndex.TS supports easy loading of files from folders using the SimpleDirectoryReader class.

It is a simple reader that reads all files from a directory and its subdirectories.

import { SimpleDirectoryReader } from "llamaindex/readers/SimpleDirectoryReader";
// or
// import { SimpleDirectoryReader } from 'llamaindex'

const reader = new SimpleDirectoryReader();
const documents = await reader.loadData("../data");

documents.forEach((doc) => {
console.log(`document (${doc.id_}):`, doc.getText());
});

Currently, it supports reading .txt, .pdf, .csv, .md, .docx, .htm, .html, .jpg, .jpeg, .png and .gif files, but support for other file types is planned.

You can override the default reader for all file types, inlcuding unsupported ones, with the overrideReader option. Additionally, you can override the default reader for specific file types or add support for additional file types with the fileExtToReader option. Also, you can provide a defaultReader as a fallback for files with unsupported extensions. By default it is TextFileReader.

SimpleDirectoryReader supports up to 9 concurrent requests. Use the numWorkers option to set the number of concurrent requests. By default it runs in sequential mode, i.e. set to 1.

import type { Document, Metadata } from "llamaindex";
import { FileReader } from "llamaindex";
import {
FILE_EXT_TO_READER,
SimpleDirectoryReader,
} from "llamaindex/readers/SimpleDirectoryReader";
import { TextFileReader } from "llamaindex/readers/TextFileReader";

class ZipReader extends FileReader {
loadDataAsContent(fileContent: Buffer): Promise<Document<Metadata>[]> {
throw new Error("Implement me");
}
}

const reader = new SimpleDirectoryReader();
const documents = await reader.loadData({
directoryPath: "../data",
defaultReader: new TextFileReader(),
fileExtToReader: {
...FILE_EXT_TO_READER,
zip: new ZipReader(),
},
});

documents.forEach((doc) => {
console.log(`document (${doc.id_}):`, doc.getText());
});

LlamaParse

LlamaParse is an API created by LlamaIndex to efficiently parse files, e.g. it's great at converting PDF tables into markdown.

To use it, first login and get an API key from https://cloud.llamaindex.ai. Make sure to store the key as apiKey parameter or in the environment variable LLAMA_CLOUD_API_KEY.

Then, you can use the LlamaParseReader class to local files and convert them into a parsed document that can be used by LlamaIndex. See LlamaParseReader.ts for a list of supported file types:

import { LlamaParseReader, VectorStoreIndex } from "llamaindex";

async function main() {
// Load PDF using LlamaParse
const reader = new LlamaParseReader({ resultType: "markdown" });
const documents = await reader.loadData("../data/TOS.pdf");

// Split text and create embeddings. Store them in a VectorStoreIndex
const index = await VectorStoreIndex.fromDocuments(documents);

// Query the index
const queryEngine = index.asQueryEngine();
const response = await queryEngine.query({
query: "What is the license grant in the TOS?",
});

// Output response
console.log(response.toString());
}

main().catch(console.error);

Additional options can be set with the LlamaParseReader constructor:

  • resultType can be set to markdown, text or .json. Defaults to text
  • language primarly helps with OCR recognition. Defaults to en. See ../readers/type.ts for a list of supported languages.
  • parsingInstructions can help with complicated document structures. See this LlamaIndex Blog Post for an example.
  • skipDiagonalText set to true to ignore diagonal text.
  • invalidateCache set to true to ignore the LlamaCloud cache. All document are kept in cache for 48hours after the job was completed to avoid processing the same document twice. Can be useful for testing when trying to re-parse the same document with, e.g. different parsingInstructions.
  • gpt4oMode set to true to use GPT-4o to extract content.
  • gpt4oApiKey set the GPT-4o API key. Optional. Lowers the cost of parsing by using your own API key. Your OpenAI account will be charged. Can also be set in the environment variable LLAMA_CLOUD_GPT4O_API_KEY.
  • numWorkers as in the python version, is set in SimpleDirectoryReader. Default is 1.

LlamaParse with SimpleDirectoryReader

Below a full example of LlamaParse integrated in SimpleDirectoryReader with additional options.

import {
LlamaParseReader,
SimpleDirectoryReader,
VectorStoreIndex,
} from "llamaindex";

async function main() {
const reader = new SimpleDirectoryReader();

const docs = await reader.loadData({
directoryPath: "../data/parallel", // brk-2022.pdf split into 6 parts
numWorkers: 2,
// set LlamaParse as the default reader for all file types. Set apiKey here or in environment variable LLAMA_CLOUD_API_KEY
overrideReader: new LlamaParseReader({
language: "en",
resultType: "markdown",
parsingInstruction:
"The provided files is Berkshire Hathaway's 2022 Annual Report. They contain figures, tables and raw data. Capture the data in a structured format. Mathematical equation should be put out as LATEX markdown (between $$).",
}),
});

const index = await VectorStoreIndex.fromDocuments(docs);

// Query the index
const queryEngine = index.asQueryEngine();
const response = await queryEngine.query({
query:
"What is the general strategy for shareholder safety outlined in the report? Use a concrete example with numbers",
});

// Output response
console.log(response.toString());
}

main().catch(console.error);

API Reference