NodeParser
The NodeParser
in LlamaIndex is responsible for splitting Document
objects into more manageable Node
objects. When you call .fromDocuments()
, the NodeParser
from the Settings
is used to do this automatically for you. Alternatively, you can use it to split documents ahead of time.
import { Document, SimpleNodeParser } from "llamaindex";
const nodeParser = new SimpleNodeParser();
Settings.nodeParser = nodeParser;
TextSplitter
The underlying text splitter will split text by sentences. It can also be used as a standalone module for splitting raw text.
import { SentenceSplitter } from "llamaindex";
const splitter = new SentenceSplitter({ chunkSize: 1 });
const textSplits = splitter.splitText("Hello World");
MarkdownNodeParser
The MarkdownNodeParser
is a more advanced NodeParser
that can handle markdown documents. It will split the markdown into nodes and then parse the nodes into a Document
object.
import { MarkdownNodeParser } from "llamaindex";
const nodeParser = new MarkdownNodeParser();
const nodes = nodeParser.getNodesFromDocuments([
new Document({
text: `# Main Header
Main content
# Header 2
Header 2 content
## Sub-header
Sub-header content
`,
}),
]);
The output metadata will be something like:
[
TextNode {
id_: '008e41a8-b097-487c-bee8-bd88b9455844',
metadata: { 'Header 1': 'Main Header' },
excludedEmbedMetadataKeys: [],
excludedLlmMetadataKeys: [],
relationships: { PARENT: [Array] },
hash: 'KJ5e/um/RkHaNR6bonj9ormtZY7I8i4XBPVYHXv1A5M=',
text: 'Main Header\nMain content',
textTemplate: '',
metadataSeparator: '\n'
},
TextNode {
id_: '0f5679b3-ba63-4aff-aedc-830c4208d0b5',
metadata: { 'Header 1': 'Header 2' },
excludedEmbedMetadataKeys: [],
excludedLlmMetadataKeys: [],
relationships: { PARENT: [Array] },
hash: 'IP/g/dIld3DcbK+uHzDpyeZ9IdOXY4brxhOIe7wc488=',
text: 'Header 2\nHeader 2 content',
textTemplate: '',
metadataSeparator: '\n'
},
TextNode {
id_: 'e81e9bd0-121c-4ead-8ca7-1639d65fdf90',
metadata: { 'Header 1': 'Header 2', 'Header 2': 'Sub-header' },
excludedEmbedMetadataKeys: [],
excludedLlmMetadataKeys: [],
relationships: { PARENT: [Array] },
hash: 'B3kYNnxaYi9ghtAgwza0ZEVKF4MozobkNUlcekDL7JQ=',
text: 'Sub-header\nSub-header content',
textTemplate: '',
metadataSeparator: '\n'
}
]