lmstudio.js コード例 - SDK (TypeScript) | LM Studio ドキュメント

お知らせ

下記の内容は、[email protected] の変更を反映した更新はまだ完了しておらず、0.0.12 API を参照しています。

公開されているドキュメントとヘッダーの更新には、今しばらくお待ちください 🐻👾🙏。

以下は、LM Studio の TypeScript クライアント SDK を使用して、モデルの読み込み、アンロード、生成などの操作を実行する方法の例です。

LLM の読み込みと生成

この例では、lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF モデルを読み込み、それを使用してテキストを予測します。

import { LMStudioClient } from "@lmstudio/sdk";

async function main() {
  const client = new LMStudioClient();

  // Load a model
  const llama3 = await client.llm.load("lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF", {
    config: { gpuOffload: "max" },
  });

  // Create a text completion prediction
  const prediction = llama3.complete("The meaning of life is");

  // Stream the response
  for await (const text of prediction) {
    process.stdout.write(text);
  }
}

main();

プロチップ

process.stdout.write は、改行なしでテキストを出力できるNode.js 固有の関数です。

ブラウザでは、次のような処理を行う必要がある場合があります。

// Get the element where you want to display the output
const outputElement = document.getElementById("output");

for await (const text of prediction) {
  outputElement.textContent += text;
}

デフォルト以外の LM Studio サーバーポートの使用

この例では、異なるポート (例: 8080) で実行されている LM Studio に接続する方法を示します。

import { LMStudioClient } from "@lmstudio/sdk";

async function main() {
  const client = new LMStudioClient({
    baseUrl: "ws://127.0.0.1:8080",
  });

  // client.llm.load(...);
}

main();

クライアント終了後もモデルを読み込んだままにする (デーモンモード)

デフォルトでは、クライアントが LM Studio から切断されると、そのクライアントによって読み込まれたすべてのモデルがアンロードされます。noHup オプションをtrueに設定することで、これを回避できます。

await client.llm.load("lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF",
  config: { gpuOffload: "max" },
  noHup: true,
});

// The model stays loaded even after the client disconnects

読み込まれたモデルに分かりやすい名前を付ける

モデルを読み込む際に、識別子を設定できます。この識別子は、後でモデルを参照するために使用できます。

await client.llm.load("lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF", {
  config: { gpuOffload: "max" },
  identifier: "my-model",
});

// You can refer to the model later using the identifier
const myModel = await client.llm.get("my-model");
// myModel.complete(...);

カスタム設定を使用してモデルを読み込む

デフォルトでは、モデルの読み込み設定は、モデルに関連付けられたプリセットから取得されます（LM Studio の「マイモデル」ページで変更できます）。

const llama3 = await client.llm.load("lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF", {
  config: {
    gpuOffload: "max",
    contextLength: 1024,
    gpuOffload: 0.5, // Offloads 50% of the computation to the GPU
  },
});

// llama3.complete(...);

特定のプリセットを使用してモデルを読み込む

プリセットは、モデルのデフォルトの読み込み設定とデフォルトの推論設定を決定します。デフォルトでは、モデルに関連付けられたプリセットが使用されます（LM Studio の「マイモデル」ページで変更できます）。preset オプションを指定することで、使用するプリセットを変更できます。

const llama3 = await client.llm.load("lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF", {
  config: { gpuOffload: "max" }, // Overrides the preset
  preset: "My ChatML",
});

カスタム読み込み進捗

onProgress コールバックを提供することで、モデルの読み込み進捗状況を追跡できます。

const llama3 = await client.llm.load("lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF", {
  config: { gpuOffload: "max" },
  verbose: false, // Disables the default progress logging
  onProgress: (progress) => {
    console.log(`Progress: ${(progress * 100).toFixed(1)}%`);
  },
});

読み込み可能なすべてのモデルの一覧表示

読み込み可能なすべてのモデルを見つけるには、system オブジェクトの listDownloadedModel メソッドを使用できます。

const downloadedModels = await client.system.listDownloadedModels();
const downloadedLLMs = downloadedModels.filter((model) => model.type === "llm");

// Load the first model
const model = await client.llm.load(downloadedLLMs[0].path);
// model.complete(...);

読み込みのキャンセル

AbortController を使用して読み込みをキャンセルできます。

const controller = new AbortController();

try {
  const llama3 = await client.llm.load("lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF", {
    signal: controller.signal,
  });
  // llama3.complete(...);
} catch (error) {
  console.error(error);
}

// Somewhere else in your code:
controller.abort();

情報

AbortController は、非同期操作をキャンセルできる標準的な JavaScript API です。最新のブラウザと Node.js でサポートされています。詳細については、MDN Web Docs を参照してください。

モデルのアンロード

unloadメソッドを呼び出すことで、モデルをアンロードできます。

const llama3 = await client.llm.load("lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF", {
  identifier: "my-model",
});

// ...Do stuff...

await client.llm.unload("my-model");

注意：デフォルトでは、クライアントによってロードされたすべてのモデルは、クライアントが接続を切断したときにアンロードされます。したがって、モデルのライフタイムを正確に制御する必要がない限り、手動でアンロードする必要はありません。

プロチップ

接続切断後もモデルをロードしたままにしておきたい場合は、モデルのロード時にnoHupオプションをtrueに設定できます。

特定のロード済みモデルの使用

識別子で既にロードされているモデルを検索するには、以下を使用します。

const myModel = await client.llm.get({ identifier: "my-model" });
// Or just
const myModel = await client.llm.get("my-model");

// myModel.complete(...);

パスで既にロードされているモデルを検索するには、以下を使用します。

// Matches any quantization
const llama3 = await client.llm.get({ path: "lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF" });

// Or if a specific quantization is desired:
const llama3 = await client.llm.get({
  path: "lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf",
});

// llama3.complete(...);

任意のロード済みモデルの使用

特定のモデルがなく、単にロード済みの任意のモデルを使用したい場合は、空のオブジェクトをclient.llm.getに渡すだけで済みます。

const anyModel = await client.llm.get({});
// anyModel.complete(...);

ロード済みモデルのリスト表示

ロード済みのすべてのモデルをリスト表示するには、client.llm.listLoadedメソッドを使用します。

const loadedModels = await client.llm.listLoaded();

if (loadedModels.length === 0) {
  throw new Error("No models loaded");
}

// Use the first one
const firstModel = await client.llm.get({ identifier: loadedModels[0].identifier });
// firstModel.complete(...);

テキスト補完

テキスト補完を実行するには、completeメソッドを使用します。

const prediction = model.complete("The meaning of life is");

for await (const text of prediction) {
  process.stdout.write(text);
}

デフォルトでは、プリセット内の推論パラメーターが予測に使用されます。これらを次のように上書きできます。

const prediction = anyModel.complete("Meaning of life is", {
  contextOverflowPolicy: "stopAtLimit",
  maxPredictedTokens: 100,
  prePrompt: "Some pre-prompt",
  stopStrings: ["\n"],
  temperature: 0.7,
});

// ...Do stuff with the prediction...

チャット補完

会話を実行するには、respondメソッドを使用します。

const prediction = anyModel.respond([
  { role: "system", content: "Answer the following questions." },
  { role: "user", content: "What is the meaning of life?" },
]);

for await (const text of prediction) {
  process.stdout.write(text);
}

同様に、会話の推論パラメーターを上書きできます（使用可能なオプションはテキスト補完とは異なります）。

const prediction = anyModel.respond(
  [
    { role: "system", content: "Answer the following questions." },
    { role: "user", content: "What is the meaning of life?" },
  ],
  {
    contextOverflowPolicy: "stopAtLimit",
    maxPredictedTokens: 100,
    stopStrings: ["\n"],
    temperature: 0.7,
    inputPrefix: "Q: ",
    inputSuffix: "\nA:",
  },
);

// ...Do stuff with the prediction...

お知らせ

LLMはステートレスです。以前の入力からの情報を記憶したり保持したりしません。そのため、LLMで予測を行う際には、常に完全な履歴/コンテキストを提供する必要があります。

予測統計情報の取得

予測統計情報を取得したい場合は、予測オブジェクトを待機してPredictionResultを取得できます。これにより、statsプロパティを介して統計情報にアクセスできます。

const prediction = model.complete("The meaning of life is");

for await (const text of prediction) {
  process.stdout.write(text);
}

const { stats } = await prediction;
console.log(stats);

情報

予測ストリームを既に消費している場合、予測オブジェクトを待機しても追加の待機は発生しません。結果は予測オブジェクト内にキャッシュされるためです。

一方、最終結果のみを気にしている場合は、ストリームを反復処理する必要はありません。代わりに、予測オブジェクトを直接待機して最終結果を取得できます。

const prediction = model.complete("The meaning of life is");
const result = await prediction;
const content = result.content;
const stats = result.stats;

// Or just:

const { content, stats } = await model.complete("The meaning of life is");

JSON（構造化出力）の生成

LM Studioは構造化予測をサポートしており、モデルに特定の構造に準拠したコンテンツの生成を強制します。構造化予測を有効にするには、structuredフィールドを設定する必要があります。completeメソッドとrespondメソッドの両方で使用できます。

構造化予測の使用方法の例を次に示します。

const prediction = model.complete("Here is a joke in JSON:", {
  maxPredictedTokens: 100,
  structured: { type: "json" },
});

const result = await prediction;
try {
  // Although the LLM is guaranteed to only produce valid JSON, when it is interrupted, the
  // partial result might not be. Always check for errors. (See below)
  const parsed = JSON.parse(result.content);
  console.info(parsed);
} catch (e) {
  console.error(e);
}

場合によっては、任意のJSONでは不十分な場合があります。特定のJSONスキーマを適用したい場合があります。これを行うには、structuredフィールドにJSONスキーマを提供します。JSONスキーマの詳細については、json-schema.orgを参照してください。

const schema = {
  type: "object",
  properties: {
    setup: { type: "string" },
    punchline: { type: "string" },
  },
  required: ["setup", "punchline"],
};

const prediction = llama3.complete("Here is a joke in JSON:", {
  maxPredictedTokens: 100,
  structured: { type: "json", jsonSchema: schema },
});

const result = await prediction;
try {
  const parsed = JSON.parse(result.content);
  console.info("The setup is", parsed.setup);
  console.info("The punchline is", parsed.punchline);
} catch (e) {
  console.error(e);
}

お知らせ

モデルは指定された構造に準拠した予測を生成するように強制されますが、予測は中断される可能性があります（たとえば、ユーザーが予測を停止した場合）。その場合、部分的な結果は指定された構造に準拠していない可能性があります。したがって、使用する前に常に予測結果を確認してください。たとえば、JSON.parseをtry-catchブロックでラップします。
特定のケースでは、モデルがスタックする可能性があります。たとえば、有効なJSONの生成を強制すると、開始中括弧{を生成する可能性がありますが、終了中括弧}を生成しない可能性があります。このような場合、コンテキストの長さに達するまで予測は永遠に続きます。これは非常に時間がかかる可能性があります。したがって、常にmaxPredictedTokens制限を設定することをお勧めします。これは上記の点にも関連します。

予測の中断

予測オブジェクトでcancelメソッドを呼び出すことで、予測をキャンセルできます。

const prediction = model.complete("The meaning of life is");

// ...Do stuff...

prediction.cancel();

予測がキャンセルされると、予測は正常に停止しますが、stopReasonは"userStopped"に設定されます。キャンセルは次のように検出できます。

for await (const text of prediction) {
  process.stdout.write(text);
}
const { stats } = await prediction;
if (stats.stopReason === "userStopped") {
  console.log("Prediction was canceled by the user");
}