llama2モデルを使用して、テキスト生成スクリプトを実行する

目的

Meta社と提携しているHugging Faceが提供する自然言語系モデルであるllama2を使用して、チャット環境を構築、利用する。

参考

https://pc.watch.impress.co.jp/docs/column/nishikawa/1519390.html

llama2モデルの主なバリエーション

パラメータ数に応じて、7B、13B、70Bのモデルが存在する
本紙では下記のモデルについて検証する

llama-2-7b-chat.ipynb
- github
  - https://github.com/camenduru/text-generation-webui-colab/blob/main/llama-2-7b-chat.ipynb
llama-2-13b-chat-GPTQ-4bit.ipynb
- github
  - https://github.com/camenduru/text-generation-webui-colab/blob/main/llama-2-13b-chat-GPTQ-4bit.ipynb
japanese-elyza-llama2-7b-instruct.ipynb（日本語能力拡張のために追加学習したモデル）
- github
  - https://github.com/camenduru/japanese-text-generation-webui-colab
llama-2-70b
- Hugging Face
  - https://huggingface.co/meta-llama/Llama-2-70b-chat-hf
Llama-2-7b-chat-hf
- Hugging Face
  - https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
ELYZA-japanese-Llama-2-7b-fast-instruct
- Hugging Face
  - https://huggingface.co/elyza/ELYZA-japanese-Llama-2-7b-fast-instruct

検証結果

モデル	パラメータ数	検証方法	フィードバック
llama-2-7b-chat.ipynb	7B (70億）	google colab (https://colab.research.google.com/github/camenduru/text-generation-webui-colab/blob/main/llama-2-7b-chat.ipynb)	回答具合は良好。
llama-2-13b-chat-GPTQ-4bit.ipynb	13B (130億）	google colab (https://colab.research.google.com/github/camenduru/text-generation-webui-colab/blob/main/llama-2-13b-chat-GPTQ-4bit.ipynb)	使い出すとGPU RAMは9.4GB/15GBまで上がる。回答具合は良好。日本語での質問は英語で返す。技術質問の回答も的確。
japanese-elyza-llama2-7b-instruct.ipynb	7B (70億）	google colab (https://colab.research.google.com/github/camenduru/japanese-text-generation-webui-colab/blob/main/japanese-elyza-llama2-7b-instruct.ipynb)	使い出すとGPU RAMは10.6GB/15GBまで上がる。回答は良好。回答の入力が遅い。日本語で返答される。
llama-2-70b	70B (700億）	ブラウザ実装サイト（https://llama.replicate.dev/）	回答具合は良好。
llama-2-70b	70B (700億）	ブラウザ実装サイト（https://www.llama2.ai/）	回答具合は良好。回答が遅い
Llama-2-7b-chat-hf	7B (70億）	ローカル端末	モデルを指定してテキスト生成するスクリプトが失敗
ELYZA-japanese-Llama-2-7b-fast-instruct	7B (70億）	ローカル端末	回答具合は良好。精度はchat-gptやbardに比べて劣る

google colabを利用したモデルの起動方法

google colab

複数モデルの実行は不可。アクティブなセッションを切ってから次のモデルを構築する。
f

ローカル環境でモデルを利用する方法

ローカル環境にモデルをダウンロードして、Pythonでモデルを指定してテキスト生成を行う。

使用するモデルは後述するPythonスクリプトの「model_name」を変更すれば可能。

テキスト生成サーバスペック

下記のスペックで検証を行っている

ホスト	CPU	メモリ	GPU	結果
192.168.111.223	24core	32GB	[AMD/ATI] Navi 23 [Radeon RX 6600/6600 XT/6600M]	回答具合は良好。
192.168.112.210	8core	8GB	Intel Corporation WhiskeyLake-U GT2 [UHD Graphics 620]	メモリ不足でプロセス強制終了する。

MetaおよびHugging Faceへモデル利用申請する

Meta公式サイトからモデルをダウンロードする

https://ai.meta.com/llama/

アカウント登録が必要

ダウンロードスクリプト実行時に必要になるURLがメールで通知される

Hugging FaceでSignUpする

https://huggingface.co/

Read/Write権限を選択してトークンを発行すると、メールでトークンが通知される

llama2モデルをローカルにダウンロードする手順

検証ではモデルのダウンロード中にエラー。後述する他のモデルを使用できるため本項目は飛ばしてもOK

$ git clone https://github.com/facebookresearch/llama
$ cd llama

$ bash download.sh
Enter the URL from email
・・・Metaアカウント作成時の返信メールに記載されたURLをコピペしてEnter

メール参考

Model weights available:

Llama-2-7b
Llama-2-7b-chat
Llama-2-13b
Llama-2-13b-chat
Llama-2-70b
Llama-2-70b-chat
With each model download, you’ll receive a copy of the License and Acceptable Use Policy, and can find all other information on the model and code on GitHub.

How to download the models:

Visit the Llama repository in GitHub and follow the instructions in the README to run the download.sh script.
When asked for your unique custom URL, please insert the following:
https://download.llamameta.net/xxxxxxxxxx★該当URL
Select which model weights to download
The unique custom URL provided will remain valid for model downloads for 24 hours, and requests can be submitted multiple times.

上記URLを入力後に、ダウンロードするモデルを選択する

Enter the list of models to download without spaces (7B, 13B, 70B, 7B-chat, 13B-chat, 70B-chat), or press Enter for all :
7B

仮想環境でpipライブラリをインストール

スクリプトで別のモデルを指定する場合は本手順から開始する

仮想環境を構築する

$ cd llama
$ python3 -m venv myenv
$ source myenv/bin/activate

pipライブラリインストール

$ sudo apt install python3-pip
$ pip3 install transformers sentencepiece accelerate huggingface-hub fugashi[unidic-lite] torch

※以下のようなエラーが発生した場合は、rootユーザで実行する
WARNING: The directory '/home/adadmin/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.

Hugging Faceにログイン

(myenv) $ python3 -c "from huggingface_hub.hf_api import HfFolder; HfFolder.save_token('MY_HUGGINGFACE_TOKEN_HERE')"
(myenv) $ huggingface-cli login

Token:
・・・Hugging Faceで発行したトークンを入力する
Add token as git credential? (Y/n) y

テキスト生成スクリプトを作成する

テキスト生成用のスクリプトを作成する。model_nameに指定したモデルを使用。インターネット上にあるモデルを指定する場合は「リポジトリ名/モデル名」、ローカルにダウンロードしたリポジトリを指定する場合は、「ローカルのパス/モデル名」となる

「Llama-2-7b-chat-hf」を使用した例

create-text.pyを作成する

# TransformersライブラリからAutoTokenizerおよびpipelineをインポート
from transformers import AutoTokenizer, pipeline
import torch #torchモジュールをインポート

# 使用するモデルの指定
model_name = "meta-llama/Llama-2-7b-chat-hf"

# モデルのトークナイザーを読み込む（torch_dtypeをfloat32に設定）
tokenizer = AutoTokenizer.from_pretrained(model_name, torch_dtype=torch.float32)

# テキスト生成のパイプラインをセットアップ（torch_dtypeをfloat32に設定）
text_generator = pipeline(
    "text-generation",
    model=model_name,
    torch_dtype=torch.float32,
    device_map="auto",
)

# プロンプトを設定
prompt = """USER: What is a good way to learn data science?
SYSTEM:"""

# テキスト生成の実行
sequences = text_generator(
    prompt,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=200,
)

スクリプトを実行する

$ python3 create-text.py

(myenv) dmz@dmz:~/llama2/llama$ python3 create-text.py
Downloading (…)okenizer_config.json: 100%|██████████████████| 776/776 [00:00<00:00, 3.21MB/s]
Downloading tokenizer.model: 100%|████████████████████████| 500k/500k [00:00<00:00, 1.50MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████████| 1.84M/1.84M [00:01<00:00, 1.71MB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████████████| 414/414 [00:00<00:00, 2.39MB/s]
Downloading (…)lve/main/config.json: 100%|██████████████████| 614/614 [00:00<00:00, 2.93MB/s]
Downloading (…)fetensors.index.json: 100%|██████████████| 26.8k/26.8k [00:00<00:00, 2.16MB/s]
Downloading (…)of-00002.safetensors: 100%|████████████| 9.98G/9.98G [1:48:03<00:00, 1.54MB/s]
Downloading (…)of-00002.safetensors: 100%|██████████████| 3.50G/3.50G [36:57<00:00, 1.58MB/s]
Downloading shards: 100%|██████████████████████████████████| 2/2 [2:25:02<00:00, 4351.05s/it]
Traceback (most recent call last):
  File "/home/dmz/llama2/llama/create-text.py", line 12, in 
    text_generator = pipeline(
  File "/home/dmz/llama2/llama/myenv/lib/python3.10/site-packages/transformers/pipelines/__init__.py", line 834, in pipeline
    framework, model = infer_framework_load_model(
  File "/home/dmz/llama2/llama/myenv/lib/python3.10/site-packages/transformers/pipelines/base.py", line 269, in infer_framework_load_model
    model = model_class.from_pretrained(model, **kwargs)
  File "/home/dmz/llama2/llama/myenv/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 565, in from_pretrained
    return model_class.from_pretrained(
  File "/home/dmz/llama2/llama/myenv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3307, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/home/dmz/llama2/llama/myenv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3633, in _load_pretrained_model
    offload_index = {
  File "/home/dmz/llama2/llama/myenv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3636, in 
    if param_device_map[p] == "disk"
KeyError: 'lm_head.weight'

23/10/4時点で上記エラーとなる

「bert-large-japanese」日本語モデルを使用した例

上述の以下を完了してから実行すること

仮想環境でpipライブラリをインストール
Hugging Faceにログイン

create-text_bert_1.pyを作成する


import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

model_name = "cl-tohoku/bert-large-japanese"

unmasker = pipeline('fill-mask', model=model_name)
result = unmasker("今日の昼食は[MASK]でした。")
print(result)

実行結果


Some weights of the model checkpoint at cl-tohoku/bert-large-japanese were not used when initializing BertForMaskedLM: ['bert.pooler.dense.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[{'score': 0.04221123829483986, 'token': 32474, 'token_str': 'サラダ', 'sequence': '今日 の 昼食 は サラダ でし た 。'}, {'score': 0.036806270480155945, 'token': 18526, 'token_str': 'カレー', 'sequence': '今日 の 昼食 は カレー でし た 。'}, {'score': 0.031343549489974976, 'token': 31893, 'token_str': 'ご飯', 'sequence': '今日 の 昼食 は ご飯 でし た 。'}, {'score': 0.021632134914398193, 'token': 17540, 'token_str': '元気', 'sequence': '今日 の 昼食 は 元気 でし た 。'}, {'score': 0.020115580409765244, 'token': 23869, 'token_str': 'うどん', 'sequence': '今日 の 昼食 は うどん でし た 。'}]

create-text_bert_2.py


import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

model_name = "bert-base-multilingual-uncased"

unmasker = pipeline('fill-mask', model=model_name)
result = unmasker("今日の昼食は[MASK]でした。")
print(result)

実行結果


Some weights of the model checkpoint at bert-base-multilingual-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[{'score': 0.17987704277038574, 'token': 7753, 'token_str': '見', 'sequence': '今 日 の 昼 食 は 見 てした 。'}, {'score': 0.06706620752811432, 'token': 4080, 'token_str': '捨', 'sequence': '今 日 の 昼 食 は 捨 てした 。'}, {'score': 0.06436685472726822, 'token': 2073, 'token_str': '全', 'sequence': '今 日 の 昼 食 は 全 てした 。'}, {'score': 0.06041225045919418, 'token': 5216, 'token_str': '満', 'sequence': '今 日 の 昼 食 は 満 てした 。'}, {'score': 0.025420552119612694, 'token': 4518, 'token_str': '果', 'sequence': '今 日 の 昼 食 は 果 てした 。'}]

「ELYZA-japanese-Llama-2-7b 」日本語モデルを使用した例

上述の以下を完了してから実行すること

仮想環境でpipライブラリをインストール
Hugging Faceにログイン

create-text_elyza.pyを作成する


import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<>\n", "\n<>\n\n"
DEFAULT_SYSTEM_PROMPT = "あなたは誠実で優秀な日本人のアシスタントです。"
text = "ワニが海辺に行ってアザラシと友達になり、最終的には家に帰るというプロットの短編小説を書いてください。"

model_name = "./models/ELYZA-japanese-Llama-2-7b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float32)

if torch.cuda.is_available():
    model = model.to("cuda")

prompt = "{bos_token}{b_inst} {system}{prompt} {e_inst} ".format(
    bos_token=tokenizer.bos_token,
    b_inst=B_INST,
    system=f"{B_SYS}{DEFAULT_SYSTEM_PROMPT}{E_SYS}",
    prompt=text,
    e_inst=E_INST,
)

with torch.no_grad():
    token_ids = tokenizer.encode

(prompt, add_special_tokens=False, return_tensors="pt")

    output_ids = model.generate(
        token_ids.to(model.device),
        max_new_tokens=2048,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )
output = tokenizer.decode(output_ids.tolist()[0][token_ids.size(1):], skip_special_tokens=True)
print(output)

モデルの場所


(.venv) admin@host11:~/llama2_2th/text-generation-webui$ ll models/ELYZA-japanese-Llama-2-7b-instruct/
total 13163360
drwxrwxr-x 2 admin admin       4096 Nov  8 06:40 ./
drwxrwxr-x 3 admin admin       4096 Oct 10 23:59 ../
-rw-rw-r-- 1 admin admin        627 Oct 11 02:13 config.json
-rw-rw-r-- 1 admin admin       1385 Nov  8 06:40 create-text_elyza_localmodel.py
-rw-rw-r-- 1 admin admin        159 Oct 11 02:13 generation_config.json
-rw-rw-r-- 1 admin admin 9976570304 Oct 11 03:59 model-00001-of-00002.safetensors
-rw-rw-r-- 1 admin admin 3500294472 Oct 11 04:42 model-00002-of-00002.safetensors
-rw-rw-r-- 1 admin admin      23950 Oct 11 04:42 model.safetensors.index.json
-rw-rw-r-- 1 admin admin        437 Oct 11 04:42 special_tokens_map.json
-rw-rw-r-- 1 admin admin    1842866 Oct 11 04:42 tokenizer.json
-rw-rw-r-- 1 admin admin     499723 Oct 11 04:42 tokenizer.model
-rw-rw-r-- 1 admin admin        834 Oct 11 04:42 tokenizer_config.json

実行結果


(.venv) admin@host11:~/llama2_2th/text-generation-webui$ python3 create-text_elyza.py
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00,  3.18s/it]
/home/admin/.local/lib/python3.11/site-packages/transformers/generation/utils.py:1473: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use and modify the model generation configuration (see https://huggingface.co/docs/transformers/generation_strategies#default-text-generation-configuration )
  warnings.warn(
ワニが海辺に行ってアザラシと友達になり、最終的には家に帰るというプロットの短編小説を以下に記述します。

ワニは海辺に来て、アザラシを見つけました。
アザラシはワニに対して恐怖を抱いていましたが、ワニは彼に対して友好的であり、彼は彼の家に招待しました。
アザラシはワニの家に泊まり、翌朝、ワニは彼を家に送りました。
アザラシは家に帰り、彼の友人にワニのことを話しました。
彼の友人は彼に対して、ワニは彼の家

検証では回答具合は良好

llama2モデルを使用して、テキスト生成スクリプトを実行する

目的

参考

llama2モデルの主なバリエーション

google colabを利用したモデルの起動方法

ローカル環境でモデルを利用する方法

テキスト生成サーバスペック

MetaおよびHugging Faceへモデル利用申請する

仮想環境でpipライブラリをインストール

Hugging Faceにログイン

テキスト生成スクリプトを作成する

「Llama-2-7b-chat-hf」を使用した例

「bert-large-japanese」日本語モデルを使用した例

「ELYZA-japanese-Llama-2-7b 」日本語モデルを使用した例

CATEGORIES & TAGS