首页 > Python资料 博客日记
快速方便地下载huggingface的模型库和数据集
2024-09-13 06:00:05Python资料围观109次
本篇文章分享快速方便地下载huggingface的模型库和数据集,对你有帮助的话记得收藏一下,看Python资料网收获更多编程知识
快速方便地下载huggingface的模型库和数据集
方法一:用于使用 aria2/wget+git 下载 Huggingface 模型和数据集的 CLI 工具
来自https://gist.github.com/padeoe/697678ab8e528b85a2a7bddafea1fa4f。
使用方法:将hfd.sh拷贝过去,然后参考下面的参考命令,下载数据集或者模型
🤗Huggingface 模型下载器
考虑到官方 huggingface-cli
缺乏多线程下载支持,以及错误处理不足在 hf_transfer
中,这个命令行工具巧妙地利用 wget
或 aria2
来处理 LFS 文件,并使用 git clone
来处理其余文件。
特点
- ⏯️ 从断点恢复:您可以随时重新运行它或按 Ctrl+C。
- 🚀 多线程下载:利用多线程加速下载过程。
- 🚫 文件排除:使用
--exclude
或--include
跳过或指定文件,为具有重复格式的模型(例如,*.bin
或*.safetensors
)节省时间)。 - 🔐 身份验证支持:对于需要 Huggingface 登录的门控模型,请使用
--hf_username
和--hf_token
进行身份验证。 - 🪞 镜像站点支持:使用“HF_ENDPOINT”环境变量进行设置。
- 🌍代理支持:使用“HTTPS_PROXY”环境变量进行设置。
- 📦 简单:仅依赖
git
、aria2c/wget
。
Usage
首先,下载 hfd.sh
或克隆此存储库,然后授予脚本执行权限。
chmod a+x hfd.sh
为了方便起见,您可以创建一个别名
alias hfd="$PWD/hfd.sh"
使用说明:
$ ./hfd.sh -h
Usage:
hfd <repo_id> [--include include_pattern] [--exclude exclude_pattern] [--hf_username username] [--hf_token token] [--tool aria2c|wget] [-x threads] [--dataset] [--local-dir path]
Description:
Downloads a model or dataset from Hugging Face using the provided repo ID.
Parameters:
repo_id The Hugging Face repo ID in the format 'org/repo_name'.
--include (Optional) Flag to specify a string pattern to include files for downloading.
--exclude (Optional) Flag to specify a string pattern to exclude files from downloading.
include/exclude_pattern The pattern to match against filenames, supports wildcard characters. e.g., '--exclude *.safetensor', '--include vae/*'.
--hf_username (Optional) Hugging Face username for authentication. **NOT EMAIL**.
--hf_token (Optional) Hugging Face token for authentication.
--tool (Optional) Download tool to use. Can be aria2c (default) or wget.
-x (Optional) Number of download threads for aria2c. Defaults to 4.
--dataset (Optional) Flag to indicate downloading a dataset.
--local-dir (Optional) Local directory path where the model or dataset will be stored.
Example:
hfd bigscience/bloom-560m --exclude *.safetensors
hfd meta-llama/Llama-2-7b --hf_username myuser --hf_token mytoken -x 4
hfd lavita/medical-qa-shared-task-v1-toy --dataset
下载模型:
hfd bigscience/bloom-560m
下载模型需要登录
从https://huggingface.co/settings/tokens获取huggingface令牌,然后
hfd meta-llama/Llama-2-7b --hf_username YOUR_HF_USERNAME_NOT_EMAIL --hf_token YOUR_HF_TOKEN
下载模型并排除某些文件(例如.safetensors):
hfd bigscience/bloom-560m --exclude *.safetensors
使用 aria2c 和多线程下载:
hfd bigscience/bloom-560m
输出:
下载过程中,将显示文件 URL:
$ hfd bigscience/bloom-560m --tool wget --exclude *.safetensors
...
Start Downloading lfs files, bash script:
wget -c https://huggingface.co/bigscience/bloom-560m/resolve/main/flax_model.msgpack
# wget -c https://huggingface.co/bigscience/bloom-560m/resolve/main/model.safetensors
wget -c https://huggingface.co/bigscience/bloom-560m/resolve/main/onnx/decoder_model.onnx
...
# 安装包
apt update
apt-get install aria2
apt-get install iftop
apt-get install git-lfs
#参考命令
bash /xxx/xxx/hfd.sh mmaaz60/ActivityNet-QA-Test-Videos --tool aria2c -x 16 --dataset --local-dir /xxx/xxx/ActivityNet
hfd.sh
#!/usr/bin/env bash
# Color definitions
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color
trap 'printf "${YELLOW}\nDownload interrupted. If you re-run the command, you can resume the download from the breakpoint.\n${NC}"; exit 1' INT
display_help() {
cat << EOF
Usage:
hfd <repo_id> [--include include_pattern] [--exclude exclude_pattern] [--hf_username username] [--hf_token token] [--tool aria2c|wget] [-x threads] [--dataset] [--local-dir path]
Description:
Downloads a model or dataset from Hugging Face using the provided repo ID.
Parameters:
repo_id The Hugging Face repo ID in the format 'org/repo_name'.
--include (Optional) Flag to specify a string pattern to include files for downloading.
--exclude (Optional) Flag to specify a string pattern to exclude files from downloading.
include/exclude_pattern The pattern to match against filenames, supports wildcard characters. e.g., '--exclude *.safetensor', '--include vae/*'.
--hf_username (Optional) Hugging Face username for authentication. **NOT EMAIL**.
--hf_token (Optional) Hugging Face token for authentication.
--tool (Optional) Download tool to use. Can be aria2c (default) or wget.
-x (Optional) Number of download threads for aria2c. Defaults to 4.
--dataset (Optional) Flag to indicate downloading a dataset.
--local-dir (Optional) Local directory path where the model or dataset will be stored.
Example:
hfd bigscience/bloom-560m --exclude *.safetensors
hfd meta-llama/Llama-2-7b --hf_username myuser --hf_token mytoken -x 4
hfd lavita/medical-qa-shared-task-v1-toy --dataset
EOF
exit 1
}
MODEL_ID=$1
shift
# Default values
TOOL="aria2c"
THREADS=4
HF_ENDPOINT=${HF_ENDPOINT:-"https://hf-mirror.com"}
while [[ $# -gt 0 ]]; do
case $1 in
--include) INCLUDE_PATTERN="$2"; shift 2 ;;
--exclude) EXCLUDE_PATTERN="$2"; shift 2 ;;
--hf_username) HF_USERNAME="$2"; shift 2 ;;
--hf_token) HF_TOKEN="$2"; shift 2 ;;
--tool) TOOL="$2"; shift 2 ;;
-x) THREADS="$2"; shift 2 ;;
--dataset) DATASET=1; shift ;;
--local-dir) LOCAL_DIR="$2"; shift 2 ;;
*) shift ;;
esac
done
# Check if aria2, wget, curl, git, and git-lfs are installed
check_command() {
if ! command -v $1 &>/dev/null; then
echo -e "${RED}$1 is not installed. Please install it first.${NC}"
exit 1
fi
}
# Mark current repo safe when using shared file system like samba or nfs
ensure_ownership() {
if git status 2>&1 | grep "fatal: detected dubious ownership in repository at" > /dev/null; then
git config --global --add safe.directory "${PWD}"
printf "${YELLOW}Detected dubious ownership in repository, mark ${PWD} safe using git, edit ~/.gitconfig if you want to reverse this.\n${NC}"
fi
}
[[ "$TOOL" == "aria2c" ]] && check_command aria2c
[[ "$TOOL" == "wget" ]] && check_command wget
check_command curl; check_command git; check_command git-lfs
[[ -z "$MODEL_ID" || "$MODEL_ID" =~ ^-h ]] && display_help
if [[ -z "$LOCAL_DIR" ]]; then
LOCAL_DIR="${MODEL_ID#*/}"
fi
if [[ "$DATASET" == 1 ]]; then
MODEL_ID="datasets/$MODEL_ID"
fi
echo "Downloading to $LOCAL_DIR"
if [ -d "$LOCAL_DIR/.git" ]; then
printf "${YELLOW}%s exists, Skip Clone.\n${NC}" "$LOCAL_DIR"
cd "$LOCAL_DIR" && ensure_ownership && GIT_LFS_SKIP_SMUDGE=1 git pull || { printf "${RED}Git pull failed.${NC}\n"; exit 1; }
else
REPO_URL="$HF_ENDPOINT/$MODEL_ID"
GIT_REFS_URL="${REPO_URL}/info/refs?service=git-upload-pack"
echo "Testing GIT_REFS_URL: $GIT_REFS_URL"
response=$(curl -s -o /dev/null -w "%{http_code}" "$GIT_REFS_URL")
if [ "$response" == "401" ] || [ "$response" == "403" ]; then
if [[ -z "$HF_USERNAME" || -z "$HF_TOKEN" ]]; then
printf "${RED}HTTP Status Code: $response.\nThe repository requires authentication, but --hf_username and --hf_token is not passed. Please get token from https://huggingface.co/settings/tokens.\nExiting.\n${NC}"
exit 1
fi
REPO_URL="https://$HF_USERNAME:$HF_TOKEN@${HF_ENDPOINT#https://}/$MODEL_ID"
elif [ "$response" != "200" ]; then
printf "${RED}Unexpected HTTP Status Code: $response\n${NC}"
printf "${YELLOW}Executing debug command: curl -v %s\nOutput:${NC}\n" "$GIT_REFS_URL"
curl -v "$GIT_REFS_URL"; printf "\n${RED}Git clone failed.\n${NC}"; exit 1
fi
echo "GIT_LFS_SKIP_SMUDGE=1 git clone $REPO_URL $LOCAL_DIR"
GIT_LFS_SKIP_SMUDGE=1 git clone $REPO_URL $LOCAL_DIR && cd "$LOCAL_DIR" || { printf "${RED}Git clone failed.\n${NC}"; exit 1; }
ensure_ownership
while IFS= read -r file; do
truncate -s 0 "$file"
done <<< $(git lfs ls-files | cut -d ' ' -f 3-)
fi
printf "\nStart Downloading lfs files, bash script:\ncd $LOCAL_DIR\n"
files=$(git lfs ls-files | cut -d ' ' -f 3-)
declare -a urls
while IFS= read -r file; do
url="$HF_ENDPOINT/$MODEL_ID/resolve/main/$file"
file_dir=$(dirname "$file")
mkdir -p "$file_dir"
if [[ "$TOOL" == "wget" ]]; then
download_cmd="wget -c \"$url\" -O \"$file\""
[[ -n "$HF_TOKEN" ]] && download_cmd="wget --header=\"Authorization: Bearer ${HF_TOKEN}\" -c \"$url\" -O \"$file\""
else
download_cmd="aria2c --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c \"$url\" -d \"$file_dir\" -o \"$(basename "$file")\""
[[ -n "$HF_TOKEN" ]] && download_cmd="aria2c --header=\"Authorization: Bearer ${HF_TOKEN}\" --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c \"$url\" -d \"$file_dir\" -o \"$(basename "$file")\""
fi
[[ -n "$INCLUDE_PATTERN" && ! "$file" == $INCLUDE_PATTERN ]] && printf "# %s\n" "$download_cmd" && continue
[[ -n "$EXCLUDE_PATTERN" && "$file" == $EXCLUDE_PATTERN ]] && printf "# %s\n" "$download_cmd" && continue
printf "%s\n" "$download_cmd"
urls+=("$url|$file")
done <<< "$files"
for url_file in "${urls[@]}"; do
IFS='|' read -r url file <<< "$url_file"
printf "${YELLOW}Start downloading ${file}.\n${NC}"
file_dir=$(dirname "$file")
if [[ "$TOOL" == "wget" ]]; then
[[ -n "$HF_TOKEN" ]] && wget --header="Authorization: Bearer ${HF_TOKEN}" -c "$url" -O "$file" || wget -c "$url" -O "$file"
else
[[ -n "$HF_TOKEN" ]] && aria2c --header="Authorization: Bearer ${HF_TOKEN}" --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c "$url" -d "$file_dir" -o "$(basename "$file")" || aria2c --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c "$url" -d "$file_dir" -o "$(basename "$file")"
fi
[[ $? -eq 0 ]] && printf "Downloaded %s successfully.\n" "$url" || { printf "${RED}Failed to download %s.\n${NC}" "$url"; exit 1; }
done
printf "${GREEN}Download completed successfully.\n${NC}"
方法二:模型下载【个人使用记录】
这个代码不能保持目录结构,见下面的改进版
import datetime
import os
import threading
from huggingface_hub import hf_hub_url
from huggingface_hub.hf_api import HfApi
from huggingface_hub.utils import filter_repo_objects
# 执行命令
def execCmd(cmd):
print("命令%s开始运行%s" % (cmd, datetime.datetime.now()))
os.system(cmd)
print("命令%s结束运行%s" % (cmd, datetime.datetime.now()))
if __name__ == '__main__':
# 需下载的hf库名称
repo_id = "Salesforce/blip2-opt-2.7b"
# 本地存储路径
save_path = './blip2-opt-2.7b'
# 获取项目信息
_api = HfApi()
repo_info = _api.repo_info(
repo_id=repo_id,
repo_type="model",
revision='main',
token=None,
)
# 获取文件信息
filtered_repo_files = list(
filter_repo_objects(
items=[f.rfilename for f in repo_info.siblings],
allow_patterns=None,
ignore_patterns=None,
)
)
cmds = []
threads = []
# 需要执行的命令列表
for file in filtered_repo_files:
# 获取路径
url = hf_hub_url(repo_id=repo_id, filename=file)
# 断点下载指令
cmds.append(f'wget -c {url} -P {save_path}')
print(cmds)
print("程序开始%s" % datetime.datetime.now())
for cmd in cmds:
th = threading.Thread(target=execCmd, args=(cmd,))
th.start()
threads.append(th)
for th in threads:
th.join()
print("程序结束%s" % datetime.datetime.now())
保持目录结构
import datetime
import os
import threading
from pathlib import Path
from huggingface_hub import hf_hub_url
from huggingface_hub.hf_api import HfApi
from huggingface_hub.utils import filter_repo_objects
# 执行命令
def execCmd(cmd):
print("命令%s开始运行%s" % (cmd, datetime.datetime.now()))
os.system(cmd)
print("命令%s结束运行%s" % (cmd, datetime.datetime.now()))
if __name__ == '__main__':
# 需下载的hf库名称
repo_id = "Salesforce/blip2-opt-2.7b"
# 本地存储路径
save_path = './blip2-opt-2.7b'
# 创建本地保存目录
Path(save_path).mkdir(parents=True, exist_ok=True)
# 获取项目信息
_api = HfApi()
repo_info = _api.repo_info(
repo_id=repo_id,
repo_type="model",
revision='main',
token=None,
)
# 获取文件信息
filtered_repo_files = list(
filter_repo_objects(
items=[f.rfilename for f in repo_info.siblings],
allow_patterns=None,
ignore_patterns=None,
)
)
cmds = []
threads = []
# 需要执行的命令列表
for file in filtered_repo_files:
# 获取路径
url = hf_hub_url(repo_id=repo_id, filename=file)
# 在本地创建子目录
local_file = os.path.join(save_path, file)
local_dir = os.path.dirname(local_file)
Path(local_dir).mkdir(parents=True, exist_ok=True)
# 断点下载指令
cmds.append(f'wget -c {url} -P {local_dir}')
print(cmds)
print("程序开始%s" % datetime.datetime.now())
for cmd in cmds:
th = threading.Thread(target=execCmd, args=(cmd,))
th.start()
threads.append(th)
for th in threads:
th.join()
print("程序结束%s" % datetime.datetime.now())
数据集下载
import datetime
import os
import threading
from pathlib import Path
from huggingface_hub import HfApi
from huggingface_hub.utils import filter_repo_objects
# 执行命令
def execCmd(cmd):
print("命令%s开始运行%s" % (cmd, datetime.datetime.now()))
os.system(cmd)
print("命令%s结束运行%s" % (cmd, datetime.datetime.now()))
if __name__ == '__main__':
# 需下载的数据集ID
dataset_id = "openai/webtext"
# 本地存储路径
save_path = './webtext'
# 创建本地保存目录
Path(save_path).mkdir(parents=True, exist_ok=True)
# 获取数据集信息
_api = HfApi()
dataset_info = _api.dataset_info(
dataset_id=dataset_id,
revision='main',
token=None,
)
# 获取文件信息
filtered_dataset_files = list(
filter_repo_objects(
items=[f.rfilename for f in dataset_info.siblings],
allow_patterns=None,
ignore_patterns=None,
)
)
cmds = []
threads = []
# 需要执行的命令列表
for file in filtered_dataset_files:
# 获取路径
url = dataset_info.get_file_url(file)
# 在本地创建子目录
local_file = os.path.join(save_path, file)
local_dir = os.path.dirname(local_file)
Path(local_dir).mkdir(parents=True, exist_ok=True)
# 断点下载指令
cmds.append(f'wget -c {url} -P {local_dir}')
print(cmds)
print("程序开始%s" % datetime.datetime.now())
for cmd in cmds:
th = threading.Thread(target=execCmd, args=(cmd,))
th.start()
threads.append(th)
for th in threads:
th.join()
print("程序结束%s" % datetime.datetime.now())
不足之处
不支持需要授权的库。
文件太多可能会开很多线程。
♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠
版权声明:本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若内容造成侵权/违法违规/事实不符,请联系邮箱:jacktools123@163.com进行投诉反馈,一经查实,立即删除!
标签:
相关文章
最新发布
- 【Python】selenium安装+Microsoft Edge驱动器下载配置流程
- Python 中自动打开网页并点击[自动化脚本],Selenium
- Anaconda基础使用
- 【Python】成功解决 TypeError: ‘<‘ not supported between instances of ‘str’ and ‘int’
- manim边学边做--三维的点和线
- CPython是最常用的Python解释器之一,也是Python官方实现。它是用C语言编写的,旨在提供一个高效且易于使用的Python解释器。
- Anaconda安装配置Jupyter(2024最新版)
- Python中读取Excel最快的几种方法!
- Python某城市美食商家爬虫数据可视化分析和推荐查询系统毕业设计论文开题报告
- 如何使用 Python 批量检测和转换 JSONL 文件编码为 UTF-8
点击排行
- 版本匹配指南:Numpy版本和Python版本的对应关系
- 版本匹配指南:PyTorch版本、torchvision 版本和Python版本的对应关系
- Python 可视化 web 神器:streamlit、Gradio、dash、nicegui;低代码 Python Web 框架:PyWebIO
- 相关性分析——Pearson相关系数+热力图(附data和Python完整代码)
- Python与PyTorch的版本对应
- Anaconda版本和Python版本对应关系(持续更新...)
- Python pyinstaller打包exe最完整教程
- Could not build wheels for llama-cpp-python, which is required to install pyproject.toml-based proj