OpenWebUI & Monitor

超级版OpenWebUI,带有用户管理系统

首先我们需要对compose文件进行调整,需要使用host模式,抛弃端口映射逻辑来方便各个服务之间的通讯,以及方便我们后续使用 ufw 防火墙来组织直接通过IP+端口的方式访问,提高安全性。

services:
  postgre:
    image: docker.1panel.live/library/postgres:latest
    container_name: postgre
    restart: always
    environment:
      - POSTGRES_USER=st
      - POSTGRES_PASSWORD=STshentong
      - POSTGRES_DB=openwebui
    volumes:
      - /data/postgres_data:/var/lib/postgresql/data
      - /data/postgres-init:/docker-entrypoint-initdb.d
    network_mode: host #端口5432

  open-webui:
    image: docker.1panel.live/dyrnq/open-webui:latest
    container_name: open-webui
    volumes:
      - /data/open-webui:/app/backend/data
    restart: always
    environment:
      - DATABASE_URL=postgresql://st:STshentong@localhost:5432/openwebui # 连接到 openwebui 数据库
      - ENABLE_WEBSOCKET_SUPPORT=True
      - WEBSOCKET_REDIS_URL=redis://localhost:6379
      - REDIS_URL=redis://localhost:6379
      - WEBUI_NAME=ChatST
      - AIOHTTP_CLIENT_TIMEOUT_MODEL_LIST=1
      - AIOHTTP_CLIENT_TIMEOUT_OPENAI_MODEL_LIST=1
      - USER_AGENT=${USER_AGENT:-Mozilla/5.0 (compatible; OpenWebUI/1.0; +https://github.com/open-webui)}
    build:
      args:
        USER_AGENT: $USER_AGENT
    network_mode: host #端口8080
    depends_on:
      - postgre
      - redis

  redis:
    image: docker.1panel.live/library/redis:latest
    container_name: redis
    restart: always
    network_mode: host #端口6379

  watchtower:
    image: docker.1panel.live/containrrr/watchtower
    container_name: open-webui-watchtower
    restart: always
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    command: --interval 3600 open-webui
    depends_on:
      - open-webui
    network_mode: host

在这个文件中我已经备注了各个服务的默认端口,注意OpenWebUI的默认端口是8080,而不是之前映射的3000;为了安装后面的监控,我们还需要一个数据库启动脚本,来创建 monitor 所需的数据库,先创建一个文件夹 /data/postgres-init ,然后在里面创建一个名为 init-extra-db.sql 的SQL文件,内容如下:

-- 创建额外的数据库
CREATE DATABASE openwebui_monitor;
-- 可选:如果你需要特定用户拥有这个库,可以在这里授权
-- GRANT ALL PRIVILEGES ON DATABASE openwebui_monitor TO st;

这样postgre在启动时就会创建这两个所需的数据库了。并且在这个compose文件中,我统一把所有的数据都存放在 /data 文件夹对应的目录中,而不是之前的docker volume,便于管理。

接下来我们开始部署监控系统,在 data 文件夹中创建一个 monitor 文件夹,里面放上两个文件,一个是compose文件,这个系统会用到我们刚才部署的数据库:

services:
  monitor:
    image: docker.1panel.live/variantconst/openwebui-monitor:latest
    container_name: openwebui-monitor
    env_file:
      - .env # 可以在 .env 文件中覆盖以下环境变量
    environment:
      # 默认连接到 host 网络下的 Postgre 容器(通过 localhost:5432)
      - POSTGRES_HOST=${POSTGRES_HOST:-localhost}
      - POSTGRES_PORT=${POSTGRES_PORT:-5432}
      # 默认使用第一个 Compose 文件中 Postgre 的用户和密码
      - POSTGRES_USER=${POSTGRES_USER:-st}
      - POSTGRES_PASSWORD=${POSTGRES_PASSWORD:-STshentong}
      # 默认连接到 openwebui_monitor 数据库
      - POSTGRES_DATABASE=${POSTGRES_DATABASE:-openwebui_monitor}
    restart: always
    network_mode: host #端口3000

以及一个环境变量文件 .env ,注意前面的点 . :

# OpenWebUI Configuration
OPENWEBUI_DOMAIN=http://127.0.0.1:8080
OPENWEBUI_API_KEY=sk-dc2c745c8cd94faea80a5f3d9b348086 # OpenWebUI API key for fetching model list

# Access Control
ACCESS_TOKEN=STshentong # Used for Monitor page login
API_KEY=STshentong0722 # Used for authentication when sending requests to Monitor

# Price Configuration (Optional, $/million tokens)
INIT_BALANCE=5 # Initial balance for users, optional

# PostgreSQL Database Configuration (Optional, configure these if using external database)
# POSTGRES_HOST=172.21.0.2
# POSTGRES_PORT=5432
# POSTGRES_USER=postgres
# POSTGRES_PASSWORD=openwebui
# POSTGRES_DATABASE=openwebui_monitor

注意这里的上面一个API,是需要在OpenWebUI中开启API密钥,然后生成一个API,这个API是用来获取模型列表的;下面一个API是OpenWebUI前端函数调用这个后台模块时使用的,可以自己随便写;这里还有个设置项是初始额度,我这里设置成了5,后续通过脚本可以实现每天定时刷新重置成5元:

#!/bin/bash

# 容器名称
CONTAINER_NAME="postgre"

# 数据库连接信息
DB_USER="st"
DB_PASSWORD="STshentong"
DB_NAME="openwebui_monitor"

# SQL 更新语句
UPDATE_SQL="UPDATE users SET balance = 5;"

echo "尝试连接到容器 ${CONTAINER_NAME} 的数据库 ${DB_NAME}..."

# 使用 docker exec 在容器内部执行 psql 命令
# -U: 指定用户
# -d: 指定数据库
# -c: 执行 SQL 命令
# PGSSWORD: 设置环境变量,用于非交互式提供密码
PGPASSWORD="${DB_PASSWORD}" docker exec -it "${CONTAINER_NAME}" psql -U "${DB_USER}" -d "${DB_NAME}" -c "${UPDATE_SQL}"

# 检查命令执行结果
if [ $? -eq 0 ]; then
  echo "成功将 users 表中的 balance 列所有数值重置为 5。"
else
  echo "执行 SQL 更新时出错,请检查错误信息!"
fi

都配置完成后再启动这个监控系统(先把OpenWebUI配置好,拿到API密钥),启动成功后再到OpenWebUI中添加函数(原链接: https://github.com/VariantConst/OpenWebUI-Monitor/blob/main/resources/functions/openwebui_monitor.py ):

"""
title: Usage Monitor
author: VariantConst & OVINC CN
git_url: https://github.com/VariantConst/OpenWebUI-Monitor.git
version: 0.3.6
requirements: httpx
license: MIT
"""

import logging
import time
from typing import Dict, Optional
from httpx import AsyncClient
from pydantic import BaseModel, Field
import json


logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

TRANSLATIONS = {
    "en": {
        "request_failed": "Request failed: {error_msg}",
        "insufficient_balance": "Insufficient balance: Current balance `{balance:.4f}`",
        "cost": "Cost: ${cost:.4f}",
        "balance": "Balance: ${balance:.4f}",
        "tokens": "Tokens: {input}+{output}",
        "time_spent": "Time: {time:.2f}s",
        "tokens_per_sec": "{tokens_per_sec:.2f} T/s",
    },
    "zh": {
        "request_failed": "请求失败: {error_msg}",
        "insufficient_balance": "余额不足: 当前余额 `{balance:.4f}`",
        "cost": "费用: ¥{cost:.4f}",
        "balance": "余额: ¥{balance:.4f}",
        "tokens": "Token: {input}+{output}",
        "time_spent": "耗时: {time:.2f}s",
        "tokens_per_sec": "{tokens_per_sec:.2f} T/s",
    },
}


class CustomException(Exception):
    pass


class Filter:
    class Valves(BaseModel):
        api_endpoint: str = Field(default="", description="openwebui-monitor's base url")
        api_key: str = Field(default="", description="openwebui-monitor's api key")
        priority: int = Field(default=5, description="filter priority")
        language: str = Field(default="zh", description="language (en/zh)")
        show_time_spent: bool = Field(default=True, description="show time spent")
        show_tokens_per_sec: bool = Field(default=True, description="show tokens per second")
        show_cost: bool = Field(default=True, description="show cost")
        show_balance: bool = Field(default=True, description="show balance")
        show_tokens: bool = Field(default=True, description="show tokens")

    def __init__(self):
        self.type = "filter"
        self.name = "OpenWebUI Monitor"
        self.valves = self.Valves()
        self.outage_map: Dict[str, bool] = {}
        self.start_time: Optional[float] = None

    def get_text(self, key: str, **kwargs) -> str:
        lang = self.valves.language if self.valves.language in TRANSLATIONS else "en"
        text = TRANSLATIONS[lang].get(key, TRANSLATIONS["en"][key])
        return text.format(**kwargs) if kwargs else text

    async def request(self, client: AsyncClient, url: str, headers: dict, json_data: dict):
        json_data = json.loads(json.dumps(json_data, default=lambda o: o.dict() if hasattr(o, "dict") else str(o)))

        response = await client.post(url=url, headers=headers, json=json_data)
        response.raise_for_status()
        response_data = response.json()
        if not response_data.get("success"):
            logger.error(self.get_text("request_failed", error_msg=response_data))
            raise CustomException(self.get_text("request_failed", error_msg=response_data))
        return response_data

    async def inlet(self, body: dict, __metadata__: Optional[dict] = None, __user__: Optional[dict] = None) -> dict:
        __user__ = __user__ or {}
        __metadata__ = __metadata__ or {}
        self.start_time = time.time()
        user_id = __user__.get("id", "default")

        client = AsyncClient()

        try:
            response_data = await self.request(
                client=client,
                url=f"{self.valves.api_endpoint}/api/v1/inlet",
                headers={"Authorization": f"Bearer {self.valves.api_key}"},
                json_data={"user": __user__, "body": body},
            )
            self.outage_map[user_id] = response_data.get("balance", 0) <= 0
            if self.outage_map[user_id]:
                logger.info(self.get_text("insufficient_balance", balance=response_data.get("balance", 0)))
                raise CustomException(self.get_text("insufficient_balance", balance=response_data.get("balance", 0)))
            return body

        except Exception as err:
            logger.exception(self.get_text("request_failed", error_msg=err))
            if isinstance(err, CustomException):
                raise err
            raise Exception(f"error calculating usage, {err}") from err

        finally:
            await client.aclose()

    async def outlet(
        self,
        body: dict,
        __metadata__: Optional[dict] = None,
        __user__: Optional[dict] = None,
        __event_emitter__: Optional[callable] = None,
    ) -> dict:
        __user__ = __user__ or {}
        __metadata__ = __metadata__ or {}
        user_id = __user__.get("id", "default")

        if self.outage_map.get(user_id, False):
            return body

        client = AsyncClient()

        try:
            response_data = await self.request(
                client=client,
                url=f"{self.valves.api_endpoint}/api/v1/outlet",
                headers={"Authorization": f"Bearer {self.valves.api_key}"},
                json_data={"user": __user__, "body": body},
            )

            stats_list = []
            if self.valves.show_tokens:
                stats_list.append(self.get_text("tokens", input=response_data["inputTokens"], output=response_data["outputTokens"]))
            if self.valves.show_cost:
                stats_list.append(self.get_text("cost", cost=response_data["totalCost"]))
            if self.valves.show_balance:
                stats_list.append(self.get_text("balance", balance=response_data["newBalance"]))
            if self.start_time and self.valves.show_time_spent:
                elapsed = time.time() - self.start_time
                stats_list.append(self.get_text("time_spent", time=elapsed))
                if self.valves.show_tokens_per_sec:
                    tokens_per_sec = (response_data["outputTokens"] / elapsed if elapsed > 0 else 0)
                    stats_list.append(self.get_text("tokens_per_sec", tokens_per_sec=tokens_per_sec))

            stats = " | ".join(stats_list)
            if __event_emitter__:
                await __event_emitter__({"type": "status", "data": {"description": stats, "done": True}})

            logger.info("usage_monitor: %s %s", user_id, stats)
            return body

        except Exception as err:
            logger.exception(self.get_text("request_failed", error_msg=err))
            raise Exception(self.get_text("request_failed", error_msg=err))
        finally:
            await client.aclose()

然后我个人喜欢安装一个轻量化的系统监控 beszel ,方便我们实时掌握系统资源占用情况。同样在 /data 目录下创建一个文件夹 /beszel ,然后使用compose文件,我这里也修改成了统一使用host模式:

services:
  beszel:
    image: docker.1panel.live/henrygd/beszel:latest
    container_name: beszel
    restart: unless-stopped
    network_mode: host #端口 8090
    volumes:
      - ./beszel_data:/beszel_data
      - ./beszel_socket:/beszel_socket

  beszel-agent:
    image: docker.1panel.live/henrygd/beszel-agent:latest
    container_name: beszel-agent
    restart: unless-stopped
    network_mode: host
    volumes:
      - ./beszel_socket:/beszel_socket
      - /var/run/docker.sock:/var/run/docker.sock:ro
    environment:
      LISTEN: /beszel_socket/beszel.sock
      # 请勿删除密钥周围的引号
      KEY: '使用"添加系统"对话框复制的公钥进行更新'

这个时候他数据的默认存放位置就是在compose文件的那个目录中,所以不需要修改了,启动后使用主机格式 /beszel_socket/beszel.sock 得到一个密钥,然后替换掉compose文件中的KEY,重新部署compose即可。

由于使用这个监控系统需要启用API,这使得别人外部调用我们的网站变得非常简单,因此必须增加防火墙,我们使用宝塔WAF,因为这似乎是免费开源WAF中性能最好的,安装脚本:

URL=https://download.bt.cn/cloudwaf/scripts/install_cloudwaf.sh && if [ -f /usr/bin/curl ];then curl -sSO "$URL" ;else wget -O install_cloudwaf.sh "$URL";fi;bash install_cloudwaf.sh

直接一键安装即可,安装完成后会给出登录链接和初始用户名密码,配置好WAF后我们就可以启动系统的ufw防火墙来阻止直接对端口的访问了。

sudo ufw status  #查看当前防火墙状态
sudo ufw allow ssh  #先放行SSH端口
sudo ufw enable   #启动防火墙
sudo ufw default deny incoming   #阻止其他端口连接
sudo ufw allow 80
sudo ufw allow 443
sudo ufw allow 8888   #假设我们已经将宝塔WAF配置到了8888端口

一般来说放行这些端口就已经足够了,其他网站都通过宝塔WAF进行反代,对于数据库的外部连接,我不建议放行5432端口,防止被爆破,需要使用时可以通过SSH隧道连接数据库。对于错误添加的规则,可以用下面的方法删除:

sudo ufw status numbered
sudo ufw delete 5
sudo ufw delete 13

全新重构的二开版本

改用方块佬的二开版本,带积分制度,带充值,以及对外部调用计费,项目地址: https://github.com/U8F69/open-webui

Docker Compose修改:

services:
  postgre:
    image: docker.1panel.live/library/postgres:latest
    container_name: postgre
    restart: always
    environment:
      - POSTGRES_USER=st
      - POSTGRES_PASSWORD=STshentong
      - POSTGRES_DB=openwebui
    volumes:
      - ./postgres_data:/var/lib/postgresql/data
    network_mode: host #端口5432

  open-webui:
    image: ghcr.nju.edu.cn/u8f69/open-webui:latest
    container_name: open-webui
    volumes:
      - ./open-webui:/app/backend/data
    restart: always
    environment:
      - DATABASE_URL=postgresql://st:STshentong@localhost:5432/openwebui # 连接到 openwebui 数据库
      - ENABLE_WEBSOCKET_SUPPORT=True
      - WEBSOCKET_REDIS_URL=redis://localhost:6379
      - REDIS_URL=redis://localhost:6379
      - WEBUI_NAME=ChatST
      - AIOHTTP_CLIENT_TIMEOUT_MODEL_LIST=1
      - AIOHTTP_CLIENT_TIMEOUT_OPENAI_MODEL_LIST=1
      - USER_AGENT=${USER_AGENT:-Mozilla/5.0 (compatible; OpenWebUI/1.0; +https://github.com/open-webui)}
      - WEBUI_SECRET_KEY=85fafa5e-0992-4d9b-a84c-6679646040f3
      - LICENSE_KEY=enterprise
      - ORGANIZATION_NAME=ST-STUDIO
      - CUSTOM_NAME=ChatST
    build:
      args:
        USER_AGENT: $USER_AGENT
    network_mode: host #端口8080
    depends_on:
      - postgre
      - redis

  redis:
    image: docker.1panel.live/library/redis:latest
    container_name: redis
    restart: always
    network_mode: host #端口6379

鉴于OpenWebUI频繁的bug问题,取消了自动更新,如果需要更新可以使用 docker compose pull 然后 docker compose up -d 来手动实现。

宝塔WAF的迁移:官方文档: https://www.kancloud.cn/kern123/cloudwaf/3209672

我的步骤汇总与优化:

btw stop
#安装并行压缩器
sudo apt-get install pigz
cd /www/ && tar --use-compress-program="pigz" -cvf cloud_waf.tar.gz cloud_waf

将压缩后的文件下载上传到新服务器,然后解压(以下命令假设上传到了root目录下,根据实际修改):

tar --use-compress-program="pigz" -xvf cloud_waf.tar.gz
mkdir -pv /www/cloud_waf/nginx/conf.d/waf/
\cp -arpf /root/cloud_waf/nginx/conf.d/waf/mysql_default.pl /www/cloud_waf/nginx/conf.d/waf/mysql_default.pl
mv /root/cloud_waf/mysql /www/cloud_waf/mysql

在新服务器上安装宝塔WAF(同样基于root目录,可能需要修改):

URL=https://download.bt.cn/cloudwaf/scripts/install_cloudwaf.sh && if [ -f /usr/bin/curl ];then curl -sSO "$URL" ;else wget -O install_cloudwaf.sh "$URL";fi;bash install_cloudwaf.sh
sleep 5 && btw stop
sleep 5 && \cp -arpf /root/cloud_waf/* /www/cloud_waf
btw start
btw 6

特殊的脚本汇总:

数据库清理脚本:

#!/bin/bash
# 定义Docker容器名称和PostgreSQL用户
CONTAINER_NAME="postgre"
PG_USER="st"
DATABASE_NAME="openwebui"
PROTECTED_USERS=("'wyk'" "'wwf'" "'syx'") # 受保护的用户列表

# 清理:同时满足未激活(role='pending')且30天未登录的用户(排除受保护用户)
echo "清理未激活且30天未登录的用户..."
docker exec -it ${CONTAINER_NAME} psql -U ${PG_USER} -d ${DATABASE_NAME} -c <<EOSQL
DELETE FROM "user"
WHERE 
  role = 'pending'
  AND last_active_at < EXTRACT(EPOCH FROM NOW()) - (30 * 24 * 60 * 60)
  AND name NOT IN (${PROTECTED_USERS[@]});
EOSQL

# 清理30天前未置顶和未归档的聊天记录
echo "清理30天前未置顶和未归档的聊天记录..."
docker exec -it ${CONTAINER_NAME} psql -U ${PG_USER} -d ${DATABASE_NAME} -c "DELETE FROM \"chat\" WHERE updated_at <= EXTRACT(EPOCH FROM DATE_TRUNC('day', NOW() - INTERVAL '29 day'))::INTEGER AND pinned = false AND archived = false;"

# 清理20天前的文件上传记录
echo "清理20天前的文件上传记录..."
docker exec -it ${CONTAINER_NAME} psql -U ${PG_USER} -d ${DATABASE_NAME} -c "DELETE FROM \"file\" WHERE updated_at < EXTRACT(EPOCH FROM NOW()) - (19 * 24 * 60 * 60);"

# 清理7天前的 credit_log 记录
echo "清理7天前的 credit_log 记录..."
docker exec -it ${CONTAINER_NAME} psql -U ${PG_USER} -d ${DATABASE_NAME} -c "DELETE FROM \"credit_log\" WHERE created_at < EXTRACT(EPOCH FROM NOW()) - (7 * 24 * 60 * 60);"

echo "数据库清理完成。"

容器监控脚本(因为现在有一段时间不知道为什么,OpenWebUI总会莫名其妙CPU占用100%或者0%,然后卡死崩溃,看日志啥也看不出来,只能写了个监控脚本检测到异常占用自动重启容器了):

#!/bin/bash

# --- 配置 ---
CONTAINER_NAME="open-webui"                     # 要监控的容器名称
CPU_HIGH_THRESHOLD=100.0                        # CPU使用率过高阈值
CPU_LOW_THRESHOLD=0.1                           # CPU使用率过低阈值
CHECK_INTERVAL=5                                # 检查间隔时间(秒)
CONSECUTIVE_CHECKS_BEFORE_RESTART=4             # 需要连续多少次检测超标/过低才重启
LOG_FILE="$HOME/open_webui_monitor.log"           # 日志文件路径 (确保目录可写)
# ---结束配置---

# 确保日志文件所在的目录存在
mkdir -p "$(dirname "$LOG_FILE")"

log_message() {
  echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" | tee -a "$LOG_FILE"
}

# 初始化连续超标/过低计数器
consecutive_high_cpu_count=0
consecutive_low_cpu_count=0

log_message "脚本启动:开始监控容器 '$CONTAINER_NAME' (CPU高阈值: ${CPU_HIGH_THRESHOLD}%, CPU低阈值: ${CPU_LOW_THRESHOLD}%, 检查间隔: ${CHECK_INTERVAL}s, 连续 ${CONSECUTIVE_CHECKS_BEFORE_RESTART} 次异常后重启)"

# 检查 bc 命令是否存在
if ! command -v bc &> /dev/null; then
    log_message "错误: bc 命令未找到,它是进行浮点数比较所必需的。请先安装 bc (例如: sudo apt install bc 或 sudo yum install bc)。脚本退出。"
    exit 1
fi

# 检查 docker 命令是否存在
if ! command -v docker &> /dev/null; then
    log_message "错误: docker 命令未找到。请确保Docker已安装并且在PATH中。脚本退出。"
    exit 1
fi


while true; do
  # 检查容器是否正在运行
  IS_RUNNING=$(docker ps -q -f name="^${CONTAINER_NAME}$" -f status=running --no-trunc)

  if [ -z "$IS_RUNNING" ]; then
    # 进一步检查容器是否存在但已停止,还是完全不存在
    DOES_EXIST=$(docker ps -a -q -f name="^${CONTAINER_NAME}$" --no-trunc)
    if [ -z "$DOES_EXIST" ]; then
        log_message "信息: 容器 '$CONTAINER_NAME' 不存在。脚本将继续检查,以防容器后续被创建。"
    else
        log_message "信息: 容器 '$CONTAINER_NAME' 当前未运行。将等待其启动。"
    fi
    # 如果容器未运行,重置所有计数器
    consecutive_high_cpu_count=0
    consecutive_low_cpu_count=0
    sleep "$CHECK_INTERVAL"
    continue
  fi

  # 获取CPU使用率
  CPU_USAGE_STR=$(docker stats "$CONTAINER_NAME" --no-stream --format "{{.CPUPerc}}")
  CPU_USAGE_NUM=$(echo "$CPU_USAGE_STR" | sed 's/%//')

  # 校验获取到的CPU使用率是否为有效数字
  if ! [[ "$CPU_USAGE_NUM" =~ ^[0-9]+([.][0-9]+)?$ ]]; then
    log_message "警告: 未能从容器 '$CONTAINER_NAME' 获取有效的CPU使用率数据 (收到: '$CPU_USAGE_STR')。可能是容器正在启动或遇到问题。将重试。"
    # 不重置计数器,因为这可能是一次性的读取问题,如果容器确实有问题,我们不希望中断计数
    sleep "$CHECK_INTERVAL"
    continue
  fi

  # --- 核心判断逻辑 ---

  # 1. 判断CPU是否过高
  if (( $(echo "$CPU_USAGE_NUM >= $CPU_HIGH_THRESHOLD" | bc -l) )); then
    consecutive_high_cpu_count=$((consecutive_high_cpu_count + 1))
    consecutive_low_cpu_count=0 # CPU过高,则重置过低计数
    log_message "警告(过高): 容器 '$CONTAINER_NAME' CPU使用率 ($CPU_USAGE_NUM%) 达到或超过阈值 (${CPU_HIGH_THRESHOLD}%). 这是连续第 $consecutive_high_cpu_count 次 (共需 $CONSECUTIVE_CHECKS_BEFORE_RESTART 次)."

    if [ "$consecutive_high_cpu_count" -ge "$CONSECUTIVE_CHECKS_BEFORE_RESTART" ]; then
      log_message "关键(过高): 容器 '$CONTAINER_NAME' CPU使用率已连续 $consecutive_high_cpu_count 次超标. 正在尝试重启容器..."
      if docker restart "$CONTAINER_NAME"; then
        log_message "成功: 容器 '$CONTAINER_NAME' 已因CPU持续过高而重启。"
        log_message "信息: 重启后将等待30秒,让容器有时间稳定下来,然后再继续监控。"
        consecutive_high_cpu_count=0 # 重启成功,重置计数器
        consecutive_low_cpu_count=0
        sleep 30 # 重启后可以等待一段时间再进行下一次检查
      else
        log_message "错误: 重启容器 '$CONTAINER_NAME' 失败。"
        consecutive_high_cpu_count=0 # 重启失败,也重置计数器,避免对可能持续故障的容器进行连续重启轰炸
        consecutive_low_cpu_count=0
      fi
    fi

  # 2. 判断CPU是否过低
  elif (( $(echo "$CPU_USAGE_NUM < $CPU_LOW_THRESHOLD" | bc -l) )); then
    consecutive_low_cpu_count=$((consecutive_low_cpu_count + 1))
    consecutive_high_cpu_count=0 # CPU过低,则重置过高计数
    log_message "警告(过低): 容器 '$CONTAINER_NAME' CPU使用率 ($CPU_USAGE_NUM%) 低于阈值 (${CPU_LOW_THRESHOLD}%). 这是连续第 $consecutive_low_cpu_count 次 (共需 $CONSECUTIVE_CHECKS_BEFORE_RESTART 次)."

    if [ "$consecutive_low_cpu_count" -ge "$CONSECUTIVE_CHECKS_BEFORE_RESTART" ]; then
      log_message "关键(过低): 容器 '$CONTAINER_NAME' CPU使用率已连续 $consecutive_low_cpu_count 次过低. 正在尝试重启容器..."
      if docker restart "$CONTAINER_NAME"; then
        log_message "成功: 容器 '$CONTAINER_NAME' 已因CPU持续过低而重启。"
        log_message "信息: 重启后将等待30秒,让容器有时间稳定下来,然后再继续监控。"
        consecutive_high_cpu_count=0 # 重启成功,重置计数器
        consecutive_low_cpu_count=0
        sleep 30 # 重启后可以等待一段时间再进行下一次检查
      else
        log_message "错误: 重启容器 '$CONTAINER_NAME' 失败。"
        consecutive_high_cpu_count=0 # 重启失败,也重置计数器,避免对可能持续故障的容器进行连续重启轰炸
        consecutive_low_cpu_count=0
      fi
    fi

  # 3. CPU使用率正常
  else
    if [ "$consecutive_high_cpu_count" -gt 0 ] || [ "$consecutive_low_cpu_count" -gt 0 ]; then
      log_message "信息: 容器 '$CONTAINER_NAME' CPU使用率 ($CPU_USAGE_NUM%) 已恢复正常。重置异常计数 (之前连续高CPU $consecutive_high_cpu_count 次, 连续低CPU $consecutive_low_cpu_count 次)。"
    else
      log_message "信息: 容器 '$CONTAINER_NAME' CPU使用率 ($CPU_USAGE_NUM%) 正常。"
    fi
    # CPU使用率正常,重置所有计数器
    consecutive_high_cpu_count=0
    consecutive_low_cpu_count=0
  fi

  sleep "$CHECK_INTERVAL"
done

创建完脚本后,使用系统进程的方式在后台运行,实现开机自启,先创建一个进程文件: /etc/systemd/system/openwebui-monitor.service :

[Unit]
Description=Open WebUI Container CPU Monitor and Restarter
After=docker.service
Requires=docker.service

[Service]
Type=simple
ExecStart=/www/sh/monitor.sh
Restart=always
User=root

[Install]
WantedBy=multi-user.target

然后启动系统进程:

sudo systemctl daemon-reload
sudo systemctl enable openwebui-monitor.service
sudo systemctl start openwebui-monitor.service
sudo systemctl status openwebui-monitor.service # 查看状态

抛弃 chat/files/credit/credit_logs 表进行迁移:

#!/bin/bash
# --- 本地 PostgreSQL 数据库配置 ---
# 要备份的数据库所在的主机(本地)
LOCAL_PG_HOST="127.0.0.1"
# 本地 PostgreSQL 服务的端口
LOCAL_PG_PORT="5432"
# 要备份的数据库名称
POSTGRESQL_DBNAME="openwebui"
# 连接数据库的用户名
POSTGRESQL_USERNAME="st"
# 连接数据库的密码 (建议使用 .pgpass 文件或环境变量等更安全的方式)
POSTGRESQL_PASSWORD="STshentong"
# --- 要排除的表 ---
# 定义一个包含所有要排除的表名的数组
EXCLUDE_TABLES_ARRAY=("chat" "file" "credit" "credit_log")
# --- 备份目的地配置 ---
# 备份文件存储的目录
BACKUP_DEST_DIR="/www/backup/nochat" # 根据您之前的日志,保持此路径
# 备份文件名格式 (包含时间戳)
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
# 构建排除表名的字符串,用于文件名 (例如: chat_file)
excluded_tables_filename_part=$(IFS=_; echo "${EXCLUDE_TABLES_ARRAY[*]}")
# 最终的备份文件完整路径
BACKUP_FILE_PATH="${BACKUP_DEST_DIR}/openwebui_exclude_${excluded_tables_filename_part}_${TIMESTAMP}.dump"
# --- 主逻辑 ---
# 1. 确保备份目标目录存在
echo "检查并创建备份目录 ${BACKUP_DEST_DIR}..."
mkdir -p "${BACKUP_DEST_DIR}"
if [ $? -ne 0 ]; then
    echo "错误:无法创建备份目录 ${BACKUP_DEST_DIR}!"
    exit 1
fi
echo "备份目录准备就绪。"
# 2. 执行本地数据库备份 (使用 pg_dump)
echo "开始备份本地 PostgreSQL 数据库 (${POSTGRESQL_DBNAME}) 到 ${BACKUP_FILE_PATH}..."
echo "将排除表: ${EXCLUDE_TABLES_ARRAY[*]}"
echo "连接信息: Host=${LOCAL_PG_HOST}, Port=${LOCAL_PG_PORT}, User=${POSTGRESQL_USERNAME}"
# 设置 PGPASSWORD 环境变量,pg_dump 会自动使用它
export PGPASSWORD="${POSTGRESQL_PASSWORD}"
# 构建 pg_dump 的排除表参数
pg_dump_exclude_args=()
for table_name in "${EXCLUDE_TABLES_ARRAY[@]}"; do
    pg_dump_exclude_args+=("-T" "$table_name")
done
# 执行 pg_dump 命令
# 确保每个行连接符 \ 是该行的最后一个字符,其后不能有空格或注释
pg_dump \
    -h "${LOCAL_PG_HOST}" \
    -p "${LOCAL_PG_PORT}" \
    -U "${POSTGRESQL_USERNAME}" \
    -d "${POSTGRESQL_DBNAME}" \
    -Fc \
    "${pg_dump_exclude_args[@]}" \
    -f "${BACKUP_FILE_PATH}"
# 检查 pg_dump 的退出状态
PGDUMP_EXIT_CODE=$?
# 无论成功与否,都取消设置 PGPASSWORD 环境变量,以增强安全性
unset PGPASSWORD
if [ ${PGDUMP_EXIT_CODE} -eq 0 ]; then
    echo "数据库备份成功 (已排除表: ${EXCLUDE_TABLES_ARRAY[*]})!"
    echo "备份文件已保存到: ${BACKUP_FILE_PATH}"
else
    echo "错误:数据库备份失败! (pg_dump 退出码: ${PGDUMP_EXIT_CODE})"
    # 可选:删除可能不完整的备份文件
    if [ -f "${BACKUP_FILE_PATH}" ]; then
        echo "正在删除可能不完整的备份文件: ${BACKUP_FILE_PATH}"
        rm -f "${BACKUP_FILE_PATH}"
    fi
    exit 1
fi
# 3. 可选:清理旧的备份文件 (例如,删除超过3天的备份)
echo "开始清理旧备份文件 (保留最近3天)..."
# -mtime +2: 查找修改时间在3天之前的文件 (即保留今天、昨天、前天的)
find "${BACKUP_DEST_DIR}" -name "openwebui_exclude_${excluded_tables_filename_part}_*.dump" -type f -mtime +2 -print -delete
echo "旧备份清理完成。"
echo "--- 备份脚本执行完毕 ---"
exit 0

在恢复时也需要进行调整(记得修改恢复文件路径):

#!/bin/bash
set -e # If any command fails, immediately exit the script

# --- Configuration (Specific to this target server) ---
POSTGRESQL_DBNAME="openwebui"
POSTGRESQL_USERNAME="st"
# !!! Ensure this password is correct for the 'st' DATABASE user !!!
POSTGRESQL_PASSWORD="STshentong"

# --- Connection Info (Running locally) ---
TARGET_HOST="127.0.0.1"
# !!! Port for the PostgreSQL instance on THIS server !!!
TARGET_PORT="5432"

# --- Backup File Path (Located on THIS server) ---
# !!! IMPORTANT: Update this path to your actual backup file !!!
# Example: BACKUP_FILE_PATH="/data/backup/nochat/openwebui_exclude_chat_file_20231027_100000.dump"
# You might want to make this an argument to the script, e.g., BACKUP_FILE_PATH="$1"
BACKUP_FILE_PATH="/data/openwebui_exclude_chat_file_credit_credit_log_20250623_103946.dump" # <<<--- !!! UPDATE THIS LINE !!!

# --- pg_restore Configuration ---
# Adjusted for an 8-core server, you can tweak this
PG_RESTORE_JOBS=4

# --- Restore Logic ---
echo "--- Starting local restore process on ${TARGET_HOST}:${TARGET_PORT} ---"
echo "Restoring database: ${POSTGRESQL_DBNAME}"
echo "Running script as OS user: $(whoami)"
echo "Connecting to DB as PostgreSQL user: ${POSTGRESQL_USERNAME}"
echo "Using backup file: ${BACKUP_FILE_PATH}"

# Check if backup file exists
if [ ! -f "${BACKUP_FILE_PATH}" ]; then
    echo "ERROR: Backup file not found at ${BACKUP_FILE_PATH}"
    echo "Please ensure the BACKUP_FILE_PATH variable is set correctly in the script."
    exit 1
fi

# --- Attempt to terminate existing connections ---
# This is crucial to avoid "database is being accessed by other users" errors during restore,
# especially when --clean is used.
echo "[${TARGET_HOST}:${TARGET_PORT}] Attempting to terminate existing connections to database ${POSTGRESQL_DBNAME}..."
TERMINATE_SQL="SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = '${POSTGRESQL_DBNAME}' AND pid <> pg_backend_pid();"

export PGPASSWORD="${POSTGRESQL_PASSWORD}"
# Connect to 'postgres' or another maintenance DB to terminate connections to the target DB
if ! psql \
    --host "${TARGET_HOST}" \
    --port "${TARGET_PORT}" \
    --username "${POSTGRESQL_USERNAME}" \
    --dbname "postgres" \
    -tAc "${TERMINATE_SQL}" > /dev/null; then
    echo "[${TARGET_HOST}:${TARGET_PORT}] WARNING: Failed to execute terminate command. This might be due to:"
    echo "  - No active connections to '${POSTGRESQL_DBNAME}'."
    echo "  - Insufficient database permissions for user '${POSTGRESQL_USERNAME}' to terminate backends."
    echo "  - Failure to connect to the 'postgres' database."
    echo "Continuing with restore, but it might fail if connections persist."
else
    echo "[${TARGET_HOST}:${TARGET_PORT}] Terminate command sent successfully. Waiting a few seconds for connections to close..."
    sleep 5 # Give some time for connections to actually terminate
fi
unset PGPASSWORD

# --- IMPORTANT MODIFICATION: ---
# We DO NOT drop or create the database here.
# We are restoring into an EXISTING database and want to preserve tables
# ('chat', 'file') that are NOT in the backup.
# The target database POSTGRESQL_DBNAME must already exist.
# If it doesn't exist, pg_restore will fail. You would need to create it manually
# or add a 'createdb --if-not-exists' equivalent logic if that's a desired scenario.
echo "[${TARGET_HOST}:${TARGET_PORT}] Ensuring database '${POSTGRESQL_DBNAME}' exists (not creating or dropping it)."
echo "The restore will proceed into the existing '${POSTGRESQL_DBNAME}' database."
echo "Tables 'chat' and 'file' (if they exist) will NOT be touched by this restore."
echo "Other tables present in the backup file will be dropped (if they exist) and then recreated."

# --- Execute Restore ---
echo "[${TARGET_HOST}:${TARGET_PORT}] Starting database restore from local file using ${PG_RESTORE_JOBS} parallel jobs..."
echo "Using --clean option to drop and recreate objects FROM THE BACKUP FILE."

export PGPASSWORD="${POSTGRESQL_PASSWORD}"
# Using 'time' to measure duration of the restore operation
time pg_restore \
    --host "${TARGET_HOST}" \
    --port "${TARGET_PORT}" \
    --username "${POSTGRESQL_USERNAME}" \
    --dbname "${POSTGRESQL_DBNAME}" \
    --jobs "${PG_RESTORE_JOBS}" \
    --clean \
    --if-exists \
    --verbose \
    "${BACKUP_FILE_PATH}"

PGRESTORE_EXIT_CODE=$?
unset PGPASSWORD

if [ ${PGRESTORE_EXIT_CODE} -eq 0 ]; then
    echo "[${TARGET_HOST}:${TARGET_PORT}] Database restore completed successfully!"
    echo "Tables from backup were restored. Tables 'chat' and 'file' were not affected by the restore process."
else
    echo "[${TARGET_HOST}:${TARGET_PORT}] ERROR: Database restore failed! (pg_restore exit code: ${PGRESTORE_EXIT_CODE})"
    exit 1
fi

echo "--- Restore script execution finished ---"
exit 0

最后更新于