当前位置: 首页 > news >正文

做国内贸易的网站win10 网站建设软件有哪些

做国内贸易的网站,win10 网站建设软件有哪些,c 网站开发工程师招聘,新手如何学会做网络销售返回的是文档解析分段内容组成的列表,分段内容默认chunk_size: int 250, chunk_overlap: int 50,250字分段,50分段处保留后面一段的前50字拼接即窗口包含下下一段前面50个字划分 from typing import Union, Listimport jieba import recla…

返回的是文档解析分段内容组成的列表,分段内容默认chunk_size: int = 250, chunk_overlap: int = 50,250字分段,50分段处保留后面一段的前50字拼接即窗口包含下下一段前面50个字划分

from typing import Union, Listimport jieba
import reclass SentenceSplitter:def __init__(self, chunk_size: int = 250, chunk_overlap: int = 50):self.chunk_size = chunk_sizeself.chunk_overlap = chunk_overlapdef split_text(self, text: str) -> List[str]:if self._is_has_chinese(text):return self._split_chinese_text(text)else:return self._split_english_text(text)def _split_chinese_text(self, text: str) -> List[str]:sentence_endings = {'\n', '。', '!', '?', ';', '…'}  # 句末标点符号chunks, current_chunk = [], ''for word in jieba.cut(text):if len(current_chunk) + len(word) > self.chunk_size:chunks.append(current_chunk.strip())current_chunk = wordelse:current_chunk += wordif word[-1] in sentence_endings and len(current_chunk) > self.chunk_size - self.chunk_overlap:chunks.append(current_chunk.strip())current_chunk = ''if current_chunk:chunks.append(current_chunk.strip())if self.chunk_overlap > 0 and len(chunks) > 1:chunks = self._handle_overlap(chunks)return chunksdef _split_english_text(self, text: str) -> List[str]:# 使用正则表达式按句子分割英文文本sentences = re.split(r'(?<=[.!?])\s+', text.replace('\n', ' '))chunks, current_chunk = [], ''for sentence in sentences:if len(current_chunk) + len(sentence) <= self.chunk_size or not current_chunk:current_chunk += (' ' if current_chunk else '') + sentenceelse:chunks.append(current_chunk)current_chunk = sentenceif current_chunk:  # Add the last chunkchunks.append(current_chunk)if self.chunk_overlap > 0 and len(chunks) > 1:chunks = self._handle_overlap(chunks)return chunksdef _is_has_chinese(self, text: str) -> bool:# check if contains chinese charactersif any("\u4e00" <= ch <= "\u9fff" for ch in text):return Trueelse:return Falsedef _handle_overlap(self, chunks: List[str]) -> List[str]:# 处理块间重叠overlapped_chunks = []for i in range(len(chunks) - 1):chunk = chunks[i] + ' ' + chunks[i + 1][:self.chunk_overlap]overlapped_chunks.append(chunk.strip())overlapped_chunks.append(chunks[-1])return overlapped_chunkstext_splitter = SentenceSplitter()def load_file(filepath):print("filepath:",filepath)if filepath.endswith(".md"):contents = extract_text_from_markdown(filepath)elif filepath.endswith(".pdf"):contents = extract_text_from_pdf(filepath)elif filepath.endswith('.docx'):contents = extract_text_from_docx(filepath)else:contents = extract_text_from_txt(filepath)return contentsdef extract_text_from_pdf(file_path: str):"""Extract text content from a PDF file."""import PyPDF2contents = []with open(file_path, 'rb') as f:pdf_reader = PyPDF2.PdfReader(f)for page in pdf_reader.pages:page_text = page.extract_text().strip()raw_text = [text.strip() for text in page_text.splitlines() if text.strip()]new_text = ''for text in raw_text:new_text += textif text[-1] in ['.', '!', '?', '。', '!', '?', '…', ';', ';', ':', ':', '”', '’', ')', '】', '》', '」','』', '〕', '〉', '》', '〗', '〞', '〟', '»', '"', "'", ')', ']', '}']:contents.append(new_text)new_text = ''if new_text:contents.append(new_text)return contentsdef extract_text_from_txt(file_path: str):"""Extract text content from a TXT file."""with open(file_path, 'r', encoding='utf-8') as f:contents = [text.strip() for text in f.readlines() if text.strip()]return contentsdef extract_text_from_docx(file_path: str):"""Extract text content from a DOCX file."""import docxdocument = docx.Document(file_path)contents = [paragraph.text.strip() for paragraph in document.paragraphs if paragraph.text.strip()]return contentsdef extract_text_from_markdown(file_path: str):"""Extract text content from a Markdown file."""import markdownfrom bs4 import BeautifulSoupwith open(file_path, 'r', encoding='utf-8') as f:markdown_text = f.read()html = markdown.markdown(markdown_text)soup = BeautifulSoup(html, 'html.parser')contents = [text.strip() for text in soup.get_text().splitlines() if text.strip()]return contentstexts = load_file(r"C:\Users\lo***山市城市建筑外立面管理条例.docx")
print(texts)

在这里插入图片描述

http://www.yayakq.cn/news/109678/

相关文章:

  • 做外贸网站那个平台好联盟文明网站建设有新突破
  • 银川网站建设网络网上购物商城系统论文
  • iis访问网站打开要很久1688网站入口
  • 百度个人网站申请网站分为哪些类型
  • 网站建设公司代理商优化排名seo
  • 网站页面设计价格邢台手机网站建设
  • 备案 非网站备案建设网站成都
  • 厦门手机网站建设公司哪家好怎么让网站收录在google
  • 做花酒的网站深圳的网站建设公司排名
  • 建设网站用什么服务器学校网站建设规范
  • 万网备案网站名称注册小公司要交税吗
  • 抚顺网站建设在哪请人做网站
  • ftp 企业网站网站标题符号有哪些
  • 汽车电商网站建设介绍东莞网站建设的ppt
  • 浙江手机版建站系统开发重庆比较好的广告公司
  • 个人网站创意北京到牡丹江
  • 设计社交网站海外推广引流
  • 建博客和建网站哪个好中国三北防护林体系建设网站
  • 什么软件可以看网站怎么将网站权重提上去
  • 企业网站建设中有哪几个重要点备案时网站服务内容
  • 网站建设微信公众号小程序制作国外开源网站建设软件
  • 手机网站开发哪个好seo与网站建设的关联
  • 抖音网站建设哪家好app开发和网站开发哪个好
  • 做直播网站软件有哪些软件下载建湖哪家专业做网站
  • 金昌市建设局网站wordpress七牛云储存
  • 做外贸一般上什么网站页面置换算法课程设计
  • 做外贸需要做网站吗wordpress文档预览
  • seo+网站排名wordpress 搬家500错误
  • 电子商务网站建设自服务器厦门设计公司有哪些
  • 济源网站优化网站建设费放什么科目