Research Resource

中文语料

https://github.com/brightmart/nlp_chinese_corpus 中文大规模语料。
https://github.com/goto456/stopwords 中文常用停用词表（包括哈工大停用词表）。
https://github.com/CLUEbenchmark/CLUENER2020 CLUENER 细粒度命名实体识别。
https://www.cluebenchmarks.com/ 中文语言理解测评基准(CLUE)。

中文数据增强

https://github.com/zhanlaoban/EDA_NLP_for_Chinese 中文 EDA 实现，文献：https://arxiv.org/abs/1901.11196
https://github.com/425776024/nlpcda 中文数据增强：按近义字、实体替换、数字改写等方式。
https://github.com/thunlp/WantWords 万词王：反向查找词典 https://wantwords.thunlp.org/

NLP 工具包

https://github.com/fighting41love/funNLP nlp 库，比较全。
https://github.com/fastnlp/fastHan fastHan可处理中文分词、词性标注、依存分析、命名实体识别四项任务。
https://github.com/fxsjy/jieba 结巴分词。
https://github.com/ckiplab/ckiptagger CkipTagger 開源中文處理工具包含，分词、词性标注、NER。
https://github.com/yongzhuo/Macropodus Macropodus Albert+BiLSTM+CRF 网络架构为基础的自然语言处理工具包，提供中文分词、词性标注、命名实体识别、关键词抽取、文本摘要、新词发现、文本相似度、计算器、数字转换、拼音转换、繁简转换等常见 NLP 功能。

句法分析/词法分析

https://github.com/lemonhu/open-entity-relation-extraction 根据句子依存关系，提取 KG 三元组。
https://github.com/ibatra/BERT-Keyword-Extractor 基于 bert 提取 keyphrase。

(Chinese) Grammatical Error Diagnosis

https://github.com/shibing624/pycorrector 中文文本纠错工具 pycorrector。
https://github.com/kpu/kenlm 高效语言模型 Kenlm。
https://github.com/iqiyi/FASPell FASPell 中文拼写检查器。
https://github.com/ACL2020SpellGCN/SpellGCN SpellGCN 中文拼写检查器。
https://github.com/whgaara/pytorch-soft-masked-bert Soft-masked bert。

Semantic Role Labeling 语义角色标注

https://github.com/zxplkyy/BiRNN-SRL 中文 SRL（语义角色标注）数据集。
https://github.com/Nrgeup/chinese_semantic_role_labeling 基于 Lstm + CRF 的中文 SRL。

Natural Language Generation 自然语言生成

https://github.com/UFAL-DSG/tgen 相关文献：https://www.aclweb.org/anthology/P16-2008.pdf
https://github.com/microsoft/MASS 微软 MASS seq2seq 模型。

文本相似度

https://github.com/ZhuiyiTechnology/roformer-sim 结合 nlu 与 nlg 的模型 SimBERTv2。
https://github.com/princeton-nlp/SimCSE 基于对比学习的相似度模型 Simple Contrastive Learning of Sentence Embeddings。

长文本技术

https://github.com/allenai/longformer 官方 longformer。
https://github.com/SCHENLIU/longformer-chinese 中文 longformer 。
https://github.com/LowinLi/chinese-bigbird 中文 big bird。
https://github.com/Sleepychord/CogLTX 长文 CogLTX: Applying BERT to Long Texts。

中文预训练模型

https://github.com/google-research/bert 官方 Google Bert (tensorflow)。
https://github.com/huggingface/transformers Transformer NLP 模型集合。
https://github.com/brightmart/roberta_zh 中文 Roberta。
https://github.com/ymcui/Chinese-ELECTRA 中文 ELECTRA。
https://github.com/ymcui/Chinese-BERT-wwm 哈工大讯飞联合实验室训练的 Bert wwm 和 Roberta wwm 等。
https://github.com/ymcui/Chinese-XLNet 中文 XLNet。
https://github.com/ZhuiyiTechnology/WoBERT 以词为基本单位的中文BERT（Word-based BERT）。
https://github.com/ymcui/MacBERT 中文 MacBERT。
https://github.com/huawei-noah/Pretrained-Language-Model 华为 noah 提供的各种中文预训练模型：nezha、tinybert、dynabert、bbpe、pmlm等。
https://github.com/sinovation/ZEN a BERT-based Chinese (Z) text encoder Enhanced by N-gram representations。
https://github.com/ShannonAI/ChineseBert 香侬科技提出的融合中文字形、拼音版的 Bert。
https://github.com/ShannonAI/glyce 香侬科技提出的融合中文字形信息的 Bert。
https://github.com/Tencent/Lichee 腾讯开源预训练模型 Lichee，https://arxiv.org/pdf/2108.00801.pdf。

形近字资源

https://github.com/lbneon/pyAndForm_v1/blob/master/models/getChineseStrokes/ChineseStrokes.dat 形近字表。
http://www.fantiz5.com/xingjinzi/ 形近字查询工具。

多模态资源

https://github.com/CryhanFang/CLIP2Video 腾讯开源的文本视频检索模型。

预训练模型其它技术

https://github.com/timoschick/pet Pattern-Exploiting Training (PET)，用于 few-shot 刷榜 Superglue。
https://github.com/autoliuweijie/K-BERT 融合知识图谱的 BERT 。

Paper List

https://paperswithcode.com/ 各种机器学习 SOTA 对应的代码实现。
https://openreview.net/ 顶会的 Open Review。
https://mp.weixin.qq.com/s/mPWoKh1UQ4HbgEbNDkjsiA NIPS2019 | 深度强化学习重点论文解读
https://mp.weixin.qq.com/s/O0Q1XoTA-7Yshr1ZqOZ90w 什么是模仿学习

其它

https://www.codingfont.com/ 代码字体选择器

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
CV.md		CV.md
README.md		README.md

lbneon/research

Folders and files

Latest commit

History

Repository files navigation

Research Resource

中文语料

https://github.com/brightmart/nlp_chinese_corpus 中文大规模语料。

https://github.com/goto456/stopwords 中文常用停用词表（包括哈工大停用词表）。

https://github.com/CLUEbenchmark/CLUENER2020 CLUENER 细粒度命名实体识别。

https://www.cluebenchmarks.com/ 中文语言理解测评基准(CLUE)。

中文数据增强

https://github.com/zhanlaoban/EDA_NLP_for_Chinese 中文 EDA 实现，文献：https://arxiv.org/abs/1901.11196

https://github.com/425776024/nlpcda 中文数据增强：按近义字、实体替换、数字改写等方式。

https://github.com/thunlp/WantWords 万词王：反向查找词典 https://wantwords.thunlp.org/

NLP 工具包

https://github.com/fighting41love/funNLP nlp 库，比较全。

https://github.com/fastnlp/fastHan fastHan可处理中文分词、词性标注、依存分析、命名实体识别四项任务。

https://github.com/fxsjy/jieba 结巴分词。

https://github.com/ckiplab/ckiptagger CkipTagger 開源中文處理工具包含，分词、词性标注、NER。

句法分析/词法分析

https://github.com/lemonhu/open-entity-relation-extraction 根据句子依存关系，提取 KG 三元组。

https://github.com/ibatra/BERT-Keyword-Extractor 基于 bert 提取 keyphrase。

(Chinese) Grammatical Error Diagnosis

https://github.com/shibing624/pycorrector 中文文本纠错工具 pycorrector。

https://github.com/kpu/kenlm 高效语言模型 Kenlm。

https://github.com/iqiyi/FASPell FASPell 中文拼写检查器。

https://github.com/ACL2020SpellGCN/SpellGCN SpellGCN 中文拼写检查器。

https://github.com/whgaara/pytorch-soft-masked-bert Soft-masked bert。

Semantic Role Labeling 语义角色标注

https://github.com/zxplkyy/BiRNN-SRL 中文 SRL（语义角色标注） 数据集。

https://github.com/Nrgeup/chinese_semantic_role_labeling 基于 Lstm + CRF 的中文 SRL。

Natural Language Generation 自然语言生成

https://github.com/UFAL-DSG/tgen 相关文献：https://www.aclweb.org/anthology/P16-2008.pdf

https://github.com/microsoft/MASS 微软 MASS seq2seq 模型。

文本相似度

https://github.com/ZhuiyiTechnology/roformer-sim 结合 nlu 与 nlg 的模型 SimBERTv2。

https://github.com/princeton-nlp/SimCSE 基于对比学习的相似度模型 Simple Contrastive Learning of Sentence Embeddings。

长文本技术

https://github.com/allenai/longformer 官方 longformer。

https://github.com/SCHENLIU/longformer-chinese 中文 longformer 。

https://github.com/LowinLi/chinese-bigbird 中文 big bird。

https://github.com/Sleepychord/CogLTX 长文 CogLTX: Applying BERT to Long Texts。

中文预训练模型

https://github.com/google-research/bert 官方 Google Bert (tensorflow)。

https://github.com/huggingface/transformers Transformer NLP 模型集合。

https://github.com/brightmart/roberta_zh 中文 Roberta。

https://github.com/ymcui/Chinese-ELECTRA 中文 ELECTRA。

https://github.com/ymcui/Chinese-BERT-wwm 哈工大讯飞联合实验室训练的 Bert wwm 和 Roberta wwm 等。

https://github.com/ymcui/Chinese-XLNet 中文 XLNet。

https://github.com/ZhuiyiTechnology/WoBERT 以词为基本单位的中文BERT（Word-based BERT）。

https://github.com/ymcui/MacBERT 中文 MacBERT。

https://github.com/huawei-noah/Pretrained-Language-Model 华为 noah 提供的各种中文预训练模型：nezha、tinybert、dynabert、bbpe、pmlm等。

https://github.com/sinovation/ZEN a BERT-based Chinese (Z) text encoder Enhanced by N-gram representations。

https://github.com/ShannonAI/ChineseBert 香侬科技提出的融合中文字形、拼音版的 Bert。

https://github.com/ShannonAI/glyce 香侬科技提出的融合中文字形信息的 Bert。

https://github.com/Tencent/Lichee 腾讯开源预训练模型 Lichee，https://arxiv.org/pdf/2108.00801.pdf。

形近字资源

https://github.com/lbneon/pyAndForm_v1/blob/master/models/getChineseStrokes/ChineseStrokes.dat 形近字表。

http://www.fantiz5.com/xingjinzi/ 形近字查询工具。

多模态资源

https://github.com/CryhanFang/CLIP2Video 腾讯开源的文本视频检索模型。

预训练模型其它技术

https://github.com/timoschick/pet Pattern-Exploiting Training (PET)，用于 few-shot 刷榜 Superglue。

https://github.com/autoliuweijie/K-BERT 融合知识图谱的 BERT 。

Paper List

https://paperswithcode.com/ 各种机器学习 SOTA 对应的代码实现。

https://openreview.net/ 顶会的 Open Review。

https://mp.weixin.qq.com/s/mPWoKh1UQ4HbgEbNDkjsiA NIPS2019 | 深度强化学习重点论文解读

https://mp.weixin.qq.com/s/O0Q1XoTA-7Yshr1ZqOZ90w 什么是模仿学习

其它

https://www.codingfont.com/ 代码字体选择器

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

https://github.com/zxplkyy/BiRNN-SRL 中文 SRL（语义角色标注）数据集。

Packages