Skip to content
/ AgCNER Public

a publicly available dataset for agricultural diseases and pests

Notifications You must be signed in to change notification settings

guojson/AgCNER

Repository files navigation

AgCNER


A publicly available dataset and code for Chinese agricultural diseases and pests

1 Note

During the under review period, the dataset has been uploaded to figshare and can be viewed by editors and reviewers. It will be fully released here as soon as the paper was accepted

2 Dataset

The large-scale corpus for Chinese ADP-NER task named AgCNER contains 13 categories, 206,992 entities, and 66,553 instances with 3,909,293 characters. Compared with other datasets, AgCNER maintains the best performance in terms of the number of categories, entities, samples, and characters. Moreover, this is the first publicly available corpus for this domain-specific field.

2.1 Entity Tags

image

2.2 Examples

image

2.3 Statistic

image

2.4 Results

image

3 Code

The full code, including HMM, CRF, BiLSTM-CRF, IDCNN-CRF, BiLSTM-Attention-CRF, Lattice-LSTM, TENER, FLAT, NFLAT, Graph4CNER, BERT-BiLSTM-CRF,BERT-IDCNN-CRF,HNER, with their outputs for AgCNER have been released.

3.1 HMM and CRF

The code for HMM and CRF was listed in file "HMM and CRF"

Before training, you can change the data_dir into your own data path, then it would be worked by using following command:

python main.py

3.2 BiLSTM-CRF, BiLSTM-Attention-CRF, and IDCNN-CRF

The code was released in file "BiLST-CRF" Before training, specific model should be confirmed in run.py, then it would be worked by using following command:

python run.py

3.3 Lattice-LSTM

Lattice LSTM for Chinese NER. Character based LSTM with Lattice embeddings as input.


The pretrained character and word embeddings are the same with the embeddings in the baseline of RichWordSegmentor

Character embeddings (gigaword_chn.all.a2b.uni.ite50.vec): Google Drive or Baidu Pan

Word(Lattice) embeddings (ctb.50d.vec): Google Drive or Baidu Pan

the code can be run by using following command:

python main.py

3.4 TENER

This project needs the natural language processing python package fastNLP. You can install by the following command:

pip install fastNLP

You can run the code by the following command

python train_tener_cn.py --dataset agcner

3.5 FLAT

code for ACL 2020 paper: FLAT: Chinese NER Using Flat-Lattice Transformer.

you can go here to know more about FastNLP.

How to run the code?

  1. Download the character embeddings and word embeddings.

    Character and Bigram embeddings (gigaword_chn.all.a2b.{'uni' or 'bi'}.ite50.vec) : Google Drive or Baidu Pan

    Word(Lattice) embeddings:

    yj, (ctb.50d.vec) : Google Drive or Baidu Pan

    ls, (sgns.merge.word.bz2) : Baidu Pan

  2. Modify the paths.py to add the pretrained embedding and the dataset

  3. Run following commands

python preprocess.py (add '--clip_msra' if you need to train FLAT on MSRA NER dataset)
cd V0 (without Bert) / V1 (with Bert)
python flat_main.py --dataset <dataset_name> (agcner)

3.5 NFLAT

before training, fastNLP should be also installed using following command:

pip install fastNLP
  1. Download the pretrained character embeddings and word embeddings and put them in the data folder.

  2. Modify the utils/paths.py to add the pretrained embedding and the dataset.

  3. Long sentence clipping for the datasets MSRA and Ontonotes, or your datasets, run the command:

python sentence_clip.py
  1. Merging char embeddings and word embeddings:
python char_word_mix.py
  1. Model training and evaluation
    • Weibo dataset
    python main.py --dataset agcner

3.6 Graph4CNER

Pretrained Embeddings:

Character embeddings (gigaword_chn.all.a2b.uni.ite50.vec) can be downloaded in Google Drive or Baidu Pan.

Word embeddings (sgns.merge.word) can be downloaded in Google Drive or Baidu Pan.

Usage:

1️⃣ Download the character embeddings and word embeddings and put them in the data/embeddings folder.

2️⃣ Modify the run_main.sh by adding your train/dev/test file directory.

3️⃣ sh run_main.sh. Note that the default hyperparameters is may not be the optimal hyperparameters, and you need to adjust these.

4️⃣ Enjoy it! 😄

3.7 BERT-BiLSTM-CRF and BERT-IDCNN-CRF

The code for BERT-BiLSTM-CRF and BERT-IDCNN-CRF was listed in BERT-BiLSTM-CRF.

  1. Pretrained Embeddings:
  • pre-trained language model BERT for Chinese NER can be downloaded in Huggingface
  1. select the training model, BERT-BiLSTM-CRF or BERT-IDCNN-CRF
  2. Model training and evaluation
    • agcner dataset
    python ner.py

HNER

Hierarchical Transformer Model for Scientific Named Entity Recognition


  1. convert dataset with .txt into .pk by using following command:
python  convert_to_pk.py
  1. Model training and evaluation
    • agcner dataset
     name = "agcner_ner"
     dirpath= "."
     data_path = "data/agcner/agcner.pk"
     # create config
     config = create_config(name, dirpath=dirpath, data_path=data_path, max_epoch=30,
                            model_name='ckiplab/bert-base-chinese')
     # train model
     train_model(config)
    
  2. Testing and predicting
    from seqlab.inference import load_model
    
     checkpoint = "agcner_ner-v2.ckpt"
    
     # load model
     model = load_model(checkpoint)
    
     with open(r'data/agcner/agcner.pk', 'rb') as f:
         data = pickle.load(f)
    
     sent_list = []
     for sent in data['test']:
         sent_list.append(sent['tokens'])
    
     gold_List = []
     for sent in data['test']:
         gold_List.append(sent['tags'])
    
     prediction = model.extract_predictions(sent_list)
     print(prediction)
     sents = []
     y_true = []
     y_pred = []
     for i in range(len(gold_List)):
         sents.extend(sent_list[i])
         y_true.extend(gold_List[i])
         y_pred.extend(prediction[i])
     labels = ['I-PART', 'B-BEL', 'I-STRAINS', 'B-DRUG', 'I-ORG', 'B-STRAINS', 'I-PET', 'B-ORG', 'B-COM', 'B-FER',
               'I-CLA',
               'I-PER', 'I-DRUG', 'B-DIS', 'I-BEL', 'B-PART', 'O', 'B-PER', 'I-REA', 'I-DIS', 'B-PET', 'B-CLA', 'I-FER',
               'B-CRO', 'I-CRO', 'B-REA', 'I-COM']
     cm = confusion_matrix(y_true, y_pred, labels=labels)
    
     np.savetxt('data/agcner/test_result_matrix.csv', cm, delimiter=',')
    
     with open('data/agcner/predict_results.txt', 'w', encoding='utf-8', ) as f:
         for i in range(len(sents)):
             f.write(sents[i] + ' ' + y_true[i] + ' ' + y_pred[i] + '\n')
    
     eval_list = []
     for ori_tokens, oril, prel in zip(sent_list, gold_List, prediction):
         for ot, ol, pl in zip(ori_tokens, oril, prel):
             eval_list.append(f"{ot} {ol} {pl}\n")
         eval_list.append("\n")
    
     # eval the model
     counts = conlleval.evaluate(eval_list)
     conlleval.report(counts)
    

4 AgBERT

Fine-turned BERT for named entity recognition of agricultural diseases and pests was also released in https://github.com/guojson/AgBERT.git

5 Thanks

About

a publicly available dataset for agricultural diseases and pests

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages