Training data
GROBID-Dictionaries's the training data is encoded following the TEI P5. See the annotation guidelines page for detailed explanations and examples concerning the best practices fro annotating the data.
Generation of training data
Make sure that the current directory is grobid-dictionaries:
> cd PATH-TO-GROBID/grobid/grobid-dictionaries
For Dictionary Segmentation model:
> java -jar PATH-TO-GROBID/grobid/grobid-dictionaries/target/grobid-dictionaries-0.4.3-SNAPSHOT.one-jar.jar -dIn PATH_TO_THE_INPUT_PDF_FILE_OR_DIRECTORY -dOut PATH-TO-OUTPUT-DIRECTORY -exe createTrainingDictionarySegmentation
For Dictionary Body Segmentation model:
> java -jar PATH-TO-GROBID/grobid/grobid-dictionaries/target/grobid-dictionaries-0.4.3-SNAPSHOT.one-jar.jar -dIn PATH_TO_THE_INPUT_PDF_FILE_OR_DIRECTORY -dOut PATH-TO-OUTPUT-DIRECTORY -exe createTrainingDictionaryBodySegmentation
For Lexical Entry model:
> java -jar PATH-TO-GROBID/grobid/grobid-dictionaries/target/grobid-dictionaries-0.4.3-SNAPSHOT.one-jar.jar -dIn PATH_TO_THE_INPUT_PDF_FILE_OR_DIRECTORY -dOut PATH-TO-OUTPUT-DIRECTORY -exe createTrainingLexicalEntry
For Form model:
> java -jar PATH-TO-GROBID/grobid/grobid-dictionaries/target/grobid-dictionaries-0.4.3-SNAPSHOT.one-jar.jar -dIn PATH_TO_THE_INPUT_PDF_FILE_OR_DIRECTORY -dOut PATH-TO-OUTPUT-DIRECTORY -exe createTrainingForm
For Sense model:
> java -jar PATH-TO-GROBID/grobid/grobid-dictionaries/target/grobid-dictionaries-0.4.3-SNAPSHOT.one-jar.jar -dIn PATH_TO_THE_INPUT_PDF_FILE_OR_DIRECTORY -dOut PATH-TO-OUTPUT-DIRECTORY -exe createTrainingSense
For EtymQuote model:
> java -jar PATH-TO-GROBID/grobid/grobid-dictionaries/target/grobid-dictionaries-0.4.3-SNAPSHOT.one-jar.jar -dIn PATH_TO_THE_INPUT_PDF_FILE_OR_DIRECTORY -dOut PATH-TO-OUTPUT-DIRECTORY -exe createTrainingEtymQuote
For Etym model:
> java -jar PATH-TO-GROBID/grobid/grobid-dictionaries/target/grobid-dictionaries-0.4.3-SNAPSHOT.one-jar.jar -dIn PATH_TO_THE_INPUT_PDF_FILE_OR_DIRECTORY -dOut PATH-TO-OUTPUT-DIRECTORY -exe createTrainingEtym
The above commands create training data to be annotated from scratch (files ending with tei.xml). It is possible also to generate pre-annotations using the current model, to be corrected afterwards (this mode is recommended when the model to be trained is becoming more precise). To do so, the latest token of the above commands should include Annotated. For example: createTrainingDictionarySegmentation -> createAnnotatedTrainingDictionarySegmentation
The execution of any of the previous commands result in the generation of 5 files:
- inputFile .rawtxt: contains the raw text extracted from a the input file (not used for training)
- inputFile .tei.xml: contains gold standard segmentation of the input file (crucial for the training)
- inputFile .modelname: contains features corresponding to each line/token in the input file (crucial for the training). The beginning of each line in the feature matrix should be synchronised with each line/token in the tei.xml
- modelname .css: a stylesheet for a better rendering of .tei.xml elements in Oxygen's author mode (useful for annotation)
- modelname .rng: an xml syntax descriptor for a element suggestion applied to .tei.xml elements in Oxygen's author mode (useful for annotation)
The generated files should be included in the training dataset while the architecture of directories and files in the toy data directory is respected.