![]() ![]() RDRsegmenter : Em_út theo anh_cả vào miền Nam. Underthesea : Em_út theo anh cả vào miền Nam. Original : Em út theo anh cả vào miền Nam.Ĭoccoc-tokenizer : Em_út theo anh_cả vào miền_Nam. ![]() The tokenizer tool has a special output format which is similar to other existing tools for tokenization of Vietnamese texts - it preserves all the original text and just marks multi-syllable tokens with underscores instead of spaces. Speed: 15M characters / second, or 2.5M tokens / second.Dataset: 1.203.165 Vietnamese Wikipedia articles ( Link).The benchmark is done on a typical laptop with Intel Core i5-5200U processor: The library provides high speed tokenization which is a requirement for performance critical applications. # output: Other languagesīindings for other languages are not yet implemented but it will be nice if someone can help to write them. word_tokenize( "xin chào, tôi là người Việt Nam", tokenize_option = 0)) # tokenize_option: # 0: TOKENIZE_NORMAL (default) #đ: TOKENIZE_HOST #Ē: TOKENIZE_URL print( T. From CocCocTokenizer import PyTokenizer # load_nontone_data is True by default T = PyTokenizer( load_nontone_data = True)
0 Comments
Leave a Reply. |