Challenges and limitations of Tokenization Task
challenges and limitations of the tokenization task.
In general, this task is used for text corpus written in English or French where these languages separate words by using white spaces, or punctuation marks to define the boundary of the sentences. Unfortunately, this method couldn’t be applicable for other languages like Chinese, Japanese, Korean Thai, Hindi, Urdu, Tamil, and others. This problem creates the need to develop a common tokenization tool that combines all languages.
Comments
Post a Comment