…is to not use your own brain but rather ride on the shoulders of an expert. Surprisingly enough (for most), a popular ‘expert’ is Stanford CoreNLP.
Suppose you have the following paragraph (credit):
Born in Pretoria, South Africa, Musk taught himself computer programming at the age of 12. He moved to Canada when he was 17 to attend Queen’s University. He transferred to the University of Pennsylvania two years later.
Your first programmer instinct would be to split this text at each full stop. Easy-peasy:
Born in Pretoria, South Africa, Musk taught himself computer programming at the age of 12.
He moved to Canada when he was 17 to attend Queen’s University.
He transferred to the University of Pennsylvania two years later.
How about this:
He began a Ph.D. in applied physics and material sciences at Stanford University in 1995 but dropped out after two days to pursue an entrepreneurial career. He co-founded Tesla, Inc., an electric vehicle and solar panel manufacturer, in 2003.
Notice full stops in “Ph.D.” and “Tesla, Inc.”? Your over simplistic logic would completely blow up here.
What about Elon Musk’s story in Chinese? This language doesn’t even have full stops as we know them:
马斯克在自己10å²é‚£å¹´ä¹°äº†ç¬¬ä¸€å°é›»è…¦ï¼Œå¹¶è‡ªå¦äº†ç¼–程。12å²æ—¶ï¼Œä»¥500美元出售了自己的第一个å为Blastar(一个太空å°æ¸¸æˆï¼‰çš„商业软件。17å²ï¼ˆ1988年)高ä¸æ¯•ä¸šåŽï¼Œæ²¡æœ‰çˆ¶æ¯çš„èµ„åŠ©ï¼Œéƒ¨åˆ†åŽŸå› æ˜¯å› ä¸ºä¹‰åŠ¡å…µå½¹ï¼Œç¦»å¼€äº†å®¶åºã€‚
Here is a code snippet (credit) that uses CoreNLP to intelligently split into sentences:
Properties props = new Properties(); props.setProperty("annotators","tokenize, ssplit"); StanfordCoreNLP pipeline = new StanfordCoreNLP(props); Annotation annotation = new Annotation("Hello, Dr. Gaurav. Umm, well... did you get my email?"); pipeline.annotate(annotation); List sentences = annotation.get(CoreAnnotations.SentencesAnnotation.class); for (CoreMap sentence : sentences) { System.out.println(sentence); } |
Result:
Hello, Dr. Gaurav.
Um, well… did you get my email?