Best way to split a paragraph into sentences

Posted on Leave a comment

…is to not use your own brain but rather ride on the shoulders of an expert. Surprisingly enough (for most), a popular ‘expert’ is Stanford CoreNLP.

Suppose you have the following paragraph (credit):

Born in Pretoria, South Africa, Musk taught himself computer programming at the age of 12. He moved to Canada when he was 17 to attend Queen’s University. He transferred to the University of Pennsylvania two years later.

Your first programmer instinct would be to split this text at each full stop. Easy-peasy:

Born in Pretoria, South Africa, Musk taught himself computer programming at the age of 12.
He moved to Canada when he was 17 to attend Queen’s University.
He transferred to the University of Pennsylvania two years later.

How about this:

He began a Ph.D. in applied physics and material sciences at Stanford University in 1995 but dropped out after two days to pursue an entrepreneurial career. He co-founded Tesla, Inc., an electric vehicle and solar panel manufacturer, in 2003.

Notice full stops in “Ph.D.” and “Tesla, Inc.”? Your over simplistic logic would completely blow up here.

What about Elon Musk’s story in Chinese? This language doesn’t even have full stops as we know them:

马斯克在自己10岁那年买了第一台電腦,并自学了编程。12岁时,以500美元出售了自己的第一个名为Blastar(一个太空小游戏)的商业软件。17岁(1988年)高中毕业后,没有父母的资助,部分原因是因为义务兵役,离开了家庭。

Here is a code snippet (credit) that uses CoreNLP to intelligently split into sentences:

Properties props = new Properties();
props.setProperty("annotators","tokenize, ssplit");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Annotation annotation = new Annotation("Hello, Dr. Gaurav. Umm, well... did you get my email?");
pipeline.annotate(annotation);
List sentences = annotation.get(CoreAnnotations.SentencesAnnotation.class);
for (CoreMap sentence : sentences) {
    System.out.println(sentence);
}

Result:

Hello, Dr. Gaurav.
Um, well… did you get my email?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.