Sunday, December 4, 2022

[Python Code] Extracting wh-questions from a Text

In this tutorial, we are going to extract wh-questions from a text. It is not very easy than it looks like because splitting text into sentences in a text is not that easy. Sentences are not necessarily in agreement with lines, and splitting text into sentences based on punctuation marks is incomplete because a period, for example, can be used for abbreviations, and so on. If you want to split text into sentences in a rough way, splitting text into sentences based on punctuation marks is okay, but if you want to do in a more precise way, here is a solution.

1. Text: Alice’s Adventures in Wonderland, by Lewis Carroll - can be obtained here

2. Logic: Basically splitting a text into sentences based on punctuation marks, but we also consider that what is before and after. For example, if a letter comes before a period and a white space comes after it, we can safely say that it is a true splitting marker. After splitting a text into sentences, we can get interrogative sentences based on "?" mark, and then what we need to do is just to check whether the first word is a wh-word.

3. Code: A complete code can be available here. 

No comments:

Post a Comment