In this tutorial, we are going to extract wh-questions from a text. It is not very easy than it looks like because splitting text into sentences in a text is not that easy. Sentences are not necessarily in agreement with lines, and splitting text into sentences based on punctuation marks is incomplete because a period, for example, can be used for abbreviations, and so on. If you want to split text into sentences in a rough way, splitting text into sentences based on punctuation marks is okay, but if you want to do in a more precise way, here is a solution.
1. Text: Alice’s Adventures in Wonderland, by Lewis Carroll - can be obtained here
2. Logic: Basically splitting a text into sentences based on punctuation marks, but we also consider that what is before and after. For example, if a letter comes before a period and a white space comes after it, we can safely say that it is a true splitting marker. After splitting a text into sentences, we can get interrogative sentences based on "?" mark, and then what we need to do is just to check whether the first word is a wh-word.
3. Code: A complete code can be available here.
No comments:
Post a Comment