There’s a difference between ‘processing’ the text and ‘parsing’ it. The processing described in the section you posted it fine, and you can manage a similar level of processing on HTML. The tricky/impossible bit is parsing the languages. For instance you can’t write a regex that’ll relibly find the subject, object and verb in any english sentence, and you can’t write a regex that’ll break an HTML document down into a hierarchy of tags as regexs don’t support counting depth of recursion, and HTML is irregular anyway, meaning it can’t be reliably parsed with a regular parser.
For instance you can’t write a regex that’ll relibly find the subject, object and verb in any english sentence
Identifying parts of speech isn’t a requirement of the word parse. That’s the linguistic definition. In computer science identifying tokens is parsing.
That’s certainly one level of parsing, and sometimes alk you need, but as the article you posted says, it more usually refers to generating a parse tree. To do that in a natural language isn’t happening with a regex.
Thanks for all the explaining. I always wondered why you can’t parse HTML since I first saw the Stack Overflow post, when you can take any HTML code you find and write an expression to work against said set of data.
I never understood the word parse to mean understanding and building a structure based on any input.
There’s a difference between ‘processing’ the text and ‘parsing’ it. The processing described in the section you posted it fine, and you can manage a similar level of processing on HTML. The tricky/impossible bit is parsing the languages. For instance you can’t write a regex that’ll relibly find the subject, object and verb in any english sentence, and you can’t write a regex that’ll break an HTML document down into a hierarchy of tags as regexs don’t support counting depth of recursion, and HTML is irregular anyway, meaning it can’t be reliably parsed with a regular parser.
Identifying parts of speech isn’t a requirement of the word parse. That’s the linguistic definition. In computer science identifying tokens is parsing.
https://en.m.wikipedia.org/wiki/Parsing
That’s certainly one level of parsing, and sometimes alk you need, but as the article you posted says, it more usually refers to generating a parse tree. To do that in a natural language isn’t happening with a regex.
Thanks for all the explaining. I always wondered why you can’t parse HTML since I first saw the Stack Overflow post, when you can take any HTML code you find and write an expression to work against said set of data.
I never understood the word parse to mean understanding and building a structure based on any input.