how to program
HOW TO PROGRAM . IT - learn how to program - online interactive programming tutorial and programming course
Search engine web page text extractor

01. Although this program is in the advanced section, it is simple for an advanced program. It demonstrates in a very simplified way, how part of the search engine spider index process works, after fetching a page a search engine can quickly extract the text from that page. To some extent this also demonstrates how screen readers and braille devices extract text from web pages, and how other programs web scrape.

Program Code:

The above example is by no means complete, is by no means optimised, and does not demonstrate how search engines really work. Professional search engines would do things differently and better. But it is there to give you an example. Only the main on-page visible text is extracted, including page title, but not other more 'hidden' text like image alternative text. The program is intentionally a simplified version, for learning purposes.

You can visit a more complex website with much larger piece of HTML page code, use the browser view HTML source, copy the HTML source into the input box, to see if this program can extract the text. If the HTML source is not well formed or W3C compliant there is a chance that the program could get stuck in a loop, most browsers will inform you when this happens, but it is also wise to make sure you have saved away any work first.

At this point there will be no explanation of what is going on or how the program works, not even comments, since this website as a whole is very much at the beginners level for now, and we want to avoid showing you how to run before you can walk. But when more tutorials are added and the beginners lessons progress on to more advanced techniques, we will explain how this program works.

To return to previous page click here

by running the tutorial programs on this site, you are agreeing to our terms and conditions (please check these first).
Thu 27 Jan 2022 web design | login © 2022 Abstract Worlds Ltd