WebContentMiningTutorial/part1.html at master · parkergray221/WebContentMiningTutorial · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
<html>
<head>
    <title>Introduction to Web Content Mining</title>
      <link rel="stylesheet" href="style.css">
    <style>

    </style>
</head>
<body>
    <table style="width:100%">
        <tr>
            <th><strong>Up:</strong> <a href="First.html">Index</a></th><th><strong>Next:</strong> <a href="part2.html">Tools Used for Web Content Mining</a></th>
        </tr>
    </table>
    <h1 class="centered-text"><strong>Introduction and Problem Definition</strong></h1>
    <h3>What is Webmining</h3>
    <p class="tab">
    Most people are familiar with the concept of data mining through popular culture. Two types of people do data mining in our popular consciousness- social media companies like Facebook or Twitter use it to collect various things- the metadata
    of your browser, the posts you make, the profiles that you might visit most, etc. They link together all of these different sources of information with other publicized data about you to create a pretty cohesive profile of you to target ads to you.
	Hackers, the more nefarious type of the two, might use that data to guess your passwords, track your location, break into your bank accounts and other dastardly things. So data mining very much has a sort of subversive reputation, its assumed that its being
	used to take advantage of us or invade our privacy. But sprouting from this subversive core is a subset of data mining that is not interested in <i>exploiting you for some gain</i> but rather exploring other aspects of the Internet in a much more benign way.
    </p>

	<p class="tab">
	<b>Web mining</b> is  one of these types: its goal is not necessarily to acquire user information, but rather to mine websites themselves and examine their patterns- good ol' fashioned 'looking under the hood'. Like data mining, web mining accomplishes its goal of
	pattern recognition by using automated programs to extract particular data from webpages like browser activities, server logs, website and link structure, page content, among others. The purpose of this data extraction is to analyze it, discovering the aforementioned
	patterns. Its commercial use is to examine a given website design and diagnose what that website might be lacking or exceeding in relative to patterns in successful websites. A web mining company can figure out what designs attract the most users and most engagement
	and sell this data to companies looking to improve their websites. Though web mining is a subcategory of data mining, it itself has its own subcategories.
	</p>

    <h3>Categories Of Webmining</h3>
    <div class="center">
    <figure style="text-align: "center">
        <img src="webminingtypes.png" width="650px">
			<figcaption style="text-align: center">Webmining categories</figcaption>
    </figure>
    </div>
    <p class="tab">
	Web mining can be split into three separate categories- web content mining, web structure mining, and web usage mining. <b>Web content mining</b> focuses on extracting useful information from standard web page contents. <b>Web structure mining</b> is focused on
	examining the hyperlink structure of a page to find interesting patterns in the routing of the page. Lastly, <b>web usage mining</b> covers user access patterns within a webpage. This type of web mining is the one that most resembles our popular conception of data mining
	and is the most sought after type commercially- there's not as much money in the other two as there is in finding out what web design types keep people clicking. For our project in particular, we're covering web content mining out of all of these categories because
	it's the type that most relates to the class topic of Web Design - it focuses on interpreting and categorizing useful distinct data found on each website.
    </p>

	<h3>Categories of Web Content Mining</h3>
	<p class="tab">
	Web content mining itself is further differentiated by two popular points of view. The way in which we perceive webpages as developers influences the sort of queries we might ask and how we will ask them. This is where two points of view
	within web content mining describe different ways of viewing webpages. Firstly, there's the <b>information retrieval view</b>, which sees webpages as one big textfile to parse. The IR view utilizes statistics/machine learning to categorize the information
	of a page based on the words, phrases, and concepts contained in that parsed text. On the other hand, the <b>database retrieval view</b> conceives of the webpage as a database and uses proprietary algorithms and association rules to understand the data it uncovers.
	Rather than trying to categorize a webpage like IR does, DBR is more interested in finding frequent sub structures of a webpage or discovering the underlying schema of a webpage. Both types belong to the same umbrella of web content mining, but their differences in
	how they view a webpage in the abstract and the types of methods they perform on it provide different results. Of these two points of view, we're focusing on the information retrieval type.
	<p>

    </body>
    </html>