Abstract
As a huge data source the internet contains a large number of worthless information, and the data of information is usually in the form of semi-structured in HTML web pages. This paper uses a new methodology to perform the task automatically. It consists of two steps, the foremost one is identifying individual data records in a page, and the next is aligning and extracting data items from the identified information data records. For the foremost process, A method based on visual information to segment of data records, which is more relevant to past methods. The next process, uses a novel partial alignment is a technique based on hierarchical parent child matching method. Partial alignment means we aligning only those data fields in a pair of data records that can be aligned (or matched) , and make none relevant information on the rest of the data fields. This approach does enables more reliable alignment in the multiple data records. The results are using a large number of Web pages from diverse domains show that the proposed two-step technique is able to segment information data history, align and retrieved data from the very well matched with relevant result. The parameters used are precision, recall and f-measure is used for evaluating the performance of the existing and proposed methods of web data extraction method. The process results prove that the proposed method is better than the existing method.