Save This Page
Home » nutch-1.0 » org.apache.nutch » indexer » [javadoc | source]
org.apache.nutch.indexer
public class: DeleteDuplicates [javadoc | source]
java.lang.Object
   org.apache.hadoop.conf.Configured
      org.apache.nutch.indexer.DeleteDuplicates

All Implemented Interfaces:
    org.apache.hadoop.mapred.Mapper, org.apache.hadoop.util.Tool, OutputFormat, org.apache.hadoop.mapred.Reducer

Delete duplicate documents in a set of Lucene indexes. Duplicates have either the same contents (via MD5 hash) or the same URL. This tool uses the following algorithm:
Nested Class Summary:
public static class  DeleteDuplicates.IndexDoc   
public static class  DeleteDuplicates.InputFormat   
public static class  DeleteDuplicates.HashPartitioner   
public static class  DeleteDuplicates.UrlsReducer   
public static class  DeleteDuplicates.HashReducer   
Constructor:
 public DeleteDuplicates() 
 public DeleteDuplicates(Configuration conf) 
Method from org.apache.nutch.indexer.DeleteDuplicates Summary:
checkOutputSpecs,   close,   configure,   dedup,   getRecordWriter,   main,   map,   reduce,   run,   setConf
Methods from java.lang.Object:
equals,   getClass,   hashCode,   notify,   notifyAll,   toString,   wait,   wait,   wait
Method from org.apache.nutch.indexer.DeleteDuplicates Detail:
 public  void checkOutputSpecs(FileSystem fs,
    JobConf job) 
 public  void close() 
 public  void configure(JobConf job) 
 public  void dedup(Path[] indexDirs) throws IOException 
 public RecordWriter getRecordWriter(FileSystem fs,
    JobConf job,
    String name,
    Progressable progress) throws IOException 
    Write nothing.
 public static  void main(String[] args) throws Exception 
 public  void map(WritableComparable key,
    Writable value,
    OutputCollector output,
    Reporter reporter) throws IOException 
    Map [*,IndexDoc] pairs to [index,doc] pairs.
 public  void reduce(Text key,
    Iterator values,
    OutputCollector output,
    Reporter reporter) throws IOException 
    Delete docs named in values from index named in key.
 public int run(String[] args) throws Exception 
 public  void setConf(Configuration conf)