Module dataframe

Module dataframe 

Source
Expand description

Pandas-like dataframe & series.

§Series

§1. Declare Series

  • To declare series, you should have Vec<T> where T is one of following types.
Primitive typeDType
usizeUSIZE
u8U8
u16U16
u32U32
u64U64
isizeISIZE
i8I8
i16I16
i32I32
i64I64
f32F32
f64F64
boolBool
charChar
StringStr
  • If you prepare Vec<T>, then Series::new(Vec<T>)

§2. Methods for Series

  • TypedVector<T> trait for Series

    pub trait TypedVector<T> {
        fn new(v: Vec<T>) -> Self;
        fn to_vec(&self) -> Vec<T>;
        fn as_slice(&self) -> &[T];
        fn as_slice_mut(&mut self) -> &mut [T];
        fn at_raw(&self, i: usize) -> T;
        fn push(&mut self, elem: T);
    }
  • Series methods

    impl Series {
        // Core
        pub fn at(&self, i: usize) -> Scalar;
        pub fn len(&self) -> usize;
        pub fn to_type(&self, dtype: DType) -> Series;
        pub fn as_type(&mut self, dtype: DType);
        pub fn select_indices(&self, indices: &[usize]) -> Series;
        pub fn to_f64_vec(&self) -> anyhow::Result<Vec<f64>>;
    
        // Statistics (numeric types only, except min/max)
        pub fn sum(&self) -> anyhow::Result<f64>;
        pub fn mean(&self) -> anyhow::Result<f64>;
        pub fn var(&self) -> anyhow::Result<f64>;
        pub fn sd(&self) -> anyhow::Result<f64>;
        pub fn min(&self) -> anyhow::Result<Scalar>;
        pub fn max(&self) -> anyhow::Result<Scalar>;
    }
    • at is simple getter for Series. It returns Scalar.
    • as_type is a method for mutable type casting.
      • All types can be changed to Str.
      • All integer & float types can be exchanged.
      • Bool, Char can be changed to Str or U8 only.
      • U8 can be changed to all types.
    • select_indices selects elements by indices, returning a new Series.
    • to_f64_vec converts numeric Series to Vec<f64> (bridge for statistics).
    • sum, mean, var, sd convert to f64 internally via to_f64_vec.
    • min, max preserve the original type and return Scalar. Works on all ordered types including Char and String.

§3. Example

extern crate peroxide;
use peroxide::fuga::*;

fn main() {
    let a = Series::new(vec![1, 2, 3, 4]);
    let b = Series::new(vec!['a', 'b', 'c', 'd']);
    let mut c = Series::new(vec![true, false, false, true]);

    a.print();       // print for Series
    b.dtype.print(); // print for dtype of Series (=Char)
    c.as_type(U8);   // Bool => U8

    assert_eq!(c.dtype, U8);

    // Select by indices
    let d = a.select_indices(&[0, 2]);
    assert_eq!(d, Series::new(vec![1, 3]));

    // Statistics
    let e = Series::new(vec![1.0, 2.0, 3.0, 4.0, 5.0]);
    assert_eq!(e.sum().unwrap(), 15.0);
    assert_eq!(e.mean().unwrap(), 3.0);
    assert_eq!(e.min().unwrap(), Scalar::new(1.0f64));
    assert_eq!(e.max().unwrap(), Scalar::new(5.0f64));
}

§DataFrame

§1. Declare DataFrame

  • To declare dataframe, use constructor.
    • DataFrame::new(Vec<Series>)
extern crate peroxide;
use peroxide::fuga::*;

fn main() {
    // 1-1. Empty DataFrame
    let mut df = DataFrame::new(vec![]);

    // 1-2. Push Series
    df.push("a", Series::new(vec![1, 2, 3, 4]));
    df.push("b", Series::new(vec![0.1, 0.2, 0.3, 0.4]));
    df.push("c", Series::new(vec!['a', 'b', 'c', 'd']));

    // 1-3. Print
    df.print();

    // 2-1. Construct Series first
    let a = Series::new(vec![1, 2, 3, 4]);
    let b = Series::new(vec![0.1, 0.2, 0.3, 0.4]);
    let c = Series::new(vec!['a', 'b', 'c', 'd']);

    // 2-2. Declare DataFrame with exist Series
    let mut dg = DataFrame::new(vec![a, b, c]);

    // 2-3. Print or Set header
    dg.print();                         // But header: 0 1 2
    dg.set_header(vec!["a", "b", "c"]); // Change header
}

§2. Methods for DataFrame

  • DataFrame method

    impl DataFrame {
        // Constructor & Basic
        pub fn new(v: Vec<Series>) -> Self;
        pub fn header(&self) -> &Vec<String>;
        pub fn header_mut(&mut self) -> &mut Vec<String>;
        pub fn set_header(&mut self, new_header: Vec<&str>);
        pub fn push(&mut self, name: &str, series: Series);
        pub fn drop(&mut self, col_header: &str);
        pub fn row(&self, i: usize) -> DataFrame;
        pub fn spread(&self) -> String;
        pub fn as_types(&mut self, dtypes: Vec<DType>);
        pub fn filter_by<F>(&self, column: &str, f: F) -> anyhow::Result<DataFrame>;
        pub fn mask(&self, mask: &Series) -> anyhow::Result<DataFrame>;
        pub fn select_rows(&self, indices: &[usize]) -> DataFrame;
    
        // Shape & Info
        pub fn nrow(&self) -> usize;
        pub fn ncol(&self) -> usize;
        pub fn shape(&self) -> (usize, usize);
        pub fn dtypes(&self) -> Vec<DType>;
        pub fn is_empty(&self) -> bool;
        pub fn contains(&self, col_header: &str) -> bool;
    
        // Row Operations
        pub fn head(&self, n: usize) -> DataFrame;
        pub fn tail(&self, n: usize) -> DataFrame;
        pub fn slice(&self, offset: usize, length: usize) -> DataFrame;
    
        // Column Operations
        pub fn select(&self, columns: &[&str]) -> DataFrame;
        pub fn rename(&mut self, old: &str, new: &str);
        pub fn column_names(&self) -> Vec<&str>;
        pub fn select_dtypes(&self, dtypes: &[DType]) -> DataFrame;
    
        // Statistics (numeric columns only)
        pub fn describe(&self) -> DataFrame;
        pub fn sum(&self) -> DataFrame;
        pub fn mean(&self) -> DataFrame;
    }
    • push(&mut self, name: &str, series: Series): push head & Series pair
    • drop(&mut self, col_header: &str): drop specific column by header
    • row(&self, i: usize) -> DataFrame : Extract $i$-th row as new DataFrame
    • filter_by(&self, column, f) : Filter DataFrame by specific column
    • mask(&self, mask: &Series) : Mask DataFrame by boolean Series
    • select_rows(&self, indices) : Select rows by indices
    • nrow, ncol, shape : Row count (max column length), column count, (nrow, ncol) tuple
    • dtypes : Vec<DType> of each column’s type
    • is_empty : true if no columns or no rows
    • contains(col_header) : true if the column exists
    • head(n), tail(n) : First / last n rows
    • slice(offset, length) : Row slice starting at offset
    • select(columns) : Select columns by name (panics on missing)
    • rename(old, new) : Rename a column in-place
    • column_names : Vec<&str> of all headers
    • select_dtypes(dtypes) : Select columns matching given DTypes
    • describe : Computes count / mean / sd / min / max for each numeric column
    • sum, mean : Single-row DataFrame with column-wise sum / mean
  • WithCSV trait

    pub trait WithCSV: Sized {
        fn write_csv(&self, file_path: &str) -> Result<(), Box<dyn Error>>;
        fn read_csv(file_path: &str, delimiter: char) -> Result<Self, Box<dyn Error>>;
    }
    • csv feature should be required
    // Example for CSV
    #[macro_use]
    extern crate peroxide;
    use peroxide::fuga::*;
    
    fn main() -> Result<(), Box<dyn Error>> {
        // Write CSV
        let mut df = DataFrame::new(vec![]);
        df.push("a", Series::new(vec!['x', 'y', 'z']));
        df.push("b", Series::new(vec![0, 1, 2]));
        df.push("c", Series::new(c!(0.1, 0.2, 0.3)));
        df.write_csv("example_data/doc_csv.csv")?;
    
        // Read CSV
        let mut dg = DataFrame::read_csv("example_data/doc_csv.csv", ',')?;
        dg.as_types(vec![Char, I32, F64]);
    
        assert_eq!(df, dg);
    
        Ok(())
    }
  • WithNetCDF trait

    pub trait WithNetCDF: Sized {
        fn write_nc(&self, file_path: &str) -> Result<(), Box<dyn Error>>;
        fn read_nc(file_path: &str) -> Result<Self, Box<dyn Error>>;
        fn read_nc_by_header(file_path: &str, header: Vec<&str>) -> Result<Self, Box<dyn Error>>;
    }
    • nc feature should be required
    • libnetcdf dependency should be required
    • Char, Bool are saved as U8 type. Thus, for reading Char or Bool type nc file, explicit type casting is required.
    #[macro_use]
    extern crate peroxide;
    use peroxide::fuga::*;
    
    fn main() -> Result<(), Box<dyn Error>> {
        // Write netcdf
        let mut df = DataFrame::new(vec![]);
        df.push("a", Series::new(vec!['x', 'y', 'z']));
        df.push("b", Series::new(vec![0, 1, 2]));
        df.push("c", Series::new(c!(0.1, 0.2, 0.3)));
        df.write_nc("example_data/doc_nc.nc")?;
    
        // Read netcdf
        let mut dg = DataFrame::read_nc("example_data/doc_nc.nc")?;
        dg["a"].as_type(Char); // Char, Bool are only read/written as U8 type
    
        assert_eq!(df, dg);
    
        Ok(())
    }
  • WithParquet trait

    pub trait WithParquet: Sized {
        fn write_parquet(&self, file_path: &str, compression: Compression) -> Result<(), Box<dyn Error>>;
        fn read_parquet(file_path: &str) -> Result<Self, Box<dyn Error>>;
    }
    • parquet feature should be required
    • Char is saved with String type. Thus, for reading Char type parquet file, the output type is String.
    • Caution : For different length Bool type column, missing values are filled with false.
    #[macro_use]
    extern crate peroxide;
    use peroxide::fuga::*;
     
    fn main() -> Result<(), Box<dyn Error>> {
        // Write parquet
        let mut df = DataFrame::new(vec![]);
        df.push("a", Series::new(vec!['x', 'y', 'z']));
        df.push("b", Series::new(vec![0, 1, 2]));
        df.push("c", Series::new(c!(0.1, 0.2, 0.3)));
        df.write_parquet("example_data/doc_pq.parquet", SNAPPY)?;
    
        // Read parquet
        let mut dg = DataFrame::read_parquet("example_data/doc_pq.parquet")?;
        dg["a"].as_type(Char); // Char is only read/written as String type
    
        assert_eq!(df, dg);
    
        Ok(())
    }

Structs§

DataFrame
Generic DataFrame structure
Scalar
Generic Scalar
Series
Generic Series

Enums§

DType
Data Type enum
DTypeArray
Vector with DType
DTypeValue
Scalar with DType

Traits§

TypedScalar
TypedVector
WithCSV
To handle CSV file format
WithNetCDF
To handle with NetCDF file format
WithParquet
To handle parquet format