Answer

  1. What happens when we execute dbt build --vars '{'is_test_run':'true'}' You'll need to have completed the "Build the first dbt models" video.

    Because it is set that if test it should only apply in 100 records, if not test like in production then apply to all records. The default value is set to true to only use 100 records.

  2. What is the code that our CI job will run? Where is this code coming from?

    When our improvement from branch want to main/master it would invoke Pull Request. Then because DBT already connected to github then dbt automate testing the code to perform check

    PREPARATION

    to import data in Big Query we can use external table from GCS files but to make sure the data types like we wanted is difficult. So i’m use another option by creating BigQuery table directly from csv files using python script. In this case the csv are have been downloaded in my local directory.

    from google.cloud import bigquery
    
    # Replace with your project ID, dataset ID, table name, and folder path
    
    init_url = "<https://github.com/DataTalksClub/nyc-tlc-data/releases/download/>"
    # Replace with your project ID, dataset ID, table name, and folder path
    project_id = "de-tf-411908"
    dataset_id = "trips_data_all"
    table_name = "fhv_tripsdata"
    folder_path = "../"
    
    # Create a BigQuery client
    client = bigquery.Client(project=project_id)
    
    schema = [
        bigquery.SchemaField("dispatching_base_num", "STRING"),
        bigquery.SchemaField("pickup_datetime", "DATETIME"),  # Changed to DATETIME
        bigquery.SchemaField("dropOff_datetime", "DATETIME"),  # Changed to DATETIME
        bigquery.SchemaField("PUlocationID", "INT64"),  # Changed to INT64
        bigquery.SchemaField("DOlocationID", "INT64"),  # Changed to INT64
        bigquery.SchemaField(
            "SR_Flag", "INT64"
        ),  # Changed to INT64 (assuming integer values)
        bigquery.SchemaField("Affiliated_base_number", "STRING"),
    ]
    
    def web_to_bq(year, service):
        for i in range(12):
    
            # sets the month part of the file_name string
            month = "0" + str(i + 1)
            month = month[-2:]
    
            # csv file_name
            file_name = f"{service}_tripdata_{year}-{month}.csv.gz"
    
            job_config = bigquery.LoadJobConfig(
                schema=schema,
                skip_leading_rows=1,
                source_format=bigquery.SourceFormat.CSV,
                field_delimiter=",",
            )
            with open(file_name, "rb") as f:
                load_job = client.load_table_from_file(
                    f,
                    destination=f"{dataset_id}.{table_name}",
                    job_config=job_config,
                )
            # Wait for the job to finish
            load_job.result()  # Raises an exception if the job fails
            print(f"Data from {file_name} loaded successfully")
    
    if __name__ == "__main__":
        web_to_bq(2019, "fhv")
    
    

Next we create staging for fhv_tripsdata table in DBT

  1. What is the count of records in the model fact_fhv_trips after running all dependencies with the test run variable disabled (:false)?

    Create a staging model for the fhv data, similar to the ones made for yellow and green data. Add an additional filter for keeping only records with pickup time in year 2019. Do not add a deduplication step. Run this models without limits (is_test_run: false).

    Create a core model similar to fact trips, but selecting from stg_fhv_tripdata and joining with dim_zones. Similar to what we've done in fact_trips, keep only records with known pickup and dropoff locations entries for pickup and dropoff locations. Run the dbt model without limits (is_test_run: false).

Untitled

Untitled

  1. What is the service that had the most rides during the month of July 2019 month with the biggest amount of rides after building a tile for the fact_fhv_trips table?

    Create a dashboard with some tiles that you find interesting to explore the data. One tile should show the amount of trips per month, as done in the videos for fact_trips, including the fact_fhv_trips data.

    after build fact_trips and fhv_trips wihout using limit os use var is test run equal false then we make dasboard or report to see rides by service type in july 2019. Personally i am using Power BI because it is the most easy tool i have in my computer.

    so after connect to Google Bigquery and make few transformation here the result:

    Untitled